Having originated in the academic world and remained the province of PhD students and researchers for years, audio analytics is now being used in UK cities both on its own and to complement video analytics. Similar applications are likely to follow elsewhere in Europe. This article considers developments and discusses recent projects.
By Chris Gomersall, London
Audio analytics has been developed in part to address the limitations of video analytics, notably the fact that in a busy scene, crucial images are often obscured and activity can be hard to follow. Anybody involved in high-end intelligent scene analysis will know that the difference between two people fighting and, say, putting on their coats, can be extremely difficult to detect. Discriminating between good-natured horseplay and real violence is even harder. For the rules-based, non-intuitive intelligence of analytics software, the activities are often too similar for a distinction to be made.
These examples typify the many situations in which audio analytics can enhance scene analysis. The scenarios underline the kind of challenges faced by Ipsotek, a world leader in audio analytics and the only company to offer true integration of video and audio analytics. These are of course easy phrases to trot out but the claims can be substantiated with case studies, some of already mature, while most rival offerings have yet to be used in the field.
Every additional trigger halves false alarms
An environment in which audio analytics can work effectively alongside video analytics is car parks where camera views are often impeded by pillars and low ceilings. Visual triggers can easily be augmented here with a simple audio alarm on the sound of broken laminate glass. A typical situation might be a combination of video analytics to track an object the size and shape of a car, time difference (speed) between two points which is likely to be faster than usual in the event of theft and a third filter to detect the shattered glass. The analytics provider would require each condition to be met for the scenario to be valid and an alarm event to occur. It is our experience that every additional trigger in a scenario will halve the incidence of false alarms, so making the system more credible and useful to the client.
Think of the last time you saw violence on a city street. The likelihood is that you heard the fight before you saw it. Twenty percent of verbal aggression incidents result in physical assault. Audio analytics can be crucial in establishing which party started a fight and it can also reveal if an attack is racially motivated. There is evidence to suggest that judges are willing to sentence more severely when there is an audio file to support video footage, providing the two sets of data are time stamped.
| Assault, leading to panic alarm |
|
In the UK, street aggression has been exacerbated by recent changes in alcohol licensing laws and a climate of "binge" drinking which contrasts with European café culture. To Ipsotek's surprise, it has become apparent that 70 percent of street violence at night time in bar and club areas is initiated by ladies with their menfolk only joining the fray when blows are exchanged.
Evidential weight
Readers of WIK who have spent time in a CCTV control room will know that cameras rarely pan round to give images of a violent incident until it is already in progress, often with one party lying on the ground. Being omni-directional, audio analytics can capture sound and provide a useful audit trail of events even if the microphone is pointing in the wrong direction. Such an audit trail can be crucial in either confirming or discrediting somebody's version of events in a court of law.
Audio analytics is now advancing so that irrelevant sounds can be filtered out effectively. The intelligence inherent in leading audio analytics systems allows the software not only to ignore ambient sounds such as vehicle engines, footsteps, the slamming of doors and wail of sirens, but to learn about sounds that are peculiar to an area during an optimisation phase. Thus, in a recent project for the London Borough of Hackney, Ipsotek's engineers identified the noise made by scaffolders while they dropped planks of wood as a potential false alarm since the sound approximated to that of a sawn-off shotgun which was in our library of sounds on which to alert. Detailed comparison allowed our software developers to discriminate against the extraneous sound.
Replicating the human ear
Ipsoteks ability to filter out sounds that create false alarms is based on a cochleogram, a digital facsimile of the part of the human ear that processes and responds to information. Vital to our work in this area are hair cells which vibrate at certain wave lengths and respond to certain frequencies. Our software developers have taught a computer how to process audio information in the same way as the human ear and, crucially, how to filter out extraneous audio signals in the manner of a human being. A business person holding a telephone conversation will focus largely on what is being said to him; he will filter out interference on the phone line, the air-conditioning in his office, the whirr of PC servers and traffic on the street. When an audio analytics system is used in the field, a similar process takes place. The software will evaluate extraneous noises, reject them and focus on the trigger sounds it has been trained to recognise.
Exceptional buy-in at shop-floor level
Hackney, a suburb in the east of London, has been the site of one of Ipsotek's major projects. Within days of installing our audio analytics, security officers were being alerted to gunshots they would not otherwise have known about. Operator buy-in is exceptional and it has been gratifying for Ipsotek engineers to see security staff spin round in their chairs the moment our equipment creates an alarm, before studying the site schematic intently to see the location of the incident. Control rooms have been quick to realise that a muzzle flash lasts only 250 milliseconds and video analytics is often ineffective in alerting police to the use of firearms.
Our developers have created a comprehensive library of gun shot audio files covering a range of firearms, calibers and ammunition as well as distances. In a demonstration of how seriously the British government takes the company's product offering and research programme, Ipsotek was recently allowed unprecedented access to the Metropolitan Police Service's specialist firearms command, Operation Trident, and our sound library is to a large extent the result of this cooperation. Trident is a dedicated unit that was set up following a series of shootings - many drug-related - in London boroughs.
Aggression is not a binary event
Belligerence and hostility in the human voice can be plotted graphically,
| Vocal aggression - true or not. Photos: Ipsotek |
|
with normal conversation forming a base line and the adrenal-fed exchanges of people fighting at the top of the scale with a value of 100. Aggression happens on a continuum, it is not a "yes-no" event as in binary code. The interest of control room operatives and police in values along the scale will depend on the location and there will inevitably be varying expectations and tolerances. Taking the city of Hamburg as an example, audio feeds with an aggression value of, say, 50, might cause interest from street microphones in a genteel stockbrokers' suburb but it would require a reading in the nineties from the Reeperbahn or an adjacent red light district before officers had cause for concern.
Similarly, security staff monitoring audio feeds from a microphone fitted in the office of a bank manager who noticed high-to-middling levels of aggression from one party only would simply conclude that a customer had been denied an overdraft. In addition to the project in London, Ipsotek is involved in audio analytics at a major town in the north of England and a city in Scotland. A single microphone at the Scottish location has produced nine true activations in three weeks and 150 activations annually per microphone is an average figure. At these socially-deprived northern locations, police and local government are only concerned with the upper levels where people are in the state of adrenal aggression that is consistent with public order offences. It is virtually impossible to replicate the changes in the vocal chords that occur when somebody is pumped up with adrenalin to the point of violence and to date, only one method-trained actress has deceived the Ipsotek system.
The mechanics
Technical issues regarding installation are not complicated but it is crucial to work with integrators who appreciate the principles of audio analysis and microphone positioning. On no account become involved with a mainstream CCTV installer whose only understanding of audio is what he picked up playing bass guitar in his school band!
Microphone placement should normally be six metres above street level with range being approximately 12 metres in urban locations. Installers should note that volume halves as distance is doubled. Obvious sources of masking noise such as building sites should be avoided and microphones should be pole or building-mounted. Always pay heed to the local knowledge of taxi drivers and shopkeepers (particularly fast food outlets) when considering where to place microphones in nightlife areas. Microphones should be connected via balance screened XLR units and UDP audio can be streamed to street cabinet fibre switches using Barix Instreamer 100 IP converters. The resulting audio to Ethernet conversion creates a high-quality MP3 stream that can be distributed via an IP-based network or the Internet.
In the UK, Ipsotek is able to deliver data to the fibre media converters of the principal telephone provider (British Telecom) and onwards to rs1000 fibre optic cable. In this way the country's existing sunk infrastructure is exploited with no disruption for work by civil engineers.
No question of indiscriminate eavesdropping
Ipsotek is scrupulous in discarding all audio data that does not contain an incident. The company will never condone or facilitate indiscriminate monitoring of conversations or continuous feeds by anybody, however high in the management chain. Ipsotek takes pains to distance itself from such practice which is in fact exceptionally rare. The industry is set for pan-European debate on how audio analytics ought to be used and it should be noted that the UK Deputy Information Commissioner, Jonathan Bamford, is on record as saying that he regards audio analytics as less invasive than video scene evaluation.
Ipsotek's library of sounds is extensive but our current focus is on aggression, panic, gun shots and broken glass. As successful case studies are noted by central and local government, the range of applications will widen and audio analytics will gain wider acceptance. There is increasing interest from banks, prisons and cash in transit operators. Video analytics is versatile and effective; indeed Ipsotek excels at it and invests heavily in the technology. But there are occasions when event evaluation must come from another direction.
About the author:
Chris Gomersall is Chief Executive Officer of London-based Ipsotek. Ipsotek is a world leader in audio analytics and the only company to offer true integration of video and audio analytics.
Contact: info@ipsotek.com