Two-Dimensional Spatialization of Sound

1. Background

In 1997, a system was designed at CRC to study the use of 
directional sound as a situational awareness enhancement 
for dismounted soldiers.  Four wearable sets were constructed 
for outdoor field trials, each set containing a laptop computer,
differential GPS, wireless LAN transceiver, headphones and 
head-mounted compass.  Using the GPS position and head bearing 
data, the bearing of the other three sets relative to the wearers 
head could be calculated.  Voice communication (via the wireless LAN)
between the four dispersed individuals was spatialized on the 
horizontal plane so that the sound arriving at the ears of each 
listener would appear to come from the actual direction of the 
speaker.  Ambient sound was received by artificial ears mounted 
to the headphones so that it could be presented to the listener 
without loss of directional cues, while protecting the soldier's 
hearing.


2. Description

The 2D sound tool is an ANSI C library of functions which 
are called by a client program.
  
The main functions are:

	 - soundscapeUpdate(), which specifies the coordinates of 
		the sound source and listener and the positions of 
		the walls of the room which they occupy, and

	 - spatialize(), which given a vector of monaural speech 
		samples produces a vector of binaural spatialized 
		speech samples based on the previously supplied 
		coordinates.

The soundscape uses Head Related Transfer Functions (HRTF) generated
using the KEMAR head model[3].  These have been calculated in steps of 10
degrees of arc for elevations from -40 degrees to plus 90 degrees.  
Only the data for 0 degrees elevation has been used.

The HRTF data was originaly created by Bill Gardner <billg@media.mit.edu> 
and Keith Martin <kdm@media.mit.edu> at the MIT Media Lab.

Differential HRTFs (DHRTF) are generated in which the shadowed ear
response (source on opposite side of the head from ear) is inverse 
filtered with the unshadowed ear response (source on same side of the 
head as ear).  The unshadowed ear then receives the unmodified sound, 
whereas the shadowed ear receives the sound modified using the DHRTF.  
There were several reasons for doing this:

 - the impact on intelligibility is minimized, 
 - the effect of microphone placement in the ear canals is eliminated, 
 - delays common to both right and left ear HRTFs are removed, and
 - the computational load is substantially decreased.

The KEMAR HRTFs are known to be deficient with regard to front-back 
reversals, in which a sound placed behind the listener will be perceived 
to be in front of the listener, and vice-versa.  To overcome this, a set 
of "common mode" cues is superimposed on the DHRTFs.  These cues exploit 
the observation that for wavelengths shorter than the diameter of the 
head, the sound shadow increases gradually with frequency [2].  The 
shadow varies sinusoidally with respect to azimuth and has a maximim 
value of 10 dB at the Nyquist frequency ([1] p. 62) at an azimuth of 225 
degrees on the shadowed side.


The spatialized sound thus produced lacks the sensation of coming from 
"outside the head".  To overcome this, the first order reverberations of 
the sound from the walls of the room are calculated using two dimensional 
ray tracing ([1] p. 184) with some assumptions about sound absorption by 
the walls.  The four reflected sound sources are treated as separate 
sound rays and the right and left ear response is calculated as above, 
for each ray.  Judicious choice of room dimensions and listener location 
also helps to reduce front-back reversals.


The range of the sound and its reverberations is also taken into account.
Sound intensity is in general proportional to the inverse square of 
range, however this implies that a sound source placed very close to the 
head could be very loud.  Instead, the sound level is normalized.  The 
direct sound ray is not scaled, but the reflected rays are attenuated 
proportional to the square of the ratio of the range of the direct ray to 
the range of the reflected ray.  (This means that the level of reflected 
rays is greatest when the sound source is placed at a distance from the 
head.)


When soundscapeUpdate() is called, the differential and common mode cues 
for each of the five rays are convolved with the filter for the right and 
left ear to create two filters representing the composite response for each 
ear.  Finally, the right and left ear filters are "pruned" to remove taps 
with very small weights, to reduce computational load.


When spatialize() is called, the input monaural sound is passed through 
these filters to produce a binaural sound containing spatialization cues.


3. References

[1] D. R. Begault, "3-D Sound for Virtual Reality and Multimedia", Academic 
Press, 1994.

[2] J. M. Loomis, C. Hebert, J. G. Cicinelli, "Active Localization of 
Virtual Sounds", J. Acoust. Soc. Am. Vol 88 No 4, October 1990.

[3]Gardner W.G. and Martin K.D., "HRTF Measurements of a KEMAR", J. Acoust. 
Soc. Am., 97(6), pp 3907-3908