DNN-based localisation under reverberant conditions

The Two!Ears Auditory Model comes with several knowledge sources that work together to estimate the perceived azimuth of a sound source, see Localisation knowledge sources for a summary. One stage of this process is the mapping of the extracted features like ITDs and ILDs to the perceived azimuth angle. This mapping is highly influenced by the environment. For example, if you are in a room the ITD values will look quite different than in the case of an anechoic chamber. That is the reason why we have different knowledge sources that do this mapping: DnnLocationKS, GmmLocationsKS, and ItdLocationKS. ItdLocationKS utilises a simple lookup table for the mapping works well in the case of Prediction of localisation in spatial audio systems. GmmLocationsKS is at the moment trained only for anechoic condition. In this example we have a look at DnnLocationKS which was trained with a multi-conditional training approach to work under reverberant conditions [MaEtAl2015dnn]. Beside this, DnnLocationKS works in the same way as GmmLocationsKS and connects with LocalisationDecisionKS, and HeadRotationKS to solve front-back confusions.

In this example we will have a look at localisation in a larger room, namely the BRIR data set measured in TU Berlin, room Auditorium 3, which provides six different loudspeaker positions as possible sound sources. All files can be found in the examples/localisation_DNNs folder which consists of the following files:

BlackboardDnnNoHeadRotation.xml
BlackboardDnn.xml
estimateAzimuth.m
localise.m
resetBinauralSimulator.m
setupBinauralSimulator.m

The setup is very similar to Localisation with and without head rotations with a few exceptions. First, the setup of the Binaural simulator is different as we use BRIRs instead of HRTFs, and have one impulse response set for every sound source. The initial configuration of the Binaural simulator is provided by the setupBinauralSimulator function:

sim = simulator.SimulatorConvexRoom();
set(sim, ...
    'BlockSize',            4096, ...
    'SampleRate',           44100, ...
    'NumberOfThreads',      1, ...
    'LengthOfSimulation',   1, ...
    'Renderer',             @ssr_brs, ...
    'Verbose',              false, ...
    'Sources',              {simulator.source.Point()}, ...
    'Sinks',                simulator.AudioSink(2) ...
    );
set(sim.Sinks, ...
    'Name',                 'Head', ...
    'Position',             [ 0.00  0.00  0.00]' ...
    );
set(sim.Sources{1}, ...
    'AudioBuffer',          simulator.buffer.Ring(1) ...
    );
set(sim.Sources{1}.AudioBuffer, ...
    'File', 'sound_databases/grid_subset/s1/bbaf2n.wav' ...
    );

Here, we configure it to use the @ssr_brs renderer which is needed for BRIRs, define the speech signal to use, but don’t provide a BRIR yet as this will be done on the fly later on.

We have four different configuration files for setting up the Blackboard system. One important step for the DnnLocationKS is to define a resampling as it is trained for 16000 Hz at the moment. As an example, we list the file BlackboardDnn.xml`:

<?xml version="1.0" encoding="utf-8"?>
<blackboardsystem>

    <dataConnection Type="AuditoryFrontEndKS">
       <Param Type="double">16000</Param>
    </dataConnection>

    <KS Name="loc" Type="DnnLocationKS">
        <Param Type="char">MCT-DIFFUSE</Param>
    </KS>
    <KS Name="dec" Type="LocalisationDecisionKS">
        <Param Type="int">1</Param>
    </KS>
    <KS Name="rot" Type="HeadRotationKS">
        <Param Type="ref">robotConnect</Param>
    </KS>

    <Connection Mode="replaceOld" Event="AgendaEmpty">
        <source>scheduler</source>
        <sink>dataConnect</sink>
    </Connection>
    <Connection Mode="replaceOld">
        <source>dataConnect</source>
        <sink>loc</sink>
    </Connection>
    <Connection Mode="add">
        <source>loc</source>
        <sink>dec</sink>
    </Connection>
    <Connection Mode="add" Event="RotateHead">
        <source>dec</source>
        <sink>rot</sink>
    </Connection>

</blackboardsystem>

Here, we use different knowledge sources that work together in order to solve the localisation task. We have AuditoryFrontEndKS for extract auditory cues from the ear signals sampled at 16 kHz, DnnLocationKS, LocalisationDecisionKS, and HeadRotationKS for the actual localisation task. The Param tags are parameters we can pass to the knowledge sources. After setting up which knowledge sources we will use, we connect them with the Connection tags. For more information on configuring the blackboard see Configuration.

In the other blackboard configuration files we set up a blackboard for the case of DnnLocationKS without confusion solving by head rotation.

Now, everything is prepared and we can start Matlab in order to perform the localisation. You can just start it and run the following command to see it in action, afterwards we will have a look at what happened:

>> localise

-------------------------------------------------------------------------
Source direction   DnnLocationKS w head rot.   DnnLocationKS wo head rot.
-------------------------------------------------------------------------
        0                 0                       -180
      -52               -55                        -55
     -131               -45                        -45
        0                 0                       -180
       30                30                         25
      -30               -30                        -30
------------------------------------------------------------------------

As you can see the model with head rotation returned better results than the model without head rotation enabled.

Now, we have a look into the details of the localise() function. We will only talk about the parts that are responsible for the task, not for printing out the results onto the screen. First, we define the sources we are going to synthesise and start the Binaural simulator:

brirs = { ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src1_xs+0.00_ys+3.97.sofa'; ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src2_xs+4.30_ys+3.42.sofa'; ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src3_xs+2.20_ys-1.94.sofa'; ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src4_xs+0.00_ys+1.50.sofa'; ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src5_xs-0.75_ys+1.30.sofa'; ...
    'impulse_responses/qu_kemar_rooms/auditorium3/QU_KEMAR_Auditorium3_src6_xs+0.75_ys+1.30.sofa'; ...
    };
headOrientation = 90; % towards y-axis (facing src1)
sourceAngles = [90, 38.5, -41.4, 90, 120, 60] - headOrientation; % phi = atan2d(ys,xs)

After that we have a loop over the different sources in which we are loading the corresponding BRIR into the Binaural simulator and run the Blackboard system inside the estimateAzimuth function:

for ii = 1:length(sourceAngles)
    direction = sourceAngles(ii);
    sim.Sources{1}.IRDataset = simulator.DirectionalIR(brirs{ii});
    sim.rotateHead(headOrientation, 'absolute');
    sim.Init = true;
    % DnnLocationKS w head rot.
    phi1 = estimateAzimuth(sim, 'BlackboardDnn.xml');
    resetBinauralSimulator(sim, headOrientation);
    % DnnLocationKS wo head rot.
    phi2 = estimateAzimuth(sim, 'BlackboardDnnNoHeadRotation.xml');
    sim.ShutDown = true;
end

As we run four different blackboards after each other, we have to reinitialise the Binaural simulator in between.