Segmentation with and without priming¶
The Blackboard system of the Two!Ears Auditory Model is equipped with a SegmentationKS knowledge source which is capable of generating soft-masks for auditory features in the time-frequency domain. The segmentation framework relies on a probabilistic clustering approach to assign individual time-frequency units to sound sources that are present in the scene. This assignment can either be computed unsupervised or exploit additional prior knowledge about potential source positions provided by the user or e.g. by the DnnLocationKS knowledge source.
Note
To run the examples, an instance of the SegmentationKS knowledge source has to be trained first. Please refer to (Re)train the segmentation stage for details.
This example will demonstrate how the SegmentationKS knowledge source is
properly initialised with and without prior knowledge and how the hypotheses
which are generated by the segmentation framwork can be used within the
Blackboard system. The example can be found in the examples/segmentation
folder
which consists of the following files:
demo_segmentation_clean.m
demo_segmentation_noisy.m
demo_segmentation_priming.m
demo_train_segmentation.m
segmentation_blackboard_clean.xml
segmentation_blackboard_noise.xml
segmentation_config.xml
test_scene_clean.xml
test_scene_noise.xml
training_scene.xml
Note
The SegmentationKS
knowledge source is based on Matlab functions that
were introduced in release R2013b. Therefore, it is currently not possible
to use SegmentationKS
with earlier versions of Matlab.
The example contains three different demo scenes, namely
demo_segmentation_clean.m
, demo_segmentation_noisy.m
and
demo_segmentation_priming.m
. The first demo shows the segmentation framework
for three speakers in anechoic and undisturbed acoustic conditions without
providing prior knowledge about the speaker positions. The scene parameters are
specified in the corresponding test_scene_clean.xml
file:
<?xml version="1.0" encoding="utf-8"?>
<scene
Renderer="ssr_binaural"
BlockSize="4096"
SampleRate="44100"
LengthOfSimulation = "3"
HRIRs="impulse_responses/qu_kemar_anechoic/QU_KEMAR_anechoic_3m.sofa">
<source Name="Speaker1"
Type="point"
Position="0.8660 0.5 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech08.wav"/>
</source>
<source Name="Speaker2"
Type="point"
Position="1 0 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech14.wav"/>
</source>
<source Name="Speaker3"
Type="point"
Position="0.8660 -0.5 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech07.wav"/>
</source>
<sink Name="Head"
Position="0 0 1.75"
UnitX="1 0 0"
UnitZ="0 0 1"/>
</scene>
The speaker positions described here correspond to angular positions at -30°, 0°
and 30°, respectively. These positions will be fixed for all conditions in this
demo. For more documentation on specifying an acoustic scene, see
Configuration using XML Scene Description. Additionally, the file
segmentation_blackboard_clean.xml
contains the necessary information to
build a Blackboard system with the corresponding SegmentationKS (see
Setting up the blackboard for details):
<?xml version="1.0" encoding="utf-8"?>
<blackboardsystem>
<dataConnection Type="AuditoryFrontEndKS"/>
<KS Name="seg" Type="SegmentationKS">
<Param Type="char">DemoKS</Param>
<Param Type="double">3</Param>
<Param Type="int">3</Param>
<Param Type="int">0</Param>
</KS>
<Connection Mode="replaceOld" Event="AgendaEmpty">
<source>scheduler</source>
<sink>dataConnect</sink>
</Connection>
<Connection Mode="replaceOld">
<source>dataConnect</source>
<sink>seg</sink>
</Connection>
</blackboardsystem>
The SegmentationKS knowledge source takes four parameters as input
arguments. The first parameter is the name of the knowledge source instance
which contains previously trained localisation models. For further information
about training this specific knowledge source, please refer to
(Re)train the segmentation stage. The second parameter defines the
block size in seconds on which the segmentation should be performed. In this
demo, a block size of 3 seconds is assumed for all cases. The third parameter
specifies the number of sources which are assumed to be present in a scene and
the fourth parameter is a flag which can be either set to 0 or 1, indicating if
an additional background estimation should be performed. If this is the case,
the model assumes that individual time-frequency units can either be associated
with a sound source or with background noise, which is helpful in noisy acoustic
environments but can also degrade performance if no or little background noise
is present. As no background noise is assumed in the first demo, this parameter
is set to zero accordingly. Running the script demo_segmentation_clean.m
will produce a result similar to Fig. Fig. 52.
The second demo file demo_segmentation_noisy.m
provides essentially the same
acoustic configuration as the first demo with additional diffuse background
noise. This is specified in the corresponding test_scene_noise.xml
configuration file:
<?xml version="1.0" encoding="utf-8"?>
<scene
Renderer="ssr_binaural"
BlockSize="4096"
SampleRate="44100"
LengthOfSimulation = "3"
HRIRs="impulse_responses/qu_kemar_anechoic/QU_KEMAR_anechoic_3m.sofa">
<source Name="Speaker1"
Type="point"
Position="0.8660 0.5 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech08.wav"/>
</source>
<source Name="Speaker2"
Type="point"
Position="1 0 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech14.wav"/>
</source>
<source Name="Speaker3"
Type="point"
Position="0.8660 -0.5 1.75">
<buffer ChannelMapping="1"
Type="fifo"
File="sound_databases/IEEE_AASP/speech/speech07.wav"/>
</source>
<source Type="pwd"
Name="Noise"
Azimuths="0 30 60 90 120 150 180 210 240 270 300 330">
<buffer ChannelMapping="1 2 3 4 5 6 7 8 9 10 11 12"
Type="noise"
Variance="0.02"
Mean="0.0"/>
</source>
<sink Name="Head"
Position="0 0 1.75"
UnitX="1 0 0"
UnitZ="0 0 1"/>
</scene>
To account for the background noise during the estimation process, the
corresponding flag in the blackboard configuration file
segmentation_blackboard_noise.xml
is set to one:
<?xml version="1.0" encoding="utf-8"?>
<blackboardsystem>
<dataConnection Type="AuditoryFrontEndKS"/>
<KS Name="seg" Type="SegmentationKS">
<Param Type="char">DemoKS</Param>
<Param Type="double">3</Param>
<Param Type="int">3</Param>
<Param Type="int">1</Param>
</KS>
<Connection Mode="replaceOld" Event="AgendaEmpty">
<source>scheduler</source>
<sink>dataConnect</sink>
</Connection>
<Connection Mode="replaceOld">
<source>dataConnect</source>
<sink>seg</sink>
</Connection>
</blackboardsystem>
Running the corresponding script demo_segmentation_noisy.m
will generate
an additional soft-mask for the background noise which is shown in Fig.
Fig. 53.
Note
The background noise estimation procedure is based on the assumption that the noise present in the scene is diffuse and hence its directions of arrival follow a uniform distribution around the unit circle. If this condition is not valid and directional noise sources are present, considering them as additional sources instead of using the background estimation procedure might yield better results.
Finally, the third demo shows the possibilities of priming the
SegmentationKS knowledge source, which means providing prior knowledge about
the source positions before the segmentation is actually performed. For this
purpose, the implementation of SegmentationKS provides an additional
function setFixedPositions()
which can be used to manually specify the
positions of the sound sources. In the script
demo_segmentation_priming.m
, this is done in the following way:
1 2 3 | % Provide prior knowledge of the two speaker locations
prior = [-deg2rad(30); deg2rad(30); 0];
bbs.blackboard.KSs{2}.setFixedPositions(prior);
|
It is also possible to exploit this functionality dynamically during runtime by using intermediate results of the DnnLocationKS knowledge source from the blackboard. Note that angular positions are handled in radians within the SegmentationKS framework, hence position estimates in degrees must be converted accordingly. A possible result for this demo is shown in Fig. Fig. 54.