Train sound type identification models

Part of the Two!Ears Auditory Model is the Identity knowledge source: IdentityKS which can be instantiated (multiple times) to identify the type of auditory objects, like “speech”, “fire”, “knock” etc. Each IdentityKS needs a source type model – this example shows one possibility for trainining such a model. Have a look at the Identification of sound types to see how these models are being used in the Blackboard system.

The base folder for this example is examples/train_identification_model, with the example script file being trainAndTestCleanModel.m. Later in the model training process, new directories with names like test_1vsAll_training will be created by the training pipeline, containing log files of the training, file lists of the used training and testing data, and of course the trained models. These are the models that can be used with the IdentityKS. To see if everything is working, just run

>> trainAndTestCleanModel;

Example step-through

To dive into the example, start Matlab, navigate into the example directory, and open trainAndTestCleanModel.m, which contains a function (also usable as a script). Let’s have a look before firing it up!

Caching dir

The code starts with a snippet that checks for the existence of a cache directory and create it if it does not already exist. The pipeline uses this cache directory for saving and re-using intermediate results.

Feature and model creators

The next code paragraph creates the basic pipeline object of type TwoEarsIdTrainPipe, and then sets the basic components: The block creator, the feature creator, the label creator and the model creator.

pipe = TwoEarsIdTrainPipe('cacheSystemDir', cacheSystemDir);
pipe.blockCreator = BlockCreators.MeanStandardBlockCreator( 0.5, 0.5/3 );
pipe.featureCreator = FeatureCreators.FeatureSet5Blockmean();
oneVsRestLabeler = LabelCreators.MultiEventTypeLabeler( ...
                                  'types', {{classname}}, 'negOut', 'rest' );
pipe.labelCreator = oneVsRestLabeler;
pipe.modelCreator = ModelTrainers.GlmNetLambdaSelectTrainer( ...
    'performanceMeasure', @PerformanceMeasures.BAC2, ...
    'cvFolds', 4, ...
    'alpha', 0.99 ); % giving higher numerical stability (instead of 1.0)

In this case, an L1-regularized sparse logistic regression model will be trained through the use of the GlmNetLambdaSelectTrainer, which is a wrapper for GLMNET. A pile of auditory features will be used in this model, processed and compiled by the sec-FeatureSet5Blockmean feature creator. The block creator is responsible of cutting these auditory features into blocks for each time-window. The label creator uses the ground truth files for assigning a label to each block that the model can target. Have a look into the respective sections to learn more!

Training and testing sets

The models will be trained using a particular set of sounds, specified in the trainset flist. For this example, the IEEE AASP single event sounds serve as training material. There are sounds for several classes like “laughter”, “keys”, “speech”, etc. If you don’t call the trainAndTestCleanModel function with a different class name, a model for the “speech” class will be trained (this is specified in the third line). Regardless of the class the model is trained for, all sounds listed in the flist (have a look) will be used for training – but only the ones belonging to the model class will serve as “positive” examples.

pipe.trainset = 'learned_models/IdentityKS/trainTestSets/IEEE_AASP_80pTrain_TrainSet_1.flist';

Scene configuration

A “clean” scene configuration is used to train this model.

sc = SceneConfig.SceneConfiguration();
sc.addSource( SceneConfig.PointSource( ...
                      'data', SceneConfig.FileListValGen( 'pipeInput' ) ) );

The sound sources are positioned at 0° azimuth relative to the head, there is no interfering noise, and no reverberation (free-field conditions). Have a look into the respective training pipeline documentation part to get to know the many possibilities to configure the acoustic training scene.

Running the pipeline

After everything is set up, the pipeline has to be initialised and can then be run.

pipe.init( sc, 'fs', 16000 );
modelPath = 'modelName', classname, ...
                               'modelPath', 'test_1vsAll_training' );

Initialisation can take some time depending on the files for training and testing, and whether they are available through a local copy of the Two!Ears database, through the download cache of the remote Two!Ears database, or whether they have to be downloaded from there first. The time needed for actually running the pipeline can vary substantially, depending on

  • the total accumulated length of sound files used
  • the scene configuration – using reverberation or noise interference makes the binaural simulation take longer
  • the features having to be extracted by the Auditory front-end
  • the type of model (training) – there are big differences here, as the computational effort can be much higher for some models than for others (GLMNET, the one used here, is pretty fast)
  • and whether the files have been processed in this configuration before or not. The pipeline saves intermediate files after each processing stage (binaural simulation, auditory front-end, feature creation) for each sound file and each configuration, and it finds those files later, if a file is to be processed in the same (or partly the same) configuration. This way, a lot of time-consuming preprocessing can be saved. You can try it – interrupt the preprocessing at any moment by hitting ctrl+c, and restart the script. You will see that all processed files/stages won’t be done again.

The arguments 'fs', 16000 indicate that the pipeline will operate with 16 KHz signals. The pipeline takes care of resampling sounds if they are not already sampled at this rate accordingly.

After successful training, you should see something like

Running: MultiConfigurationsEarSignalProc

Running: MultiConfigurationsAFEmodule

Running: MultiConfigurationsFeatureProc

Running: GatherFeaturesProc

##   Training model "speech"

==  Training model on trainSet...

Run on full trainSet...
GlmNet training with alpha=0.990000
   size(x) = 5040x846

Run cv to determine best lambda...
Starting run 1 of CV... GlmNet training with alpha=0.990000
   size(x) = 4111x846


Calculate Performance for all lambdas...................................................Done

-- Model is saved at C:\projekte\twoEars\twoears-examples\train_identification_model\test_1vsAll_training --

The path returned indicates after running the training process contains the location of the model on your drive.

Model testing

A new model has been trained. To test it we repeat the above code, except that this time, we configure the pipeline to use test data instead of training data.

pipe.trainset = [];
pipe.testset = 'learned_models/IdentityKS/trainTestSets/IEEE_AASP_80pTrain_TestSet_1.flist';

The testset specifies files used for testing the trained model. This is not necessary for the model creation, it only serves as an immediate way of providing feedback about the model performance after training. Of course, the testset should only contain files that have not been used for training, in order to test for the model’s generalisation performance.

After successful testing, you should see something like

==  Testing model on testSet...

##   "speech" Performance: 0.942548

 -- Model is saved at C:\projekte\twoEars\twoears-examples\train_identification_model\test_1vsAll_testing --

The stated performance is on the test set, and the path afterwards indicates the location of the model on your drive.