Simple audio classification using a torch

This article is a translation of Daniel Falbel’s ‘Simple Audio Classification’ article. tensorflow/keras to torch/torchaudio. The main goal is to introduce torchaudio and explain its contributions. torch ecosystem. Here we focus on two popular datasets: Audio Loader and Spectrogram Converter. An interesting by-product is the similarity between Torch and TensorFlow, sometimes showing differences and sometimes showing similarities.

Download and Import

In Torch Audio speechcommand_dataset It’s built-in. Basically you can filter background_noise and choose between versions. v0.01 and v0.02.

# set an existing folder here to cache the dataset
DATASETS_PATH <- "~/datasets/"

# 1.4GB download
df <- speechcommand_dataset(
  root = DATASETS_PATH, 
  url = "speech_commands_v0.01",
  download = TRUE
)

# expect folder: _background_noise_
df$EXCEPT_FOLDER
# (1) "_background_noise_"

# number of audio files
length(df)
# (1) 64721

# a sample
sample <- df(1)

sample$waveform(, 1:10)

torch_tensor
0.0001 *
 0.9155  0.3052  1.8311  1.8311 -0.3052  0.3052  2.4414  0.9155 -0.9155 -0.6104
( CPUFloatType{1,10} )

sample$sample_rate
# 16000
sample$label
# bed

plot(sample$waveform(1), type = "l", col = "royalblue", main = sample$label)

Figure 1: Sample waveform for ‘bed’.

class

 (1) "bed"    "bird"   "cat"    "dog"    "down"   "eight"  "five"  
 (8) "four"   "go"     "happy"  "house"  "left"   "marvin" "nine"  
(15) "no"     "off"    "on"     "one"    "right"  "seven"  "sheila"
(22) "six"    "stop"   "three"  "tree"   "two"    "up"     "wow"   
(29) "yes"    "zero"

generator data loader

torch::dataloader has the same mission as data_generator Defined in the original article. Responsible for batch preparation and parallelism/device I/O orchestration, including shuffling, padding, one-hot encoding, etc.

In Torch, you do this by passing train/test subsets. torch::dataloader It encapsulates all the batch setup logic inside. collate_fn() function.

At this point, dataloader(train_subset) This doesn’t work because the sample is not filled. So we have to build our own collate_fn() With a padding strategy.

We recommend that your implementation use the following approach: collate_fn():

start collate_fn <- function(batch) browser().
instantiate dataloader with collate_fn()
Call and create an environment enumerate(dataloader) So you can request batch retrieval from the dataloader.
run environment((1))((1)). It should now be sent inside collate_fn() with access to: batch Input object.
Build logic.

collate_fn <- function(batch) {
  browser()
}

ds_train <- dataloader(
  train_subset, 
  batch_size = 32, 
  shuffle = TRUE, 
  collate_fn = collate_fn
)

ds_train_env <- enumerate(ds_train)
ds_train_env((1))((1))

final collate_fn() Fill the waveform to length 16001 and then stack everything together. At this point there is no spectrogram yet. We will make the spectrogram transformation part of our model architecture.

pad_sequence <- function(batch) {
    # Make all tensors in a batch the same length by padding with zeros
    batch <- sapply(batch, function(x) (x$t()))
    batch <- torch::nn_utils_rnn_pad_sequence(batch, batch_first = TRUE, padding_value = 0.)
    return(batch$permute(c(1, 3, 2)))
  }

# Final collate_fn
collate_fn <- function(batch) {
 # Input structure:
 # list of 32 lists: list(waveform, sample_rate, label, speaker_id, utterance_number)
 # Transpose it
 batch <- purrr::transpose(batch)
 tensors <- batch$waveform
 targets <- batch$label_index

 # Group the list of tensors into a batched tensor
 tensors <- pad_sequence(tensors)
 
 # target encoding
 targets <- torch::torch_stack(targets)

 list(tensors = tensors, targets = targets) # (64, 1, 16001)
}

The deployment structure is as follows:

Placement((1)): waveform – tensor Includes dimensions (32, 1, 16001)
Placement((2)): target – tensor contains dimension(32, 1)

Additionally, torchaudio comes with three loaders: av_loader, tuner_loaderand audiofile_loader– There will be more to come. set_audio_backend() This is used to set one of them as the audio loader. Performance will vary depending on the audio format (mp3 or wav). There is no perfect world yet. tuner_loader Best for mp3. audiofile_loader Best suited for wav, but neither has the option to partially load samples from an audio file without first bringing all the data into memory.

For a specific audio backend, you need to pass it to each worker via: worker_init_fn() controversy.

ds_train <- dataloader(
  train_subset, 
  batch_size = 128, 
  shuffle = TRUE, 
  collate_fn = collate_fn,
  num_workers = 16,
  worker_init_fn = function(.) {torchaudio::set_audio_backend("audiofile_loader")},
  worker_globals = c("pad_sequence") # pad_sequence is needed for collect_fn
)

ds_test <- dataloader(
  test_subset, 
  batch_size = 64, 
  shuffle = FALSE, 
  collate_fn = collate_fn,
  num_workers = 8,
  worker_globals = c("pad_sequence") # pad_sequence is needed for collect_fn
)

model definition

instead keras::keras_model_sequential()we are torch::nn_module(). As referenced in the original article, the model is based on the architecture for MNIST in this tutorial, which we will call ‘DanielNN’.

dan_nn <- torch::nn_module(
  "DanielNN",
  
  initialize = function(
    window_size_ms = 30, 
    window_stride_ms = 10
  ) {
    
    # spectrogram spec
    window_size <- as.integer(16000*window_size_ms/1000)
    stride <- as.integer(16000*window_stride_ms/1000)
    fft_size <- as.integer(2^trunc(log(window_size, 2) + 1))
    n_chunks <- length(seq(0, 16000, stride))
    
    self$spectrogram <- torchaudio::transform_spectrogram(
      n_fft = fft_size, 
      win_length = window_size, 
      hop_length = stride, 
      normalized = TRUE, 
      power = 2
    )
    
    # convs 2D
    self$conv1 <- torch::nn_conv2d(in_channels = 1, out_channels = 32, kernel_size = c(3,3))
    self$conv2 <- torch::nn_conv2d(in_channels = 32, out_channels = 64, kernel_size = c(3,3))
    self$conv3 <- torch::nn_conv2d(in_channels = 64, out_channels = 128, kernel_size = c(3,3))
    self$conv4 <- torch::nn_conv2d(in_channels = 128, out_channels = 256, kernel_size = c(3,3))
    
    # denses
    self$dense1 <- torch::nn_linear(in_features = 14336, out_features = 128)
    self$dense2 <- torch::nn_linear(in_features = 128, out_features = 30)
  },
  
  forward = function(x) {
    x %>% # (64, 1, 16001)
      self$spectrogram() %>% # (64, 1, 257, 101)
      torch::torch_add(0.01) %>%
      torch::torch_log() %>%
      self$conv1() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv2() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv3() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      self$conv4() %>%
      torch::nnf_relu() %>%
      torch::nnf_max_pool2d(kernel_size = c(2,2)) %>%
      
      torch::nnf_dropout(p = 0.25) %>%
      torch::torch_flatten(start_dim = 2) %>%
      
      self$dense1() %>%
      torch::nnf_relu() %>%
      torch::nnf_dropout(p = 0.5) %>%
      self$dense2() 
  }
)

model <- dan_nn()


device <- torch::torch_device(if(torch::cuda_is_available()) "cuda" else "cpu")
model$to(device = device)

print(model)

An `nn_module` containing 2,226,846 parameters.

── Modules ──────────────────────────────────────────────────────
● spectrogram: <Spectrogram> #0 parameters
● conv1: <nn_conv2d> #320 parameters
● conv2: <nn_conv2d> #18,496 parameters
● conv3: <nn_conv2d> #73,856 parameters
● conv4: <nn_conv2d> #295,168 parameters
● dense1: <nn_linear> #1,835,136 parameters
● dense2: <nn_linear> #3,870 parameters

model fitting

Unlike TensorFlow model %>% compile(...) Lift the torch. Now let’s set it up. loss criterion, optimizer strategy and evaluation metrics Explicitly in the training loop.

loss_criterion <- torch::nn_cross_entropy_loss()
optimizer <- torch::optim_adadelta(model$parameters, rho = 0.95, eps = 1e-7)
metrics <- list(acc = yardstick::accuracy_vec)

training loop

library(glue)
library(progress)

pred_to_r <- function(x) {
  classes <- factor(df$classes)
  classes(as.numeric(x$to(device = "cpu")))
}

set_progress_bar <- function(total) {
  progress_bar$new(
    total = total, clear = FALSE, width = 70,
    format = ":current/:total (:bar) - :elapsed - loss: :loss - acc: :acc"
  )
}

epochs <- 20
losses <- c()
accs <- c()

for(epoch in seq_len(epochs)) {
  pb <- set_progress_bar(length(ds_train))
  pb$message(glue("Epoch {epoch}/{epochs}"))
  coro::loop(for(batch in ds_train) {
    optimizer$zero_grad()
    predictions <- model(batch((1))$to(device = device))
    targets <- batch((2))$to(device = device)
    loss <- loss_criterion(predictions, targets)
    loss$backward()
    optimizer$step()
    
    # eval reports
    prediction_r <- pred_to_r(predictions$argmax(dim = 2))
    targets_r <- pred_to_r(targets)
    acc <- metrics$acc(targets_r, prediction_r)
    accs <- c(accs, acc)
    loss_r <- as.numeric(loss$item())
    losses <- c(losses, loss_r)
    
    pb$tick(tokens = list(loss = round(mean(losses), 4), acc = round(mean(accs), 4)))
  })
}



# test
predictions_r <- c()
targets_r <- c()
coro::loop(for(batch_test in ds_test) {
  predictions <- model(batch_test((1))$to(device = device))
  targets <- batch_test((2))$to(device = device)
  predictions_r <- c(predictions_r, pred_to_r(predictions$argmax(dim = 2)))
  targets_r <- c(targets_r, pred_to_r(targets))
})
val_acc <- metrics$acc(factor(targets_r, levels = 1:30), factor(predictions_r, levels = 1:30))
cat(glue("val_acc: {val_acc}\n\n"))

Epoch 1/20                                                            
(W SpectralOps.cpp:590) Warning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (function operator())
354/354 (=========================) -  1m - loss: 2.6102 - acc: 0.2333
Epoch 2/20                                                            
354/354 (=========================) -  1m - loss: 1.9779 - acc: 0.4138
Epoch 3/20                                                            
354/354 (============================) -  1m - loss: 1.62 - acc: 0.519
Epoch 4/20                                                            
354/354 (=========================) -  1m - loss: 1.3926 - acc: 0.5859
Epoch 5/20                                                            
354/354 (==========================) -  1m - loss: 1.2334 - acc: 0.633
Epoch 6/20                                                            
354/354 (=========================) -  1m - loss: 1.1135 - acc: 0.6685
Epoch 7/20                                                            
354/354 (=========================) -  1m - loss: 1.0199 - acc: 0.6961
Epoch 8/20                                                            
354/354 (=========================) -  1m - loss: 0.9444 - acc: 0.7181
Epoch 9/20                                                            
354/354 (=========================) -  1m - loss: 0.8816 - acc: 0.7365
Epoch 10/20                                                           
354/354 (=========================) -  1m - loss: 0.8278 - acc: 0.7524
Epoch 11/20                                                           
354/354 (=========================) -  1m - loss: 0.7818 - acc: 0.7659
Epoch 12/20                                                           
354/354 (=========================) -  1m - loss: 0.7413 - acc: 0.7778
Epoch 13/20                                                           
354/354 (=========================) -  1m - loss: 0.7064 - acc: 0.7881
Epoch 14/20                                                           
354/354 (=========================) -  1m - loss: 0.6751 - acc: 0.7974
Epoch 15/20                                                           
354/354 (=========================) -  1m - loss: 0.6469 - acc: 0.8058
Epoch 16/20                                                           
354/354 (=========================) -  1m - loss: 0.6216 - acc: 0.8133
Epoch 17/20                                                           
354/354 (=========================) -  1m - loss: 0.5985 - acc: 0.8202
Epoch 18/20                                                           
354/354 (=========================) -  1m - loss: 0.5774 - acc: 0.8263
Epoch 19/20                                                           
354/354 (==========================) -  1m - loss: 0.5582 - acc: 0.832
Epoch 20/20                                                           
354/354 (=========================) -  1m - loss: 0.5403 - acc: 0.8374
val_acc: 0.876705979296493

predict

We have already calculated all the predictions. test_subsetLet’s recreate the alluvium from the original article.

library(dplyr)
library(alluvial)
df_validation <- data.frame(
  pred_class = df$classes(predictions_r),
  class = df$classes(targets_r)
)
x <-  df_validation %>%
  mutate(correct = pred_class == class) %>%
  count(pred_class, class, correct)

alluvial(
  x %>% select(class, pred_class),
  freq = x$n,
  col = ifelse(x$correct, "lightblue", "red"),
  border = ifelse(x$correct, "lightblue", "red"),
  alpha = 0.6,
  hide = x$n < 20
)

모델 성능: 실제 레이블 <--> Prediction label.” width=”336″/></p>
<p class=

Figure 2: Model performance: actual labels <–> Predicted label.

The model accuracy is 87.7%, which is somewhat worse than the tensorflow version in the original post. Nonetheless, all the conclusions from the original post still hold true.

reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.

recall

To give attribution, please cite this work as follows:

Damiani (2021, Feb. 4). Posit AI Blog: Simple audio classification with torch. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/

BibTeX Quotes

@misc{athossimpleaudioclassification,
  author = {Damiani, Athos},
  title = {Posit AI Blog: Simple audio classification with torch},
  url = {https://blogs.rstudio.com/tensorflow/posts/2021-02-04-simple-audio-classification-with-torch/},
  year = {2021}
}

Simple audio classification using a torch

What Experts Say About Drone Sightings in Northeastern United States

Is America calling the shots? Drone ban in UK – sUAS News

Samsung lets you disable annoyingly bright HDR content in all apps

Light it up! Snoop Dogg carries the Olympic torch at the final games in Paris – National

Gausman contributes to Blue Jays’ sweep of Angels

A Drake security guard was shot outside his Toronto home.

The Jays scored four runs in the eighth to beat the Rays 6-3.

ICC Rankings – Sadia Iqbal becomes the No. 1 T20I bowler before Sophie Ecclestone snatches it back.

De Nachtwinkel in Mechelen offers entertainment day and night.

All Blacks ‘robbed’ RWC glory

Ellis Jenge becomes first player to wear NFL boots from rugby, Bristol Bears star aims to start new season fitter than ever

Our Picks

Merrick Garland Vows to Protect DOJ as ‘Political Weapon’

bp Partners Audi Formula 1 Team

Review: JORDY is reborn as ‘SEX WITH MYSELF’.

Most Popular

Light it up! Snoop Dogg carries the Olympic torch at the final games in Paris – National

Gausman contributes to Blue Jays’ sweep of Angels

A Drake security guard was shot outside his Toronto home.

Simple audio classification using a torch

Download and Import

class

generator data loader

model definition

model fitting

training loop

predict

reuse

recall

Related Posts