第一句子网 > 音频处理问题难？快使用Tensorflow构建一个语音识别模型

音频处理问题难？快使用Tensorflow构建一个语音识别模型

时间：2019-02-17 23:00:32

本文我们将通过一个使用Tensorflow对一些声音剪辑进行分类的例子，帮助你了解足够的基础知识，从而能够构建自己的语音识别模型。另外，你也可以通过进一步的学习，将这些概念应用到更大、更复杂的音频文件中。

本案例的完整代码可以在GitHub上获取。

获取数据

数据收集是数据科学中的难题之一。虽然有很多可用的数据，但并不是所有的数据都容易用于机器学习问题。因此必须确保数据是干净的、有标签的和完整的。

为了实现本次案例，我们将使用Google发布的一些音频文件，可以在Github上获取。

首先，我们将创建一个新的Conducto管道。在这里，您可以构建，训练和测试模型，并与其他感兴趣的人共享链接：

### # Main Pipeline ### def main() -> co.Serial: path = "/conducto/data/pipeline"root = co.Serial(image = get_image()) # Get data from keras for testing and training root["Get Data"] = co.Exec(run_whole_thing, f"{path}/raw") return root 然后，开始编写 run_whole_thing 功能：

def run_whole_thing(out_dir): os.makedirs(out_dir, exist_ok=True) # Set seed for experiment reproducibility seed = 55tf.random.set_seed(seed) np.random.seed(seed) data_dir = pathlib.Path("data/mini_speech_commands") 接下来，设置目录以保存音频文件：

if not data_dir.exists(): # Get the files from external source and put them in an accessible directory tf.keras.utils.get_file( mini_speech_commands.zip, origin="//data/mini_speech_commands.zip", extract=True) 预处理数据

现在将数据保存在正确的目录中，可以将其拆分为训练、测试和验证数据集。

首先，我们需要编写一些函数来帮助预处理数据，以使其可以在我们的模型中起作用。

我们需要算法能够理解的数据格式。我们将使用卷积神经网络，所以数据需要转换成图像。

第一个函数将把二进制音频文件转换成一个张量：

# Convert the binary audio file to a tensor def decode_audio(audio_binary): audio, _ = tf.audio.decode_wav(audio_binary) return tf.squeeze(audio, axis=-1) 由于我们有一个具有原始数据的张量，所以我们需要得到匹配它们的标签。这就是下面的函数通过从文件路径获取音频文件的标签功能：

# Get the label (yes, no, up, down, etc) for an audio file. def get_label(file_path): parts = tf.strings.split(file_path, os.path.sep) return parts[-2] 接下来，我们需要将音频文件与正确的标签相关联。执行此操作并返回一个可与 Tensorflow配合使用的元组：

# Create a tuple that has the labeled audio files def get_waveform_and_label(file_path): label = get_label(file_path) audio_binary = tf.io.read_file(file_path) waveform = decode_audio(audio_binary) return waveform, label 前面我们简要提到了使用卷积神经网络(CNN)算法。这是我们处理语音识别模型的方法之一。通常CNN在图像数据上工作得很好，有助于减少预处理时间。

我们要利用这一点，把音频文件转换成频谱图。频谱图是频率频谱的图像。如果查看一个音频文件，你会发现它只是频率数据。因此，我们要写一个将音频数据转换成图像的函数：

# Convert audio files to images def get_spectrogram(waveform): # Padding for files with less than 16000 samples zero_padding = tf.zeros([16000] - tf.shape(waveform), dtype=tf.float32) # Concatenate audio with padding so that all audio clips will be of the same length waveform = tf.cast(waveform, tf.float32) equal_length = tf.concat([waveform, zero_padding], 0) spectrogram = tf.signal.stft( equal_length, frame_length=255, frame_step=128) spectrogram = tf.abs(spectrogram) return spectrogram 现在我们已经将数据格式化为图像，我们需要将正确的标签应用于这些图像。这与我们制作原始音频文件的做法类似：

# Label the images created from the audio files and return a tuple def get_spectrogram_and_label_id(audio, label): spectrogram = get_spectrogram(audio) spectrogram = tf.expand_dims(spectrogram, -1) label_id = tf.argmax(label == commands) return spectrogram, label_id 我们需要的最后一个 helper 函数将处理传递给它的任何音频文件集的所有上述操作：

# Preprocess any audio files def preprocess_dataset(files, autotune, commands): # Creates the dataset files_ds = tf.data.Dataset.from_tensor_slices(files) # Matches audio files with correct labels output_ds = files_ds.map(get_waveform_and_label, num_parallel_calls=autotune) # Matches audio file images to the correct labels output_dsoutput_dsoutput_ds = output_ds.map( get_spectrogram_and_label_id, num_parallel_calls=autotune) return output_ds 当已经有了所有这些辅助函数，我们就可以分割数据了。

将数据拆分为数据集

将音频文件转换为图像有助于使用CNN更容易处理数据，这就是我们编写所有这些帮助函数的原因。我们将做一些事情来简化数据的分割。

首先，我们将获得所有音频文件的潜在命令列表，我们将在代码的其他地方使用这些命令：

# Get all of the commands for the audio files commands = np.array(tf.io.gfile.listdir(str(data_dir))) commandscommandscommands = commands[commands != README.md] 然后我们将得到数据目录中所有文件的列表，并对其进行混洗，以便为每个需要的数据集分配随机值：

# Get a list of all the files in the directory filenames = tf.io.gfile.glob(str(data_dir) + /*/*) # Shuffle the file names so that random bunches can be used as the training, testing, and validation sets filenames = tf.random.shuffle(filenames) # Create the list of files for training data train_files = filenames[:6400] # Create the list of files for validation data validation_files = filenames[6400: 6400 + 800] # Create the list of files for test data test_files = filenames[-800:] 现在，我们已经清晰地将培训、验证和测试文件分开，这样我们就可以继续对这些文件进行预处理，使它们为构建和测试模型做好准备。这里使用autotune来在运行时动态调整参数的值：

autotune = tf.data.AUTOTUNE 第一个示例只是为了展示预处理的工作原理，它给了一些我们需要的spectrogram_ds值：

# Get the converted audio files for training the model files_ds = tf.data.Dataset.from_tensor_slices(train_files) waveform_ds = files_ds.map( get_waveform_and_label, num_parallel_calls=autotune) spectrogram_ds = waveform_ds.map( get_spectrogram_and_label_id, num_parallel_calls=autotune) 既然已经了解了预处理的步骤过程，我们可以继续使用helper函数来处理所有数据集：

# Preprocess the training, test, and validation datasets train_ds = preprocess_dataset(train_files, autotune, commands) validation_ds = preprocess_dataset( validation_files, autotune, commands) test_ds = preprocess_dataset(test_files, autotune, commands) 我们要设置一些训练示例，这些训练示例在每个时期的迭代中运行，因此我们将设置批处理大小：

# Batch datasets for training and validation batch_size = 64train_dstrain_dstrain_ds = train_ds.batch(batch_size) validation_dsvalidation_dsvalidation_ds = validation_ds.batch(batch_size) 最后，我们可以利用缓存来减少训练模型时的延迟：

# Reduce latency while training train_dstrain_dstrain_ds = train_ds.cache().prefetch(autotune) validation_dsvalidation_dsvalidation_ds = validation_ds.cache().prefetch(autotune) 最终，我们的数据集采用了可以训练模型的形式。

建立模型

由于数据集已明确定义，所以我们可以继续构建模型。我们将使用CNN创建模型，因此我们需要获取数据的形状以获取适用于我们图层的正确形状，然后我们继续按顺序构建模型：

# Build model for spectrogram, _ in spectrogram_ds.take(1): input_shape = spectrogram.shape num_labels = len(commands) norm_layer = preprocessing.Normalization() norm_layer.adapt(spectrogram_ds.map(lambda x, _: x)) model = models.Sequential([ layers.Input(shape=input_shape), preprocessing.Resizing(32, 32), norm_layer, layers.Conv2D(32, 3, activation= elu), layers.Conv2D(64, 3, activation= elu), layers.MaxPooling2D(), layers.Dropout(0.25), layers.Flatten(), layers.Dense(128, activation= elu), layers.Dropout(0.5), layers.Dense(num_labels), ]) model.summary() 我们在模型上做了一些配置，以便给我们最好的准确性：

# Configure built model with losses and metrics pile( optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.SparseCategoricalCrossentropy( from_logits=True), metrics=[accuracy], ) 模型建立好了，现在剩下的就是训练它了。

训练模型

在所有的工作都对数据进行预处理和建立模型之后，训练就相对简单了。我们确定要使用训练和验证数据集运行多少个周期：

# Finally train the model and return info about each epoch EPOCHS = 10model.fit( train_ds, validation_data=validation_ds, epochs=EPOCHS, callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2), ) 这样这个模型就已经训练好了，现在需要对它进行测试。

测试模型

现在我们有了一个准确率约为83%的模型，是时候测试它在新数据上的表现了。所以我们使用测试数据集并将音频文件从标签中分离出来：

# Test the model test_audio = [] test_labels = [] for audio, label in test_ds: test_audio.append(audio.numpy()) test_labels.append(label.numpy()) test_audio = np.array(test_audio) test_labels = np.array(test_labels) 然后我们获取音频数据并在我们的模型中使用它，看看它是否预测了正确的标签：

# See how accurate the model is when making predictions on the test dataset y_pred = np.argmax(model.predict(test_audio), axis=1) y_true = test_labelstest_acc = sum(y_pred == y_true) / len(y_true) print(fTest set accuracy: {test_acc:.0%}) 完成管道

只需要编写一小段代码就可以完成您的管道并使其与任何人共享。这定义了将在Conducto管道中使用的图像，并处理文件执行:

### # Pipeline Helper functions ### def get_image(): return co.Image( "python:3.8-slim", copy_dir=".", reqs_py=["conducto", "tensorflow", "keras"], ) if __name__ == "__main__": co.main(default=main) 现在，你可以在终端中运行python pipeline.py——它应该会启动一个到新Conducto管道的链接。

结论

这是解决音频处理问题的方法之一，但是根据要分析的数据，它可能要复杂得多。如果将其构建在管道中，可以很轻松地与同事共享并在遇到错误时获得帮助或反馈。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。