摘 要



针对梅尔频率倒谱系数特征(Mel Frequency Cepstrum Coefficient, MFCC)在深度模型里存在语音信息表征能力弱的问题,提出一种对数梅尔滤波组((Log MelFilter-bank, Fbank)特征结合卷积神经网络(Convolutional Neural Networks, CNN)再提取的特征提取方法,并和DFSMN结合构建声学模型CNN-DFSMN,实现语音转拼音任务。实验结果表明,Fbank特征结合CNN再提取的特征提取方法与其他特征提取方法相比,语音信息表征能力更强,模型的字符错误率(Character Error Rate, CER)更低。



The Design and Implementation of the Speech Recognition System Based on Python


With the development of the Internet, voice files have become more and more accessible files.How to efficiently extract the key information from a recording, extract the content that people are interested in, and intuitively present it to the door.This paper takes DFSMN as an acoustic model and introduces the TensorFlowr model to transform speech recognition into a translation task, which has certain theoretical significance and research value.

This paper describes several mainstream deep learning models in the field of speech recognition.According to the deep learning theory, the overall scheme of the TensorFlow-based continuous speech learning system is designed.Focus on the shortcomings of speech feature extraction method and language model TensorFlow, and optimize the feature extraction method and language model.

For Mer frequency inversion coefficient characteristics (Mel Frequency Cepstrum Coefficient, MFCC) has the problem of weak speech information representation ability in the deep model, Introduce a log-Mayer filter group ((Log MelFilter-bank, Fbank) features combined with convolutional neural networks (Convolutional Neural Networks, CNN) The feature extraction method for reextraction, Combined with DFSMN to construct the acoustic model CNN-DFSMN, Realize the voice to pinyin task.The experimental results show that the feature extraction method of Fbank feature extraction has stronger representation ability and lower character error rate (Character Error Rate, CER).

An attention computational improvement method based on Hadamard matrix is proposed for the problems that language model TensorFlowr has complex computation and insufficient model generalization ability.This method uses the Hadamard matrix generated with different threshold values to generate a new attention matrix.Experimental results show that the improved TensorFlow model using the Hadmard matrix has both reduced recognition time and CER of the language model compared with the initial TensorFlowr model. Key words: Python, speech recognition, speech processing, TensorFlow, model

Key words: Haojing College of Shaanxi University of Science & Technology,Undergraduates

3 语音识别系统方案设计


3.1 语音信号的预处理




import osimport matplotlib.pyplot as pltfrom collections import Counterimport numpy as npfrom tensorflow import kerasfrom model import ResNetModelimport tensorflow as tfdef load_data():# 读取数据x = np.load('train_data/data.npy')y = np.load('train_data/label.npy')num = len(Counter(y))print("类别数量为:", num)return x, y, numif __name__ == '__main__':data, label, label_num = load_data()# 修改data的shapedata = data.reshape((data.shape[0], data.shape[1], data.shape[2], 1))# 模型参数model_param = {"label_count": label_num,"num_b": 20}data_shape = (data.shape[1], data.shape[2], data.shape[3])Resnetmodel = ResNetModel(input_shape=data_shape, classes=model_param['label_count'])ResNet_model = Resnetmodel.ResNet50()ResNet_model.summary()learning_rate = 1e-4num_epochs = 10batch_size = 16# 设置模型log输出地址log_dir = os.path.join("logs/")if not os.path.exists(log_dir):os.mkdir(log_dir)# 设置模型训练优化器,默认为Adamoptimizer = keras.optimizers.Adam(learning_rate)tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)# 编译模型来配置学习过程pile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])# 模型训练history = ResNet_model.fit(data, label, epochs=num_epochs, batch_size=batch_size, callbacks=[tensorboard_callback],validation_split=0.2)# 保存模型model_path = 'models_save/resnet_model.h5'ResNet_model.save(model_path)print("完成模型训练,保存地址:", model_path)
