第一句子网 - 唯美句子、句子迷、好句子大全
第一句子网 > 文本挖掘 情感分析_文本挖掘的情感分析

文本挖掘 情感分析_文本挖掘的情感分析

时间:2019-09-19 08:53:42

相关推荐

文本挖掘 情感分析_文本挖掘的情感分析

文本挖掘 情感分析

In this tutorial, I will explore some text mining techniques for sentiment analysis. Well look at how to prepare textual data. After that we will try two different classifiers to infer the tweets sentiment. We will tune the hyperparameters of both classifiers with grid search. Finally, we evaluate the performance on a set of metrics like precision, recall and the F1 score.

在本教程中,我将探讨一些用于情感分析的文本挖掘技术。 我们将研究如何准备文本数据。 之后,我们将尝试使用两个不同的分类器来推断推文的情绪。 我们将使用网格搜索调整两个分类器的超参数。 最后,我们根据一组指标(如准确性,召回率和F1得分)评估性能。

For this project, well be working with the Twitter US Airline Sentiment data set on Kaggle. It contains the tweet’s text and one variable with three possible sentiment values. Lets start by importing the packages and configuring some settings.

对于此项目,我们将使用Kaggle上的Twitter美国航空情绪数据集 。 它包含推文的文本和一个带有三个可能的情感值的变量。 让我们首先导入软件包并配置一些设置。

import numpy as np import pandas as pd pd.set_option(display.max_colwidth, -1)from time import timeimport reimport stringimport osimport emojifrom pprint import pprintimport collectionsimport matplotlib.pyplot as pltimport seaborn as snssns.set(style="darkgrid")sns.set(font_scale=1.3)from sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.metrics import classification_reportfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.linear_model import LogisticRegressionfrom sklearn.externals import joblibimport gensimfrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizeimport warningswarnings.filterwarnings(ignore)np.random.seed(37)

加载数据 (Loading the data)

We read in the comma separated file we downloaded from the Kaggle Datasets. We shuffle the data frame in case the classes are sorted. Applying thereindexmethod on thepermutationof the original indices is good for that. In this notebook, we will work with thetextvariable and theairline_sentimentvariable.

我们读取从Kaggle数据集下载的逗号分隔文件。 如果对类进行排序,我们会重新整理数据框。 将reindex方法应用于原始索引的permutation对此很有好处。 在此笔记本中,我们将使用text变量和airline_sentiment变量。

df = pd.read_csv(../input/Tweets.csv)df = df.reindex(np.random.permutation(df.index))df = df[[ ext, airline_sentiment]]

探索性数据分析 (Exploratory Data Analysis)

目标变量 (Target variable)

There are three class labels we will predict: negative, neutral or positive.

我们将预测三种类别的标签:负面,中性或正面。

The class labels are imbalanced as we can see below in the chart. This is something that we should keep in mind during the model training phase. With thefactorplotof the seaborn package, we can visualize the distribution of the target variable.

类别标签不平衡,如下图所示。 在模型训练阶段,我们应该牢记这一点。 随着factorplot的seaborn包,我们可以直观的目标变量的分布。

sns.factorplot(x="airline_sentiment", data=df, kind="count", size=6, aspect=1.5, palette="PuBuGn_d")plt.show();

输入变量 (Input variable)

To analyze thetextvariable we create a classTextCounts. In this class we compute some basic statistics on the text variable.

为了分析text变量,我们创建了一个TextCounts类。 在此类中,我们计算有关文本变量的一些基本统计信息。

count_words: number of words in the tweet

count_words:鸣叫中的单词数

count_mentions: referrals to other Twitter accounts start with a @

count_mentions:对其他Twitter帐户的引荐以@开头

count_hashtags: number of tag words, preceded by a #

count_hashtags:标记词的数量,count_hashtags

count_capital_words: number of uppercase words are sometimes used to “shout” and express (negative) emotions

count_capital_words:大写单词的数量有时用于“喊”和表达(负面)情绪

count_excl_quest_marks: number of question or exclamation marks

count_excl_quest_marks:问题或感叹号的数量

count_urls: number of links in the tweet, preceded by http(s)

count_urls:推文中的链接数,以http(s)count_urls

count_emojis: number of emoji, which might be a good sign of the sentiment

count_emojis:表情符号的数量,这可能是情绪的好兆头

class TextCounts(BaseEstimator, TransformerMixin):def count_regex(self, pattern, tweet):return len(re.findall(pattern, tweet))def fit(self, X, y=None, **fit_params):# fit method is used when specific operations need to be done on the train data, but not on the test datareturn selfdef transform(self, X, **transform_params):count_words = X.apply(lambda x: self.count_regex(r\w+, x)) count_mentions = X.apply(lambda x: self.count_regex(r@\w+, x))count_hashtags = X.apply(lambda x: self.count_regex(r#\w+, x))count_capital_words = X.apply(lambda x: self.count_regex(r\[A-Z]{2,}\b, x))count_excl_quest_marks = X.apply(lambda x: self.count_regex(r!|\?, x))count_urls = X.apply(lambda x: self.count_regex(rhttp.?://[^\s]+[\s]?, x))# We will replace the emoji symbols with a description, which makes using a regex for counting easier# Moreover, it will result in having more words in the tweetcount_emojis = X.apply(lambda x: emoji.demojize(x)).apply(lambda x: self.count_regex(r:[a-z_&]+:, x))df = pd.DataFrame({count_words: count_words, count_mentions: count_mentions, count_hashtags: count_hashtags, count_capital_words: count_capital_words, count_excl_quest_marks: count_excl_quest_marks, count_urls: count_urls, count_emojis: count_emojis})return dftc = TextCounts()df_eda = tc.fit_transform(df.text)df_eda[airline_sentiment] = df.airline_sentiment

It could be interesting to see how the TextStats variables relate to the class variable. So we write a functionshow_distthat provides descriptive statistics and a plot per target class.

看看TextStats变量与类变量之间的关系可能会很有趣。 因此,我们编写了一个函数show_dist,该函数提供描述性统计信息和每个目标类的图表。

def show_dist(df, col):print(Descriptive stats for {}.format(col))print(-*(len(col)+22))print(df.groupby(airline_sentiment)[col].describe())bins = np.arange(df[col].min(), df[col].max() + 1)g = sns.Face

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。