第一句子网 - 唯美句子、句子迷、好句子大全
第一句子网 > Python数据分析与机器学习47-维基百科词条EDA

Python数据分析与机器学习47-维基百科词条EDA

时间:2024-06-26 07:31:09

相关推荐

Python数据分析与机器学习47-维基百科词条EDA

文章目录

一. 数据源介绍二. 将浮点型转为整数三. 获取网页的语言四. 分析不同语言的时间序列五. 查看英文下各个词条的时间序列六. 各个语言的热点词条参考:

一. 数据源介绍

train_1.csv:

维基百科各个词条每天点击量

二. 将浮点型转为整数

浮点型数据更占内存,所以我们可以将浮点型转为整形,减小内存的消耗,从而加快程序运行的速度

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport re# 读取数据源train = pd.read_csv('E:/file/train_1.csv').fillna(0)print(train.head())print(train.info())print("########################################################")# 浮点数占内存,转为 整数for col in train.columns[1:]:train[col] = pd.to_numeric(train[col],downcast='integer')print(train.head())print(train.info())print("########################################################")

测试记录:

Page ... -12-310 _all-access_spider ... 20.01 _all-access_spider ... 20.02 _all-access_spider ... 17.03 _all-access_spider ... 11.04 _all-access_s... ... 10.0[5 rows x 551 columns]<class 'pandas.core.frame.DataFrame'>RangeIndex: 145063 entries, 0 to 145062Columns: 551 entries, Page to -12-31dtypes: float64(550), object(1)memory usage: 609.8+ MBNone########################################################Page ... -12-310 _all-access_spider ...201 _all-access_spider ...202 _all-access_spider ...173 _all-access_spider ...114 _all-access_s... ...10[5 rows x 551 columns]<class 'pandas.core.frame.DataFrame'>RangeIndex: 145063 entries, 0 to 145062Columns: 551 entries, Page to -12-31dtypes: int32(550), object(1)memory usage: 305.5+ MBNone########################################################

三. 获取网页的语言

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport re# 读取数据源train = pd.read_csv('E:/file/train_1.csv').fillna(0)# 浮点数占内存,转为 整数#for col in train.columns[1:]:# train[col] = pd.to_numeric(train[col],downcast='integer')# 获取网页的语言def get_language(page):res = re.search('[a-z][a-z].',page)#print (res.group()[0:2])if res:return res.group()[0:2]return 'na'train['lang'] = train.Page.map(get_language)from collections import Counterprint(Counter(train.lang))

测试记录:

Counter({'en': 24108, 'ja': 20431, 'de': 18547, 'na': 17855, 'fr': 17802, 'zh': 17229, 'ru': 15022, 'es': 14069})

四. 分析不同语言的时间序列

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport refrom collections import Counter# 读取数据源train = pd.read_csv('E:/file/train_1.csv').fillna(0)# 浮点数占内存,转为 整数#for col in train.columns[1:]:# train[col] = pd.to_numeric(train[col],downcast='integer')# 获取网页的语言def get_language(page):res = re.search('[a-z][a-z].',page)#print (res.group()[0:2])if res:return res.group()[0:2]return 'na'train['lang'] = train.Page.map(get_language)# 将不同的语言放到一个列表里lang_sets = {}lang_sets['en'] = train[train.lang=='en'].iloc[:,0:-1]lang_sets['ja'] = train[train.lang=='ja'].iloc[:,0:-1]lang_sets['de'] = train[train.lang=='de'].iloc[:,0:-1]lang_sets['na'] = train[train.lang=='na'].iloc[:,0:-1]lang_sets['fr'] = train[train.lang=='fr'].iloc[:,0:-1]lang_sets['zh'] = train[train.lang=='zh'].iloc[:,0:-1]lang_sets['ru'] = train[train.lang=='ru'].iloc[:,0:-1]lang_sets['es'] = train[train.lang=='es'].iloc[:,0:-1]sums = {}for key in lang_sets:sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0) / lang_sets[key].shape[0]days = [r for r in range(sums['en'].shape[0])]# 画图进行分析fig = plt.figure(1, figsize=[10, 10])plt.ylabel('Views per Page')plt.xlabel('Day')plt.title('Pages in Different Languages')labels = {'en': 'English', 'ja': 'Japanese', 'de': 'German','na': 'Media', 'fr': 'French', 'zh': 'Chinese','ru': 'Russian', 'es': 'Spanish'}for key in sums:plt.plot(days, sums[key], label=labels[key])plt.legend()plt.show()

测试记录:

我们可以看到英文的明显高于其他语言的

中间凸起的,一般是有热点时间发生,浏览量飞速上升

五. 查看英文下各个词条的时间序列

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport refrom collections import Counter# 读取数据源train = pd.read_csv('E:/file/train_1.csv').fillna(0)# 浮点数占内存,转为 整数#for col in train.columns[1:]:# train[col] = pd.to_numeric(train[col],downcast='integer')# 获取网页的语言def get_language(page):res = re.search('[a-z][a-z].',page)#print (res.group()[0:2])if res:return res.group()[0:2]return 'na'train['lang'] = train.Page.map(get_language)# 将不同的语言放到一个列表里lang_sets = {}lang_sets['en'] = train[train.lang=='en'].iloc[:,0:-1]lang_sets['ja'] = train[train.lang=='ja'].iloc[:,0:-1]lang_sets['de'] = train[train.lang=='de'].iloc[:,0:-1]lang_sets['na'] = train[train.lang=='na'].iloc[:,0:-1]lang_sets['fr'] = train[train.lang=='fr'].iloc[:,0:-1]lang_sets['zh'] = train[train.lang=='zh'].iloc[:,0:-1]lang_sets['ru'] = train[train.lang=='ru'].iloc[:,0:-1]lang_sets['es'] = train[train.lang=='es'].iloc[:,0:-1]sums = {}for key in lang_sets:sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0) / lang_sets[key].shape[0]days = [r for r in range(sums['en'].shape[0])]def plot_entry(key, idx):data = lang_sets[key].iloc[idx, 1:]fig = plt.figure(1, figsize=(10, 5))plt.plot(days, data)plt.xlabel('day')plt.ylabel('views')plt.title(train.iloc[lang_sets[key].index[idx], 0])plt.show()idx = [1, 5, 10, 50, 100, 250,500, 750,1000,1500,2000,3000,4000,5000]for i in idx:plot_entry('en',i)plt.show()

测试记录:

后面的进行省略

六. 各个语言的热点词条

代码:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport refrom collections import Counter# 读取数据源train = pd.read_csv('E:/file/train_1.csv').fillna(0)# 浮点数占内存,转为 整数#for col in train.columns[1:]:# train[col] = pd.to_numeric(train[col],downcast='integer')# 获取网页的语言def get_language(page):res = re.search('[a-z][a-z].',page)#print (res.group()[0:2])if res:return res.group()[0:2]return 'na'train['lang'] = train.Page.map(get_language)lang_sets = {}lang_sets['en'] = train[train.lang=='en'].iloc[:,0:-1]lang_sets['ja'] = train[train.lang=='ja'].iloc[:,0:-1]lang_sets['de'] = train[train.lang=='de'].iloc[:,0:-1]lang_sets['na'] = train[train.lang=='na'].iloc[:,0:-1]lang_sets['fr'] = train[train.lang=='fr'].iloc[:,0:-1]lang_sets['zh'] = train[train.lang=='zh'].iloc[:,0:-1]lang_sets['ru'] = train[train.lang=='ru'].iloc[:,0:-1]lang_sets['es'] = train[train.lang=='es'].iloc[:,0:-1]sums = {}for key in lang_sets:sums[key] = lang_sets[key].iloc[:,1:].sum(axis=0) / lang_sets[key].shape[0]days = [r for r in range(sums['en'].shape[0])]npages = 5top_pages = {}for key in lang_sets:print(key)sum_set = pd.DataFrame(lang_sets[key][['Page']])sum_set['total'] = lang_sets[key].sum(axis=1)sum_set = sum_set.sort_values('total',ascending=False)print(sum_set.head(10))top_pages[key] = sum_set.index[0]print('\n\n')for key in top_pages:fig = plt.figure(1,figsize=(10,5))cols = train.columnscols = cols[1:-1]data = train.loc[top_pages[key],cols]plt.plot(days,data)plt.xlabel('Days')plt.ylabel('Views')plt.title(train.loc[top_pages[key],'Page'])plt.show()

测试记录:

enPage total38573 _all-access_all-agents 1.206618e+109774 _desktop_all-agents 8.774497e+0974114 _mobile-web_all-agents 3.153985e+0939180 Special:_all-access_all... 1.304079e+0910403 Special:_desktop_all-ag... 1.011848e+0974690 Special:_mobile-web_all... 2.921628e+0839172 Special:_all-access_all-a... 1.339931e+0810399 Special:_desktop_all-agents 1.332859e+0833644 _all-access_spider 1.290204e+0834257 Special:_all-access_spider 1.243102e+08jaPage total120336メインページ_all-access_all-agents 210753795.086431メインページ_desktop_all-agents 134147415.0123025 特別:検索_all-access_all-agents 70316929.089202 特別:検索_desktop_all-agents 69215206.057309 メインページ_mobile-web_all-agents 66459122.0119609 特別:最近の更新_all-access_all-agents 17662791.088897 特別:最近の更新_desktop_all-agents 17627621.0119625 真田信繁_all-access_all-agents 10793039.0123292 特別:外部リンク検索_all-access_all-agents 10331191.089463特別:外部リンク検索_desktop_all-agents 10327917.0dePage total139119 Wikipedia:_all-acce... 1.603934e+09116196 Wikipedia:_mobile-w... 1.112689e+0967049 Wikipedia:_desktop_... 4.269924e+08140151 Spezial:_all-access_all-... 2.234259e+0866736 Spezial:_desktop_all-agents 2.196368e+08140147 Spezial:_all-access_a... 4.029181e+07138800 Special:_all-access_all... 3.988154e+0768104 Spezial:_desktop_all-... 3.535523e+0768511 Special:MyPage/toolserverhelferleinconfig.js_d... 3.258496e+07137765 _all-access_all-agents 3.173246e+07naPage total45071 Special:_all-acces... 67150638.081665 Special:_desktop_a... 63349756.045056 Special:_al... 53795386.045028 _all-access_all... 52732292.081644 Special:_de... 48061029.081610 _desktop_all-ag... 39160923.046078 Special:RecentChangesLinked_commons.wikimedia.... 28306336.045078 Special:_all... 23733805.081671 Special:_des... 2544.082680 Special:RecentChangesLinked_commons.wikimedia.... 21915202.0frPage total27330 Wikipédia:_a... 868480667.055104 Wikipédia:_m... 611302821.07344 Wikipédia:_d... 239589012.027825 Spécial:_all-access_... 95666374.08221 Spécial:_desktop_all... 88448938.026500 Sp?cial:_all-access_all... 76194568.06978 Sp?cial:_desktop_all-ag... 76185450.0131296 Wikipédia:_a... 63860799.026993 Organisme_de_placement_collectif_en_valeurs_mo... 36647929.07213 Organisme_de_placement_collectif_en_valeurs_mo... 36624145.0zhPage total28727 Wikipedia:首页_all-access_all-a... 123694312.061350 Wikipedia:首页_desktop_all-agents 66435641.0105844 Wikipedia:首页_mobile-web_all-a... 50887429.028728 Special:搜索_all-access_all-agents 48678124.061351Special:搜索_desktop_all-agents 48203843.028089 _all-access_all-ag... 11485845.030960 Special:链接搜索_all-access_all-a... 10320403.063510 Special:链接搜索_desktop_all-agents 10320336.060711_desktop_all-agents 7968443.030446 瑯琊榜_(電視劇)_all-access_all-agents 5891589.0ruPage total99322 Заглавная_страница_all-access... 1.086019e+09103123 Заглавная_страница_desktop_al... 7.428800e+0817670 Заглавная_страница_mobile-web... 3.279304e+0899537 Служебная:Поиск_all-access_al... 1.037643e+08103349 Служебная:Поиск_desktop_all-a... 9.866417e+07100414 Служебная:Ссылки_сюда_all-acc... 2.510200e+07104195 Служебная:Ссылки_сюда_desktop... 2.505816e+0797670 Special:_all-access_all... 2.437457e+07101457 Special:_desktop_all-ag... 2.195847e+0798301 Служебная:Вход_all-access_all... 1.216259e+07esPage total92205 Wikipedia:_all-access_... 751492304.095855 Wikipedia:_mobile-web_... 565077372.090810 Especial:_all-access_al... 194491245.071199 Wikipedia:_desktop_all... 165439354.069939 Especial:_desktop_all-a... 160431271.094389 Especial:_mobile-web_al... 34059966.090813 Especial:_all-access_al... 33983359.0143440 Wikipedia:_all-access_... 31615409.093094 Lali_Espó_all-access_all-... 26602688.069942 Especial:_desktop_all-a... 25747141.0

后面的进行省略

参考:

/course/introduction.htm?courseId=1003590004#/courseDetail?tab=1

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。