不能使用^{}读取嵌套表,但可以滚动自己的html阅读器,并对表单元格使用read_html:import pandas as pd
import bs4
with open('up_pf00344.test.html') as f:
html = f.read()
soup = bs4.BeautifulSoup(html, 'lxml')
results = soup.find(attrs = {'id': 'results'})
# get first visible header row as dataframe headers
for row in results.thead.find_all('tr'):
if 'display:none' not in row.get('style',''):
df = pd.DataFrame(columns=[col.get_text() for col in row.find_all('th')])
break
# append all table rows to dataframe
for row in results.tbody.find_all('tr', recursive=False):
if 'display:none' in row.get('style',''):
continue
df_row = []
for col in row.find_all('td', recursive=False):
table = col.find_all('table')
df_row.append(pd.read_html(str(col))[0] if table else col.get_text())
df.loc[len(df)] = df_row
df.iloc[0].map(type)的结果:
^{pr2}$
好处:由于表行有一个id,因此可以将其用作数据帧df.loc[row.get('id')] = df_row的索引,而不是df.loc[len(df)] = df_row。在