BS4 求助
<body><html> <table border="1" width="100%" cellspacing="0" cellpadding="1"> <tr bgcolor="#3366FF"> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Date </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Day </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Time </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Course </font></td> <td align="left" width="40%" valign="top"><font color="#FFFFFF"> Course Title </font></td> <td align="left" width="10%" valign="top"><font color="#FFFFFF"> Duration </font></td> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AC1101 </td> <td align="left" width="40%" valign="top"> ACCOUNTING I </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> AD1101 </td> <td align="left" width="40%" valign="top"> FINANCIAL ACCOUNTING </td> <td align="left" width="10%" valign="top"> 2.5 </td> </tr> <tr align="yes" valign="yes" bgcolor="#99CCFF"> <td align="left" width="10%" valign="top"> 24 November 2017 </td> <td align="left" width="10%" valign="top"> Friday </td> <td align="left" width="10%" valign="top"> 9.00 am </td> <td align="left" width="10%" valign="top"> BA3201 </td> <td align="left" width="40%" valign="top"> LIFE CONTINGENCIES AND DEMOGRAPHY </td> <td align="left" width="10%" valign="top"> 3 </td> <tr align="yes" valign="yes" bgcolor="#FFFFFF"> </table> </body></html>
这样一个 html 文件,想导出到这样的 json 格式
{"AC1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AC1101","name":"ACCOUNTING I","duration":"2.5"},"AD1101":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"AD1101","name":"FINANCIAL ACCOUNTING","duration":"2.5"},"BA2201":{"date":"21 April 2017","day":"Friday","time":"9.00 am","code":"BA2201","name":"ACTUARIAL ECONOMICS","duration":"2.5"}}
https://gist.github.com/wudaown/c4f46daa4bd6edc42b8d870fd77c7322
求助 bs4 如何导!不想用正则
谢谢
#!/usr/bin/python3# _*_ coding:utf8 _*_
f = open('tmp.html')
from bs4 import BeautifulSoup
soup = BeautifulSoup(f)
f.close()
data = []
for i in soup.find_all('td'):
data.append(i.text.strip('\n').strip(' '))
r = len(data)//6
d = dict()
for i in range(r):
d.update( {data[3+i*6] : {'date':data[0+i*6],'day':data[1+i*6],'time':data[2+i*6],'code':data[3+i*6],'name':data[4+i*6],'duration':data[5+i*6]}})
for k,v in d.items():
print(k,v) ----------------------- 以下是精选回复-----------------------
答:In [1]: from lxml import etree
In [2]: with open('tmp.html','r') as f:
...: tree=etree.HTML(f.read())
In [10]: tmp=tree.xpath('//tr')
In [29]: import json
In [37]: out=list()
...: for tmp1 in tmp[1:]:
...: i=0
...: dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
...: t1=dict()
...: for t in tmp1:
...: i=i+1
...: t2=t.xpath('text()')[0]
...: t1[dict_d[i]]=t2
...: out.append(t1)
In [45]: out2=dict()
...: for o in out:
...: try:
...: out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
...: except:
...: pass
In [46]: out2
Out[46]:
{' AC1101 ': {'Course Title': ' ACCOUNTING I ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 2.5 ',
'Time': ' 9.00 am '},
' AD1101 ': {'Course Title': ' FINANCIAL ACCOUNTING ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 2.5 ',
'Time': ' 9.00 am '},
' BA3201 ': {'Course Title': ' LIFE CONTINGENCIES AND DEMOGRAPHY ',
'Date': ' 24 November 2017 ',
'Day': ' Friday ',
'Duration': ' 3 ',
'Time': ' 9.00 am '}}
答:from lxml import etree
with open('tmp.html','r') as f:
____tree=etree.HTML(f.read())
tmp=tree.xpath('//tr')
import json
out=list()
for tmp1 in tmp[1:]:
____i=0
____dict_d={1:'Date',2:'Day',3:'Time',4:'Course',5:' Course Title',6:'Duration'}
____t1=dict()
____for t in tmp1:
________i=i+1
________t2=t.xpath('text()')[0]
________t1[dict_d[i]]=t2
____out.append(t1)
out2=dict()
for o in out:
____try:
________out2[o['Course']]={'Course Title':o[' Course Title'],'Date':o['Date'],'Day':o['Day'],'Duration':o['Duration'],'Time':o['Time']}
____except:
________pass
print(out2)
答:为什么不用 pyquery 呢 滑稽
0条评论