这里会显示出您选择的修订版和当前版本之间的差别。
— |
modules:beautifulsoup:start [2011/03/26 04:01] (当前版本) |
||
---|---|---|---|
行 1: | 行 1: | ||
+ | ====== Beautiful Soup 中文教程 ====== | ||
+ | Beautiful Soup 是一个处理Python HTML/XML的模块,功能相当强劲,最近仔细的看了一下他的帮助文档,终于看明白了一些。 | ||
+ | 准备好好研究一下,顺便将Beautiful Soup的一些用法整理一下,放到这个wiki上面,那个文档确实不咋地。 | ||
+ | |||
+ | [[modules:BeautifulSoup:start]]的官方页面:http://www.crummy.com/software/BeautifulSoup/ | ||
+ | |||
+ | ===== BeautifulSoup 下载与安装===== | ||
+ | 下载地址为: \\ | ||
+ | http://www.crummy.com/software/BeautifulSoup/ | ||
+ | |||
+ | 安装其实很简单,BeautifulSoup只有一个文件,只要把这个文件拷到你的工作目录,就可以了。 | ||
+ | |||
+ | <code python> | ||
+ | from BeautifulSoup import BeautifulSoup # For processing HTML | ||
+ | from BeautifulSoup import BeautifulStoneSoup # For processing XML | ||
+ | import BeautifulSoup # To get everything | ||
+ | </code> | ||
+ | |||
+ | ===== 创建 BeautifulSoup 对象===== | ||
+ | BeautifulSoup对象需要一段html文本就可以创建了。 | ||
+ | |||
+ | 下面的代码就创建了一个BeautifulSoup对象: | ||
+ | |||
+ | <code python> | ||
+ | from BeautifulSoup import BeautifulSoup | ||
+ | doc = ['<html><head><title>PythonClub.org</title></head>', | ||
+ | '<body><p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.', | ||
+ | '<p id="secondpara" align="blah">This is paragraph <b>two</b> of pythonclub.org.', | ||
+ | '</html>'] | ||
+ | soup = BeautifulSoup(''.join(doc)) | ||
+ | </code> | ||
+ | |||
+ | ===== 查找HTML内指定元素 ===== | ||
+ | |||
+ | BeautifulSoup可以直接用"."访问指定HTML元素 | ||
+ | |||
+ | ====根据html标签(tag)查找:查找html title==== | ||
+ | 可以用 soup.html.head.title 得到title的name,和字符串值。 | ||
+ | <code python> | ||
+ | >>> soup.html.head.title | ||
+ | <title>PythonClub.org</title> | ||
+ | >>> soup.html.head.title.name | ||
+ | u'title' | ||
+ | >>> soup.html.head.title.string | ||
+ | u'PythonClub.org' | ||
+ | >>> | ||
+ | </code> | ||
+ | 也可以直接通过soup.title直接定位到指定HTML元素: | ||
+ | <code python> | ||
+ | >>> soup.title | ||
+ | <title>PythonClub.org</title> | ||
+ | >>> | ||
+ | </code> | ||
+ | |||
+ | ====根据html内容查找:查找包含特定字符串的整个标签内容==== | ||
+ | |||
+ | 下面的例子给出了查找含有"para"的html tag内容: | ||
+ | <code python> | ||
+ | >>> soup.findAll(text=re.compile("para")) | ||
+ | [u'This is paragraph ', u'This is paragraph '] | ||
+ | >>> soup.findAll(text=re.compile("para"))[0].parent | ||
+ | <p id="firstpara" align="center">This is paragraph <b>one</b> of ptyhonclub.org.</p> | ||
+ | >>> soup.findAll(text=re.compile("para"))[0].parent.contents | ||
+ | [u'This is paragraph ', <b>one</b>, u' of ptyhonclub.org.'] | ||
+ | </code> | ||
+ | |||
+ | ==== 根据CSS属性查找HTML内容 ==== | ||
+ | |||
+ | <code python> | ||
+ | soup.findAll(id=re.compile("para$")) | ||
+ | # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, | ||
+ | # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] | ||
+ | |||
+ | soup.findAll(attrs={'id' : re.compile("para$")}) | ||
+ | # [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>, | ||
+ | # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] | ||
+ | </code> | ||
+ | ===== 深入理解BeautifulSoup ===== | ||
+ | |||
+ | * [[modules:beautifulsoup:encode|BeautifulSoup 编码相关]] | ||
+ | |||
+ | * [[modules:beautifulsoup:tricks|BeautifulSoup 技巧]] |