python-files:htmlparser

差别

这里会显示出您选择的修订版和当前版本之间的差别。

到此差别页面的链接

@@ 行 1: / 行 1: @@
+====== Python 用HTMLParser解析HTML文件 ======
+HTMLParser是Python自带的模块，使用简单，能够很容易的实现HTML文件的分析。\\
+本文主要简单讲一下HTMLParser的用法. \\
+使用时需要定义一个从类HTMLParser继承的类，重定义函数：
+  * handle_starttag( tag, attrs)
+  * handle_startendtag( tag, attrs)
+  * handle_endtag( tag)
+来实现自己需要的功能。
+tag是的html标签，attrs是 (属性，值)元组(tuple)的列表(list). \\
+HTMLParser自动将tag和attrs都转为小写。
+下面给出的例子抽取了html中的所有链接：
+<code python>
+from HTMLParser import HTMLParser
+class MyHTMLParser(HTMLParser):
+    def __init__(self):
+        HTMLParser.__init__(self)
+        self.links = []
+    def handle_starttag(self, tag, attrs):
+        #print "Encountered the beginning of a %s tag" % tag
+        if tag == "a":
+            if len(attrs) == 0: pass
+            else:
+                for (variable, value)  in attrs:
+                    if variable == "href":
+                        self.links.append(value)
+if __name__ == "__main__":
+    html_code = """
+    <a href="www.google.com"> google.com</a>
+    <A Href="www.pythonclub.org"> PythonClub </a>
+    <A HREF = "www.sina.com.cn"> Sina </a>
+    """
+    hp = MyHTMLParser()
+    hp.feed(html_code)
+    hp.close()
+    print(hp.links)
+</code>
+输出为：
+<file>
+['www.google.com', 'www.pythonclub.org', 'www.sina.com.cn']
+</file>
+如果想抽取图形链接
+<file>
+<img src='http://www.google.com/intl/zh-CN_ALL/images/logo.gif' />
+</file>
+就要重定义 handle_startendtag( tag, attrs)  函数

python-files/htmlparser.txt · 最后更改: 2010/06/02 01:18 (外部编辑)