这里会显示出您选择的修订版和当前版本之间的差别。
python-basic:difflib [2010/06/02 01:18] |
python-basic:difflib [2010/06/02 01:18] (当前版本) |
||
---|---|---|---|
行 1: | 行 1: | ||
+ | ====== Python difflib|SequenceMatcher|Differ|HtmlDiff 使用方法 ====== | ||
+ | ===== 介绍 ===== | ||
+ | difflib 是python提供的比较序列(string list)差异的模块。 \\ | ||
+ | 实现了三个类: \\ | ||
+ | * SequenceMatcher 任意类型序列的比较 (可以比较字符串) | ||
+ | * Differ 对字符串进行比较 | ||
+ | * HtmlDiff 将比较结果输出为html格式 | ||
+ | |||
+ | ===== SequenceMatcher 实例 ===== | ||
+ | |||
+ | ==== 代码: ==== | ||
+ | |||
+ | <code python> | ||
+ | import difflib | ||
+ | from pprint import pprint | ||
+ | |||
+ | a = 'pythonclub.org is wonderful' | ||
+ | b = 'Pythonclub.org also wonderful' | ||
+ | #构造SequenceMatcher类 | ||
+ | s = difflib.SequenceMatcher(None, a, b) | ||
+ | |||
+ | #得到相同的block | ||
+ | print "s.get_matching_blocks():" | ||
+ | pprint(s.get_matching_blocks()) | ||
+ | |||
+ | print "s.get_opcodes():" | ||
+ | for tag, i1, i2, j1, j2 in s.get_opcodes(): | ||
+ | print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])) | ||
+ | #在此实现你的功能 | ||
+ | |||
+ | </code> | ||
+ | |||
+ | ==== 输出为: ==== | ||
+ | <file> | ||
+ | s.get_matching_blocks(): | ||
+ | [(1, 1, 14), (16, 17, 1), (17, 19, 10), (27, 29, 0)] | ||
+ | |||
+ | s.get_opcodes(): | ||
+ | replace a[0:1] (p) b[0:1] (P) | ||
+ | equal a[1:15] (ythonclub.org ) b[1:15] (ythonclub.org ) | ||
+ | replace a[15:16] (i) b[15:17] (al) | ||
+ | equal a[16:17] (s) b[17:18] (s) | ||
+ | insert a[17:17] () b[18:19] (o) | ||
+ | equal a[17:27] ( wonderful) b[19:29] ( wonderful) | ||
+ | </file> | ||
+ | |||
+ | |||
+ | ===== SequenceMatcher find_longest_match BUG===== | ||
+ | <code python> | ||
+ | import difflib | ||
+ | |||
+ | str1 = "Poor Impulse Control: A Good Babysitter Is Hard To Find" | ||
+ | |||
+ | str2 = """ A Good Babysitter Is Hard To Find This is Frederick | ||
+ | by Leo Lionni, the first book I picked for myself. | ||
+ | I was in kindergarten, I believe, which would be either 1968 or 1969. | ||
+ | Frederick has a specific lesson for children about how art is as | ||
+ | important in life as bread, but there's a secondary consideration | ||
+ | I took away: if we pool our talents our lives are immeasurably better. | ||
+ | Curiously, this book is the story of my life, however one interprets | ||
+ | those things. I expect Mickey Rooney to show up any time with a barn | ||
+ | and a plan for a show, though my mom is not making costumes. My sisters | ||
+ | own a toy store with a fantastic selection of imaginative children's books. | ||
+ | I try not to open them because I can't close them and put them back. | ||
+ | My tantrums are setting a bad example for the kids. Anyway, I mention | ||
+ | this because yesterday was Mr. Rogers' 40th anniversary. I appreciate | ||
+ | the peaceful gentleman more as time passes, as I play with finger puppets | ||
+ | in department meetings, as I eye hollow trees for Lady Elaine Fairchild | ||
+ | infestations. Maybe Pete can build me trolley tracks!Labels: To Take | ||
+ | Your Heart Away """ | ||
+ | |||
+ | s = difflib.SequenceMatcher(None, str1, str2) | ||
+ | print len(str1), len(str2) | ||
+ | star_a, start_b, length = s.find_longest_match(0, len(str1)-1, 0, len(str2)-1) | ||
+ | print star_a, start_b, length | ||
+ | print str1[star_a:star_a + length] | ||
+ | </code> | ||
+ | |||
+ | 输出结果为: | ||
+ | <file> | ||
+ | 55 1116 | ||
+ | 0 1048 1 | ||
+ | P | ||
+ | |||
+ | 版本为: | ||
+ | Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on | ||
+ | win32 | ||
+ | Type "help", "copyright", "credits" or "license" for more information. | ||
+ | >>> | ||
+ | </file> | ||
+ | 而最长的应该为 A Good Babysitter Is Hard To Find. | ||
+ | |||
+ | ==== 解决方法 ==== | ||
+ | 将 str1 于 str2 交换一下, len(str1) > len(str2). \\ | ||
+ | 则输出结果是想得到的结果。 \\ | ||
+ | **感谢 davies(at)newsmth** \\ | ||
+ | |||
+ | 原来这是个已知的bug: http://psf.upfronthosting.co.za/roundup/tracker/issue1528074 \\ | ||
+ | 第二个字符串长度不能超过200,\\ | ||
+ | Work Around为: 将较长的字符串设为第一个,而较短的设为第二个。 |