Python difflib|SequenceMatcher|Differ|HtmlDiff 使用方法

介绍

difflib 是python提供的比较序列(string list)差异的模块。
实现了三个类:

  • SequenceMatcher 任意类型序列的比较 (可以比较字符串)
  • Differ 对字符串进行比较
  • HtmlDiff 将比较结果输出为html格式

SequenceMatcher 实例

代码:

import difflib
from pprint import pprint
 
a = 'pythonclub.org is wonderful'
b = 'Pythonclub.org also wonderful'
#构造SequenceMatcher类
s = difflib.SequenceMatcher(None, a, b)
 
#得到相同的block
print "s.get_matching_blocks():"
pprint(s.get_matching_blocks())
print 
print "s.get_opcodes():"
for tag, i1, i2, j1, j2 in s.get_opcodes():
    print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %  (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
    #在此实现你的功能

输出为:

s.get_matching_blocks():
[(1, 1, 14), (16, 17, 1), (17, 19, 10), (27, 29, 0)]

s.get_opcodes():
replace a[0:1] (p) b[0:1] (P)
  equal a[1:15] (ythonclub.org ) b[1:15] (ythonclub.org )
replace a[15:16] (i) b[15:17] (al)
  equal a[16:17] (s) b[17:18] (s)
 insert a[17:17] () b[18:19] (o)
  equal a[17:27] ( wonderful) b[19:29] ( wonderful)

SequenceMatcher find_longest_match BUG

import difflib
 
str1 = "Poor Impulse Control: A Good Babysitter Is Hard To Find"
 
str2 = """     A Good Babysitter Is Hard To Find    This is Frederick
by Leo Lionni, the first book I picked for myself.
I was in kindergarten, I believe, which would be either 1968 or 1969.
Frederick has a specific lesson for children about how art is as
important in life as bread, but there's a secondary consideration
I took away: if we pool our talents our lives are immeasurably better.
Curiously, this book is the story of my life, however one interprets
those things. I expect Mickey Rooney to show up any time with a barn
and a plan for a show, though my mom is not making costumes. My sisters
own a toy store with a fantastic selection of imaginative children's books.
I try not to open them because I can't close them and put them back.
My tantrums are setting a bad example for the kids. Anyway, I mention
this because yesterday was Mr. Rogers' 40th anniversary. I appreciate
the peaceful gentleman more as time passes, as I play with finger puppets
in department meetings, as I eye hollow trees for Lady Elaine Fairchild
infestations. Maybe Pete can build me trolley tracks!Labels: To Take
Your Heart Away   """
 
s = difflib.SequenceMatcher(None, str1, str2)
print len(str1), len(str2)
star_a, start_b, length = s.find_longest_match(0, len(str1)-1, 0, len(str2)-1)
print star_a, start_b, length
print str1[star_a:star_a + length]

输出结果为:

55 1116
0 1048 1
P

版本为:
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>

而最长的应该为 A Good Babysitter Is Hard To Find.

解决方法

将 str1 于 str2 交换一下, len(str1) > len(str2).
则输出结果是想得到的结果。
感谢 davies(at)newsmth

原来这是个已知的bug: http://psf.upfronthosting.co.za/roundup/tracker/issue1528074
第二个字符串长度不能超过200,
Work Around为: 将较长的字符串设为第一个,而较短的设为第二个。

python-basic/difflib.txt · 最后更改: 2010/06/02 09:18 (外部编辑)
2007~2011 Copyright @ http://www.pythonclub.org