我正在寻找一种方法来测试一个给定的字符串是否在整个字符串中重复自己。
例子:
[
'0045662100456621004566210045662100456621', # '00456621'
'0072992700729927007299270072992700729927', # '00729927'
'001443001443001443001443001443001443001443', # '001443'
'037037037037037037037037037037037037037037037', # '037'
'047619047619047619047619047619047619047619', # '047619'
'002457002457002457002457002457002457002457', # '002457'
'001221001221001221001221001221001221001221', # '001221'
'001230012300123001230012300123001230012300123', # '00123'
'0013947001394700139470013947001394700139470013947', # '0013947'
'001001001001001001001001001001001001001001001001001', # '001'
'001406469760900140646976090014064697609', # '0014064697609'
]
是重复自己的字符串,和
[
'004608294930875576036866359447',
'00469483568075117370892018779342723',
'004739336492890995260663507109',
'001508295625942684766214177978883861236802413273',
'007518796992481203',
'0071942446043165467625899280575539568345323741',
'0434782608695652173913',
'0344827586206896551724137931',
'002481389578163771712158808933',
'002932551319648093841642228739',
'0035587188612099644128113879',
'003484320557491289198606271777',
'00115074798619102416570771',
]
是一些不这样做的例子。
我给出的字符串的重复部分可能相当长,字符串本身可能有500个或更多字符,因此循环每个字符试图构建一个模式,然后检查模式与字符串的其余部分似乎非常慢。再乘以几百个字符串,我看不出任何直观的解决方案。
我研究了一下正则表达式,当你知道你在寻找什么,或者至少知道你在寻找的模式的长度时,它们似乎很有用。不幸的是,我两个都不知道。
我怎么知道一个字符串是否在重复它自己,如果是的话,最短的重复子序列是什么?
这里有一个使用正则表达式的解决方案。
import re
REPEATER = re.compile(r"(.+?)\1+$")
def repeated(s):
match = REPEATER.match(s)
return match.group(1) if match else None
遍历问题中的例子:
examples = [
'0045662100456621004566210045662100456621',
'0072992700729927007299270072992700729927',
'001443001443001443001443001443001443001443',
'037037037037037037037037037037037037037037037',
'047619047619047619047619047619047619047619',
'002457002457002457002457002457002457002457',
'001221001221001221001221001221001221001221',
'001230012300123001230012300123001230012300123',
'0013947001394700139470013947001394700139470013947',
'001001001001001001001001001001001001001001001001001',
'001406469760900140646976090014064697609',
'004608294930875576036866359447',
'00469483568075117370892018779342723',
'004739336492890995260663507109',
'001508295625942684766214177978883861236802413273',
'007518796992481203',
'0071942446043165467625899280575539568345323741',
'0434782608695652173913',
'0344827586206896551724137931',
'002481389578163771712158808933',
'002932551319648093841642228739',
'0035587188612099644128113879',
'003484320557491289198606271777',
'00115074798619102416570771',
]
for e in examples:
sub = repeated(e)
if sub:
print("%r: %r" % (e, sub))
else:
print("%r does not repeat." % e)
... 产生如下输出:
'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.
正则表达式(.+?)\1+$分为三部分:
(.+?)是一个匹配组,包含至少一个(但尽可能少)任意字符(因为+?是贪婪的,)。
\1+检查第一部分中匹配组是否至少重复一次。
$检查字符串的结尾,以确保在重复的子字符串之后没有额外的非重复内容(并且使用re.match()确保在重复的子字符串之前没有非重复文本)。
在Python 3.4及以后版本中,你可以放弃$而使用re.fullmatch(),或者(至少在任何Python 2.3版本中)使用re.search()和正则表达式^(.+?)\1+$,所有这些都取决于个人的喜好。
这里有一个使用正则表达式的解决方案。
import re
REPEATER = re.compile(r"(.+?)\1+$")
def repeated(s):
match = REPEATER.match(s)
return match.group(1) if match else None
遍历问题中的例子:
examples = [
'0045662100456621004566210045662100456621',
'0072992700729927007299270072992700729927',
'001443001443001443001443001443001443001443',
'037037037037037037037037037037037037037037037',
'047619047619047619047619047619047619047619',
'002457002457002457002457002457002457002457',
'001221001221001221001221001221001221001221',
'001230012300123001230012300123001230012300123',
'0013947001394700139470013947001394700139470013947',
'001001001001001001001001001001001001001001001001001',
'001406469760900140646976090014064697609',
'004608294930875576036866359447',
'00469483568075117370892018779342723',
'004739336492890995260663507109',
'001508295625942684766214177978883861236802413273',
'007518796992481203',
'0071942446043165467625899280575539568345323741',
'0434782608695652173913',
'0344827586206896551724137931',
'002481389578163771712158808933',
'002932551319648093841642228739',
'0035587188612099644128113879',
'003484320557491289198606271777',
'00115074798619102416570771',
]
for e in examples:
sub = repeated(e)
if sub:
print("%r: %r" % (e, sub))
else:
print("%r does not repeat." % e)
... 产生如下输出:
'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.
正则表达式(.+?)\1+$分为三部分:
(.+?)是一个匹配组,包含至少一个(但尽可能少)任意字符(因为+?是贪婪的,)。
\1+检查第一部分中匹配组是否至少重复一次。
$检查字符串的结尾,以确保在重复的子字符串之后没有额外的非重复内容(并且使用re.match()确保在重复的子字符串之前没有非重复文本)。
在Python 3.4及以后版本中,你可以放弃$而使用re.fullmatch(),或者(至少在任何Python 2.3版本中)使用re.search()和正则表达式^(.+?)\1+$,所有这些都取决于个人的喜好。