我在一个正则表达式后,将验证一个完整的复杂的英国邮政编码只在输入字符串。所有不常见的邮政编码形式必须包括以及通常。例如:

匹配

CW3 9不锈钢 SE5 0EG SE50EG Se5 0eg WC2H 7LT

不匹配

aWC2H 7LT WC2H 7LTa WC2H

我怎么解决这个问题?


当前回答

我建议你看看英国政府的邮政编码数据标准[链接现在死了;XML的存档,参见维基百科的讨论]。这里有关于数据的简要描述,附带的xml模式提供了一个正则表达式。这可能不是你想要的,但会是一个很好的起点。RegEx与XML略有不同,因为给定的定义允许在格式A9A 9AA中第三个位置的P字符。

英国政府提供的正则表达式为:

([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})

正如维基百科讨论中指出的那样,这将允许一些非真实的邮政编码(例如以AA, ZY开头的邮政编码),并且它们确实提供了一个更严格的测试,您可以尝试一下。

其他回答

我需要一个可以在SAS中使用PRXMATCH和相关函数的版本,所以我想到了这个:

^[A-PR-UWYZ](([A-HK-Y]?\d\d?)|(\d[A-HJKPSTUW])|([A-HK-Y]\d[ABEHMNPRV-Y]))\s?\d[ABD-HJLNP-UW-Z]{2}$

测试用例和注意事项:

/* 
Notes
The letters QVX are not used in the 1st position.
The letters IJZ are not used in the second position.
The only letters to appear in the third position are ABCDEFGHJKPSTUW when the structure starts with A9A.
The only letters to appear in the fourth position are ABEHMNPRVWXY when the structure starts with AA9A.
The final two letters do not use the letters CIKMOV, so as not to resemble digits or each other when hand-written.
*/

/*
    Bits and pieces
    1st position (any):         [A-PR-UWYZ]         
    2nd position (if letter):   [A-HK-Y]
    3rd position (A1A format):  [A-HJKPSTUW]
    4th position (AA1A format): [ABEHMNPRV-Y]
    Last 2 positions:           [ABD-HJLNP-UW-Z]    
*/


data example;
infile cards truncover;
input valid 1. postcode &$10. Notes &$100.;
flag = prxmatch('/^[A-PR-UWYZ](([A-HK-Y]?\d\d?)|(\d[A-HJKPSTUW])|([A-HK-Y]\d[ABEHMNPRV-Y]))\s?\d[ABD-HJLNP-UW-Z]{2}$/',strip(postcode));
cards;
1  EC1A 1BB  Special case 1
1  W1A 0AX   Special case 2
1  M1 1AE    Standard format
1  B33 8TH   Standard format
1  CR2 6XH   Standard format
1  DN55 1PT  Standard format
0  QN55 1PT  Bad letter in 1st position
0  DI55 1PT  Bad letter in 2nd position
0  W1Z 0AX   Bad letter in 3rd position
0  EC1Z 1BB  Bad letter in 4th position
0  DN55 1CT  Bad letter in 2nd group
0  A11A 1AA  Invalid digits in 1st group
0  AA11A 1AA  1st group too long
0  AA11 1AAA  2nd group too long
0  AA11 1AAA  2nd group too long
0  AAA 1AA   No digit in 1st group
0  AA 1AA    No digit in 1st group
0  A 1AA     No digit in 1st group
0  1A 1AA    Missing letter in 1st group
0  1 1AA     Missing letter in 1st group
0  11 1AA    Missing letter in 1st group
0  AA1 1A    Missing letter in 2nd group
0  AA1 1     Missing letter in 2nd group
;
run;

前半段邮政编码有效格式

[a - z] [a - z][0 - 9]的[a -ž] [a - z] [a - z] [0 - 9] [0 - 9] [a - z] [0 - 9] [0 - 9] [a - z] [a - z] [0 - 9] [a - z] [a - z]的[a -ž] [a - z][0 - 9]的[a -ž] [a - z] [0 - 9]

异常 位置1 - QVX未使用 位置2 -除GIR 0AA外,IJZ不使用 位置3 - AEHMNPRTVXY只使用 位置4 - ABEHMNPRVWXY

邮政编码的后半部分

[0 - 9] [a - z]的[a -ž]

异常 位置2+3 - CIKMOV未使用

记住,不是所有可能的代码都被使用了,所以这个列表是有效代码的必要条件,而不是充分条件。只是匹配所有有效代码的列表可能会更容易?

邮政编码可能会发生变化,验证邮政编码的唯一真正方法是拥有完整的邮政编码列表,并查看它是否存在。

但是正则表达式很有用,因为它们:

是否易于使用和实现 是短暂的 都跑得很快 相当容易维护(与完整的邮政编码列表相比) 仍然捕获大多数输入错误

但是正则表达式往往很难维护,特别是对于那些一开始就没有想到它的人来说。所以它一定是:

尽量简单易懂 相对未来的证明

这意味着这个答案中的大多数正则表达式都不够好。例如,我可以看到[a - pr - uwyz][a - hk - y][0-9][ABEHMNPRV-Y]将匹配形式为AA1A的邮政编码区域-但如果添加了新的邮政编码区域,这将是一个令人头疼的问题,因为很难理解它匹配哪些邮政编码区域。

我还想让我的正则表达式匹配邮政编码的前半部分和后半部分。

所以我想到了这个:

(GIR(?=\s*0AA)|(?:[BEGLMNSW]|[A-Z]{2})[0-9](?:[0-9]|(?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])[A-HJ-NP-Z])?)\s*([0-9][ABD-HJLNP-UW-Z]{2})

在PCRE格式中,可以这样写:

/^
  ( GIR(?=\s*0AA) # Match the special postcode "GIR 0AA"
    |
    (?:
      [BEGLMNSW] | # There are 8 single-letter postcode areas
      [A-Z]{2}     # All other postcode areas have two letters
      )
    [0-9] # There is always at least one number after the postcode area
    (?:
      [0-9] # And an optional extra number
      |
      # Only certain postcode areas can have an extra letter after the number
      (?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])
      [A-HJ-NP-Z] # Possible letters here may change, but [IO] will never be used
      )?
    )
  \s*
  ([0-9][ABD-HJLNP-UW-Z]{2}) # The last two letters cannot be [CIKMOV]
$/x

对我来说,这是尽可能多地验证之间的正确平衡,与此同时,未来的验证和易于维护。

我建议你看看英国政府的邮政编码数据标准[链接现在死了;XML的存档,参见维基百科的讨论]。这里有关于数据的简要描述,附带的xml模式提供了一个正则表达式。这可能不是你想要的,但会是一个很好的起点。RegEx与XML略有不同,因为给定的定义允许在格式A9A 9AA中第三个位置的P字符。

英国政府提供的正则表达式为:

([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?))))\s?[0-9][A-Za-z]{2})

正如维基百科讨论中指出的那样,这将允许一些非真实的邮政编码(例如以AA, ZY开头的邮政编码),并且它们确实提供了一个更严格的测试,您可以尝试一下。

通过经验测试和观察,以及https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation的确认,以下是我的Python正则表达式版本,可以正确地解析和验证英国邮政编码:

UK_POSTCODE_REGEX = r ' (? P < postcode_area > [a - z] {1,2}) (? P <区> (?:[0 - 9]{1,2})| (?:[0 - 9][a - z])) (? P <部门> [0 - 9])(? P <邮编> [a - z]{2})”

这个正则表达式很简单,并且有捕获组。它不包括所有合法的英国邮政编码的验证,而只考虑字母与数字的位置。

下面是我在代码中如何使用它:

@dataclass
class UKPostcode:
    postcode_area: str
    district: str
    sector: int
    postcode: str

    # https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
    # Original author of this regex: @jontsai
    # NOTE TO FUTURE DEVELOPER:
    # Verified through empirical testing and observation, as well as confirming with the Wiki article
    # If this regex fails to capture all valid UK postcodes, then I apologize, for I am only human.
    UK_POSTCODE_REGEX = r'(?P<postcode_area>[A-Z]{1,2})(?P<district>(?:[0-9]{1,2})|(?:[0-9][A-Z]))(?P<sector>[0-9])(?P<postcode>[A-Z]{2})'

    @classmethod
    def from_postcode(cls, postcode):
        """Parses a string into a UKPostcode

        Returns a UKPostcode or None
        """
        m = re.match(cls.UK_POSTCODE_REGEX, postcode.replace(' ', ''))

        if m:
            uk_postcode = UKPostcode(
                postcode_area=m.group('postcode_area'),
                district=m.group('district'),
                sector=m.group('sector'),
                postcode=m.group('postcode')
            )
        else:
            uk_postcode = None

        return uk_postcode


def parse_uk_postcode(postcode):
    """Wrapper for UKPostcode.from_postcode
    """
    uk_postcode = UKPostcode.from_postcode(postcode)
    return uk_postcode

下面是单元测试:

@pytest.mark.parametrize(
    'postcode, expected', [
        # https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
        (
            'EC1A1BB',
            UKPostcode(
                postcode_area='EC',
                district='1A',
                sector='1',
                postcode='BB'
            ),
        ),
        (
            'W1A0AX',
            UKPostcode(
                postcode_area='W',
                district='1A',
                sector='0',
                postcode='AX'
            ),
        ),
        (
            'M11AE',
            UKPostcode(
                postcode_area='M',
                district='1',
                sector='1',
                postcode='AE'
            ),
        ),
        (
            'B338TH',
            UKPostcode(
                postcode_area='B',
                district='33',
                sector='8',
                postcode='TH'
            )
        ),
        (
            'CR26XH',
            UKPostcode(
                postcode_area='CR',
                district='2',
                sector='6',
                postcode='XH'
            )
        ),
        (
            'DN551PT',
            UKPostcode(
                postcode_area='DN',
                district='55',
                sector='1',
                postcode='PT'
            )
        )
    ]
)
def test_parse_uk_postcode(postcode, expected):
    uk_postcode = parse_uk_postcode(postcode)
    assert(uk_postcode == expected)