如何检查给定的字符串是否是有效的URL地址?

我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。


当前回答

对于Python,这是Django 1.5.1中使用的验证正则表达式的实际URL:

import re
regex = re.compile(
        r'^(?:http|ftp)s?://'  # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
        r'localhost|'  # localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4
        r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6
        r'(?::\d+)?'  # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

这既处理ipv4和ipv6地址,也处理端口和GET参数。

在代码44行中找到。

其他回答

我用这个:/ ((https ?: \ / \ / | ftp: \ / \ / | www \) \ S + \[^。\ n ]+((?:\([^)]*\))|[^.,;:?!"'\ n \)\]<* ])+)/

它很短,但它处理了一些边缘情况,比如某些以括号结尾的维基百科链接(https://en.wikipedia.org/wiki/Sally_(name),这里投票最多的答案似乎没有涵盖这一点。

下面是Android源代码的Java版本。这是我找到的最好的一个。

public static final Matcher WEB  = Pattern.compile(new StringBuilder()                 
.append("((?:(http|https|Http|Https|rtsp|Rtsp):")                      
.append("\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)")                         
.append("\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_")                         
.append("\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?")                         
.append("((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+")   // named host                            
.append("(?:")   // plus top level domain                         
.append("(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])")                         
.append("|(?:biz|b[abdefghijmnorstvwyz])")                         
.append("|(?:cat|com|coop|c[acdfghiklmnoruvxyz])")                         
.append("|d[ejkmoz]")                         
.append("|(?:edu|e[cegrstu])")                         
.append("|f[ijkmor]")                         
.append("|(?:gov|g[abdefghilmnpqrstuwy])")                         
.append("|h[kmnrtu]")                         
.append("|(?:info|int|i[delmnoqrst])")                         
.append("|(?:jobs|j[emop])")                         
.append("|k[eghimnrwyz]")                         
.append("|l[abcikrstuvy]")                         
.append("|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])")                         
.append("|(?:name|net|n[acefgilopruz])")                         
.append("|(?:org|om)")                         
.append("|(?:pro|p[aefghklmnrstwy])")                         
.append("|qa")                         
.append("|r[eouw]")                         
.append("|s[abcdeghijklmnortuvyz]")                         
.append("|(?:tel|travel|t[cdfghjklmnoprtvwz])")                         
.append("|u[agkmsyz]")                         
.append("|v[aceginu]")                         
.append("|w[fs]")                         
.append("|y[etu]")                         
.append("|z[amw]))")                         
.append("|(?:(?:25[0-5]|2[0-4]") // or ip address                                                 
.append("[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]")                             
.append("|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]")                         
.append("[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}")                         
.append("|[1-9][0-9]|[0-9])))")                         
.append("(?:\\:\\d{1,5})?)") // plus option port number                             
.append("(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~")  // plus option query params                         
.append("\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?")                         
.append("(?:\\b|$)").toString()                 
).matcher("");

获取URL的部分(Regex)这篇文章讨论了解析URL以识别其各个组件。如果您想检查URL是否格式良好,它应该足以满足您的需求。

如果你需要检查它是否有效,你最终将不得不尝试访问另一端的任何东西。

不过,一般来说,使用框架或其他库提供的函数可能会更好。许多平台都包含了解析url的函数。例如,有Python的urlparse模块,在。net中你可以使用System模块。类的构造函数,作为验证URL的一种方法。

我无法找到我正在寻找的正则表达式,所以我修改了一个正则表达式来满足我的要求,显然现在它似乎工作得很好。我的要求是:

匹配带有协议的url (www.gooogle.com) 使用查询参数和路径匹配url (http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e) 不要匹配有不可接受字符的url(例如。' '£),例如:(www.google.com/somthing"/somethingmore)

以下是我的想法,任何建议都很感激:

@Test
    public void testWebsiteUrl(){
        String regularExpression = "((http|ftp|https):\\/\\/)?[\\w\\-_]+(\\.[\\w\\-_]+)+([\\w\\-\\.,@?^=%&amp;:/~\\+#]*[\\w\\-\\@?^=%&amp;/~\\+#])?";

        assertTrue("www.google.com".matches(regularExpression));
        assertTrue("www.google.co.uk".matches(regularExpression));
        assertTrue("http://www.google.com".matches(regularExpression));
        assertTrue("http://www.google.co.uk".matches(regularExpression));
        assertTrue("https://www.google.com".matches(regularExpression));
        assertTrue("https://www.google.co.uk".matches(regularExpression));
        assertTrue("google.com".matches(regularExpression));
        assertTrue("google.co.uk".matches(regularExpression));
        assertTrue("google.mu".matches(regularExpression));
        assertTrue("mes.intnet.mu".matches(regularExpression));
        assertTrue("cse.uom.ac.mu".matches(regularExpression));

        assertTrue("http://www.google.com/path".matches(regularExpression));
        assertTrue("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e".matches(regularExpression));
        assertTrue("http://www.google.com/?queryparam=123".matches(regularExpression));
        assertTrue("http://www.google.com/path?queryparam=123".matches(regularExpression));

        assertFalse("www..dr.google".matches(regularExpression));

        assertFalse("www:google.com".matches(regularExpression));

        assertFalse("https://www@.google.com".matches(regularExpression));

        assertFalse("https://www.google.com\"".matches(regularExpression));
        assertFalse("https://www.google.com'".matches(regularExpression));

        assertFalse("http://www.google.com/path'".matches(regularExpression));
        assertFalse("http://subdomain.web-site.com/cgi-bin/perl.cgi?key1=value1&key2=value2e'".matches(regularExpression));
        assertFalse("http://www.google.com/?queryparam=123'".matches(regularExpression));
        assertFalse("http://www.google.com/path?queryparam=12'3".matches(regularExpression));

    }

我一直在写一篇深入的文章,讨论使用正则表达式进行URI验证。它基于RFC3986。

正则表达式URI验证

虽然这篇文章还不完整,但我已经提出了一个PHP函数,它在验证HTTP和FTP url方面做得非常好。以下是当前版本:

// function url_valid($url) { Rev:20110423_2000
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] =>
//    [host] => www.jmrware.com
//    [IP_literal] =>
//    [IPV6address] =>
//    [ls32] =>
//    [IPvFuture] =>
//    [IPv4address] =>
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme']) {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognized URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m['regname']) { // If host regname specified, check for DNS conformance.
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [A-Za-z0-9]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [A-Za-z0-9]          # Part first char is alphanum (no dash).
              [A-Za-z0-9\-]{0,61}  # Internal chars are alphanum plus dash.
              [A-Za-z0-9]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # Zero or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exactly two alpha chars.
              \.?                  # Top level domain can end in a dot.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

这个函数使用了两个正则表达式;一个用于匹配有效通用uri的子集(具有非空主机的绝对uri),另一个用于验证DNS“点分隔部分”主机名。虽然这个函数目前只验证HTTP和FTP方案,但它的结构使它可以很容易地扩展以处理其他方案。