如何检查给定的字符串是否是有效的URL地址?

我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。


当前回答

我想我找到了一个更通用的regexp来验证url,特别是网站

​(https?:\/\/)?(www\.)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)|(https?:\/\/)?(www\.)?(?!ww)[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,4}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)

它不允许例如www.something或http://www或http://www.something

点击这里查看:http://regexr.com/3e4a2

其他回答

我试着制定我的url版本。我的需求是在一个字符串中捕获实例,其中可能的url可以是cse.uom.ac.mu -注意它的前面没有http或www

String regularExpression = "((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})";

assertTrue("www.google.com".matches(regularExpression));
assertTrue("www.google.co.uk".matches(regularExpression));
assertTrue("http://www.google.com".matches(regularExpression));
assertTrue("http://www.google.co.uk".matches(regularExpression));
assertTrue("https://www.google.com".matches(regularExpression));
assertTrue("https://www.google.co.uk".matches(regularExpression));
assertTrue("google.com".matches(regularExpression));
assertTrue("google.co.uk".matches(regularExpression));
assertTrue("google.mu".matches(regularExpression));
assertTrue("mes.intnet.mu".matches(regularExpression));
assertTrue("cse.uom.ac.mu".matches(regularExpression));

//cannot contain 2 '.' after www
assertFalse("www..dr.google".matches(regularExpression));

//cannot contain 2 '.' just before com
assertFalse("www.dr.google..com".matches(regularExpression));

// to test case where url www must be followed with a '.'
assertFalse("www:google.com".matches(regularExpression));

// to test case where url www must be followed with a '.'
//assertFalse("http://wwwe.google.com".matches(regularExpression));

// to test case where www must be preceded with a '.'
assertFalse("https://www@.google.com".matches(regularExpression));

我一直在写一篇深入的文章,讨论使用正则表达式进行URI验证。它基于RFC3986。

正则表达式URI验证

虽然这篇文章还不完整,但我已经提出了一个PHP函数,它在验证HTTP和FTP url方面做得非常好。以下是当前版本:

// function url_valid($url) { Rev:20110423_2000
//
// Return associative array of valid URI components, or FALSE if $url is not
// RFC-3986 compliant. If the passed URL begins with: "www." or "ftp.", then
// "http://" or "ftp://" is prepended and the corrected full-url is stored in
// the return array with a key name "url". This value should be used by the caller.
//
// Return value: FALSE if $url is not valid, otherwise array of URI components:
// e.g.
// Given: "http://www.jmrware.com:80/articles?height=10&width=75#fragone"
// Array(
//    [scheme] => http
//    [authority] => www.jmrware.com:80
//    [userinfo] =>
//    [host] => www.jmrware.com
//    [IP_literal] =>
//    [IPV6address] =>
//    [ls32] =>
//    [IPvFuture] =>
//    [IPv4address] =>
//    [regname] => www.jmrware.com
//    [port] => 80
//    [path_abempty] => /articles
//    [query] => height=10&width=75
//    [fragment] => fragone
//    [url] => http://www.jmrware.com:80/articles?height=10&width=75#fragone
// )
function url_valid($url) {
    if (strpos($url, 'www.') === 0) $url = 'http://'. $url;
    if (strpos($url, 'ftp.') === 0) $url = 'ftp://'. $url;
    if (!preg_match('/# Valid absolute URI having a non-empty, valid DNS host.
        ^
        (?P<scheme>[A-Za-z][A-Za-z0-9+\-.]*):\/\/
        (?P<authority>
          (?:(?P<userinfo>(?:[A-Za-z0-9\-._~!$&\'()*+,;=:]|%[0-9A-Fa-f]{2})*)@)?
          (?P<host>
            (?P<IP_literal>
              \[
              (?:
                (?P<IPV6address>
                  (?:                                                (?:[0-9A-Fa-f]{1,4}:){6}
                  |                                                ::(?:[0-9A-Fa-f]{1,4}:){5}
                  | (?:                          [0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){4}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,1}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){3}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,2}[0-9A-Fa-f]{1,4})?::(?:[0-9A-Fa-f]{1,4}:){2}
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,3}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}:
                  | (?:(?:[0-9A-Fa-f]{1,4}:){0,4}[0-9A-Fa-f]{1,4})?::
                  )
                  (?P<ls32>[0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}
                  | (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                       (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
                  )
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,5}[0-9A-Fa-f]{1,4})?::   [0-9A-Fa-f]{1,4}
                |   (?:(?:[0-9A-Fa-f]{1,4}:){0,6}[0-9A-Fa-f]{1,4})?::
                )
              | (?P<IPvFuture>[Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&\'()*+,;=:]+)
              )
              \]
            )
          | (?P<IPv4address>(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
                               (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))
          | (?P<regname>(?:[A-Za-z0-9\-._~!$&\'()*+,;=]|%[0-9A-Fa-f]{2})+)
          )
          (?::(?P<port>[0-9]*))?
        )
        (?P<path_abempty>(?:\/(?:[A-Za-z0-9\-._~!$&\'()*+,;=:@]|%[0-9A-Fa-f]{2})*)*)
        (?:\?(?P<query>       (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        (?:\#(?P<fragment>    (?:[A-Za-z0-9\-._~!$&\'()*+,;=:@\\/?]|%[0-9A-Fa-f]{2})*))?
        $
        /mx', $url, $m)) return FALSE;
    switch ($m['scheme']) {
    case 'https':
    case 'http':
        if ($m['userinfo']) return FALSE; // HTTP scheme does not allow userinfo.
        break;
    case 'ftps':
    case 'ftp':
        break;
    default:
        return FALSE;   // Unrecognized URI scheme. Default to FALSE.
    }
    // Validate host name conforms to DNS "dot-separated-parts".
    if ($m['regname']) { // If host regname specified, check for DNS conformance.
        if (!preg_match('/# HTTP DNS host name.
            ^                      # Anchor to beginning of string.
            (?!.{256})             # Overall host length is less than 256 chars.
            (?:                    # Group dot separated host part alternatives.
              [A-Za-z0-9]\.        # Either a single alphanum followed by dot
            |                      # or... part has more than one char (63 chars max).
              [A-Za-z0-9]          # Part first char is alphanum (no dash).
              [A-Za-z0-9\-]{0,61}  # Internal chars are alphanum plus dash.
              [A-Za-z0-9]          # Part last char is alphanum (no dash).
              \.                   # Each part followed by literal dot.
            )*                     # Zero or more parts before top level domain.
            (?:                    # Explicitly specify top level domains.
              com|edu|gov|int|mil|net|org|biz|
              info|name|pro|aero|coop|museum|
              asia|cat|jobs|mobi|tel|travel|
              [A-Za-z]{2})         # Country codes are exactly two alpha chars.
              \.?                  # Top level domain can end in a dot.
            $                      # Anchor to end of string.
            /ix', $m['host'])) return FALSE;
    }
    $m['url'] = $url;
    for ($i = 0; isset($m[$i]); ++$i) unset($m[$i]);
    return $m; // return TRUE == array of useful named $matches plus the valid $url.
}

这个函数使用了两个正则表达式;一个用于匹配有效通用uri的子集(具有非空主机的绝对uri),另一个用于验证DNS“点分隔部分”主机名。虽然这个函数目前只验证HTTP和FTP方案,但它的结构使它可以很容易地扩展以处理其他方案。

我刚刚写了一篇博客文章,介绍了一个很好的解决方案,可以识别大多数常用格式的url,比如:

www.google.com http://www.google.com mailto: somebody@google.com somebody@google.com url www.url-with-querystring.com/ ? = has-querystring

使用的正则表达式是:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%@.\w_]*)#?(?:[\w]*))?)/

来自Android开源项目的URL正则表达式

介绍

Android开源项目(AOSP)在Patterns.java中包含多个带有URL正则表达式的代码块。由于使用unicode,非java用户很难从中提取regex模式,因此我编写了一些代码来完成这项工作。因为regex模式包含unicode,其文字字符串语法因编程语言而不同,所以我为每个regex模式添加了两种格式。 例如,Java使用\uUNICODE_NUMBER格式,而PHP使用\u{UNICODE_NUMBER}。

名为“WEB_URL”的模式

API文档描述:

正则表达式模式,以匹配大部分RFC 3987国际化url,即iri。

正则表达式在unicode \uUNICODE_NUMBER (Java, Python, Ruby)格式:

(((?:(?i:http|https|rtsp|ftp)://(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?(?:(([a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]](?:[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]_\-]{0,61}[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]]){0,1}\.)+(xn\-\-[\w\-]{0,58}\w|[a-zA-Z[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]]{2,63})|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\:\d{1,5})?)([/\?](?:(?:[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]];/\?:@&=#~\-\.\+!\*'\(\),_\$])|(?:%[a-fA-F0-9]{2}))*)?(?:\b|$|^))```

unicode \u{UNICODE_NUMBER} (PHP)格式的正则表达式:

(((?:(?i:http|https|rtsp|ftp)://(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?(?:(([a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]](?:[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]_\-]{0,61}[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]]){0,1}\.)+(xn\-\-[\w\-]{0,58}\w|[a-zA-Z[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]]{2,63})|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\:\d{1,5})?)([/\?](?:(?:[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]];/\?:@&=#~\-\.\+!\*'\(\),_\$])|(?:%[a-fA-F0-9]{2}))*)?(?:\b|$|^))

其他模式

java包含更多的模式,但发布它们将达到Stackoverflow的帖子长度限制。但我将在这里发布它们的API描述,以便您了解它们的存在和用途。我还在下面添加了使用Kotlin输出这些模式的代码。

名为“WEB_URL_WITHOUT_PROTOCOL”的模式

描述:

正则表达式,用于匹配不以受支持协议开头的字符串。这些顶级域名预计将是已知顶级域名之一。

定义:

"("
+ WORD_BOUNDARY
+ "(?<!:\\/\\/)"
+ "("
+ "(?:" + STRICT_DOMAIN_NAME + ")"
+ "(?:" + PORT_NUMBER + ")?"
+ ")"
+ "(?:" + PATH_AND_QUERY + ")?"
+ WORD_BOUNDARY
+ ")";

名为WEB_URL_WITH_PROTOCOL的模式

描述:

正则表达式,以匹配以受支持协议开头的字符串。域名和顶级域名的规则更加宽松。tld是可选的。

定义:

"("
+ WORD_BOUNDARY
+ "(?:"
+ "(?:" + PROTOCOL + "(?:" + USER_INFO + ")?" + ")"
+ "(?:" + RELAXED_DOMAIN_NAME + ")?"
+ "(?:" + PORT_NUMBER + ")?"
+ ")"
+ "(?:" + PATH_AND_QUERY + ")?"
+ WORD_BOUNDARY
+ ")";

名为AUTOLINK_WEB_URL的模式

描述:

正则表达式模式来匹配IRIs。如果字符串以 http(s)://表达式尝试用 放宽顶级域名规则。如果字符串不是以http(s)://开头 顶级域名应该是已知顶级域名之一。

定义:

"(" + WEB_URL_WITH_PROTOCOL + "|" + WEB_URL_WITHOUT_PROTOCOL + ")")

从AOSP patterns .java输出模式的代码

这段代码是用Kotlin(一种基于Java JVM的语言)编写的。If将regex模式从AOSP patterns .java转换为可读的格式:

import java.util.regex.Pattern

fun createPattern(pattern: Pattern, unicodeStringFormat: String): String =
    pattern.toString().flatMap {
        val charCode = it.code
        if (charCode > 126) {
            unicodeStringFormat.format(charCode).toList()
        } else {
            listOf(it)
        }
    }.joinToString("")

fun main() {
    val unicodeStringFormatJava = "\\u%04x"
    val unicodeStringFormatPHP = "\\u{%04x}"

    // Pattern: WEB_URL
    println(createPattern(Patterns.WEB_URL, unicodeStringFormatJava))
    println(createPattern(Patterns.WEB_URL, unicodeStringFormatPHP))

    // Pattern: AUTOLINK_WEB_URL
    println(createPattern(Patterns.AUTOLINK_WEB_URL, unicodeStringFormatJava))
    println(createPattern(Patterns.AUTOLINK_WEB_URL, unicodeStringFormatPHP))

    // Pattern: WEB_URL_WITH_PROTOCOL (variable modified to public visibility)
    println(createPattern(Patterns.WEB_URL_WITH_PROTOCOL.toPattern(), unicodeStringFormatJava))
    println(createPattern(Patterns.WEB_URL_WITH_PROTOCOL.toPattern(), unicodeStringFormatPHP))

    // Pattern: WEB_URL_WITHOUT_PROTOCOL (variable modified to public visibility)
    println(createPattern(Patterns.WEB_URL_WITHOUT_PROTOCOL.toPattern(), unicodeStringFormatJava))
    println(createPattern(Patterns.WEB_URL_WITHOUT_PROTOCOL.toPattern(), unicodeStringFormatPHP))
}

Regardless the broad question asked, I post this for anyone in the future who is looking for something simple... as I think validating a URL has no perfect regular expression that fit all needs, it depends on your requirements, i.e: in my case, I just needed to verify if a URL is in the form of domain.extension and I wanted to allow the www or any other subdomain like blog.domain.extension I don't care about http(s) as in my app I have a field which says "enter the URL" so it's obvious what that entered string is.

这是regEx:

/^(www\.|[a-zA-Z0-9](.*[a-zA-Z0-9])?\.)?((?!www)[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9])\.[a-z]{2,5}(:[0-9]{1,5})?$/i

这个regExp中的第一个块是:

(www \ | [a-zA-Z0-9] (. * [a-zA-Z0-9]) ? \。)?——>我们开始检查URL是否以www开头。[a-zA-Z0-9]或[a-zA-Z0-9] (. *) ?这意味着一个字母或数字+ (anyCharacter(0或多次)+另一个字母或数字),然后是一个点

注意(.*[a-zA-Z0-9])?\.)?我们翻译由(anyCharacter(0或多次)+另一个字母或数字) 是可选的(可以是或不是)这就是为什么我们将它分组在括号之间,后面跟着问号?

到目前为止我们讨论的整个块也放在括号之间,后面跟着?这意味着WWW或任何其他词(表示子域)都是可选的。

第二部分是:((? ! www) [a-zA-Z0-9] [a-zA-Z0-9 -] + [a-zA-Z0-9]) \。——>表示“域”部分,它可以是任何单词(www除外),以字母或数字开头+任何其他字母(包括破折号“-”)重复一次或多次,以任何字母或数字结尾,后面跟一个点。

最后一部分是[a-z]{2,}——>,它代表“扩展名”,它可以是任何字母重复2次或更多次,所以它可以是com, net, org, art基本上任何扩展名