如何检查给定的字符串是否是有效的URL地址?
我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。
如何检查给定的字符串是否是有效的URL地址?
我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。
当前回答
下面是针对这种情况的最佳和最匹配的正则表达式
^(?:http(?:s)?:\/\/)?(?:www\.)?(?:[\w-]*)\.\w{2,}$
其他回答
下面是Android源代码的Java版本。这是我找到的最好的一个。
public static final Matcher WEB = Pattern.compile(new StringBuilder()
.append("((?:(http|https|Http|Https|rtsp|Rtsp):")
.append("\\/\\/(?:(?:[a-zA-Z0-9\\$\\-\\_\\.\\+\\!\\*\\'\\(\\)")
.append("\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,64}(?:\\:(?:[a-zA-Z0-9\\$\\-\\_")
.append("\\.\\+\\!\\*\\'\\(\\)\\,\\;\\?\\&\\=]|(?:\\%[a-fA-F0-9]{2})){1,25})?\\@)?)?")
.append("((?:(?:[a-zA-Z0-9][a-zA-Z0-9\\-]{0,64}\\.)+") // named host
.append("(?:") // plus top level domain
.append("(?:aero|arpa|asia|a[cdefgilmnoqrstuwxz])")
.append("|(?:biz|b[abdefghijmnorstvwyz])")
.append("|(?:cat|com|coop|c[acdfghiklmnoruvxyz])")
.append("|d[ejkmoz]")
.append("|(?:edu|e[cegrstu])")
.append("|f[ijkmor]")
.append("|(?:gov|g[abdefghilmnpqrstuwy])")
.append("|h[kmnrtu]")
.append("|(?:info|int|i[delmnoqrst])")
.append("|(?:jobs|j[emop])")
.append("|k[eghimnrwyz]")
.append("|l[abcikrstuvy]")
.append("|(?:mil|mobi|museum|m[acdghklmnopqrstuvwxyz])")
.append("|(?:name|net|n[acefgilopruz])")
.append("|(?:org|om)")
.append("|(?:pro|p[aefghklmnrstwy])")
.append("|qa")
.append("|r[eouw]")
.append("|s[abcdeghijklmnortuvyz]")
.append("|(?:tel|travel|t[cdfghjklmnoprtvwz])")
.append("|u[agkmsyz]")
.append("|v[aceginu]")
.append("|w[fs]")
.append("|y[etu]")
.append("|z[amw]))")
.append("|(?:(?:25[0-5]|2[0-4]") // or ip address
.append("[0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\\.(?:25[0-5]|2[0-4][0-9]")
.append("|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1]")
.append("[0-9]{2}|[1-9][0-9]|[1-9]|0)\\.(?:25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}")
.append("|[1-9][0-9]|[0-9])))")
.append("(?:\\:\\d{1,5})?)") // plus option port number
.append("(\\/(?:(?:[a-zA-Z0-9\\;\\/\\?\\:\\@\\&\\=\\#\\~") // plus option query params
.append("\\-\\.\\+\\!\\*\\'\\(\\)\\,\\_])|(?:\\%[a-fA-F0-9]{2}))*)?")
.append("(?:\\b|$)").toString()
).matcher("");
我使用这个正则表达式:
((https?:)?//)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,63}(:[\d]+)?(/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?
同时支持:
http://stackoverflow.com
https://stackoverflow.com
和:
//stackoverflow.com
我试着制定我的url版本。我的需求是在一个字符串中捕获实例,其中可能的url可以是cse.uom.ac.mu -注意它的前面没有http或www
String regularExpression = "((((ht{2}ps?://)?)((w{3}\\.)?))?)[^.&&[a-zA-Z0-9]][a-zA-Z0-9.-]+[^.&&[a-zA-Z0-9]](\\.[a-zA-Z]{2,3})";
assertTrue("www.google.com".matches(regularExpression));
assertTrue("www.google.co.uk".matches(regularExpression));
assertTrue("http://www.google.com".matches(regularExpression));
assertTrue("http://www.google.co.uk".matches(regularExpression));
assertTrue("https://www.google.com".matches(regularExpression));
assertTrue("https://www.google.co.uk".matches(regularExpression));
assertTrue("google.com".matches(regularExpression));
assertTrue("google.co.uk".matches(regularExpression));
assertTrue("google.mu".matches(regularExpression));
assertTrue("mes.intnet.mu".matches(regularExpression));
assertTrue("cse.uom.ac.mu".matches(regularExpression));
//cannot contain 2 '.' after www
assertFalse("www..dr.google".matches(regularExpression));
//cannot contain 2 '.' just before com
assertFalse("www.dr.google..com".matches(regularExpression));
// to test case where url www must be followed with a '.'
assertFalse("www:google.com".matches(regularExpression));
// to test case where url www must be followed with a '.'
//assertFalse("http://wwwe.google.com".matches(regularExpression));
// to test case where www must be preceded with a '.'
assertFalse("https://www@.google.com".matches(regularExpression));
如果你真的在搜索终极匹配,你可能会在“一个好的Url正则表达式?”
但是,一个真正匹配所有可能域并允许rfc允许的任何内容的正则表达式是可怕的长且不可读的,相信我;-)
来自Android开源项目的URL正则表达式
介绍
Android开源项目(AOSP)在Patterns.java中包含多个带有URL正则表达式的代码块。由于使用unicode,非java用户很难从中提取regex模式,因此我编写了一些代码来完成这项工作。因为regex模式包含unicode,其文字字符串语法因编程语言而不同,所以我为每个regex模式添加了两种格式。 例如,Java使用\uUNICODE_NUMBER格式,而PHP使用\u{UNICODE_NUMBER}。
名为“WEB_URL”的模式
API文档描述:
正则表达式模式,以匹配大部分RFC 3987国际化url,即iri。
正则表达式在unicode \uUNICODE_NUMBER (Java, Python, Ruby)格式:
(((?:(?i:http|https|rtsp|ftp)://(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?(?:(([a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]](?:[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]_\-]{0,61}[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]]){0,1}\.)+(xn\-\-[\w\-]{0,58}\w|[a-zA-Z[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]]]{2,63})|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\:\d{1,5})?)([/\?](?:(?:[a-zA-Z0-9[\u00a0-\ud7ff\uf900-\ufdcf\ufdf0-\uffef\ud800\udc00-\ud83f\udffd\ud840\udc00-\ud87f\udffd\ud880\udc00-\ud8bf\udffd\ud8c0\udc00-\ud8ff\udffd\ud900\udc00-\ud93f\udffd\ud940\udc00-\ud97f\udffd\ud980\udc00-\ud9bf\udffd\ud9c0\udc00-\ud9ff\udffd\uda00\udc00-\uda3f\udffd\uda40\udc00-\uda7f\udffd\uda80\udc00-\udabf\udffd\udac0\udc00-\udaff\udffd\udb00\udc00-\udb3f\udffd\udb44\udc00-\udb7f\udffd&&[^\u00a0[\u2000-\u200a]\u2028\u2029\u202f\u3000]];/\?:@&=#~\-\.\+!\*'\(\),_\$])|(?:%[a-fA-F0-9]{2}))*)?(?:\b|$|^))```
unicode \u{UNICODE_NUMBER} (PHP)格式的正则表达式:
(((?:(?i:http|https|rtsp|ftp)://(?:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,64}(?:\:(?:[a-zA-Z0-9\$\-\_\.\+\!\*\'\(\)\,\;\?\&\=]|(?:\%[a-fA-F0-9]{2})){1,25})?\@)?)?(?:(([a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]](?:[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]_\-]{0,61}[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]]){0,1}\.)+(xn\-\-[\w\-]{0,58}\w|[a-zA-Z[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]]]{2,63})|((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9]))))(?:\:\d{1,5})?)([/\?](?:(?:[a-zA-Z0-9[\u{00a0}-\u{d7ff}\u{f900}-\u{fdcf}\u{fdf0}-\u{ffef}\u{d800}\u{dc00}-\u{d83f}\u{dffd}\u{d840}\u{dc00}-\u{d87f}\u{dffd}\u{d880}\u{dc00}-\u{d8bf}\u{dffd}\u{d8c0}\u{dc00}-\u{d8ff}\u{dffd}\u{d900}\u{dc00}-\u{d93f}\u{dffd}\u{d940}\u{dc00}-\u{d97f}\u{dffd}\u{d980}\u{dc00}-\u{d9bf}\u{dffd}\u{d9c0}\u{dc00}-\u{d9ff}\u{dffd}\u{da00}\u{dc00}-\u{da3f}\u{dffd}\u{da40}\u{dc00}-\u{da7f}\u{dffd}\u{da80}\u{dc00}-\u{dabf}\u{dffd}\u{dac0}\u{dc00}-\u{daff}\u{dffd}\u{db00}\u{dc00}-\u{db3f}\u{dffd}\u{db44}\u{dc00}-\u{db7f}\u{dffd}&&[^\u{00a0}[\u{2000}-\u{200a}]\u{2028}\u{2029}\u{202f}\u{3000}]];/\?:@&=#~\-\.\+!\*'\(\),_\$])|(?:%[a-fA-F0-9]{2}))*)?(?:\b|$|^))
其他模式
java包含更多的模式,但发布它们将达到Stackoverflow的帖子长度限制。但我将在这里发布它们的API描述,以便您了解它们的存在和用途。我还在下面添加了使用Kotlin输出这些模式的代码。
名为“WEB_URL_WITHOUT_PROTOCOL”的模式
描述:
正则表达式,用于匹配不以受支持协议开头的字符串。这些顶级域名预计将是已知顶级域名之一。
定义:
"("
+ WORD_BOUNDARY
+ "(?<!:\\/\\/)"
+ "("
+ "(?:" + STRICT_DOMAIN_NAME + ")"
+ "(?:" + PORT_NUMBER + ")?"
+ ")"
+ "(?:" + PATH_AND_QUERY + ")?"
+ WORD_BOUNDARY
+ ")";
名为WEB_URL_WITH_PROTOCOL的模式
描述:
正则表达式,以匹配以受支持协议开头的字符串。域名和顶级域名的规则更加宽松。tld是可选的。
定义:
"("
+ WORD_BOUNDARY
+ "(?:"
+ "(?:" + PROTOCOL + "(?:" + USER_INFO + ")?" + ")"
+ "(?:" + RELAXED_DOMAIN_NAME + ")?"
+ "(?:" + PORT_NUMBER + ")?"
+ ")"
+ "(?:" + PATH_AND_QUERY + ")?"
+ WORD_BOUNDARY
+ ")";
名为AUTOLINK_WEB_URL的模式
描述:
正则表达式模式来匹配IRIs。如果字符串以 http(s)://表达式尝试用 放宽顶级域名规则。如果字符串不是以http(s)://开头 顶级域名应该是已知顶级域名之一。
定义:
"(" + WEB_URL_WITH_PROTOCOL + "|" + WEB_URL_WITHOUT_PROTOCOL + ")")
从AOSP patterns .java输出模式的代码
这段代码是用Kotlin(一种基于Java JVM的语言)编写的。If将regex模式从AOSP patterns .java转换为可读的格式:
import java.util.regex.Pattern
fun createPattern(pattern: Pattern, unicodeStringFormat: String): String =
pattern.toString().flatMap {
val charCode = it.code
if (charCode > 126) {
unicodeStringFormat.format(charCode).toList()
} else {
listOf(it)
}
}.joinToString("")
fun main() {
val unicodeStringFormatJava = "\\u%04x"
val unicodeStringFormatPHP = "\\u{%04x}"
// Pattern: WEB_URL
println(createPattern(Patterns.WEB_URL, unicodeStringFormatJava))
println(createPattern(Patterns.WEB_URL, unicodeStringFormatPHP))
// Pattern: AUTOLINK_WEB_URL
println(createPattern(Patterns.AUTOLINK_WEB_URL, unicodeStringFormatJava))
println(createPattern(Patterns.AUTOLINK_WEB_URL, unicodeStringFormatPHP))
// Pattern: WEB_URL_WITH_PROTOCOL (variable modified to public visibility)
println(createPattern(Patterns.WEB_URL_WITH_PROTOCOL.toPattern(), unicodeStringFormatJava))
println(createPattern(Patterns.WEB_URL_WITH_PROTOCOL.toPattern(), unicodeStringFormatPHP))
// Pattern: WEB_URL_WITHOUT_PROTOCOL (variable modified to public visibility)
println(createPattern(Patterns.WEB_URL_WITHOUT_PROTOCOL.toPattern(), unicodeStringFormatJava))
println(createPattern(Patterns.WEB_URL_WITHOUT_PROTOCOL.toPattern(), unicodeStringFormatPHP))
}