如何检查给定的字符串是否是有效的URL地址?
我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。
如何检查给定的字符串是否是有效的URL地址?
我对正则表达式的知识是基本的,不允许我从我已经在网上看到的数百个正则表达式中进行选择。
当前回答
关于眼睑问题,他的回答是“这是基于我对URI规范的阅读。”谢谢眼睑,你的是我寻求的完美的解决方案,因为它是基于URI规范!出色的工作。:)
我不得不做了两个修改。第一个在PHP (v5.2.10)中使用preg_match()函数使regexp正确匹配IP地址url的方法。
我不得不在管道周围的“IP Address”上方的行中添加一组括号:
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
不知道为什么。
我还将顶级域名的最小长度从3个字母减少到2个字母,以支持.co。英国和类似国家。
最后的代码:
/^(https?|ftp):\/\/(?# protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?# username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?# password
)@)?(?# auth requires @
)((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?# domain segments AND
)[a-z][a-z0-9-]*[a-z0-9](?# top level domain OR
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?# IP address
))(:\d+)?(?# port
))(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*(?# path
)(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)(?# query string
)?)?)?(?# path and query string optional
)(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?(?# fragment
)$/i
这个修改后的版本没有根据URI规范进行检查,所以我不能保证它的合规性,它被修改为处理本地网络环境中的URL和两位tld以及其他类型的Web URL,并在我使用的PHP设置中更好地工作。
作为PHP代码:
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'@)?(?#'. // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
下面是一个PHP测试程序,它使用正则表达式验证各种url:
<?php
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'@)?(?#'. // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
/**
* Verify the syntax of the given URL.
*
* @access public
* @param $url The URL to verify.
* @return boolean
*/
function is_valid_url($url) {
if (str_starts_with(strtolower($url), 'http://localhost')) {
return true;
}
return preg_match(URL_FORMAT, $url);
}
/**
* String starts with something
*
* This function will return true only if input string starts with
* niddle
*
* @param string $string Input string
* @param string $niddle Needle string
* @return boolean
*/
function str_starts_with($string, $niddle) {
return substr($string, 0, strlen($niddle)) == $niddle;
}
/**
* Test a URL for validity and count results.
* @param url url
* @param expected expected result (true or false)
*/
$numtests = 0;
$passed = 0;
function test_url($url, $expected) {
global $numtests, $passed;
$numtests++;
$valid = is_valid_url($url);
echo "URL Valid?: " . ($valid?"yes":"no") . " for URL: $url. Expected: ".($expected?"yes":"no").". ";
if($valid == $expected) {
echo "PASS\n"; $passed++;
} else {
echo "FAIL\n";
}
}
echo "URL Tests:\n\n";
test_url("http://localserver/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("http://www.google.com", true);
test_url("http://www.google.co.uk/projects/my%20folder/test.php", true);
test_url("https://myserver.localdomain", true);
test_url("http://192.168.1.120/projects/index.php", true);
test_url("http://192.168.1.1/projects/index.php", true);
test_url("http://projectpier-server.localdomain/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("https://2.4.168.19/project-pier?c=test&a=b", true);
test_url("https://localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
test_url("http://user:password@localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
echo "\n$passed out of $numtests tests passed.\n\n";
?>
再次感谢regex的虚实!
其他回答
这可能不是正则表达式的工作,而是您所选语言中的现有工具的工作。您可能希望使用已经编写、测试和调试过的现有代码。
在PHP中,使用parse_url函数。
URI模块。
Ruby: URI模块。
.NET: 'Uri'类
正则表达式并不是你在每个涉及字符串的问题上挥舞的魔杖。
这不是一个正则表达式,但完成了同样的事情(仅Javascript):
function isAValidUrl(url) {
try {
new URL(url);
return true;
} catch(e) {
return false;
}
}
关于眼睑问题,他的回答是“这是基于我对URI规范的阅读。”谢谢眼睑,你的是我寻求的完美的解决方案,因为它是基于URI规范!出色的工作。:)
我不得不做了两个修改。第一个在PHP (v5.2.10)中使用preg_match()函数使regexp正确匹配IP地址url的方法。
我不得不在管道周围的“IP Address”上方的行中添加一组括号:
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
不知道为什么。
我还将顶级域名的最小长度从3个字母减少到2个字母,以支持.co。英国和类似国家。
最后的代码:
/^(https?|ftp):\/\/(?# protocol
)(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+(?# username
)(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?(?# password
)@)?(?# auth requires @
)((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*(?# domain segments AND
)[a-z][a-z0-9-]*[a-z0-9](?# top level domain OR
)|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}(?#
)(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])(?# IP address
))(:\d+)?(?# port
))(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*(?# path
)(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)(?# query string
)?)?)?(?# path and query string optional
)(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?(?# fragment
)$/i
这个修改后的版本没有根据URI规范进行检查,所以我不能保证它的合规性,它被修改为处理本地网络环境中的URL和两位tld以及其他类型的Web URL,并在我使用的PHP设置中更好地工作。
作为PHP代码:
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'@)?(?#'. // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
下面是一个PHP测试程序,它使用正则表达式验证各种url:
<?php
define('URL_FORMAT',
'/^(https?):\/\/'. // protocol
'(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'. // username
'(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'. // password
'@)?(?#'. // auth requires @
')((([a-z0-9]\.|[a-z0-9][a-z0-9-]*[a-z0-9]\.)*'. // domain segments AND
'[a-z][a-z0-9-]*[a-z0-9]'. // top level domain OR
'|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}'.
'(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'. // IP address
')(:\d+)?'. // port
')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'. // path
'(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'. // query string
'?)?)?'. // path and query string optional
'(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'. // fragment
'$/i');
/**
* Verify the syntax of the given URL.
*
* @access public
* @param $url The URL to verify.
* @return boolean
*/
function is_valid_url($url) {
if (str_starts_with(strtolower($url), 'http://localhost')) {
return true;
}
return preg_match(URL_FORMAT, $url);
}
/**
* String starts with something
*
* This function will return true only if input string starts with
* niddle
*
* @param string $string Input string
* @param string $niddle Needle string
* @return boolean
*/
function str_starts_with($string, $niddle) {
return substr($string, 0, strlen($niddle)) == $niddle;
}
/**
* Test a URL for validity and count results.
* @param url url
* @param expected expected result (true or false)
*/
$numtests = 0;
$passed = 0;
function test_url($url, $expected) {
global $numtests, $passed;
$numtests++;
$valid = is_valid_url($url);
echo "URL Valid?: " . ($valid?"yes":"no") . " for URL: $url. Expected: ".($expected?"yes":"no").". ";
if($valid == $expected) {
echo "PASS\n"; $passed++;
} else {
echo "FAIL\n";
}
}
echo "URL Tests:\n\n";
test_url("http://localserver/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("http://www.google.com", true);
test_url("http://www.google.co.uk/projects/my%20folder/test.php", true);
test_url("https://myserver.localdomain", true);
test_url("http://192.168.1.120/projects/index.php", true);
test_url("http://192.168.1.1/projects/index.php", true);
test_url("http://projectpier-server.localdomain/projects/public/assets/javascript/widgets/UserBoxMenu/widget.css", true);
test_url("https://2.4.168.19/project-pier?c=test&a=b", true);
test_url("https://localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
test_url("http://user:password@localhost/a/b/c/test.php?c=controller&arg1=20&arg2=20", true);
echo "\n$passed out of $numtests tests passed.\n\n";
?>
再次感谢regex的虚实!
我创建了一个类似于RFC3987和其他RFC文档提供的@eyelidlessness的正则表达式(PCRE)。@eyelidlessness和我的正则表达式之间的主要区别主要是可读性和URN支持。
下面的正则表达式是一个整体(而不是与PHP混合),所以它可以很容易地用于不同的语言(只要它们支持PCRE)
测试这个正则表达式最简单的方法是使用regex101,并使用适当的修饰符(gmx)复制粘贴下面的代码和测试字符串。
要在PHP中使用这个正则表达式,请将下面的正则表达式插入到以下代码中:
$regex = <<<'EOD'
// Put the regex here
EOD;
可以通过以下方式匹配不带方案的链路: 要匹配一个没有方案的链接(例如john.doe@gmail.com或www.google.com/pathtofile.php?query),替换此部分:
(?:
(?<scheme>
(?<urn>urn)|
(?&d_scheme)
)
:
)?
用这个:
(?:
(?<scheme>
(?<urn>urn)|
(?&d_scheme)
)
:
)?
但是请注意,通过替换它,regex并不是100%可靠。 使用gmx修饰符的Regex (PCRE)用于下面的多行测试字符串
(?(DEFINE)
# Definitions
(?<ALPHA>[\p{L}])
(?<DIGIT>[0-9])
(?<HEX>[0-9a-fA-F])
(?<NCCHAR>
(?&UNRESERVED)|
(?&PCT_ENCODED)|
(?&SUB_DELIMS)|
@
)
(?<PCHAR>
(?&UNRESERVED)|
(?&PCT_ENCODED)|
(?&SUB_DELIMS)|
:|
@|
\/
)
(?<UCHAR>
(?&UNRESERVED)|
(?&PCT_ENCODED)|
(?&SUB_DELIMS)|
:
)
(?<RCHAR>
(?&UNRESERVED)|
(?&PCT_ENCODED)|
(?&SUB_DELIMS)
)
(?<PCT_ENCODED>%(?&HEX){2})
(?<UNRESERVED>
((?&ALPHA)|(?&DIGIT)|[-._~])
)
(?<RESERVED>(?&GEN_DELIMS)|(?&SUB_DELIMS))
(?<GEN_DELIMS>[:\/?#\[\]@])
(?<SUB_DELIMS>[!$&'()*+,;=])
# URI Parts
(?<d_scheme>
(?!urn)
(?:
(?&ALPHA)
((?&ALPHA)|(?&DIGIT)|[+-.])*
(?=:)
)
)
(?<d_hier_part_slashes>
(\/{2})?
)
(?<d_authority>(?&d_userinfo)?)
(?<d_userinfo>(?&UCHAR)*)
(?<d_ipv6>
(?![^:]*::[^:]*::[^:]*)
(
(
((?&HEX){0,4})
:
){1,7}
((?&d_ipv4)|:|(?&HEX){1,4})
)
)
(?<d_ipv4>
((?&octet)\.){3}
(?&octet)
)
(?<octet>
(
25[]0-5]|
2[0-4](?&DIGIT)|
1(?&DIGIT){2}|
[1-9](?&DIGIT)|
(?&DIGIT)
)
)
(?<d_reg_name>(?&RCHAR)*)
(?<d_urn_name>(?&UCHAR)*)
(?<d_port>(?&DIGIT)*)
(?<d_path>
(
\/
((?&PCHAR)*)*
(?=\?|\#|$)
)
)
(?<d_query>
(
((?&PCHAR)|\/|\?)*
)?
)
(?<d_fragment>
(
((?&PCHAR)|\/|\?)*
)?
)
)
^
(?<link>
(?:
(?<scheme>
(?<urn>urn)|
(?&d_scheme)
)
:
)
(?(urn)
(?:
(?<namespace_identifier>[0-9a-zA-Z\-]+)
:
(?<namespace_specific_string>(?&d_urn_name)+)
)
|
(?<hier_part>
(?<slashes>(?&d_hier_part_slashes))
(?<authority>
(?:
(?<userinfo>(?&d_authority))
@
)?
(?<host>
(?<ipv4>\[?(?&d_ipv4)\]?)|
(?<ipv6>\[(?&d_ipv6)\])|
(?<domain>(?&d_reg_name))
)
(?:
:
(?<port>(?&d_port))
)?
)
(?<path>(?&d_path))?
)
(?:
\?
(?<query>(?&d_query))
)?
(?:
\#
(?<fragment>(?&d_fragment))
)?
)
)
$
测试字符串
# Valid URIs
ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm
ftp://ftp.is.co.za/rfc/rfc1808.txt
http://www.ietf.org/rfc/rfc2396.txt
ldap://[2001:db8::7]/c=GB?objectClass?one
mailto:John.Doe@example.com
news:comp.infosystems.www.servers.unix
tel:+1-816-555-1212
telnet://192.0.2.16:80/
urn:isbn:0451450523
urn:oid:2.16.840
urn:isan:0000-0000-9E59-0000-O-0000-0000-2
urn:oasis:names:specification:docbook:dtd:xml:4.1.2
http://localhost/test/somefile.php?query=someval&variable=value#fragment
http://[2001:db8:a0b:12f0::1]/test
ftp://username:password@domain.com/path/to/file/somefile.html?queryVariable=value#fragment
https://subdomain.domain.com/path/to/file.php?query=value#fragment
https://subdomain.example.com/path/to/file.php?query=value#fragment
mailto:john.smith(comment)@example.com
mailto:user@[2001:DB8::1]
mailto:user@[255:192:168:1]
mailto:M.Handley@cs.ucl.ac.uk
http://localhost:4433/path/to/file?query#fragment
# Note that the example below IS a valid as it does follow RFC standards
localhost:4433/path/to/file
# These work with the optional scheme group although I'd suggest making the scheme mandatory as misinterpretations can occur
john.doe@gmail.com
www.google.com/pathtofile.php?query
[192a:123::192.168.1.1]:80/path/to/file.html?query#fragment
我写了一个很棒的版本,你可以运行
它匹配以下url(这对我来说已经足够好了)
public static void main(args) {
String url = "go to http://www.m.abut.ly/abc its awesome"
url = url.replaceAll(/https?:\/\/w{0,3}\w*?\.(\w*?\.)?\w{2,3}\S*|www\.(\w*?\.)?\w*?\.\w{2,3}\S*|(\w*?\.)?\w*?\.\w{2,3}[\/\?]\S*/ , { it ->
"woof${it}woof"
})
println url
}
http://google.com
http://google.com/help.php
http://google.com/help.php?a=5
http://www.google.com
http://www.google.com/help.php
http://www.google.com?a=5
google.com?a=5
google.com/help.php
google.com/help.php?a=5
http://www.m.google.com/help.php?a=5 (and all its permutations)
www.m.google.com/help.php?a=5 (and all its permutations)
m.google.com/help.php?a=5 (and all its permutations)
对于任何不以http或www开头的url,重要的是它们必须包含/或?
我打赌这可以稍作调整,但它的工作非常好,因为它是如此简短和紧凑……因为你可以把它分成三份:
找到任何以http开头的内容:
https?:\/\/w{0,3}\w*?\.\w{2,3}\S*
找到任何以www开头的东西:
www\.\w*?\.\w{2,3}\S*
或者找到任何必须有一个文本,然后一个点,然后至少两个字母,然后一个?或/:
\w*?\.\w{2,3}[\/\?]\S*