我如何检查一个URL是否存在(不是404)在PHP?


当前回答

karim79的get_headers()解决方案并没有为我工作,因为我得到了疯狂的结果与Pinterest。

get_headers(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(): Failed to enable crypto

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(https://www.pinterest.com/jonathan_parl/): failed to open stream: operation failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
) 

不管怎样,这个开发人员演示了cURL比get_headers()快得多:

http://php.net/manual/fr/function.get-headers.php#104723

由于许多人要求karim79修复的是cURL解决方案,这里是我今天构建的解决方案。

/**
* Send an HTTP request to a the $url and check the header posted back.
*
* @param $url String url to which we must send the request.
* @param $failCodeList Int array list of code for which the page is considered invalid.
*
* @return Boolean
*/
public static function isUrlExists($url, array $failCodeList = array(404)){

    $exists = false;

    if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){

        $url = "https://" . $url;
    }

    if (preg_match(RegularExpression::URL, $url)){

        $handle = curl_init($url);


        curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);

        curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

        curl_setopt($handle, CURLOPT_HEADER, true);

        curl_setopt($handle, CURLOPT_NOBODY, true);

        curl_setopt($handle, CURLOPT_USERAGENT, true);


        $headers = curl_exec($handle);

        curl_close($handle);


        if (empty($failCodeList) or !is_array($failCodeList)){

            $failCodeList = array(404); 
        }

        if (!empty($headers)){

            $exists = true;

            $headers = explode(PHP_EOL, $headers);

            foreach($failCodeList as $code){

                if (is_numeric($code) and strpos($headers[0], strval($code)) !== false){

                    $exists = false;

                    break;  
                }
            }
        }
    }

    return $exists;
}

让我来解释一下旋度选项:

CURLOPT_RETURNTRANSFER:返回一个字符串,而不是在屏幕上显示调用页面。

CURLOPT_SSL_VERIFYPEER: cUrl不会签出证书

CURLOPT_HEADER:在字符串中包含头文件

CURLOPT_NOBODY:不要在字符串中包含body

CURLOPT_USERAGENT:一些站点需要它才能正常运行(例如:https://plus.google.com)


附加说明:在这个函数中,我使用Diego Perini的正则表达式在发送请求之前验证URL:

const URL = "%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu"; //@copyright Diego Perini

附加说明2:我将标题字符串和用户标题[0]分开,以确保只验证返回代码和消息(例如:200、404、405等)。

附加说明3:有时仅验证代码404是不够的(参见单元测试),因此有一个可选的$failCodeList参数提供所有要拒绝的代码列表。

当然,这里还有单元测试(包括所有流行的社交网络)来证明我的代码是合法的:

public function testIsUrlExists(){

//invalid
$this->assertFalse(ToolManager::isUrlExists("woot"));

$this->assertFalse(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque4545646456"));

$this->assertFalse(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque890800"));

$this->assertFalse(ToolManager::isUrlExists("https://instagram.com/mariloubiz1232132/", array(404, 405)));

$this->assertFalse(ToolManager::isUrlExists("https://www.pinterest.com/jonathan_parl1231/"));

$this->assertFalse(ToolManager::isUrlExists("https://regex101.com/546465465456"));

$this->assertFalse(ToolManager::isUrlExists("https://twitter.com/arcadefire4566546"));

$this->assertFalse(ToolManager::isUrlExists("https://vimeo.com/**($%?%$", array(400, 405)));

$this->assertFalse(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666456456456"));


//valid
$this->assertTrue(ToolManager::isUrlExists("www.google.ca"));

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));

$this->assertTrue(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque"));

$this->assertTrue(ToolManager::isUrlExists("https://instagram.com/mariloubiz/"));

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));

$this->assertTrue(ToolManager::isUrlExists("https://www.pinterest.com/"));

$this->assertTrue(ToolManager::isUrlExists("https://regex101.com"));

$this->assertTrue(ToolManager::isUrlExists("https://twitter.com/arcadefire"));

$this->assertTrue(ToolManager::isUrlExists("https://vimeo.com/"));

$this->assertTrue(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666"));
}

祝大家取得巨大成功,

Jonathan Parent-Lévesque在蒙特利尔报道

其他回答

以上所有解决方案+额外的糖。(终极AIO解决方案)

/**
 * Check that given URL is valid and exists.
 * @param string $url URL to check
 * @return bool TRUE when valid | FALSE anyway
 */
function urlExists ( $url ) {
    // Remove all illegal characters from a url
    $url = filter_var($url, FILTER_SANITIZE_URL);

    // Validate URI
    if (filter_var($url, FILTER_VALIDATE_URL) === FALSE
        // check only for http/https schemes.
        || !in_array(strtolower(parse_url($url, PHP_URL_SCHEME)), ['http','https'], true )
    ) {
        return false;
    }

    // Check that URL exists
    $file_headers = @get_headers($url);
    return !(!$file_headers || $file_headers[0] === 'HTTP/1.1 404 Not Found');
}

例子:

var_dump ( urlExists('http://stackoverflow.com/') );
// Output: true;

当从php中判断url是否存在时,有几件事需要注意:

Is the url itself valid (a string, not empty, good syntax), this is quick to check server side. Waiting for a response might take time and block code execution. Not all headers returned by get_headers() are well formed. Use curl (if you can). Prevent fetching the entire body/content, but only request the headers. Consider redirecting urls: Do you want the first code returned? Or follow all redirects and return the last code? You might end up with a 200, but it could redirect using meta tags or javascript. Figuring out what happens after is tough.

请记住,无论你使用什么方法,等待回复都需要时间。 所有代码都可能(很可能)停止,直到您知道结果或请求超时。

例如:如果url无效或不可达,下面的代码可能需要很长时间才能显示页面:

<?php
$urls = getUrls(); // some function getting say 10 or more external links

foreach($urls as $k=>$url){
  // this could potentially take 0-30 seconds each
  // (more or less depending on connection, target site, timeout settings...)
  if( ! isValidUrl($url) ){
    unset($urls[$k]);
  }
}

echo "yay all done! now show my site";
foreach($urls as $url){
  echo "<a href=\"{$url}\">{$url}</a><br/>";
}

下面的函数可能会有帮助,你可能想修改它们以适应你的需要:

    function isValidUrl($url){
        // first do some quick sanity checks:
        if(!$url || !is_string($url)){
            return false;
        }
        // quick check url is roughly a valid http request: ( http://blah/... ) 
        if( ! preg_match('/^http(s)?:\/\/[a-z0-9-]+(\.[a-z0-9-]+)*(:[0-9]+)?(\/.*)?$/i', $url) ){
            return false;
        }
        // the next bit could be slow:
        if(getHttpResponseCode_using_curl($url) != 200){
//      if(getHttpResponseCode_using_getheaders($url) != 200){  // use this one if you cant use curl
            return false;
        }
        // all good!
        return true;
    }
    
    function getHttpResponseCode_using_curl($url, $followredirects = true){
        // returns int responsecode, or false (if url does not exist or connection timeout occurs)
        // NOTE: could potentially take up to 0-30 seconds , blocking further code execution (more or less depending on connection, target site, and local timeout settings))
        // if $followredirects == false: return the FIRST known httpcode (ignore redirects)
        // if $followredirects == true : return the LAST  known httpcode (when redirected)
        if(! $url || ! is_string($url)){
            return false;
        }
        $ch = @curl_init($url);
        if($ch === false){
            return false;
        }
        @curl_setopt($ch, CURLOPT_HEADER         ,true);    // we want headers
        @curl_setopt($ch, CURLOPT_NOBODY         ,true);    // dont need body
        @curl_setopt($ch, CURLOPT_RETURNTRANSFER ,true);    // catch output (do NOT print!)
        if($followredirects){
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,true);
            @curl_setopt($ch, CURLOPT_MAXREDIRS      ,10);  // fairly random number, but could prevent unwanted endless redirects with followlocation=true
        }else{
            @curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,false);
        }
//      @curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,5);   // fairly random number (seconds)... but could prevent waiting forever to get a result
//      @curl_setopt($ch, CURLOPT_TIMEOUT        ,6);   // fairly random number (seconds)... but could prevent waiting forever to get a result
//      @curl_setopt($ch, CURLOPT_USERAGENT      ,"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1");   // pretend we're a regular browser
        @curl_exec($ch);
        if(@curl_errno($ch)){   // should be 0
            @curl_close($ch);
            return false;
        }
        $code = @curl_getinfo($ch, CURLINFO_HTTP_CODE); // note: php.net documentation shows this returns a string, but really it returns an int
        @curl_close($ch);
        return $code;
    }
    
    function getHttpResponseCode_using_getheaders($url, $followredirects = true){
        // returns string responsecode, or false if no responsecode found in headers (or url does not exist)
        // NOTE: could potentially take up to 0-30 seconds , blocking further code execution (more or less depending on connection, target site, and local timeout settings))
        // if $followredirects == false: return the FIRST known httpcode (ignore redirects)
        // if $followredirects == true : return the LAST  known httpcode (when redirected)
        if(! $url || ! is_string($url)){
            return false;
        }
        $headers = @get_headers($url);
        if($headers && is_array($headers)){
            if($followredirects){
                // we want the last errorcode, reverse array so we start at the end:
                $headers = array_reverse($headers);
            }
            foreach($headers as $hline){
                // search for things like "HTTP/1.1 200 OK" , "HTTP/1.0 200 OK" , "HTTP/1.1 301 PERMANENTLY MOVED" , "HTTP/1.1 400 Not Found" , etc.
                // note that the exact syntax/version/output differs, so there is some string magic involved here
                if(preg_match('/^HTTP\/\S+\s+([1-9][0-9][0-9])\s+.*/', $hline, $matches) ){// "HTTP/*** ### ***"
                    $code = $matches[1];
                    return $code;
                }
            }
            // no HTTP/xxx found in headers:
            return false;
        }
        // no headers :
        return false;
    }

cURL可以返回HTTP代码,我不认为所有额外的代码是必要的?

function urlExists($url=NULL)
    {
        if($url == NULL) return false;
        $ch = curl_init($url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $data = curl_exec($ch);
        $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch); 
        if($httpcode>=200 && $httpcode<300){
            return true;
        } else {
            return false;
        }
    }

很快:

function http_response($url){
    $resURL = curl_init(); 
    curl_setopt($resURL, CURLOPT_URL, $url); 
    curl_setopt($resURL, CURLOPT_BINARYTRANSFER, 1); 
    curl_setopt($resURL, CURLOPT_HEADERFUNCTION, 'curlHeaderCallback'); 
    curl_setopt($resURL, CURLOPT_FAILONERROR, 1); 
    curl_exec ($resURL); 
    $intReturnCode = curl_getinfo($resURL, CURLINFO_HTTP_CODE); 
    curl_close ($resURL); 
    if ($intReturnCode != 200 && $intReturnCode != 302 && $intReturnCode != 304) { return 0; } else return 1;
}

echo 'google:';
echo http_response('http://www.google.com');
echo '/ ogogle:';
echo http_response('http://www.ogogle.com');

karim79的get_headers()解决方案并没有为我工作,因为我得到了疯狂的结果与Pinterest。

get_headers(): SSL operation failed with code 1. OpenSSL Error messages: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(): Failed to enable crypto

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
)

get_headers(https://www.pinterest.com/jonathan_parl/): failed to open stream: operation failed

Array
(
    [url] => https://www.pinterest.com/jonathan_parl/
    [exists] => 
) 

不管怎样,这个开发人员演示了cURL比get_headers()快得多:

http://php.net/manual/fr/function.get-headers.php#104723

由于许多人要求karim79修复的是cURL解决方案,这里是我今天构建的解决方案。

/**
* Send an HTTP request to a the $url and check the header posted back.
*
* @param $url String url to which we must send the request.
* @param $failCodeList Int array list of code for which the page is considered invalid.
*
* @return Boolean
*/
public static function isUrlExists($url, array $failCodeList = array(404)){

    $exists = false;

    if(!StringManager::stringStartWith($url, "http") and !StringManager::stringStartWith($url, "ftp")){

        $url = "https://" . $url;
    }

    if (preg_match(RegularExpression::URL, $url)){

        $handle = curl_init($url);


        curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);

        curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

        curl_setopt($handle, CURLOPT_HEADER, true);

        curl_setopt($handle, CURLOPT_NOBODY, true);

        curl_setopt($handle, CURLOPT_USERAGENT, true);


        $headers = curl_exec($handle);

        curl_close($handle);


        if (empty($failCodeList) or !is_array($failCodeList)){

            $failCodeList = array(404); 
        }

        if (!empty($headers)){

            $exists = true;

            $headers = explode(PHP_EOL, $headers);

            foreach($failCodeList as $code){

                if (is_numeric($code) and strpos($headers[0], strval($code)) !== false){

                    $exists = false;

                    break;  
                }
            }
        }
    }

    return $exists;
}

让我来解释一下旋度选项:

CURLOPT_RETURNTRANSFER:返回一个字符串,而不是在屏幕上显示调用页面。

CURLOPT_SSL_VERIFYPEER: cUrl不会签出证书

CURLOPT_HEADER:在字符串中包含头文件

CURLOPT_NOBODY:不要在字符串中包含body

CURLOPT_USERAGENT:一些站点需要它才能正常运行(例如:https://plus.google.com)


附加说明:在这个函数中,我使用Diego Perini的正则表达式在发送请求之前验证URL:

const URL = "%^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?@|\d{1,3}(?:\.\d{1,3}){3}|(?:(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)(?:\.(?:[a-z\d\x{00a1}-\x{ffff}]+-?)*[a-z\d\x{00a1}-\x{ffff}]+)*(?:\.[a-z\x{00a1}-\x{ffff}]{2,6}))(?::\d+)?(?:[^\s]*)?$%iu"; //@copyright Diego Perini

附加说明2:我将标题字符串和用户标题[0]分开,以确保只验证返回代码和消息(例如:200、404、405等)。

附加说明3:有时仅验证代码404是不够的(参见单元测试),因此有一个可选的$failCodeList参数提供所有要拒绝的代码列表。

当然,这里还有单元测试(包括所有流行的社交网络)来证明我的代码是合法的:

public function testIsUrlExists(){

//invalid
$this->assertFalse(ToolManager::isUrlExists("woot"));

$this->assertFalse(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque4545646456"));

$this->assertFalse(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque890800"));

$this->assertFalse(ToolManager::isUrlExists("https://instagram.com/mariloubiz1232132/", array(404, 405)));

$this->assertFalse(ToolManager::isUrlExists("https://www.pinterest.com/jonathan_parl1231/"));

$this->assertFalse(ToolManager::isUrlExists("https://regex101.com/546465465456"));

$this->assertFalse(ToolManager::isUrlExists("https://twitter.com/arcadefire4566546"));

$this->assertFalse(ToolManager::isUrlExists("https://vimeo.com/**($%?%$", array(400, 405)));

$this->assertFalse(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666456456456"));


//valid
$this->assertTrue(ToolManager::isUrlExists("www.google.ca"));

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));

$this->assertTrue(ToolManager::isUrlExists("https://plus.google.com/+JonathanParentL%C3%A9vesque"));

$this->assertTrue(ToolManager::isUrlExists("https://instagram.com/mariloubiz/"));

$this->assertTrue(ToolManager::isUrlExists("https://www.facebook.com/jonathan.parentlevesque"));

$this->assertTrue(ToolManager::isUrlExists("https://www.pinterest.com/"));

$this->assertTrue(ToolManager::isUrlExists("https://regex101.com"));

$this->assertTrue(ToolManager::isUrlExists("https://twitter.com/arcadefire"));

$this->assertTrue(ToolManager::isUrlExists("https://vimeo.com/"));

$this->assertTrue(ToolManager::isUrlExists("https://www.youtube.com/user/Darkjo666"));
}

祝大家取得巨大成功,

Jonathan Parent-Lévesque在蒙特利尔报道