如何在PHP中解析和处理HTML/XML？

如何解析HTML/XML并从中提取信息？

当前回答

您可以尝试的另一个选项是QueryPath。它的灵感来自jQuery，但在服务器上使用PHP，并在Drupal中使用。

2011-05-31 15:12:06

其他回答

SimpleHtmlDom的第三方替代方案使用DOM而不是字符串解析：phpQuery、Zend_DOM、QueryPath和FluentDom。

2010-09-07 08:57:59

如果您熟悉jQuery选择器，可以使用ScarletsQuery for PHP

<pre><?php
include "ScarletsQuery.php";

// Load the HTML content and parse it
$html = file_get_contents('https://www.lipsum.com');
$dom = Scarlets\Library\MarkupLanguage::parseText($html);

// Select meta tag on the HTML header
$description = $dom->selector('head meta[name="description"]')[0];

// Get 'content' attribute value from meta tag
print_r($description->attr('content'));

$description = $dom->selector('#Content p');

// Get element array
print_r($description->view);

这个库通常需要不到1秒的时间来处理脱机html。它还接受无效的HTML或标记属性上缺少引号。

2018-08-16 12:35:01

顺便说一下，这通常被称为屏幕刮擦。我为此使用的库是SimpleHTMLDomParser。

2010-08-26 17:20:17

XML_HTMLMax相当稳定——即使不再维护它。另一种选择是通过HtmlTidy将HTML导入，然后用标准的XML工具解析它。

2008-11-15 19:55:44

我创建了一个名为PHPPowertools/DOM Query的库，它允许您像使用jQuery一样抓取HTML5和XML文档。

在后台，它使用symfony/DomCrawler将CSS选择器转换为XPath选择器。它总是使用相同的DomDocument，即使在将一个对象传递给另一个对象时也是如此，以确保良好的性能。

示例用法：

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents('https://github.com');

// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));

// Passing a string (CSS selector)
$s = $H->select('div.foo');

// Passing an element object (DOM Element)
$s = $H->select($documentBody);

// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

支持的方法：

[x] $（1）[x] $.parseHTML[x] $.parseXML[x] $.parseJSON[x] $选择添加[x] $selection.addClass[x] $selection.after[x] $selection.append[x] $选择属性[x] $选择之前[x] $selection.children[x] $选择最接近[x] $selection.contents[x] $选择分离[x] $selection.每个[x] $selection.eq[x] $selection.empty（2）[x] $selection.find[x] $selection.first[x] $selection.get[x] $selection.insert之后[x] $selection.insertBefore[x] $selection.last[x] $selection.parent[x] $selection.parents[x] $selection.remove[x] $selection.removeAttr[x] $selection.removeClass[x] $selection.text[x] $selection.wrap

出于明显原因，重命名为“select”重命名为“void”，因为“empty”是PHP中的保留字

注：

该库还包括自己的零配置自动加载器，用于PSR-0兼容库。所包含的示例应该可以开箱即用，无需任何额外配置。或者，您可以将其与composer一起使用。

2015-07-09 14:33:16

如何在PHP中解析和处理HTML/XML？

推荐文章

最新文章

标签