使用正则表达式解析HTML:为什么不呢?

You, know...there's a lot of mentality of you CAN'T do it and I think that everyone on both sides of the fence are right and wrong. You CAN do it, but it takes a little more processing than just running one regex against it. Take this (I wrote this inside of an hour) as an example. It assumes the HTML is completely valid, but depending on what language you're using to apply the aforementioned regex, you could do some fixing of the HTML to make sure that it will succeed. For example, removing closing tags that are not supposed to be there: </img> for example. Then, add the closing single HTML forward slash to elements that are missing them, etc.

我将在编写一个库的上下文中使用它，该库允许我执行类似于JavaScript的[x]. getelementsbytagname()的HTML元素检索。我只是拼接了我在正则表达式的DEFINE部分中编写的功能，并使用它来进入元素树，一次一个。

那么，这将是验证HTML的最终100%答案吗?不。但这只是个开始，只要再努力一点，就可以做到。然而，试图在一个正则表达式执行中完成它是不实际的，也不有效。

2015-11-22 15:03:21

就解析而言，正则表达式在“词法分析”(lexer)阶段很有用，在这个阶段，输入被分解成标记。它在实际的“构建解析树”阶段用处不大。

对于HTML解析器，我希望它只接受格式良好的HTML，而这需要正则表达式所不能做到的功能(它们不能“计数”并确保给定数量的开始元素与相同数量的结束元素相平衡)。

2009-02-26 14:34:11

You, know...there's a lot of mentality of you CAN'T do it and I think that everyone on both sides of the fence are right and wrong. You CAN do it, but it takes a little more processing than just running one regex against it. Take this (I wrote this inside of an hour) as an example. It assumes the HTML is completely valid, but depending on what language you're using to apply the aforementioned regex, you could do some fixing of the HTML to make sure that it will succeed. For example, removing closing tags that are not supposed to be there: </img> for example. Then, add the closing single HTML forward slash to elements that are missing them, etc.

我将在编写一个库的上下文中使用它，该库允许我执行类似于JavaScript的[x]. getelementsbytagname()的HTML元素检索。我只是拼接了我在正则表达式的DEFINE部分中编写的功能，并使用它来进入元素树，一次一个。

那么，这将是验证HTML的最终100%答案吗?不。但这只是个开始，只要再努力一点，就可以做到。然而，试图在一个正则表达式执行中完成它是不实际的，也不有效。

2015-11-22 15:03:21

两个简单的原因:

编写一个能够抵御恶意输入的正则表达式是困难的;比使用预先构建的工具难多了编写一个正则表达式来处理你不可避免地会遇到的荒谬的标记是困难的;比使用预先构建的工具难多了

关于正则表达式在解析中的适用性:它们并不合适。您是否见过解析大多数语言所需的正则表达式类型?

2009-02-26 14:29:02

请记住，虽然HTML本身不是规则的，但您正在查看的页面的某些部分可能是规则的。

例如，<form>标签被嵌套是一个错误;如果网页正常工作，那么使用正则表达式获取<form>将是完全合理的。

I recently did some web scraping using only Selenium and regular expressions. I got away with it because the data I wanted was put in a <form>, and put in a simple table format (so I could even count on <table>, <tr> and <td> to be non-nested--which is actually highly unusual). In some degree, regular expressions were even almost necessary, because some of the structure I needed to access was delimited by comments. (Beautiful Soup can give you comments, but it would have been difficult to grab  and  blocks using Beautiful Soup.)

但是，如果我不得不担心嵌套表，那么我的方法根本就行不通!我就只能靠《美丽汤》了。但是，即使这样，有时也可以使用正则表达式获取所需的块，然后从那里展开。

2013-02-12 18:34:47

因为有很多方法可以“搞砸”HTML，浏览器会以一种相当自由的方式对待它们，但要用正则表达式重现浏览器的自由行为来覆盖所有情况需要相当大的努力，所以你的正则表达式将不可避免地在某些特殊情况下失败，这可能会在你的系统中引入严重的安全漏洞。

2009-02-26 14:29:35

使用正则表达式解析HTML:为什么不呢?

推荐文章

最新文章

标签