我有一个小问题,XPath包含与dom4j…
假设我的XML是
<Home>
<Addr>
<Street>ABC</Street>
<Number>5</Number>
<Comment>BLAH BLAH BLAH <br/><br/>ABC</Comment>
</Addr>
</Home>
假设我想找到文本中所有有ABC的节点,给定根元素…
所以我需要写的XPath是
/ * [contains(短信),‘ABC’)
然而,这不是dom4j返回的内容....这是dom4j的问题,还是我对XPath工作原理的理解,因为该查询只返回Street元素而不返回Comment元素?
DOM使Comment元素成为一个具有四个标记(两个)的复合元素
[Text = 'XYZ'][BR][BR][Text = 'ABC']
我假设查询仍然应该返回元素,因为它应该找到元素并在其上运行contains,但它没有……
下面的查询返回元素,但它返回的不仅仅是元素——它还返回父元素,这对问题来说是不可取的。
//*[contains(text(),'ABC')]
有人知道XPath查询只返回元素<Street/>和<Comment/>吗?
包括XPath 1.0和XPath 2.0+行为的现代答案…
这个XPath,
//*[contains(text(),'ABC')]
在XPath 1.0和XPath(2.0+)的后续版本中表现不同。
常见的行为
//*选择文档中的所有元素。
[]根据其中表达的谓词筛选这些元素。
谓词中的Contains (string, substring)将过滤那些元素,使其substring为string中的子字符串。
XPath 1.0行为
Contains (string, substring)将通过获取节点集中第一个节点的字符串值将节点集转换为字符串。
对于//*[contains(text(),'ABC')],该节点集将是文档中每个元素的所有子文本节点。
由于只使用了第一个文本节点子节点,因此违反了测试所有子文本节点是否包含'ABC'子字符串的期望。
对于不熟悉上述转换规则的人来说,这将导致反直觉的结果。
XPath 1.0在线示例显示只选择了一个“ABC”。
XPath 2.0+行为
将包含多个项的序列作为第一个参数调用contains(string, substring)是错误的。
这纠正了上面在XPath 1.0中描述的违反直觉的行为。
XPath 2.0在线示例显示了一个典型的错误消息,这是由于XPath 2.0+特有的转换错误造成的。
常见的解决方案
If you wish to include descendent elements (beyond children), test against the string value of an element as a single string, rather than the individual string values of the child text nodes, this XPath,
//*[contains(.,'ABC')]
selects your targeted Street and Comment elements and also their Addr and Home ancestor elements because those too have 'ABC' as substrings of their string values.
Online example shows ancestors being selected too.
If you wish to exclude descendent elements (beyond children), this XPath,
//*[text()[contains(.,'ABC')]]
selects only your targeted Street and Comment because only those elements have text node children whose string values contain the 'ABC' substring. This will be true for all versions of XPath
Online example shows only Street and Comment being selected.
<Comment>标记包含两个文本节点和两个<br>节点作为子节点。
你的xpath表达式是
//*[contains(text(),'ABC')]
为了分析这个问题,
* is a selector that matches any element (i.e. tag) -- it returns a node-set.
The [] are a conditional that operates on each individual node in that node set. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
contains is a function that operates on a string. If it is passed a node set, the node set is converted into a string by returning the string-value of the node in the node-set that is first in document order. Hence, it can match only the first text node in your <Comment> element -- namely BLAH BLAH BLAH. Since that doesn't match, you don't get a <Comment> in your results.
你需要把这个改成
//*[text()[contains(.,'ABC')]]
* is a selector that matches any element (i.e. tag) -- it returns a node-set.
The outer [] are a conditional that operates on each individual node in that node set -- here it operates on each element in the document.
text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
The inner [] are a conditional that operates on each node in that node set -- here each individual text node. Each individual text node is the starting point for any path in the brackets, and can also be referred to explicitly as . within the brackets. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
contains is a function that operates on a string. Here it is passed an individual text node (.). Since it is passed the second text node in the <Comment> tag individually, it will see the 'ABC' string and be able to match it.