我试图使用sed来清理url行来提取域。

所以从:

http://www.suepearson.co.uk/product/174/71/3816/

我想要:

http://www.suepearson.co.uk/

(不管后面有没有斜杠,都没有关系)

我试过:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

And(转义非贪婪量词)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

但我似乎不能让非贪婪量词(?)工作,所以它总是匹配整个字符串。


当前回答

sed的| \ (http: \ \ / www \ [a-z.0-9] * \ / \)。|\1|也可以

其他回答

不幸的是,如前所述,sed不支持这一点。 为了克服这个问题,我建议使用次优方法(实际上甚至更好),使用类似vim sed的功能。

在.bash-profile中定义

vimdo() { vim $2 --not-a-term -c "$1"  -es +"w >> /dev/stdout" -cq!  ; }

这将创建无头vim来执行命令。

现在你可以这样做:

回声路径美元| vimdo“% s_ \ c: [a-zA-Z0-9 \ \ /] python (a-zA-Z0-9 \ {-} \\/]\{-}:__ g”,

过滤掉$PATH中的python。

使用-在vimdo中从管道中输入。

而大多数语法是相同的。Vim具有更高级的特性,并且使用\{-}是非贪婪匹配的标准。参见帮助regexp。

由于这里也标记了PCRE,我们可以通过在regex中使用非惰性匹配来使用GNU grep。它将匹配与。*相对的第一个最近的匹配(它实际上是贪婪的,直到匹配的最后一次出现)。

grep -oP '^http[s]?:\/\/.*?/' Input_file

解释:此处使用grep的oP选项,其中-P负责启用PCRE正则表达式。在grep的主程序中提到regex,它匹配开始http/https,后面跟着://直到下一次出现/,因为我们已经使用了。*?它将查找first / after (http/https://).它只打印匹配的部分。

在sed中模拟惰性(非贪婪)量词

以及所有其他正则表达式口味!

Finding first occurrence of an expression: POSIX ERE (using -r option) Regex: (EXPRESSION).*|. Sed: sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on Example (finding first sequence of digits) Live demo: $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34' 12 How does it work? This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too. Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group. POSIX BRE Regex: \(\(\(EXPRESSION\).*\)*.\)* Sed: sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/' Example (finding first sequence of digits): $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34' 12 This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit. If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues. Finding first occurrence of a delimited expression: This approach will match the very first occurrence of a string that is delimited. We can call it a block of string. sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \ s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g' Input string: foobar start block #1 end barfoo start block #2 end -EDE: end -SDE: start $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g' Output: start block #1 end First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end. Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.


直接回答你的问题

使用方法#2(带分隔符的表达式),你应该选择两个合适的表达式:

艾德:[^]\ / SDE: http:

用法:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

输出:

http://www.suepearson.co.uk/

注意:对于相同的分隔符,这将不起作用。

echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'

别麻烦了,我在另一个论坛上看到的:)

基本的和扩展的Posix/GNU regex都不能识别非贪婪的量词;你需要稍后的正则表达式。幸运的是,这个上下文的Perl regex非常容易获得:

perl -pe 's|(http://.*?/).*|\1|'