我试图使用sed来清理url行来提取域。

所以从:

http://www.suepearson.co.uk/product/174/71/3816/

我想要:

http://www.suepearson.co.uk/

(不管后面有没有斜杠,都没有关系)

我试过:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

And(转义非贪婪量词)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

但我似乎不能让非贪婪量词(?)工作,所以它总是匹配整个字符串。


当前回答

这是如何使用sed健壮地进行多字符字符串的非贪婪匹配。假设你想改变每一个foo…Bar to <foo…Bar >,例如这个输入:

$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV

应该变成这样的输出:

ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

要做到这一点,你将foo和bar转换为单独的字符,然后在它们之间使用这些字符的反字符:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV

在上述:

/ / @ @A / g;s / {/ @B / g;s/}/@C/g正在将{和}转换为输入中不存在的占位符字符串,这样这些字符就可以转换为foo和bar。 s / foo / {/ g;S /bar/}/g将foo和bar分别转换为{和} S /{[^{}]*}/<&>/g正在执行我们想要的操作-将foo…Bar到<foo…Bar > s /} /酒吧/ g;S /{/foo/g将{和}转换回foo和bar。 s / @C /} / g;s / @B / {/ g;s/@ a /@/g将占位符字符串转换回原始字符。

请注意,上面的方法并不依赖于输入中不存在的任何特定字符串,因为它在第一步中就制造了这样的字符串,它也不关心你想要匹配的任何特定regexp的哪个出现,因为你可以在表达式中使用{[^{}]*}尽可能多的次数来隔离你想要的实际匹配和/或使用seds数值匹配操作符,例如只替换第二个出现:

$ sed 's/@/@A/g; s/{/@B/g; s/}/@C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/@C/}/g; s/@B/{/g; s/@A/@/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV

其他回答

克利斯朵夫·西格哈特(Christoph Sieghart)著

在sed中获得非贪婪匹配的技巧是匹配除终止匹配的字符外的所有字符。我知道,这很简单,但我在这上面浪费了宝贵的时间,毕竟shell脚本应该是快速而简单的。所以以防别人需要

贪婪匹配

% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar

非贪婪匹配

% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar

基本的和扩展的Posix/GNU regex都不能识别非贪婪的量词;你需要稍后的正则表达式。幸运的是,这个上下文的Perl regex非常容易获得:

perl -pe 's|(http://.*?/).*|\1|'

使用纯(GNU) sed仍然有希望解决这个问题。尽管这不是一个通用的解决方案,在某些情况下,你可以使用“循环”来消除字符串中所有不必要的部分,就像这样:

sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"

-r:使用扩展的正则表达式(用于+和未转义的括号) 定义一个名为"loop"的新标签 -e:在sed中添加命令 "t loop":如果有成功的替换,则跳回标记"loop"

这里唯一的问题是它也会切掉最后一个分隔符('/'),但如果你真的需要它,你仍然可以在“循环”结束后简单地把它放回去,只需要在前面的命令行末尾追加这个额外的命令:

-e "s,$,/,"

在sed中模拟惰性(非贪婪)量词

以及所有其他正则表达式口味!

Finding first occurrence of an expression: POSIX ERE (using -r option) Regex: (EXPRESSION).*|. Sed: sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on Example (finding first sequence of digits) Live demo: $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34' 12 How does it work? This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too. Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group. POSIX BRE Regex: \(\(\(EXPRESSION\).*\)*.\)* Sed: sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/' Example (finding first sequence of digits): $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34' 12 This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit. If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues. Finding first occurrence of a delimited expression: This approach will match the very first occurrence of a string that is delimited. We can call it a block of string. sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \ s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g' Input string: foobar start block #1 end barfoo start block #2 end -EDE: end -SDE: start $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g' Output: start block #1 end First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end. Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.


直接回答你的问题

使用方法#2(带分隔符的表达式),你应该选择两个合适的表达式:

艾德:[^]\ / SDE: http:

用法:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

输出:

http://www.suepearson.co.uk/

注意:对于相同的分隔符,这将不起作用。

如果你有gnu grep,那么可以使用perl regex:

grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk

或者,在域名使用后获取一切

grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/