我试图使用sed来清理url行来提取域。

所以从:

http://www.suepearson.co.uk/product/174/71/3816/

我想要:

http://www.suepearson.co.uk/

(不管后面有没有斜杠,都没有关系)

我试过:

 sed 's|\(http:\/\/.*?\/\).*|\1|'

And(转义非贪婪量词)

sed 's|\(http:\/\/.*\?\/\).*|\1|'

但我似乎不能让非贪婪量词(?)工作,所以它总是匹配整个字符串。


当前回答

非贪婪的解决方案超过一个字符

这个帖子真的很老了,但我认为人们仍然需要它。 让我们假设你想杀死所有直到HELLO第一次出现的东西。你不能说[^HELLO]…

因此,一个很好的解决方案包括两个步骤,假设您可以在输入中留出一个您不期望的惟一单词,例如top_secit。

在这种情况下,我们可以:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

当然,对于一个简单的输入,你可以使用一个更小的单词,甚至可能是一个字符。

HTH!

其他回答

在sed中模拟惰性(非贪婪)量词

以及所有其他正则表达式口味!

Finding first occurrence of an expression: POSIX ERE (using -r option) Regex: (EXPRESSION).*|. Sed: sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on Example (finding first sequence of digits) Live demo: $ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34' 12 How does it work? This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too. Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group. POSIX BRE Regex: \(\(\(EXPRESSION\).*\)*.\)* Sed: sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/' Example (finding first sequence of digits): $ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34' 12 This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit. If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues. Finding first occurrence of a delimited expression: This approach will match the very first occurrence of a string that is delimited. We can call it a block of string. sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \ s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g' Input string: foobar start block #1 end barfoo start block #2 end -EDE: end -SDE: start $ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g' Output: start block #1 end First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which is the end delimiter. At this stage our output is: foobar start block #1 end. Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.


直接回答你的问题

使用方法#2(带分隔符的表达式),你应该选择两个合适的表达式:

艾德:[^]\ / SDE: http:

用法:

$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'

输出:

http://www.suepearson.co.uk/

注意:对于相同的分隔符,这将不起作用。

非贪婪的解决方案超过一个字符

这个帖子真的很老了,但我认为人们仍然需要它。 让我们假设你想杀死所有直到HELLO第一次出现的东西。你不能说[^HELLO]…

因此,一个很好的解决方案包括两个步骤,假设您可以在输入中留出一个您不期望的惟一单词,例如top_secit。

在这种情况下,我们可以:

s/HELLO/top_sekrit/     #will only replace the very first occurrence
s/.*top_sekrit//        #kill everything till end of the first HELLO

当然,对于一个简单的输入,你可以使用一个更小的单词,甚至可能是一个字符。

HTH!

还没有看到这个答案,所以这里是如何用vi或vim做到这一点:

vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null

这将全局运行vi:%s替换(后面的g),如果没有找到模式,则避免引发错误(e),然后将结果更改保存到磁盘并退出。&>/dev/null可以防止GUI在屏幕上短暂闪烁,这很烦人。

有时候我喜欢用vi来处理超级复杂的正则表达式,因为(1)perl已经奄奄一息了,(2)vim有一个非常先进的正则表达式引擎,(3)在我日常使用的编辑文档中,我已经非常熟悉vi正则表达式了。

这可以使用cut:

echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3

使用sed,我通常通过搜索除分隔符以外的任何东西来实现非贪婪搜索,直到分隔符:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'

输出:

http://www.suon.co.uk

这是:

don't output -n search, match pattern, replace and print s/<pattern>/<replace>/p use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p remember match between brackets \( ... \), later accessible with \1,\2... match http:// followed by anything in brackets [], [ab/] would mean either a or b or / first ^ in [] means not, so followed by anything but the thing in the [] so [^/] means anything except / character * is to repeat previous group so [^/]* means characters except /. so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .* now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'

如果你想在域名后面加上反斜杠,那么在组中再加一个反斜杠来记住:

echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'

输出:

http://www.suon.co.uk/