从bash模拟“group by”的最佳方法?

假设你有一个包含IP地址的文件，每行一个地址:

10.0.10.1
10.0.10.1
10.0.10.3
10.0.10.2
10.0.10.1

您需要一个shell脚本来计算每个IP地址在文件中出现的次数。对于前面的输入，您需要以下输出:

10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

一种方法是:

cat ip_addresses |uniq |while read ip
do
    echo -n $ip" "
    grep -c $ip ip_addresses
done

然而，它真的是远远不是有效的。

如何使用bash更有效地解决这个问题?

(有一件事要补充:我知道它可以从perl或awk解决，我对bash中更好的解决方案感兴趣，而不是在那些语言中。)

额外的信息:

假设源文件是5GB，运行算法的机器有4GB。所以排序不是一个有效的解决方案，多次读取文件也不是。

我喜欢类似散列表的解决方案-任何人都可以提供改进的解决方案?

附加信息#2:

有些人问我为什么要在bash中做，而在例如perl中更容易。原因是，在机器上，我必须这样做perl是不可为我。这是一台定制的linux机器，没有我使用过的大多数工具。我认为这是一个有趣的问题。

所以，请不要责怪这个问题，如果你不喜欢它，就忽略它。：-）

又快又脏的方法如下:

Cat ip_addresses | sort -n | uniq -c

如果需要使用bash中的值，可以将整个命令赋值给一个bash变量，然后循环遍历结果。

如果省略sort命令，就不会得到正确的结果，因为uniq只查看连续的相同行。

2008-12-19 12:18:32

sort ip_addresses | uniq -c

这将首先打印计数，但除此之外，它应该完全是您想要的。

2008-12-19 12:22:35

如果顺序不重要，排序可以省略

uniq -c <source_file>

echo "$list" | uniq -c

如果源列表是一个变量

2008-12-19 12:28:01

似乎您必须使用大量代码在bash中模拟哈希以获得线性行为，或者坚持使用二次超线性版本。

在这些版本中，saua的解决方案是最好的(也是最简单的):

sort -n ip_addresses.txt | uniq -c

我找到了http://unix.derkeiler.com/Newsgroups/comp.unix.shell/2005-11/0118.html。但它丑得要命……

2008-12-19 12:33:18

我会这样做:

perl -e 'while (<>) {chop; $h{$_}++;} for $k (keys %h) {print "$k $h{$k}\n";}' ip_addresses

但uniq可能适合你。

2008-12-19 16:52:49

我知道你在Bash中寻找一些东西，但如果其他人可能在Python中寻找一些东西，你可能会考虑这样做:

mySet = set()
for line in open("ip_address_file.txt"):
     line = line.rstrip()
     mySet.add(line)

由于集合中的值在默认情况下是唯一的，而Python在这方面非常擅长，因此您可能会在这里赢得一些东西。我还没有测试代码，所以它可能有漏洞，但这可能会让你明白。如果你想要计数出现的次数，使用字典而不是集合很容易实现。

编辑: 我不擅长阅读，所以我答错了。这里有一个字典片段，可以计算发生的次数。

mydict = {}
for line in open("ip_address_file.txt"):
    line = line.rstrip()
    if line in mydict:
        mydict[line] += 1
    else:
        mydict[line] = 1

字典mydict现在保存一个唯一IP的列表作为键，它们出现的次数作为值。

2008-12-20 15:10:58

您可能可以使用文件系统本身作为哈希表。伪代码如下:

for every entry in the ip address file; do
  let addr denote the ip address;

  if file "addr" does not exist; then
    create file "addr";
    write a number "0" in the file;
  else 
    read the number from "addr";
    increase the number by 1 and write it back;
  fi
done

最后，您所需要做的就是遍历所有文件，并在其中打印文件名和编号。或者，您可以每次在文件中附加一个空格或换行符，而不是保持计数，最后只需查看文件大小(以字节为单位)。

2008-12-20 15:30:34

典型的解决方案是另一位受访者提到的:

sort | uniq -c

它比用Perl或awk编写的代码更短、更简洁。

You write that you don't want to use sort, because the data's size is larger than the machine's main memory size. Don't underestimate the implementation quality of the Unix sort command. Sort was used to handle very large volumes of data (think the original AT&T's billing data) on machines with 128k (that's 131,072 bytes) of memory (PDP-11). When sort encounters more data than a preset limit (often tuned close to the size of the machine's main memory) it sorts the data it has read in main memory and writes it into a temporary file. It then repeats the action with the next chunks of data. Finally, it performs a merge sort on those intermediate files. This allows sort to work on data many times larger than the machine's main memory.

2008-12-20 16:02:55

我觉得awk关联数组在这种情况下也很方便

$ awk '{count[$1]++}END{for(j in count) print j,count[j]}' ips.txt

一群人在这里发帖

2008-12-21 15:06:35

要根据一组现有字段汇总多个字段，请使用下面的示例:(根据您的需求替换$1、$2、$3、$4)

cat file

US|A|1000|2000
US|B|1000|2000
US|C|1000|2000
UK|1|1000|2000
UK|1|1000|2000
UK|1|1000|2000

awk 'BEGIN { FS=OFS=SUBSEP="|"}{arr[$1,$2]+=$3+$4 }END {for (i in arr) print i,arr[i]}' file

US|A|3000
US|B|3000
US|C|3000
UK|1|9000

2010-04-10 10:42:40

解决方案(分组如mysql)

grep -ioh "facebook\|xing\|linkedin\|googleplus" access-log.txt | sort | uniq -c | sort -n

结果

3249  googleplus
4211 linkedin
5212 xing
7928 facebook

2014-02-14 09:08:38

cat ip_addresses | sort | uniq -c | sort -nr | awk '{print $2 " " $1}'

这个命令将提供您想要的输出

2014-07-25 22:28:45

大多数其他解决方案计算重复。如果你真的需要分组键值对，试试这个:

以下是我的示例数据:

find . | xargs md5sum
fe4ab8e15432161f452e345ff30c68b0 a.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

这将打印按md5校验和分组的键值对。

cat table.txt | awk '{print $1}' | sort | uniq  | xargs -i grep {} table.txt
30c68b02161e15435ff52e34f4fe4ab8 b.txt
30c68b02161e15435ff52e34f4fe4ab8 c.txt
fe4ab8e15432161f452e345ff30c68b0 a.txt
fe4ab8e15432161f452e345ff30c68b0 d.txt
fe4ab8e15432161f452e345ff30c68b0 e.txt

2015-11-11 21:02:38

GROUP BY under bash

关于这个SO线程，根据不同的需求有一些不同的答案。

1. 将IP计数为SO请求(按IP地址分组)。

由于IP很容易转换为单个整数，对于小串地址，如果您需要多次重复这种操作，使用纯bash函数可能会更有效!

纯粹的bash(没有fork!)

有一种方法，使用bash函数。这条路非常快，因为没有叉子!

countIp () { 
    local -a _ips=(); local _a
    while IFS=. read -a _a ;do
        ((_ips[_a<<24|${_a[1]}<<16|${_a[2]}<<8|${_a[3]}]++))
    done
    for _a in ${!_ips[@]} ;do
        printf "%.16s %4d\n" \
          $(($_a>>24)).$(($_a>>16&255)).$(($_a>>8&255)).$(($_a&255)) ${_ips[_a]}
    done
}

注意:IP地址转换为32位无符号整型值，用作数组的索引。这使用简单的bash数组!

time countIp < ip_addresses 
10.0.10.1    3
10.0.10.2    1
10.0.10.3    1
real    0m0.001s
user    0m0.004s
sys     0m0.000s

time sort ip_addresses | uniq -c
      3 10.0.10.1
      1 10.0.10.2
      1 10.0.10.3
real    0m0.010s
user    0m0.000s
sys     0m0.000s

在我的主机上，这样做比使用fork快得多，最多可以使用大约1000个地址，但当我尝试排序并计数10,000个地址时，大约需要整整1秒钟。

2. GROUP BY duplicate(文件内容)

通过使用校验和，你可以在某个地方标识重复的文件:

find . -type f -exec sha1sum {} + |
    sort |
        sed '
          :a;
          $s/^[^ ]\+ \+//;
          N;
          s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2\o11\3/;
          ta;
          s/^[^ ]\+ \+//;
          P;
          D;
          ba
    '

这将打印所有副本，按行，以制表($'\t'或八进制011 ou可以更改/\1 \2\o11\3/;By /\1 \2|\3/;使用|作为分隔符)。

./b.txt   ./e.txt
./a.txt   ./c.txt    ./d.txt

可以写成(以|为分隔符):

find . -type f -exec sha1sum {} + | sort | sed ':a;$s/^[^ ]\+ \+//;N;
  s/^\([^ ]\+\) \+\([^ ].*\)\n\1 \+\([^ ].*\)$/\1 \2|\3/;ta;s/^[^ ]\+ \+//;P;D;ba'

纯粹的bash方式

通过使用nameref，你可以构建一个包含所有副本的bash数组:

declare -iA sums='()'
while IFS=' ' read -r sum file ;do
    declare -n list=_LST_$sum
    list+=("$file")
    sums[$sum]+=1
done < <(
    find . -type f -exec sha1sum {} +
)

从那里，你有一堆数组保存所有重复的文件名作为分离的元素:

for i in ${!sums[@]};do
     declare -n list=_LST_$i
     printf "%d %d %s\n" ${sums[$i]} ${#list[@]} "${list[*]}"
done

这可能会输出如下内容:

2 2 ./e.txt ./b.txt
3 3 ./c.txt ./a.txt ./d.txt

md5sum (${sum [$shasum]})匹配数组${_LST_ShAsUm[@]}中元素的计数。

for i in ${!sums[@]};do
    declare -n list=_LST_$i
    echo ${list[@]@A}
done

declare -a _LST_22596363b3de40b06f981fb85d82312e8c0ed511=([0]="./e.txt" [1]="./b.txt")
declare -a _LST_f572d396fae9206628714fb2ce00f72e94f2258f=([0]="./c.txt" [1]="./a.txt" [2]="./d.txt")

注意，这个方法可以处理文件名中的空格和特殊字符!

3.GROUP BY表中的列

由于匿名者提供了使用awk的有效示例，这里是一个纯bash解决方案。

所以你想要总结第3列到最后一列，并按第1列和第2列分组，table.txt看起来像这样

我们| 1000 | 2000 | 我们| 1000 | 2000 | B 我们C | 1000 | 2000 | 英国| 1 | 1000 | 2000 英国| 1 | 1000 | 2000 | 3000年英国| 1 | 1000 | 2000 | 3000 | 4000

对于不太大的表格，你可以:

myfunc() {
    local -iA restabl='()';
    local IFS=+
    while IFS=\| read -ra ar; do
        restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
    done
    for i in ${!restabl[@]} ;do
        printf '%s|%s\n' "$i" "${restabl[$i]}"
    done
}

可以输出如下内容:

myfunc <table.txt 
UK|1|19000
US|A|3000
US|C|3000
US|B|3000

对表进行排序:

myfunc() {
    local -iA restabl='()';
    local IFS=+ sorted=()
    while IFS=\| read -ra ar; do
        sorted[64#${ar[0]}${ar[1]}]="${ar[0]}|${ar[1]}"
        restabl["${ar[0]}|${ar[1]}"]+="${ar[*]:2}"
    done
    for i in ${sorted[@]} ;do
        printf '%s|%s\n' "$i" "${restabl[$i]}"
    done
}

必须返回:

myfunc <table 
UK|1|19000
US|A|3000
US|B|3000
US|C|3000

2018-02-18 12:31:58

这并没有回答原始问题的计数元素，但这个问题是搜索引擎在搜索我想要实现的东西时的第一个结果，所以我认为这可能会帮助一些人，因为它与“分组”功能有关。

我想根据它们的分组来排序文件，其中文件名中存在的一些字符串决定了组。

它使用临时分组/排序前缀，在排序后删除;Sed替换表达式(s#pattern#replacement#g)匹配目标字符串，并在目标字符串所需排序顺序对应的行前加上一个整数。然后，使用cut去除分组前缀。

注意，sed表达式可以被连接(例如，sed -e '<expr>;< expr >;<expr>;')但这里为了可读性将它们分开。

它不漂亮，可能也不快(我处理的项目少于50项)，但它至少在概念上简单，不需要学习awk。

#!/usr/bin/env bash

for line in $(find /etc \
    | sed -E -e "s#^(.*${target_string_A}.*)#${target_string_A_sort_index}:\1#;" \
    | sed -E -e "s#^(.*${target_string_B}.*)#${target_string_B_sort_index}:\1#;" \
    | sed -E -e "s#^/(.*)#00:/\1#;" \
    | sort \
    | cut -c4-
)
do
    echo "${line}"
done

例如输入

/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/special/test/d
/this/is/a/another/test/e

#!/usr/bin/env bash

for line in $(find /etc \
    | sed -E -e "s#^(.*special.*)#10:\1#;" \
    | sed -E -e "s#^(.*another.*)#05:\1#;" \
    | sed -E -e "s#^/(.*)#00:/\1#;" \
    | sort \
    | cut -c4-
)
do
    echo "${line}"
done

/this/is/a/test/a
/this/is/a/test/b
/this/is/a/test/c
/this/is/a/another/test/e
/this/is/a/special/test/d

2022-06-21 11:28:02

将数据导入sqlite db并使用sql语法(只是另一个想法)。我知道这对于这个例子来说太多了，但是对于有多个文件(表)的复杂查询是有用的

#!/bin/bash
trap clear_db EXIT
clear_db(){ rm -f "mydb$$"; }

# add header to input_file (IP)
INPUT_FILE=ips.txt

# import file into db
sqlite3 -csv mydb$$ ".import ${INPUT_FILE} mytable"

# using sql statements on table 'mytable' 
sqlite3 mydb$$ -separator " "  "SELECT IP, COUNT(*) FROM mytable GROUP BY IP;"

10.0.10.1 3
10.0.10.2 1
10.0.10.3 1

2022-06-21 13:03:28

awk + sort(带版本排序标志)的组合可能是最快的(如果你的环境有awk的话):

echo "${input...}" |

{m,g}awk '{ __[$+_]++ } END { for(_ in __) { print "",+__[_],_ } }' FS='^$' OFS='\t' | 

gsort -t$'\t' -k 3,3 -V

只有后GROUP-BY汇总行被发送到排序实用程序——与毫无理由地对输入行进行预先排序相比，这是一种系统密集型排序。

对于小输入，例如少于1000行左右，只需直接排序|uniq -c它。

    3   10.0.10.1
    1   10.0.10.2
    1   10.0.10.3

2022-06-22 01:38:27

从bash模拟“group by”的最佳方法?

推荐文章

最新文章

标签