我希望这个问题对这个论坛来说不是太基本的,但我们会看到的。我想知道如何重构一些代码以获得更好的性能,这是运行了很多次。

假设我正在使用Map(可能是HashMap)创建一个词频列表,其中每个键都是一个包含要统计的单词的String,值是一个Integer,该Integer在每次找到单词的标记时递增。

在Perl中,增加这样一个值非常简单:

$map{$word}++;

但在Java中,这要复杂得多。下面是我目前的做法:

int count = map.containsKey(word) ? map.get(word) : 0;
map.put(word, count + 1);

这当然依赖于新Java版本中的自动装箱特性。我想知道您是否可以建议一种更有效的方法来增加这个值。是否有更好的性能理由避开Collections框架而使用其他框架呢?

更新:我已经对几个答案做了测试。见下文。


当前回答

部分测试结果

对于这个问题,我已经得到了很多很好的答案——谢谢大家——所以我决定进行一些测试,找出哪种方法实际上是最快的。我测试的五个方法是:

我在问题中提到的“ContainsKey”方法 Aleksandar Dimitrov建议的“TestForNull”方法 Hank Gay建议的“AtomicLong”方法 即鲁道夫提出的“宝藏”方法 phax.myopenid.com建议的“MutableInt”方法

方法

我是这么做的……

created five classes that were identical except for the differences shown below. Each class had to perform an operation typical of the scenario I presented: opening a 10MB file and reading it in, then performing a frequency count of all the word tokens in the file. Since this took an average of only 3 seconds, I had it perform the frequency count (not the I/O) 10 times. timed the loop of 10 iterations but not the I/O operation and recorded the total time taken (in clock seconds) essentially using Ian Darwin's method in the Java Cookbook. performed all five tests in series, and then did this another three times. averaged the four results for each method.

结果

我将首先展示结果,并为感兴趣的人提供下面的代码。

正如预期的那样,ContainsKey方法是最慢的,因此我将给出每个方法的速度与该方法的速度的比较。

ContainsKey: 30.654秒(基线) AtomicLong: 29.780秒(速度的1.03倍) TestForNull: 28.804秒(1.06倍) Trove: 26.313秒(快1.16倍) MutableInt: 25.747秒(1.19倍)

结论

似乎只有MutableInt方法和Trove方法明显更快,因为只有它们的性能提升超过10%。然而,如果线程是一个问题,AtomicLong可能比其他的更有吸引力(我不确定)。我还用final变量运行了TestForNull,但是差别可以忽略不计。

注意,我没有分析不同场景中的内存使用情况。我很高兴听到任何人对MutableInt和Trove方法如何可能影响内存使用有很好的见解。

就我个人而言,我觉得MutableInt方法最有吸引力,因为它不需要加载任何第三方类。因此,除非我发现它有问题,否则我很可能会走这条路。

的代码

下面是每个方法的关键代码。

ContainsKey

import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
int count = freq.containsKey(word) ? freq.get(word) : 0;
freq.put(word, count + 1);

测试空

import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
Integer count = freq.get(word);
if (count == null) {
    freq.put(word, 1);
}
else {
    freq.put(word, count + 1);
}

AtomicLong

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicLong;
...
final ConcurrentMap<String, AtomicLong> map = 
    new ConcurrentHashMap<String, AtomicLong>();
...
map.putIfAbsent(word, new AtomicLong(0));
map.get(word).incrementAndGet();

宝库

import gnu.trove.TObjectIntHashMap;
...
TObjectIntHashMap<String> freq = new TObjectIntHashMap<String>();
...
freq.adjustOrPutValue(word, 1, 1);

MutableInt

import java.util.HashMap;
import java.util.Map;
...
class MutableInt {
  int value = 1; // note that we start at 1 since we're counting
  public void increment () { ++value;      }
  public int  get ()       { return value; }
}
...
Map<String, MutableInt> freq = new HashMap<String, MutableInt>();
...
MutableInt count = freq.get(word);
if (count == null) {
    freq.put(word, new MutableInt());
}
else {
    count.increment();
}

其他回答

有几种方法:

使用袋子算法,比如谷歌集合中包含的集合。 创建可变容器,你可以在Map中使用:


    class My{
        String word;
        int count;
    }

并使用put(“word”,new My(“word”));然后您可以检查它是否存在,并在添加时增加。

避免使用列表来滚动您自己的解决方案,因为如果您使用内循环搜索和排序,您的性能将会非常糟糕。第一个HashMap解决方案实际上相当快,但在谷歌Collections中找到的合适的解决方案可能更好。

使用谷歌集合计数单词,看起来像这样:



    HashMultiset s = new HashMultiset();
    s.add("word");
    s.add("word");
    System.out.println(""+s.count("word") );


使用HashMultiset是非常优雅的,因为在计数单词时,袋算法正是您所需要的。

查看谷歌Collections Library总是一个好主意。在这种情况下,Multiset将做的伎俩:

Multiset bag = Multisets.newHashMultiset();
String word = "foo";
bag.add(word);
bag.add(word);
System.out.println(bag.count(word)); // Prints 2

有类似map的方法用于遍历键/条目等。在内部实现目前使用HashMap<E, AtomicInteger>,所以您不会产生装箱成本。

函数Java库的TreeMap数据结构在最新的主干头中有一个更新方法:

public TreeMap<K, V> update(final K k, final F<V, V> f)

使用示例:

import static fj.data.TreeMap.empty;
import static fj.function.Integers.add;
import static fj.pre.Ord.stringOrd;
import fj.data.TreeMap;

public class TreeMap_Update
  {public static void main(String[] a)
    {TreeMap<String, Integer> map = empty(stringOrd);
     map = map.set("foo", 1);
     map = map.update("foo", add.f(1));
     System.out.println(map.get("foo").some());}}

这个程序输出“2”。

你应该意识到你最初的尝试

int count = map.containsKey(word) ? map.get(word) : 0;

包含映射上两个可能代价高昂的操作,即containsKey和get。前者执行的操作可能与后者非常相似,因此您要做两次相同的工作!

如果查看Map的API,当Map不包含所请求的元素时,get操作通常返回null。

注意,这将得到一个像

map.put( key, map.get(key) + 1 );

dangerous, since it might yield NullPointerExceptions. You should check for a null first. Also note, and this is very important, that HashMaps can contain nulls by definition. So not every returned null says "there is no such element". In this respect, containsKey behaves differently from get in actually telling you whether there is such an element. Refer to the API for details. For your case, however, you might not want to distinguish between a stored null and "noSuchElement". If you don't want to permit nulls you might prefer a Hashtable. Using a wrapper library as was already proposed in other answers might be a better solution to manual treatment, depending on the complexity of your application. To complete the answer (and I forgot to put that in at first, thanks to the edit function!), the best way of doing it natively, is to get into a final variable, check for null and put it back in with a 1. The variable should be final because it's immutable anyway. The compiler might not need this hint, but its clearer that way. final HashMap map = generateRandomHashMap(); final Object key = fetchSomeKey(); final Integer i = map.get(key); if (i != null) { map.put(i + 1); } else { // do something } If you do not want to rely on autoboxing, you should say something like map.put(new Integer(1 + i.getValue())); instead.

部分测试结果

对于这个问题,我已经得到了很多很好的答案——谢谢大家——所以我决定进行一些测试,找出哪种方法实际上是最快的。我测试的五个方法是:

我在问题中提到的“ContainsKey”方法 Aleksandar Dimitrov建议的“TestForNull”方法 Hank Gay建议的“AtomicLong”方法 即鲁道夫提出的“宝藏”方法 phax.myopenid.com建议的“MutableInt”方法

方法

我是这么做的……

created five classes that were identical except for the differences shown below. Each class had to perform an operation typical of the scenario I presented: opening a 10MB file and reading it in, then performing a frequency count of all the word tokens in the file. Since this took an average of only 3 seconds, I had it perform the frequency count (not the I/O) 10 times. timed the loop of 10 iterations but not the I/O operation and recorded the total time taken (in clock seconds) essentially using Ian Darwin's method in the Java Cookbook. performed all five tests in series, and then did this another three times. averaged the four results for each method.

结果

我将首先展示结果,并为感兴趣的人提供下面的代码。

正如预期的那样,ContainsKey方法是最慢的,因此我将给出每个方法的速度与该方法的速度的比较。

ContainsKey: 30.654秒(基线) AtomicLong: 29.780秒(速度的1.03倍) TestForNull: 28.804秒(1.06倍) Trove: 26.313秒(快1.16倍) MutableInt: 25.747秒(1.19倍)

结论

似乎只有MutableInt方法和Trove方法明显更快,因为只有它们的性能提升超过10%。然而,如果线程是一个问题,AtomicLong可能比其他的更有吸引力(我不确定)。我还用final变量运行了TestForNull,但是差别可以忽略不计。

注意,我没有分析不同场景中的内存使用情况。我很高兴听到任何人对MutableInt和Trove方法如何可能影响内存使用有很好的见解。

就我个人而言,我觉得MutableInt方法最有吸引力,因为它不需要加载任何第三方类。因此,除非我发现它有问题,否则我很可能会走这条路。

的代码

下面是每个方法的关键代码。

ContainsKey

import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
int count = freq.containsKey(word) ? freq.get(word) : 0;
freq.put(word, count + 1);

测试空

import java.util.HashMap;
import java.util.Map;
...
Map<String, Integer> freq = new HashMap<String, Integer>();
...
Integer count = freq.get(word);
if (count == null) {
    freq.put(word, 1);
}
else {
    freq.put(word, count + 1);
}

AtomicLong

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicLong;
...
final ConcurrentMap<String, AtomicLong> map = 
    new ConcurrentHashMap<String, AtomicLong>();
...
map.putIfAbsent(word, new AtomicLong(0));
map.get(word).incrementAndGet();

宝库

import gnu.trove.TObjectIntHashMap;
...
TObjectIntHashMap<String> freq = new TObjectIntHashMap<String>();
...
freq.adjustOrPutValue(word, 1, 1);

MutableInt

import java.util.HashMap;
import java.util.Map;
...
class MutableInt {
  int value = 1; // note that we start at 1 since we're counting
  public void increment () { ++value;      }
  public int  get ()       { return value; }
}
...
Map<String, MutableInt> freq = new HashMap<String, MutableInt>();
...
MutableInt count = freq.get(word);
if (count == null) {
    freq.put(word, new MutableInt());
}
else {
    count.increment();
}