Summary

All Collection.contains(obj) methods are not the same!

This article is a real world case study of the Big O differences between various implementations of Java’s Collection interface.   I found and fixed a grievous O(n^2) algorithm by using the right data structure.

Background

I was asked to investigate why some pages in our web application would save session data very quickly while another problem page would take literally tens of minutes. The application had at its core a Stateful Session Bean that held dirty objects which would be persisted to the database in a single transaction. Sure, the easy pages didn’t contain very much data to persist and we knew the problem page contains many times more data, but certainly not that much more data to cause 20 minute request times!

After I implemented the fix, the page elapsed time dropped from 20+ minutes to ~10 seconds. What did I do? I used the right data structure.

Data Structures and the Big O

The application used a Vector to store dirty objects. A Vector was used for two reasons: 1) the original engineers thought synchronization was important and 2) order was important for referential integrity. A Vector’s internal synchronization was unneeded because only a single user’s request thread ever access the application. The ordering, however, was extremely important because you couldn’t add a person’s data without first adding the person!

The problem page in the web app had to add thousands of rows of data to the database, hence there were thousands of dirty objects waiting in the cache for persistence. As the application created or dirtied objects, it checked its cache (the Vector) before adding it. You wouldn’t want the data to be persisted twice.

How did the app check its cache? vector.contains(obj);

The problem with vector.contains(obj) and list.contains(obj) is that they are O(n), which means they scale linearly. Put another way, it gets slower the more items you put into it. The page that created thousands of objects to persist got progressively slower with each object it created.

The solution was to switching to a LinkedHashSet which perserves order for referential integrity while providing O(1) performance for set.contains(obj) because all the objects are hashed.

The real problem was even worse, of course, because the app checked the cache each time before it added a new object.  This represents a good ol’ fashioned O(n^2) algorithm.

To be fair to the original developers, they wrote the application in Java 1.3 and LinkedHashSet was implemented in 1.4. Also, I don’t think they anticipated having a single page in the application generate thousands of objects.

Sample Code

Below is a simple program to highlight the performance differences between various Collection.contains(obj) methods

Elapsed times (in ms):

Vector: 3663
List: 3690
Set: 15
LinkedSet: 12

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
package mgt.perf;
 
import java.util.*;
 
public class ContainsExample {
 
    private int collectionCount = 10000;
    private int testCount = 50000;
 
    public static void main(String[] args) {
        new ContainsExample().start();
    }
 
    private void start() {
 
        Collection vector = new Vector();
        Collection list = new ArrayList();
        Collection set = new HashSet();
        Collection linkedSet = new LinkedHashSet();
 
        populate(vector);
        populate(list);
        populate(set);
        populate(linkedSet);
 
        System.out.println("Elapsed times\n");
        System.out.println("    Vector:" + test(vector));
        System.out.println("      List:" + test(list));
        System.out.println("       Set:" + test(set));
        System.out.println(" LinkedSet:" + test(linkedSet));
    }
 
    private void populate(Collection set) {
        for (int i = 0; i < collectionCount; i++) {
            set.add(i);
        }
    }
 
    private long test(Collection collection) {
        Random rnd = new Random(System.currentTimeMillis());
        long started = System.currentTimeMillis();
        for (int i = 0; i < testCount; i++) {
            int lookFor = rnd.nextInt(collectionCount);
            if (!collection.contains(lookFor)) {
                throw new IllegalStateException(lookFor + " really should be in the collection");
            }
        }
        long elapsed = System.currentTimeMillis() - started;
        return elapsed;
    }
 
}