# Stream, Sequence, Iterator – a story of laziness and sad JVM benchmarking noises Many programming languages have started to include more functional features in their standard libraries. One of those features is lazy collections, for lack of a better term, which seem to have a different name in each language (we’ll just call them iterators here) and sometimes vastly differing implementations. One thing they all have in common, though, is a lack of trust in their performance. For almost every language out there that offers lazy iterators, there will be people telling you not to use them for performance reasons, more often than not without any data to back that up. I was personally interested in this because, being a Java/Kotlin developer, I use Java’s Streams and Kotlin’s Sequences almost every day with relatively little regard for potential performance implications. They are intuitive to write and are easy to reason about, which is usually much more important than the results of a thousand microbenchmarks, so please don’t stop using your favorite language feature because it’s 2.8% slower than the alternative. Most code is already bad enough as is without desperate optimization attempts. Still, I wanted to know how they compare to imperative code. There are some resources on this for Java 8’s Stream API, but Kotlin’s Sequences seem to just be accepted as more convenient Streams[^convenience]. [^convenience]: If you’ve ever used them, you’ll know what I mean. Java’s Streams are built in a way that allows for easy parallelism, but brings its own problems and limitations for the usage. ## What *is* an iterator? You can think of an iterator as a pipeline. It lets you write code as a sequence of instructions to be applied to all elements of a container. Let’s use a simple example to demonstrate this. We want to take all numbers from 1 to 100,000, multiply each of them by 2, and then add all of them.[^sum] [^sum]: You could also just compute the sum and take that \* 2, but we specifically want that intermediate step for the example. First, the imperative solution: ```kotlin var sum = 0 for (i in 1..100_000) { sum += i * 2 } return sum ``` and now the functional version using a Sequence (Kotlin’s name for Streams/iterators): ```kotlin return (1..100_000).asSequence() .map { it * 2 } .sum() ``` An iterator not a list, and it doesn’t support indexing,[^index] because it doesn’t actually contain any data. It just knows how to get or compute it for you, but you don’t know how it does that. You don’t even always know when (or if at all) an iterator will end (in this case, we do, because we create the Sequence from the range `1..100_000`, meaning it will produce the numbers from 1 to 100,00 before it ends). You can tell an iterator to produce or emit data if you want to use it (which is often called ‘consuming’ because if you read data from the pipeline, it’s usually gone), or you can add a new step to it and hand the new pipeline to someone else, who can then consume it or add even more steps. [^index]: Or any other operation like it. No `iterator[0]`, no `iterator.get(0)` or whatever your favorite language uses. An operation like `iterator.last()` might exist, but it will consume the entire iterator instead of just accessing the last element. An important aspect to note is: adding an operation to the pipeline does nothing until someone actually starts reading from it, and even then, only the elements that are consumed are computed.[^inf] This makes it possible to operate on huge data sets while keeping memory usage low, because only the currently active element has to be held in memory. [^inf]: This is what makes infinite iterators possible. They can be very useful and are used a lot in functional languages, but they’re not today’s topic. ## Cold, hard numbers We’ll use that small example from the last section as our first example: take a range of numbers, double each number, and compute the sum – except this time, we’ll do the numbers from 1 to 1 billion. Since everything we’re doing is lazy, memory usage shouldn’t be an issue. I will use different implementations to solve them and benchmark all of them. Here are the different approaches I came up with: - a simple for loop in Java - Java’s LongStream - a for each loop with a range in Kotlin - Java’s LongStream called from Kotlin[^ktjava] - Java’s Stream wrapped in a Kotlin Sequence - a Kotlin range wrapped in a Sequence - Kotlin’s Sequence with a generator to create the range [^ktjava]: Mainly to make sure there is no performance difference between the two. The benchmarks were executed on an Intel Xeon E3-1271 v3 with 32 GB of RAM, running Arch Linux with kernel 5.4.20-1-lts, using the (at the time of writing) latest OpenJDK preview build (`15-ea+17-717`), Kotlin 1.4-M1, and [jmh](https://openjdk.java.net/projects/code-tools/jmh/) version 1.23. The bytecode target was set to Java 15 for the Java code and Java 13 for Kotlin (newer versions are currently unsupported). Source code for the Java tests: ```java public long stream() { return LongStream.range(1, upper) .map(l -> l * 2) .sum(); } public long loop() { long sum = 0; for (long i = 0; i < upper; i++) { sum += i * 2; } return sum; } ``` and for Kotlin: ```kotlin fun stream() = LongStream.range(1, upper) .map { it * 2 } .sum() fun loop(): Long { var sum = 0L for (l in 1L until upper) { sum += l * 2 } return sum } fun streamWrappedInSequence() = LongStream.range(1L, upper) .asSequence() .map { it * 2 } .sum() fun sequence() = (1 until upper).asSequence() .map { it * 2 } .sum() fun withGenerator() = generateSequence(0L, { it + 1L }) .take(upper.toInt()) .map { it * 2 } .sum() ``` with `const val upper = 1_000_000_000L`.[^`1 until upper` is used in these examples because unlike `lower..upper`, `until` is end-inclusive like Java’s LongStream.range().] Without wasting any more of your time, here are the results: ```plaintext Benchmark Mode Cnt Score Error Units Java.loop avgt 25 446.055 ± 0.677 ms/op Java.stream avgt 25 601.424 ± 12.606 ms/op Kotlin.loop avgt 25 446.600 ± 1.164 ms/op Kotlin.sequence avgt 25 2732.604 ± 6.644 ms/op Kotlin.stream avgt 25 593.353 ± 1.408 ms/op Kotlin.streamWrappedInSequence avgt 25 3829.209 ± 33.569 ms/op Kotlin.withGenerator avgt 25 8374.149 ± 880.647 ms/op ``` (full JMH output)[https://ruru.moe/pSK13p8]. Unsurprisingly, using Streams from Java and Kotlin is almost identical in terms of performance. The same is true for imperative loops, meaning Kotlin ranges introduce no overhead compared to incrementing for loops. More surprisingly, using Sequences is an order of magnitude slower. That was not at all according to my expectations, so I investigated. As it turns out, Java’s `LongStream` exists because Stream is *much* slower. The JVM has to use `Long` rather than `long` when the type is used for generics, which involves an additional boxing step and the allocation for the `Long` object.[^primitives] Still, we now know that Streams have about 25% overhead compared to the simple loop for this example, that generating sequences is a comparatively slow process, and that wrapping Streams comes at a considerable cost (compared to a sequence created from a range). [^primitives]: The JVM has a few primitive types, such as `int`, `char`, or array types. They are different from any other type because they cannot be `null`. Every regular type on the JVM extends `java.lang.Object` and is just a reference that is being passed around. The primitives are values, not references, so there’s a lot less overhead involved. Unfortunately, primitives can’t be used as generic types, so a list of longs will always convert the `long` to `Long` before adding it. That last point seemed odd, so I attached a profiler to see where the CPU time is lost. ![Flamegraph of `streamWrappedInSequence()`](https://i.kageru.moe/knT2Eg.png) We can see that the `LongStream` can produce a `PrimitiveIterator.OfLong` that is used as a source for the Sequence. The operation of boxing a primitive `long` into an object `Long` (that’s the `Long.valueOf()` step) takes almost as long as advancing the underlying iterator itself. 7.7% of the CPU time is spent in `Sequence.hasNext()`. The exact breakdown of that looks as follows: ![Checking if a Sequence has more elements](https://i.kageru.moe/k4NHhR.png) The Sequence introduces very little overhead here, as it just delegates to `hasNext()` of the underlying iterator. Worth noting is that the iterator calls `accept` as part of `hasNext()`, which will already advance the underlying iterator. The value returned by that will be stored temporarily until `nextLong()` is called. ```java public boolean tryAdvance(LongConsumer consumer) { final long i = from; if (i < upTo) { from++; consumer.accept(i); return true; } // more stuff down here } ``` where `consumer.accept()` is ```java public void accept(T t) { valueReady = true; nextElement = t; } ``` Knowing this, I have to wonder why `nextLong()` takes as long as it does. Looking at [the implementation](https://github.com/openjdk/jdk/blob/6bab0f539fba8fb441697846347597b4a0ade428/src/java.base/share/classes/java/util/Spliterators.java#L756), I don’t understand where all that time is going. `hasNext()` should always be called before `next()`, so `next()` just has to return a precomputed value. Nevertheless, we can now explain the performance difference with the additional boxing step. Primitives good; everything else bad. With that in mind, I wrote a second test that avoids the unboxing issue to compare Streams and Sequences. The next snippet uses a simple wrapper class that guarantees that we have no primitives to execute a few operations on a Stream/Sequence. I’ll use this opportunity to also compare parallel and sequential streams. The steps are simple: Take a long -> create a LongWrapper from it, double the contained value (which creates a new LongWrapper), extract the value, compute the sum. That may sound overcomplicated, but it’s sadly close to the reality of enterprise code. Wrapper types are everywhere. ```kotlin inner class LongWrapper(val value: Long) { fun double() = LongWrapper(value * 2) } fun sequence(): Long = (1 until upper).asSequence() .map(::LongWrapper) .map(LongWrapper::double) .map(LongWrapper::value) .sum() fun stream(): Optional = StreamSupport.stream((1 until upper).spliterator(), false) .map(::LongWrapper) .map(LongWrapper::double) .map(LongWrapper::value) .reduce(Long::plus) fun parallelStream(): Optional = StreamSupport.stream((1 until upper).spliterator(), true) .map(::LongWrapper) .map(LongWrapper::double) .map(LongWrapper::value) .reduce(Long::plus) fun loop(): Long { var sum = 0L for (l in 1 until upper) { val wrapper = LongWrapper(l) val doubled = wrapper.double() sum += doubled.value } return sum } ``` The results here paint a different picture: ``` NonPrimitive.loop avgt 25 445.992 ± 0.642 ms/op NonPrimitive.sequence avgt 25 27257.399 ± 342.686 ms/op NonPrimitive.stream avgt 25 44673.318 ± 1325.832 ms/op NonPrimitive.parallelStream avgt 25 33856.919 ± 249.911 ms/op ``` Full results are in the (JMH log from earlier)[https://ruru.moe/pSK13p8]. The overhead of Java streams is much higher than that of Kotlin Sequences, and even a parallel Stream is slower than using a Sequence. even though Sequences only use a single thread, but both are miles behind the simple for loop. My first assumption was that the compiler optimized away the wrapper type and just added the longs, but looking at [the bytecode](https://p.kageru.moe/AUJKiG), the constructor invocation and the `double()` method calls are still there. It’s hard to know what the JIT does at runtime, but the numbers certainly suggest that the wrapper is simply optimized away. The profiler report wasn’t helpful either, which further leads me to believe that the JIT just deletes the method and inlines the calculations. This tells us that not only do Streams/Sequences have a very measurable overhead, but they severely limit the optimizer’s (be it compile-time or JIT) ability to understand the code, which can lead to significant slowdowns in code that can be optimized. Obviously, code that doesn’t rely on the optimizer as much won’t be affected to the same degree. ## Conclusion Overall, I think that Kotlin’s Sequences are a good addition to the language. They fall behind Streams when working with primitives because the Java standard library has subtypes for many generic constructs to more efficiently handle primitive types, but in most real-world JVM applications (that being enterprise-level bloatware), primitives are the exception rather than the rule. Still, Kotlin has some types that optimize for this, such as `LongIterator`, but without a `LongSequence` to go with it, the boxing will still happen eventually, and all the performance gains are void. I hope that we can get a few more types like it in the future, which will be especially useful once Kotlin/Native reaches maturity and starts being used for small/embedded hardware. Apart from the performance, Sequences are also a lot easier to understand and even extend than Streams. Implementing your own Sequence requires barely more than an implementation of the underlying iterator, as can be seen in [CoalescingSequence](https://git.kageru.moe/kageru/Sekwences/src/branch/master/src/main/kotlin/moe/kageru/sekwences/CoalescingSequence.kt) which I implemented last year to get a feeling for how all of this works. Streams on the other hand are a lot more complex. They actually extend `Consumer`, so a `Stream` is just a `void consume(T input)` that can be called repeatedly. That makes it a lot harder to grasp where data is coming from and how it is requested, at least to me. Simplicity is often underrated in software, but I consider it a huge plus for Sequences. I will continue to use them liberally, unless I find myself in a situation where I need to process a huge number of primitives. And even then, I now know that Java’s Streams are a good choice. 25% might sound like a lot, but it’s more than worth it if it means leaving code that is much easier to understand and modify for the next person. Unless you’re actually in a very performance-critical part of your application, but if you ever find yourself in that situation, you should switch to a different language. On that note: I was originally going to include Rust’s iterators here for comparison, but rustc optimized away all of my benchmarks with [constant time solutions](https://godbolt.org/z/iJaWVP). It’s a fascinating topic, and I might write a separate blog post where I dissect some of the assembly that rustc/LLVM produced, but I need to properly understand it myself first.