I recently needed to implement a function that yields each element only once from an iterable.
It turns out that there are a few ways that can be done in python, so I decided to profile them.
Warning: the code requires a modern version of python due to typing. I opted out of using the “new” generic introduced in 3.12, but still.
from collections.abc import Iterable
import random
import time
from typing import TypeVar
import pandas as pd
import numpy as np
T = TypeVar("T")
def dedup1(it: Iterable[T]) -> Iterable[T]:
seen = set()
for item in it:
if item not in seen:
seen.add(item)
yield item
def dedup2(it: Iterable[T]) -> Iterable[T]:
seen = set()
# Binding seen.add to a variable reduces the cost of attribute fetching within a tight loop
seen_add = seen.add
# Here, 'or' allows us to add the item to 'seen' when it doesn't
# already exist there in a single line.
return (item for item in it if not (item in seen or seen_add(item)))
def dedup3(it: Iterable[T]) -> Iterable[T]:
yield from dict.fromkeys(it)
def dedup4(it: Iterable[T]) -> Iterable[T]:
seen = set()
# Binding seen.add to a variable reduces the cost of attribute fetching within a tight loop
seen_add = seen.add
for item in it:
if item in seen:
continue
seen_add(item)
yield item
def dedup5(it: Iterable[T]) -> Iterable[T]:
yield from set(it)
def dedup6(it: Iterable[T]) -> Iterable[T]:
yield from np.unique(it)
def dedup7(it: Iterable[T]) -> Iterable[T]:
yield from pd.unique(pd.Series(it))
def dedup8(it: Iterable[T]) -> Iterable[T]:
yield from OrderedDict.fromkeys(it)
lists = []
for i in range(10000):
lists.append([])
for j in range(1000):
lists[i].append(random.randint(0, 10))
for dedup in (dedup1, dedup2, dedup3, dedup4, dedup5, dedup6, dedup7):
start = time.perf_counter()
for list_ in lists:
x = list(dedup(list_))
end = time.perf_counter()
print(end - start)
print(x)
most of the implementations come from https://rednafi.com/python/deduplicate_iterables_while_preserving_order/
dedup1
is the most basic implementation that preserves order.
dedup2
binds the seen.add
to the variable and uses an or
hack to make the last line a “clever” one-liner.
dedup3
uses dict.fromkeys
because since python 3.7
dict preserves the order of insertion.
dedup4
is the same as dedup2
, but doesn’t use the hack.
dedup5
is the most basic implementation that doesn’t preserve order.
dedup6
uses numpy’s np.unique
.
dedup7
uses pandas’ pd.unique
.
dedup8
for the sake of consisteny with dedup3
uses OrderedDict.fromkeys
.
Example run’s prints:
> python main.py
0.10713764899992384
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.11175034099960612
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.08072049600013997
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.10612205899997207
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.05768413699934172
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
0.33301541400032875
[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]
1.6896407409994936
[np.int64(6), np.int64(9), np.int64(3), np.int64(7), np.int64(2), np.int64(5), np.int64(1), np.int64(4), np.int64(10), np.int64(8), np.int64(0)]
0.14495360199998686
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
As you can see, expectedly, the non-order-preserving verson is the fastest.
Out of the order-preserving the dict.fromkeys
is slightly faster than the rest.
Numpy and pandas take extremely long in comparison and change the type of items.
What is mildly interesting is that:
OrderedDict.fromkeys
takes almost twice as much time- NOT using the one-liner hack makes the implementation using variable binding faster.