519 words
3 minutes
Profiling unique implementations in Python

I recently needed to implement a function that yields each element only once from an iterable.

It turns out that there are a few ways that can be done in python, so I decided to profile them.

Warning: the code requires a modern version of python due to typing. I opted out of using the “new” generic introduced in 3.12, but still.

from collections.abc import Iterable
import random
import time
from typing import TypeVar
import pandas as pd
import numpy as np

T = TypeVar("T")


def dedup1(it: Iterable[T]) -> Iterable[T]:
    seen = set()
    for item in it:
        if item not in seen:
            seen.add(item)
            yield item


def dedup2(it: Iterable[T]) -> Iterable[T]:
    seen = set()

    # Binding seen.add to a variable reduces the cost of attribute fetching within a tight loop
    seen_add = seen.add

    # Here, 'or' allows us to add the item to 'seen' when it doesn't
    # already exist there in a single line.
    return (item for item in it if not (item in seen or seen_add(item)))


def dedup3(it: Iterable[T]) -> Iterable[T]:
    yield from dict.fromkeys(it)


def dedup4(it: Iterable[T]) -> Iterable[T]:
    seen = set()

    # Binding seen.add to a variable reduces the cost of attribute fetching within a tight loop
    seen_add = seen.add

    for item in it:
        if item in seen:
            continue
        seen_add(item)
        yield item


def dedup5(it: Iterable[T]) -> Iterable[T]:
    yield from set(it)


def dedup6(it: Iterable[T]) -> Iterable[T]:
    yield from np.unique(it)


def dedup7(it: Iterable[T]) -> Iterable[T]:
    yield from pd.unique(pd.Series(it))

def dedup8(it: Iterable[T]) -> Iterable[T]:
    yield from OrderedDict.fromkeys(it)

lists = []
for i in range(10000):
    lists.append([])
    for j in range(1000):
        lists[i].append(random.randint(0, 10))

for dedup in (dedup1, dedup2, dedup3, dedup4, dedup5, dedup6, dedup7):
    start = time.perf_counter()
    for list_ in lists:
        x = list(dedup(list_))
    end = time.perf_counter()
    print(end - start)
    print(x)

most of the implementations come from https://rednafi.com/python/deduplicate_iterables_while_preserving_order/

dedup1 is the most basic implementation that preserves order.

dedup2 binds the seen.add to the variable and uses an or hack to make the last line a “clever” one-liner.

dedup3 uses dict.fromkeys because since python 3.7 dict preserves the order of insertion.

dedup4 is the same as dedup2, but doesn’t use the hack.

dedup5 is the most basic implementation that doesn’t preserve order.

dedup6 uses numpy’s np.unique.

dedup7 uses pandas’ pd.unique.

dedup8 for the sake of consisteny with dedup3 uses OrderedDict.fromkeys.

Example run’s prints:

> python main.py
0.10713764899992384
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.11175034099960612
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.08072049600013997
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.10612205899997207
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]
0.05768413699934172
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
0.33301541400032875
[np.int64(0), np.int64(1), np.int64(2), np.int64(3), np.int64(4), np.int64(5), np.int64(6), np.int64(7), np.int64(8), np.int64(9), np.int64(10)]
1.6896407409994936
[np.int64(6), np.int64(9), np.int64(3), np.int64(7), np.int64(2), np.int64(5), np.int64(1), np.int64(4), np.int64(10), np.int64(8), np.int64(0)]
0.14495360199998686
[6, 9, 3, 7, 2, 5, 1, 4, 10, 8, 0]

As you can see, expectedly, the non-order-preserving verson is the fastest.

Out of the order-preserving the dict.fromkeys is slightly faster than the rest.

Numpy and pandas take extremely long in comparison and change the type of items.

What is mildly interesting is that:

  • OrderedDict.fromkeys takes almost twice as much time
  • NOT using the one-liner hack makes the implementation using variable binding faster.
Profiling unique implementations in Python
https://vulwsztyn.codeberg.page/posts/profiling-unique-implementations-in-python/
Author
Artur Mostowski
Published at
2025-09-04
License
CC BY-NC-SA 4.0