Learning from bidict
¶
Below is an outline of some of the more fascinating
and lesser-known Python corners I got to explore further
thanks to working on bidict
.
If you would like to learn more about any of the topics below, you may find reading bidict’s code particularly interesting.
I’ve sought to optimize the code not just for correctness and performance, but also to make for a clear and enjoyable read, illuminating anything that could otherwise be obscure or subtle.
I hope it brings you some of the joy it’s brought me. 😊
Python syntax hacks¶
bidict
used to support
(ab)using a specialized form of Python’s slice syntax
for getting and setting keys by value:
>>> element_by_symbol = bidict(H='hydrogen')
>>> element_by_symbol['H'] # [normal] syntax for the forward mapping
'hydrogen'
>>> element_by_symbol[:'hydrogen'] # [:slice] syntax for the inverse (no longer supported)
'H'
See this code for how this was implemented, and #19 for why this was dropped.
Code structure¶
bidict
s come in every combination of
mutable, immutable, ordered, and unordered types,
implementing Python’s various
relevant
collections
interfaces
as appropriate.
Factoring the code to maximize reuse, modularity, and adherence to SOLID design principles (while not missing any chances for special-case optimizations) has been one of the most fun parts of working on bidict.
To see how this is done, check out this code:
Data structures are amazing¶
Data structures are one of the most fascinating and important building blocks of programming and computer science.
It’s all too easy to lose sight of the magic when having to implement them for computer science courses or job interview questions. Part of this is because many of the most interesting real-world details get left out, and you miss all the value that comes from ongoing, direct practical application.
Bidict shows how fundamental data structures can be implemented in Python for important real-world usage, with practical concerns at top of mind. Come to catch sight of a real, live, industrial-strength linked list in the wild. Stay for the rare, exotic bidirectional mapping breeds you’ll rarely see at home. 1
- 1
To give you a taste:
A regular
bidict
encapsulates two regular dicts, keeping them in sync to preserve the bidirectional mapping invariants. Since dicts are unordered, regular bidicts are unordered too. How should we extend this to implement an ordered bidict?We’ll still need two backing mappings to store the forward and inverse associations. To store the ordering, we use a (circular, doubly-) linked list. This allows us to e.g. delete an item in any position in O(1) time.
Interestingly, the nodes of the linked list encode only the ordering of the items; the nodes themselves contain no key or value data. The two backing mappings associate the key and value data with the nodes, providing the final pieces of the puzzle.
Can we use dicts for the backing mappings, as we did for the unordered bidict? It turns out that dicts aren’t enough—the backing mappings must actually be (unordered) bidicts themselves!
Check out _orderedbase.py to see this in action.
Property-based testing is revolutionary¶
When your automated tests run, are they only checking the test cases you happened to hard-code into your test suite? How do you know these test cases aren’t missing some important edge cases?
With property-based testing, you describe the types of test case inputs your functions accept, along with the properties that should hold for all inputs. Rather than having to think up your test case inputs manually and hard-code them into your test suite, they get generated for you dynamically, in much greater quantity and edge case-exercising diversity than you could come up with by hand. This dramatically increases test coverage and confidence that your code is correct.
Bidict never would have survived so many refactorings with so few bugs if it weren’t for property-based testing, enabled by the amazing Hypothesis library. It’s game-changing.
Check out bidict’s property-based tests to see this in action.
Python surprises, gotchas, regrets¶
See nan as a Key.
What should happen when checking equality of several ordered mappings that contain the same items but in a different order? What about when comparing with an unordered mapping?
Check out what Python’s
collections.OrderedDict
does, and the surprising results:>>> from collections import OrderedDict >>> d = dict([(0, 1), (2, 3)]) >>> od = OrderedDict([(0, 1), (2, 3)]) >>> od2 = OrderedDict([(2, 3), (0, 1)]) >>> d == od True >>> d == od2 True >>> od == od2 False >>> class MyDict(dict): ... __hash__ = lambda self: 0 ... >>> class MyOrderedDict(OrderedDict): ... __hash__ = lambda self: 0 ... >>> d = MyDict([(0, 1), (2, 3)]) >>> od = MyOrderedDict([(0, 1), (2, 3)]) >>> od2 = MyOrderedDict([(2, 3), (0, 1)]) >>> len({d, od, od2}) 1 >>> len({od, od2, d}) 2
According to Raymond Hettinger (Python core developer responsible for much of Python’s collections), this design was a mistake (e.g. it violates the Liskov substitution principle and the transitive property of equality), but it’s too late now to fix.
Fortunately, it wasn’t too late for bidict to learn from this. Hence __eq__() is order-insensitive for ordered bidicts, and their separate
equals_order_sensitive()
method.If you define a custom
__eq__()
on a class, it will not be used for!=
comparisons on Python 2 automatically; you must explicitly add an__ne__()
implementation that calls your__eq__()
implementation. If you don’t,object.__ne__()
will be used instead, which behaves likeis not
. Python 3 thankfully fixes this.
Better memory usage through __slots__
¶
Using __slots__ dramatically reduces memory usage in CPython and speeds up attribute access to boot. Must be careful with pickling and weakrefs though! See BidictBase.__getstate__().
Better memory usage through weakref
¶
A bidict
and its inverse use weakref
to avoid creating a strong reference cycle,
so that when you release your last reference to a bidict,
its memory is reclaimed immediately in CPython
rather than having to wait for the next garbage collection.
See bidict Avoids Reference Cycles.
The (doubly) linked lists that back ordered bidicts also use weakrefs to avoid creating strong reference cycles.
Subclassing namedtuple()
classes¶
To get the performance benefits, intrinsic sortability, etc.
of namedtuple()
while customizing behavior, state, API, etc.,
you can subclass a namedtuple()
class.
Just make sure to include __slots__ = ()
,
or you’ll lose a lot of the performance benefits.
Here’s an example:
>>> from collections import namedtuple
>>> from itertools import count
>>> class Node(namedtuple('_Node', 'cost tiebreaker data parent depth')):
... """Represent nodes in a graph traversal. Suitable for use with e.g. heapq."""
...
... __slots__ = ()
... _counter = count() # break ties between equal-cost nodes, avoid comparing data
...
... # Give call sites a cleaner API for creating new Nodes
... def __new__(cls, cost, data, parent=None):
... tiebreaker = next(cls._counter)
... depth = parent.depth + 1 if parent else 0
... return super().__new__(cls, cost, tiebreaker, data, parent, depth)
...
... def __repr__(self):
... return 'Node(cost={cost}, data={data!r})'.format(**self._asdict())
>>> start = Node(cost=0, data='foo')
>>> child = Node(cost=5, data='bar', parent=start)
>>> child
Node(cost=5, data='bar')
>>> child.parent
Node(cost=0, data='foo')
>>> child.depth
1
namedtuple()
-style dynamic class generation¶
See the implementation
of namedbidict()
.
API Design¶
How to deeply integrate with Python’s collections
and other built-in APIs?
Beyond implementing
collections.abc.Mapping
, bidicts implement additional APIs thatdict
andOrderedDict
implement (e.g.setdefault()
,popitem()
, etc.).When creating a new API, making it familiar, memorable, and intuitive is hugely important to a good user experience.
Thanks to
Hashable
’s implementingabc.ABCMeta.__subclasshook__()
, any class that implements the required methods of theHashable
interface (namely,__hash__()
) makes it a virtual subclass already, no need to explicitly extend. I.e. As long asFoo
implements a__hash__()
method,issubclass(Foo, Hashable)
will always be True, no need to explicitly subclass viaclass Foo(Hashable): ...
How to make your own open ABC like
Hashable
?Override
__subclasshook__()
to check for the interface you require. SeeBidirectionalMapping
’s old (correct) implementation (this was later removed due to lack of use and maintenance cost when it was discovered that a bug was introduced in v0.15.0).Interesting consequence of the
__subclasshook__()
design: the “subclass” relation becomes intransitive. e.g.object
is a subclass ofHashable
,list
is a subclass ofobject
, butlist
is not a subclass ofHashable
.
What if you needed to derive from a second metaclass? Be careful to avoid “TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases”. See the great write-up in https://blog.ionelmc.ro/2015/02/09/understanding-python-metaclasses/.
collections.abc.Mapping
andcollections.abc.MutableMapping
don’t implement__subclasshook__()
, so you must either explicitly subclass them (in which case you inherit their concrete method implementations) or useabc.ABCMeta.register()
(to register as a virtual subclass without inheriting any of the implementation).Notice that Python provides
collections.abc.Reversible
but nocollections.abc.Ordered
orcollections.abc.OrderedMapping
. This was proposed in bpo-28912 but rejected. Would have been useful for bidict’s__repr__()
implementation (see_base.py
), and potentially for interop with other ordered mapping implementations such as SortedDict.
How to make APIs Pythonic?
See the Zen of Python.
“Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.”
Manifested in bidict’s default
on_dup
class attribute (seeON_DUP_DEFAULT
).“Readability counts.”
“There should be one – and preferably only one – obvious way to do it.”
An early version of bidict allowed using the
~
operator to access.inverse
and a special slice syntax likeb[:val]
to look up a key by value, but these were removed in preference to the more obvious and readable.inverse
-based spellings.
Python’s data model¶
What happens when you implement a custom
__eq__()
? e.g. What’s the difference betweena == b
andb == a
when onlya
is an instance of your class? See the great write-up in https://eev.ee/blog/2012/03/24/python-faq-equality/ for the answer.If an instance of your special mapping type is being compared against a mapping of some foreign mapping type that contains the same items, should your
__eq__()
method return true?Bidict says yes, again based on the Liskov substitution principle. Only returning true when the types matched exactly would violate this. And returning
NotImplemented
would cause Python to fall back on using identity comparison, which is not what is being asked for.(Just for fun, suppose you did only return true when the types matched exactly, and suppose your special mapping type were also hashable. Would it be worth having your
__hash__()
method include your type as well as your items? The only point would be to reduce collisions when multiple instances of different types contained the same items and were going to be inserted into the samedict
orset
, since they’d now be unequal but would hash to the same value otherwise.)Making an immutable type hashable (so it can be inserted into
dict
s andset
s): Must implement__hash__()
such thata == b ⇒ hash(a) == hash(b)
. See theobject.__hash__()
andobject.__eq__()
docs, and the implementation offrozenbidict
.Consider
FrozenOrderedBidict
: its__eq__()
is order-insensitive. So all contained items must participate in the hash order-insensitively.Can use collections.abc.Set._hash which provides a pure Python implementation of the same hash algorithm used to hash
frozenset
s. (SinceItemsView
extendsSet
,bidict.frozenbidict.__hash__()
just callsItemsView(self)._hash()
.)Does this argue for making
collections.abc.Set._hash()
non-private?Why isn’t the C implementation of this algorithm directly exposed in CPython? The only way to use it is to call
hash(frozenset(self.items()))
, which wastes memory allocating the ephemeral frozenset, and time copying all the items into it before they’re hashed.
Unlike other attributes, if a class implements
__hash__()
, any subclasses of that class will not inherit it. It’s like Python implicitly adds__hash__ = None
to the body of every class that doesn’t explicitly define__hash__
. So if you do want a subclass to inherit a base class’s__hash__()
implementation, you have to set that manually, e.g. by adding__hash__ = BaseClass.__hash__
in the class body. SeeFrozenOrderedBidict
.This is consistent with the fact that
object
implements__hash__()
, but subclasses ofobject
that override__eq__()
are not hashable by default.
Using
__new__()
to bypass default object initialization, e.g. for bettercopy()
performance. See _base.py.Overriding
object.__getattribute__()
for custom attribute lookup. See SortedBidict Recipes.Using
object.__getstate__()
,object.__setstate__()
, andobject.__reduce__()
to make an object pickleable that otherwise wouldn’t be, due to e.g. using weakrefs, as bidicts do (covered further below).
Portability¶
Python 2 vs. Python 3
As affects bidict, mostly
dict
API changes, but also functions likezip()
,map()
,filter()
, etc.See the
__ne__()
gotcha for Python 2 above.Borrowing methods from other classes:
In Python 2, must grab the
.im_func
/__func__
attribute off the borrowed method to avoid gettingTypeError: unbound method ...() must be called with ... instance as first argument
See the old implementation of
FrozenOrderedBidict
.
CPython vs. PyPy (and other Python implementations)
gc / weakref
Hence
test_bidicts_freed_on_zero_refcount()
in test_properties.py is skipped outside CPython.primitives’ identities, nan, etc.
Other interesting stuff in the standard library¶
reprlib
andreprlib.recursive_repr()
(but not needed for bidict because there’s no way to insert a bidict into itself)