Learning from bidict
#
Working on bidict has taken me to some of the most interesting and unexpected places I’ve gotten to visit in many years of programming. (When I started this project 15+ years ago, I’d never heard of things like higher-kinded types. Thanks to bidict, I not only learned about them, I got to share a practical example with Guido where they would be beneficial for Python.)
The problem space that bidict inhabits is abundant with beautiful symmetries, delightful surprises, and rich opportunities to come up with elegant solutions.
You can check out bidict’s source to see for yourself. I’ve sought to optimize the code not just for correctness and performance, but also for clarity, maintainability, and to make for an enjoyable read.
See below for more, and feel free to let me know what you think. I hope reading bidict’s code brings you some of the joy that bidict has brought me.
Code structure#
bidict
s come in
mutable, immutable, and ordered variants,
implementing Python’s various
relevant
collections
interfaces
as appropriate.
Factoring the code to maximize reuse, modularity, and adherence to SOLID design principles (while not missing any chances for specialized optimizations) has been one of the most fun parts of working on bidict.
To see how this is done, check out the code starting with __init__.py, and then follow the path suggested in the “code review nav” comments at the top of the file:
Data structures are amazing#
Data structures are one of the most fascinating and important building blocks of programming and computer science.
It’s all too easy to lose sight of the magic when having to implement them for computer science courses or job interview questions. Part of this is because many of the most interesting real-world details get left out, and you miss all the value that comes from ongoing, direct practical application.
Bidict shows how fundamental data structures can be implemented in Python for real-world usage, with practical concerns at top of mind.
OrderedBidict
's design#
A regular bidict
encapsulates two regular dicts,
keeping them in sync to preserve the bidirectional mapping invariants.
How should we extend this to implement OrderedBidict
?
From BidictBase
,
OrderedBidictBase
inherits the use of two regular dicts
to store the contents of the forward and inverse items.
To store the _ordering_ of the items,
we use a doubly-linked list
(much like OrderedDict
),
allowing us to e.g. move any item to the front
of the bidict in constant time.
Interestingly, the nodes of the linked list encode only the ordering of the items; the nodes themselves contain no key or value data. An additional backing mapping associates the key/value data with the nodes, providing the final piece of the puzzle.
And since the implementation needs to not only
look up nodes by key/value, but also key/value by node,
we use a bidict
for this internally.
Bidicts all the way down!
Python syntax hacks#
bidict used to support a specialized form of Python’s slice syntax for getting and setting keys by value:
element_by_symbol = bidict(H='hydrogen')
# [normal] syntax for the forward mapping lookup:
element_by_symbol['H'] # ==> 'hydrogen'
# [:slice] syntax for the inverse lookup (no longer supported):
element_by_symbol[:'hydrogen'] # ==> 'H'
See this code for how this was implemented.
It’s super cool when you find a way to bend Python’s syntax to support new use cases like this that stll feel like they fit well into the language, especially given that Python (wisely) limits how much you can customize its syntax.
Property-based testing is incredible#
When your automated tests run, are they only checking the test cases that you happened to think of when writing your tests? How do you know you aren’t missing some important edge cases?
With property-based testing, you describe the _types_ of the test case inputs that your APIs accept, along with the properties that should hold for all valid inputs. Rather than having to think of your test case inputs manually and hard-code them into your test suite, they get generated for you dynamically, in much greater quantity and diversity than you would typically come up with by hand. This dramatically increases test coverage and confidence that your code is correct with much less actual test code.
Bidict never would have survived so many refactorings with so few bugs if it weren’t for property-based testing, enabled by the amazing Hypothesis library.
Check out bidict’s property-based tests to see this in action.
Python surprises#
What should happen when checking equality of several ordered mappings that contain the same items but in a different order?
First let’s see how
collections.OrderedDict
works. The results may surprise you:>>> from collections import OrderedDict >>> x = OrderedDict({1: 1, 2: 2}) >>> y = {1: 1, 2: 2} >>> z = OrderedDict({2: 2, 1: 1}) >>> x == y True >>> y == z True >>> x == z # !!! False
So
collections.OrderedDict
violates the transitive property of equality. This can lead to some even more unusual behavior than the above. As an example, let’s see what would happen ifbidict.frozenbidict.__eq__()
behaved this way:>>> class BadFrozenBidict(BidictBase): ... __hash__ = frozenbidict.__hash__ ... ... def __eq__(self, other): # (deliberately simplified) ... # Override to be order-sensitive, like collections.OrderedDict: ... return all(i == j for (i, j) in zip(self.items(), other.items())) >>> x = BadFrozenBidict({1: 1, 2: 2}) >>> y = frozenbidict({1: 1, 2: 2}) >>> z = BadFrozenBidict({2: 2, 1: 1}) >>> x == y True >>> y == z True >>> x == z # !!! False >>> set1 = {x, y, z} >>> len(set1) 2 >>> set2 = {y, x, z} >>> len(set2) # !!! 1
According to Raymond Hettinger, the Python core developer who built Python’s collections foundation,
collections.OrderedDict
's__eq__()
implementation should have been order-insensitive. Making it order-sensitive violates the transitive property of equality as well as the Liskov substitution principle. It’s too late now to change this forcollections.OrderedDict
.But at least it’s not too late to learn from this. Hence OrderedBidict()'s __eq__() is order-insensitive, even for ordered bidicts. For an order-sensitive equality check, bidict provides the separate
equals_order_sensitive()
method, thanks to Raymond’s advice.See nan as a Key.
Better memory usage through __slots__
#
Using __slots__ speeds up attribute access, and can dramatically reduce memory usage in CPython when creating many instances of the same class.
As an example,
the Node
class used internally
(in the linked list that backs
OrderedBidictBase
)
uses slots for better performance at scale,
since there are as many node instances kept in memory
as there are items in every ordered bidict in memory.
See: _orderedbase.py
Note that extra care must be taken when using slots with pickling and weakrefs; see the code for more.
Better memory usage through weakref
#
A bidict
and its inverse use weakref
to
avoid creating a reference cycle.
As a result, when you drop your last reference to a bidict,
its memory is reclaimed immediately in CPython
rather than having to wait for the next garbage collection.
See: _base.py
As another example,
the Node
class used internally by
OrderedBidictBase
uses weakrefs to avoid creating reference cycles
in the doubly-linked lists used
to encode the ordering of inserted items.
See: _orderedbase.py
Using descriptors for managed attributes#
To abstract the details of creating and dereferencing
the weakrefs that OrderedBidictBase
's
aforementioned doubly-linked list nodes use
to refer to their neighbor nodes,
a WeakAttr
descriptor is used to
manage access to these attributes automatically.
See: _orderedbase.py
The implicit __class__
reference#
Anytime you have to reference the exact class of an instance
(and not a potential subclass) from within a method body,
you can use the implicit, lexically-scoped __class__
reference
rather than hard-coding the current class’s name.
See: https://docs.python.org/3/reference/datamodel.html#executing-the-class-body
Subclassing namedtuple
classes#
To get the performance benefits, intrinsic sortability, etc.
of NamedTuple
(or namedtuple()
)
while customizing behavior, API, etc.,
you can subclass.
See the OnDup class in _dup.py for an example.
Here’s another example:
>>> from collections import namedtuple
>>> from itertools import count
>>> class Node(namedtuple('_Node', 'cost tiebreaker data parent depth')):
... """Represent nodes in a graph traversal. Suitable for use with e.g. heapq."""
...
... __slots__ = ()
... _counter = count() # break ties between equal-cost nodes, avoid comparing data
...
... # Give call sites a cleaner API for creating new Nodes
... def __new__(cls, cost, data, parent=None):
... tiebreaker = next(cls._counter)
... depth = parent.depth + 1 if parent else 0
... return super().__new__(cls, cost, tiebreaker, data, parent, depth)
...
... def __repr__(self):
... return 'Node(cost={cost}, data={data!r})'.format(**self._asdict())
>>> start = Node(cost=0, data='foo')
>>> child = Node(cost=5, data='bar', parent=start)
>>> child
Node(cost=5, data='bar')
>>> child.parent
Node(cost=0, data='foo')
>>> child.depth
1
namedtuple
-style dynamic class generation#
See the implementation
of namedbidict
(it was since removed due to low usage).
API Design#
How to deeply integrate with Python’s collections
and other built-in APIs?
Beyond implementing
collections.abc.Mapping
, bidicts implement additional APIs thatdict
andOrderedDict
implement (e.g.setdefault()
,popitem()
, etc.).When creating a new API, making it familiar, memorable, and intuitive is hugely important to a good user experience.
Thanks to
Hashable
’s implementingabc.ABCMeta.__subclasshook__()
, any class that implements the required methods of theHashable
interface (namely,__hash__()
) makes it a virtual subclass already, no need to explicitly extend. I.e. As long asFoo
implements a__hash__()
method,issubclass(Foo, Hashable)
will always be True, no need to explicitly subclass viaclass Foo(Hashable): ...
How to make your own open ABC like
Hashable
?What if you needed to derive from a second metaclass? Be careful to avoid “TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases”. See the great write-up in https://blog.ionelmc.ro/2015/02/09/understanding-python-metaclasses/.
collections.abc.Mapping
andcollections.abc.MutableMapping
don’t implement__subclasshook__()
, so you must either explicitly subclass them (in which case you inherit their concrete method implementations) or useabc.ABCMeta.register()
(to register as a virtual subclass without inheriting any of the implementation).Notice that Python provides
collections.abc.Reversible
but nocollections.abc.Ordered
orcollections.abc.OrderedMapping
. See: https://bugs.python.org/issue28912See the Zen of Python for how to make APIs Pythonic.
The following Zen of Python guidelines have been particularly influential for bidict: - “Errors should never pass silently. Unless explicitly silenced. - “In the face of ambiguity, refuse the temptation to guess.” - “Readability counts.” - “There should be one – and preferably only one – obvious way to do it.”
Python’s data model#
What happens when you implement a custom
__eq__()
? e.g. What’s the difference betweena == b
andb == a
when onlya
is an instance of your class? See the great write-up in https://eev.ee/blog/2012/03/24/python-faq-equality/ for the answer.Making an immutable type hashable (so it can be inserted into
dict
s andset
s): Must implement__hash__()
such thata == b ⇒ hash(a) == hash(b)
. See theobject.__hash__()
andobject.__eq__()
docs, and the implementation offrozenbidict
.Consider
frozenbidict
: its__eq__()
is order-insensitive. So all contained items must participate in the hash order-insensitively.Can use collections.abc.Set._hash which provides a pure Python implementation of the same hash algorithm used to hash
frozenset
s. (SinceItemsView
extendsSet
,bidict.frozenbidict.__hash__()
just callsItemsView(self)._hash()
.)See also https://bugs.python.org/issue46684
Unlike other attributes, if a class implements
__hash__()
, any subclasses of that class will not inherit it. It’s like Python implicitly adds__hash__ = None
to the body of every class that doesn’t explicitly define__hash__
. So if you do want a subclass to inherit a base class’s__hash__()
implementation, you have to set that manually, e.g. by adding__hash__ = BaseClass.__hash__
in the class body.This is consistent with the fact that
object
implements__hash__()
, but subclasses ofobject
that override__eq__()
are not hashable by default.
Overriding
object.__getattribute__()
for custom attribute lookup. See SortedBidict Recipes.Using
object.__getstate__()
,object.__setstate__()
, andobject.__reduce__()
to make an object pickleable that otherwise wouldn’t be, due to e.g. using weakrefs, as bidicts do (covered further below).
Portability#
CPython vs. PyPy (and other Python implementations)
gc / weakref
Hence
test_bidicts_freed_on_zero_refcount()
in test_properties.py is skipped outside CPython.primitives’ identities, nan, etc.
Python 2 vs. Python 3
As affects bidict, mostly
dict
API changes, but also functions likezip()
,map()
,filter()
, etc.__ne__()
fixed in Python 3Borrowing methods from other classes:
In Python 2, must grab the
.im_func
/__func__
attribute off the borrowed method to avoid gettingTypeError: unbound method ...() must be called with ... instance as first argument
Other interesting stuff in the standard library#
reprlib
andreprlib.recursive_repr()
(but not needed for bidict because there’s no way to insert a bidict into itself)
Tools#
See the Thanks page for some of the fantastic tools for software verification, performance, code quality, etc. that bidict has provided a great opportunity to learn and use.