Dataclasses in Python 3.7
TDLR: Initially, I was interested in the
dataclasses feature that was added to Python 3.7 standard library. However after a closer look, It’s not clear to me if
dataclasses provide enough value. Third-party libraries such as attrs and pydantic offer considerably more value. Moreover, by adding
dataclasses to the standard library, Python now has 3 (or 4) overlapping mechanisms (namedtuple,
typing.NamedTuple, “classic” classes, and dataclasses) for defining a core concept in the Python programming language.
A demonstration and comparison of the 3 different models is shown below.
For the demo, I’ll use a
Person data model with an id (int), a name (str), and an optional favorite color (Optional[str]). The
Person model will also have custom
hash using only the
id. I’m a fan of immutable classes, so the container data model will be
immutable if possible.
Here’s an example using the “classic” class style leveraging Python 3 type annotations.
Nothing particularly interesting for any Python dev. We do have to write a bit of boilerplate in the constructor, however, this also enables some convenience translation layers (e.g., converting datetime as a string to datetime instance). I’ve omitted the immutability aspect due to the boilerplate that would be added.
We can also write this using the
typing.NamedTuple (or an untyped version using the Python 2 style
Nothing too interesting here either. This approach does have well understood downsides due to the tuple nature of the underlying design. Specific downsides include index access and comparison operators (e.g.,
__lte__) which leverage the sortablity of tuples that can potentially introduce unexpected behaviors.
Let’s use the new Python 3.7 dataclasses approach.
dataclasses design is a declarative centric approach yielding a terse and clean interface to define a class in Python.
It’s important to note that none of the 3 approaches in the standard lib support any mechanism for type checking at runtime. The
dataclasses API has a
__post_init hook that can be used to add any custom validation (typechecking or otherwise).
I think it’s useful to dig a bit deeper and understand the development requirements, or conversely, the non-goals of the
dataclasses API. PEP-557 provides some key insights (for context, one of the libraries mention in the quote below is
Here’s a few useful snippets from the PEP.
One main design goal of Data Classes is to support static type checkers. The use of PEP 526 syntax is one example of this, but so is the design of the fields() function and the @dataclass decorator. Due to their very dynamic nature, some of the libraries mentioned above are difficult to use with static type checkers.
It’s not clear to me if these “very dynamic” libraries still have an issue with popular static analysis tools such as
mypy (as of
attrs is supported). Nevertheless, it’s very clear from the PEP that there is a strong opinion that Python is a dynamic language and adding types is to annotation functions with metadata for static analysis tools. Adding types is not for runtime type checking, nor is to imply that type signatures are required.
Another useful snippet to provide context.
Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.
Historically, Python has a very heavy batteries-included approach to the standard library. To put the size of the standard library into perspective, there’s a recent PEP-594 to remove “dead batteries” from the standard library. The list of packages to remove is quite interesting. This battery-centric included approach motivates questions about the standard library.
Does the Python standard library need to be any bigger?
Why can’t (or shouldn’t)
dataclassesstyle or functionality be off-loaded to the community to develop libraries (e.g.,
traitlets)? Is a “half” or “minimal” solution really worth adding enough value?
Does Python standard lib need yet another competing mechanism to define a class or data container?
Is all of these packages in the standard lib really that useful in practice? For example, is the CSV parser used when
Do these features in the standard library bog down the Python core team?
A recent talk at Python Language Summit in May of 2019, “Batteries included, But They’re Leaking“ by Amber Brown brought up some contentious ideas on the current state of the standard library (in practice, that ship has sailed many moons ago and I’m not sure there’s a lot of constructive discussion that can be had on this topic).
I don’t really have any solid answers to any of these questions. Two of the most popular and Python libs,
numpy are both outside of the standard library (for good reasons that may be orthogonal to motivations of adding
dataclasses to the standard library) and are thriving.
Without hooks at a per
field level validation hooks among other features, I’m struggling to find the usefulness of
dataclasses for Python developers, particularly in the context of current polished third-party libraries, such as
For comparison, let’s take a look at the third-party libraries that have inspired
Similar to the
dataclasses approach, the types in
attrs'are only used as annotations (e.g.,
__annotations__) and are not enforced at runtime. However, the general validation mechanism trivially enables runtime type validation. For the purposes of this example, let’s also add custom validation on the
name of the
Person as well as adding type checking.
The abstract of Raymond Hettinger’s talk from Pycon has an interesting summary of
Dataclasses are shown to be the next step in a progression of data aggregation tools: tuple, dict, simple class, bunch recipe, named tuples, records, attrs, and then dataclasses. Each builds upon the one that came before, adding expressiveness at the expense of complexity.
I’m not sure I completely agree. The
dataclasses implementation looks closer to
attrs-lite, than the “next step of progression”.
Another alternative is pydantic. This is a bit more opinionated design. It also has a nice
Schema abstraction to communicate core metadata on the fields as well as first class support for serialization hooks. The
pydantic library also has a
dataclasses wrapper layer that can be accessed via
Here’s an example of defining our
Person data model.
Overall, I like the general style and approach, however, it does have a few quarks. Specifically the keyword only usage as well as unexpected casting behavior of
pydantic API also supports rich metadata that could be useful for generating commandline interfaces for a given schema data model and emitted JSONSchema.
dataclassesoffers a terse syntax for defining a class or data container that has type annotations using a code generation approach
fieldmetadata can be used to define defaults, communicate which fields should be used in
__post_inithook that can be used for validation
dataclassesBy design does not do type validation. It only adds
__annotation__to the data container for static analyzers to consume, such as
dataclassesis now in the standard lib, this means feature enhancement, bug fixes, and backwards compatibility are now coupled the official Python release process
- Raymond’s Pycon talk mentions the end-to-end develop time on
dataclasseswas 200+ hrs
Initially, I was a intrigued by the addition of
dataclasses to the standard library. However, after a deeper dive into the
dataclasses, it’s not clear to me that these are particularly useful for Python developers. I believe third-party solutions such as
pydantic might be a better fit due to their validation hooks and richer feature sets. It will be interesting to see the adoption of
dataclasses by both the Python core as well as third-party developers.
For a deeper look and comparison of the 3 (or 4) models to define a class or data container in Python, please consult this Notebook in this Gist
Best on all your Python-ing!