Dataclasses in Python 3.7
TDLR: Initially, I was interested in the dataclasses
feature that was added to Python 3.7 standard library. However after a closer look, It’s not clear to me if dataclasses
provide enough value. Third-party libraries such as attrs and pydantic offer considerably more value. Moreover, by adding dataclasses
to the standard library, Python now has 3 (or 4) overlapping mechanisms (namedtuple, typing.NamedTuple
, “classic” classes, and dataclasses) for defining a core concept in the Python programming language.
In Python 3.7, dataclasses were added to the standard library. There’s a terrific talk by Raymond Hettinger that is a great introduction to the history and design of dataclasses
.
A demonstration and comparison of the 3 different models is shown below.
For the demo, I’ll use a Person
data model with an id (int), a name (str), and an optional favorite color (Optional[str]). The Person
model will also have custom eq
and hash
using only the Person
s id
. I’m a fan of immutable classes, so the container data model will be immutable
if possible.
Here’s an example using the “classic” class style leveraging Python 3 type annotations.
Nothing particularly interesting for any Python dev. We do have to write a bit of boilerplate in the constructor, however, this also enables some convenience translation layers (e.g., converting datetime as a string to datetime instance). I’ve omitted the immutability aspect due to the boilerplate that would be added.
1 | from typing import Optional |
We can also write this using the typing.NamedTuple
(or an untyped version using the Python 2 style collections.namedtuple
).
1 | from typing import NamedTuple, Optional |
Nothing too interesting here either. This approach does have well understood downsides due to the tuple nature of the underlying design. Specific downsides include index access and comparison operators (e.g., __lte__
) which leverage the sortablity of tuples that can potentially introduce unexpected behaviors.
Let’s use the new Python 3.7 dataclasses approach.
1 | from dataclasses import dataclass, field |
The dataclasses
design is a declarative centric approach yielding a terse and clean interface to define a class in Python.
It’s important to note that none of the 3 approaches in the standard lib support any mechanism for type checking at runtime. The dataclasses
API has a __post_init
hook that can be used to add any custom validation (typechecking or otherwise).
I think it’s useful to dig a bit deeper and understand the development requirements, or conversely, the non-goals of the dataclasses
API. PEP-557 provides some key insights (for context, one of the libraries mention in the quote below is attrs
).
Here’s a few useful snippets from the PEP.
One main design goal of Data Classes is to support static type checkers. The use of PEP 526 syntax is one example of this, but so is the design of the fields() function and the @dataclass decorator. Due to their very dynamic nature, some of the libraries mentioned above are difficult to use with static type checkers.
It’s not clear to me if these “very dynamic” libraries still have an issue with popular static analysis tools such as mypy
(as of mypy
0.57, attrs
is supported). Nevertheless, it’s very clear from the PEP that there is a strong opinion that Python is a dynamic language and adding types is to annotation functions with metadata for static analysis tools. Adding types is not for runtime type checking, nor is to imply that type signatures are required.
Another useful snippet to provide context.
Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.
Historically, Python has a very heavy batteries-included approach to the standard library. To put the size of the standard library into perspective, there’s a recent PEP-594 to remove “dead batteries” from the standard library. The list of packages to remove is quite interesting. This battery-centric included approach motivates questions about the standard library.
Does the Python standard library need to be any bigger?
Why can’t (or shouldn’t)
dataclasses
style or functionality be off-loaded to the community to develop libraries (e.g.,attrs
andtraitlets
)? Is a “half” or “minimal” solution really worth adding enough value?Does Python standard lib need yet another competing mechanism to define a class or data container?
Is all of these packages in the standard lib really that useful in practice? For example, is the CSV parser used when
pandas.read_csv
exists?Do these features in the standard library bog down the Python core team?
A recent talk at Python Language Summit in May of 2019, “Batteries included, But They’re Leaking“ by Amber Brown brought up some contentious ideas on the current state of the standard library (in practice, that ship has sailed many moons ago and I’m not sure there’s a lot of constructive discussion that can be had on this topic).
I don’t really have any solid answers to any of these questions. Two of the most popular and Python libs, requests
and numpy
are both outside of the standard library (for good reasons that may be orthogonal to motivations of adding dataclasses
to the standard library) and are thriving.
Without hooks at a per field
level validation hooks among other features, I’m struggling to find the usefulness of dataclasses
for Python developers, particularly in the context of current polished third-party libraries, such as attrs
, pydantic
, or traitlets
.
For comparison, let’s take a look at the third-party libraries that have inspired dataclasses
.
Third Party Libraries
Let’s take a quick look into two of the third-party libraries, attrs and pydantic that inspired dataclasses
. I believe both of these libraries are supported by mypy
.
Attrs
Similar to the dataclasses
approach, the types in attrs'
are only used as annotations (e.g., __annotations__
) and are not enforced at runtime. However, the general validation mechanism trivially enables runtime type validation. For the purposes of this example, let’s also add custom validation on the name
of the Person
as well as adding type checking.
1 | from typing import Optional |
The abstract of Raymond Hettinger’s talk from Pycon has an interesting summary of dataclasses
.
Dataclasses are shown to be the next step in a progression of data aggregation tools: tuple, dict, simple class, bunch recipe, named tuples, records, attrs, and then dataclasses. Each builds upon the one that came before, adding expressiveness at the expense of complexity.
I’m not sure I completely agree. The dataclasses
implementation looks closer to attrs
-lite, than the “next step of progression”.
Pydantic
Another alternative is pydantic. This is a bit more opinionated design. It also has a nice Schema
abstraction to communicate core metadata on the fields as well as first class support for serialization hooks. The pydantic
library also has a dataclasses
wrapper layer that can be accessed via pydantic.dataclasses
.
Here’s an example of defining our Person
data model.
1 | from typing import Optional |
Overall, I like the general style and approach, however, it does have a few quarks. Specifically the keyword only usage as well as unexpected casting behavior of int
s to str
s.
The pydantic
API also supports rich metadata that could be useful for generating commandline interfaces for a given schema data model and emitted JSONSchema.
1 | from pydantic import BaseModel, validator, ValidationError, Schema |
Summary And Conclusion
dataclasses
offers a terse syntax for defining a class or data container that has type annotations using a code generation approachdataclasses
field
metadata can be used to define defaults, communicate which fields should be used ineq
orhash
,lte
, etc…dataclasses
has a__post_init
hook that can be used for validationdataclasses
By design does not do type validation. It only adds__annotation__
to the data container for static analyzers to consume, such asmypy
- Since
dataclasses
is now in the standard lib, this means feature enhancement, bug fixes, and backwards compatibility are now coupled the official Python release process - Raymond’s Pycon talk mentions the end-to-end develop time on
dataclasses
was 200+ hrs
Initially, I was a intrigued by the addition of dataclasses
to the standard library. However, after a deeper dive into the dataclasses
, it’s not clear to me that these are particularly useful for Python developers. I believe third-party solutions such as attrs
or pydantic
might be a better fit due to their validation hooks and richer feature sets. It will be interesting to see the adoption of dataclasses
by both the Python core as well as third-party developers.
For a deeper look and comparison of the 3 (or 4) models to define a class or data container in Python, please consult this Notebook in this Gist
Best on all your Python-ing!