python
Exploring TypedDict in Python 3.8
This post will explore the new TypedDict
feature in Python and explore leveraging TypedDict
combined with the static analysis tool mypy
to improve the robustness of your Python code.
PEP-589
TypedDict
was proposed in PEP-589 and accepted in Python 3.8.
A few key quotes from PEP-589 can provide context and motivation for the problem that TypedDict
is attempting to address.
This PEP proposes a type constructor typing.TypedDict to support the use case where a dictionary object has a specific set of string keys, each with a value of a specific type.
More generally, representing pure data objects using only Python primitive types such as dictionaries, strings and lists has had certain appeal. They are are easy to serialize and deserialize even when not using JSON. They trivially support various useful operations with no extra effort, including pretty-printing (through str() and the pprint module), iteration, and equality comparisons.
This particular section of the PEP is interesting and suggests that TypedDict
can be particularly useful for retrofitting legacy code (with type annotations).
Dataclasses are a more recent alternative to solve this use case, but there is still a lot of existing code that was written before dataclasses became available, especially in large existing codebases where type hinting and checking has proven to be helpful. Unlike dictionary objects, dataclasses don’t directly support JSON serialization, though there is a third-party package that implements it
The reference implementation was defined in mypy_extensions and can be installed in Python 3.7 (e.g., pip install mypy_extensions
), or using typing.TypedDict
in Python 3.8.
These following examples are run with mypy
0.711 and examples shown below can be obtained from this gist.
Motivation: Dictionary-Mania
Here’s a common example where a type-checking tool (e.g., mypy
) won’t be able to help you catch type errors in your code.
1 | def example_0() -> int: |
However, with TypedDict
, you can define this a structural-typing-ish interface to dict
for a specific data model.
Using Python < 3.8 will require from mypy_extensions import TypedDict
whereas, Python >= 3.8 will require from typing import TypedDict
.
Let’s create a simple Movie
data model example and explore how mypy
can be used to help catch type errors.
Example 1: Basic Usage of TypedDict
1 | class Movie(TypedDict): |
To enable runnable code that purposely has errors that can be caught by mypy
, let’s define a helper function to require a specific Exception
type to be raised.
1 | import logging |
Example 2: Exploring Mutating and Assignment of TypedDicts
Let’s mutate the Movie
TypedDict
instance and explore how mypy
can catch type errors during assignment.
1 | def example_02() -> int: |
There’s a few interesting items to note.
mypy
will catch assignment errors- The current version of
mypy
will get a bit confused withdict
methods, such as.clear()
. Moreover,.clear()
will also yieldKeyError
s (related, seetotal=False
keyword of theTypedDict
) mypy
will only allow merging dicts that are the same type. You can’t mixTypedDict
and a raw dict withoutmypy
raising an issue
Example #3: TypedDicts total Keyword Argument
There’s a total
keyword to the TypedDict
that communicates that the dict does not need to be completely well formed. This is particularly interesting in how the mypy
interpets the types.
For example, X
with alpha
, beta
and gamma
as int
s, will be
1 | class X(TypedDict, total=False): |
Lets dive deeper using a variation of the previously defined Movie
example using total=False
to explore how mypy
interprets the ‘incomplete’ data model.
1 | class Movie2(TypedDict, total=False): |
Finally, let’s explore how isinstance
works with TypedDict
Example 4: TypedDict and isinstance
1 | def example_04() -> int: |
The important item to note here is that you can NOT use isinstance
with TypedDict
. Python will raise a runtime error of TypeError
. Specifically the error you’ll see is show below.
1 | TypeError: TypedDict does not support instance and class checks |
Summary
TypedDict
+mypy
can be valuable to help catch type errors in Python and can help with heavy dictionary-centric interfaces in your application or libraryTypedDict
can be used in Python 3.7 usingmypy_extensions
packageTypedDict
can be used in Python 2.7 usingmypy_extensions
and the 2.7 ‘namedtuple-esque’ syntax style (e.g.,Movie = TypedDict('Movie', {'title':str, year:int})
)- Using the
total=False
keyword toTypedDict
can introduce large wholes the static typechecking process yieldingKeyError
s. The keywordtotal=False
should be used judiciously (if at all) isinstance
should not be used withTypedDict
as it will raise a runtimeTypeError
exception- Be mindful when using
TypeDict
methods such asclear()
TypeDict
introduces a new (somewhat) competing data modeling alternative to dataclasses, typing.NamedTuple, “classic” classes and third-party libraries, such as pydantic and attrs. It’s not completely clear to me how all these different competing data model abstractions models are going to age gracefully
I believe TypedDict
can be a value tool to help improve clarity of interfaces, specifically in legacy code that is a bit dictionary-mania heavy. However, for new code, I would suggest avoid using TypedDict
in favor of the thin data models, such as pydantic and attrs.
Best to you and your Python-ing.
P.S. A runnable form of the code used in the post can be found in this gist.
Python Dashboards with Panel: Kicking the Tires
PyViz recently release the first official release (0.60) of Panel. Overall, I’m digging the iterative development model of developing dashboard components within Juypter lab/notebook.
Here’s an example notebook that will demonstrate creating a few dashboard components in Panel. The raw notebook can be launched using mybinder.org)
Python 3.8.0b1 Positional-Only Arguments and the Walrus Operator
On June 4th 2019, Python 3.8.0b1 was released. The official changelog is here.
There are two interesting syntactic changes/features that were added which I believe are useful to explore in some depth. Specifically, the new “walrus”‘ :=
operator and the new Positional-Only function parameter features.
Walrus
First, the “walrus” expression operator (:=
) defined in PEP-572
…naming sub-parts of a large expression can assist an interactive debugger, providing useful display hooks and partial results. Without a way to capture sub-expressions inline, this would require refactoring of the original code; with assignment expressions, this merely requires the insertion of a few name := markers. Removing the need to refactor reduces the likelihood that the code be inadvertently changed as part of debugging (a common cause of Heisenbugs), and is easier to dictate to another programmer.
A (contrived) example using Python 3.8.0b1 built from source “3.8.0b1+ (heads/3.8:23f41a64ea)”
1 | list(range(0, 10)) xs = |
The idea is to use an expression based approach to remove unnecessary chatter and potential bugs of storing local state.
Another simple example:
1 | # Python 3.8 |
As a side note, many of these “None-ish” based examples in the PEP (somewhat mechanically) look like a map
, flatMap
, 'foreach'
on Option[T]
cases in Scala.
Python doesn’t really do this well due to its inside-out nature of composing maps/filter/generators (versus a left to right model). Nevertheless, here’s the example to demonstrate the general idea using a functional centric approach.
1 | def processor(sx): return rx.match(sx) |
The “Exceptional cases” described in the PEP are worth investigating in more detail. There’s several cases where “Valid, though probably confusing” is used.
For example:
1 | y := f(x) # INVALID |
Note that the “walrus” operator can also be used in function definitions.
1 | def foo(answer = p := 42): return "" # INVALID |
Positional Only Args
The other interesting feature added to Python 3.8 is Positional-Only arguments in function definitions.
For as long as I can recall, Python has had this fundamental feature (or bug) on how functions or methods are called.
For example,
1 | def f(x, y=1234): return x + y |
Often this fundamental ambiguity of function call “style” isn’t really a big deal. However, it can leak local variable names as part of the public interface.As a result, minor variable renaming can potentially break interfaces. It’s also not clear what should be a keyword only argument or a positional only argument with a default. For example, simply changing f(x, y=1234)
to f(n, y=1234)
can potentially break the interface depending on the call “style”.
I’ve worked with a few developers over the years that viewed this as a feature and thought that this style made the API calls more explicit. For example:
1 | def compute(alpha, beta, gamma): |
I never really liked this looseness of positional vs keyword and would (if possible) try to discourage its use. Regardless, it can be argued this is a feature of the language (at least in Python <= 3.7). I would guess that many Python developers are also leveraging the unpacking dict style as well to call functions.
1 | d = dict(alpha=90, gamma=70, beta=120) |
In Python 3.0, function definitions using Keyword-Only arguments was added (see PEP-3102 from 2006) using the *
delimiter. All arguments to the right of the *
are Keyword-Only arguments.
1 | def f(a, b, *, c=1234): |
Unfortunately, this still leaves a fundamental issue with clearly defining function arguments. There are three cases: Positional-Only, Positional or Keyword, and Keyword-Only. PEP-3102 only solves the Keyword-Only case and doesn’t address the other two cases.
Hence in Python < 3.8, there’s still a fundamental ambiguity when defining a function and how it can be called.
For example:
1 |
|
Starting with Python 3.8.0, a Positional-Only parameters mechanism was added. The details are captured in PEP-570
Similar to the *
delimiter in Python 3.0.0 for Keyword-Only args), a /
delimiter was added to clearly delineate Positional-Only (or conversely Keyword-Only args) in function or method definitions. This makes the three cases of function arguments unambigious in how they should be called.
Here’s a few examples:
1 | def f0(a, b, c=1234, /): |
Combining the /
and *
with the type annotations yields:
1 | def f4(a:int, b:int, /, *, c:int=1234): |
We can dive a bit deeper and inspect the function signature via inspect.
1 | import inspect |
Let’s inspect each example:
1 | def pf(f): |
Running in the REPL yeilds:
1 | funcs = (f0, f1, f2, f2, f2, f4) |
Note, you can use bind to call the func (this must also adhere to the correct function definition of arg and keywords in the function of interest).
Also, it’s worth noting that both scipy and numpy have been using this /
style in the docs for some time.
When Should I Start Adopting these Features?
If you’re a library developer that has packages on PyPi, it might not be clear when it’s “safe” to start leveraging these features. I was only able to find one source of Python 3 adoption and as a result, I’m only able outline a very crude model.
On December 23, 2016, Python 3.6 was officially released. In the Fall of 2018, JetBrains release the Python Developer Survey which contains the Python 2/3 breakdown, as well as the breakdown of different versions within Python 3. As of the Fall 2018, 54% of Python 3 developers were using Python 3.6.x. Therefore, using this very crude model, if you assume that the rate of adoption of 3.6 and 3.8 are the same and if the minimum threshold of adoption of 3.8 is 54%, then you’ll need to wait approximately 2 years before starting to leverage these 3.8 features.
When you do plan to leverage these 3.8 specific features and pushing a package to the Python Package Index, then I would suggest clearly defining the Python version in the setup.py
. For more details, see the official packaging docs for more details.
1 | # setup.py |
Summary and Conclusion
- Python 3.8 added the “walrus” operator
:=
that enables results of expressions to be used - It’s recommended reading the Exceptional Cases for better understanding of where to (and to not) use the
:=
operator. - Python 3.8 added Positional-Only function definitions using the
/
delimiter - Defining functions with Positional-Only arguments will require a trailing
/
in the definition. E.g.,def adder(n, m, /): return 0
- There are changes in the standard lib to communicate. It’s not clear how locked down or backward compatible the interfaces were changes. Here’s a random example of the function signature of functools.partial being updated to use
/
. - Positional-Only arguments should improve consistency of API calls across Python runtimes (e.g., cpython and pypi)
- The Positional-Only PEP-570 outlines improvements in performance, however, I wasn’t able to find any performance studies on this topic.
- Migrating to 3.8 might involve potentially breaking API changes based on the usage of
/
in the Python 3.8 standard lib - For core library authors of libs on pypi, I would recommend using the crude approximation (described above) of approximately 2 years away from being able to adopt the new 3.8 features
- For
mypy
users, you might want to make sure you investigate the supported versions of Python 3.8 (More Details on the compatiblity matrix)
I understand the general motivation to solve core friction points or ambiguities at the language level, however, the new syntatic changes are a little too noisy for my tastes. Specifically the Positional-Only /
combined with the *
and type annotations. Regardless, the (Python 3.8) ship has sailed long ago. I would encourage the Python community to periodically track and provide feedback on the current PEPs to help guide the evolution of the Python programming language. And finally, Python 3.8.0 (beta and future 3.8 RCs) bugs should be filed to https://bugs.python.org .
Best to you and your Python-ing!
Further Reading
- Full Python 3.8.0b1 release notes
- Hacker News discussion on Positional-Only feature
- Alexander Hutner covering walrus feature
- Bug Tracker
- Python 3.8 ChangeLog
- Python PEPs
P.S. A reminder that the PSF has a Q2 2019 fundraiser that ends June 30th.
Dataclasses in Python 3.7
TDLR: Initially, I was interested in the dataclasses
feature that was added to Python 3.7 standard library. However after a closer look, It’s not clear to me if dataclasses
provide enough value. Third-party libraries such as attrs and pydantic offer considerably more value. Moreover, by adding dataclasses
to the standard library, Python now has 3 (or 4) overlapping mechanisms (namedtuple, typing.NamedTuple
, “classic” classes, and dataclasses) for defining a core concept in the Python programming language.
In Python 3.7, dataclasses were added to the standard library. There’s a terrific talk by Raymond Hettinger that is a great introduction to the history and design of dataclasses
.
A demonstration and comparison of the 3 different models is shown below.
For the demo, I’ll use a Person
data model with an id (int), a name (str), and an optional favorite color (Optional[str]). The Person
model will also have custom eq
and hash
using only the Person
s id
. I’m a fan of immutable classes, so the container data model will be immutable
if possible.
Here’s an example using the “classic” class style leveraging Python 3 type annotations.
Nothing particularly interesting for any Python dev. We do have to write a bit of boilerplate in the constructor, however, this also enables some convenience translation layers (e.g., converting datetime as a string to datetime instance). I’ve omitted the immutability aspect due to the boilerplate that would be added.
1 | from typing import Optional |
We can also write this using the typing.NamedTuple
(or an untyped version using the Python 2 style collections.namedtuple
).
1 | from typing import NamedTuple, Optional |
Nothing too interesting here either. This approach does have well understood downsides due to the tuple nature of the underlying design. Specific downsides include index access and comparison operators (e.g., __lte__
) which leverage the sortablity of tuples that can potentially introduce unexpected behaviors.
Let’s use the new Python 3.7 dataclasses approach.
1 | from dataclasses import dataclass, field |
The dataclasses
design is a declarative centric approach yielding a terse and clean interface to define a class in Python.
It’s important to note that none of the 3 approaches in the standard lib support any mechanism for type checking at runtime. The dataclasses
API has a __post_init
hook that can be used to add any custom validation (typechecking or otherwise).
I think it’s useful to dig a bit deeper and understand the development requirements, or conversely, the non-goals of the dataclasses
API. PEP-557 provides some key insights (for context, one of the libraries mention in the quote below is attrs
).
Here’s a few useful snippets from the PEP.
One main design goal of Data Classes is to support static type checkers. The use of PEP 526 syntax is one example of this, but so is the design of the fields() function and the @dataclass decorator. Due to their very dynamic nature, some of the libraries mentioned above are difficult to use with static type checkers.
It’s not clear to me if these “very dynamic” libraries still have an issue with popular static analysis tools such as mypy
(as of mypy
0.57, attrs
is supported). Nevertheless, it’s very clear from the PEP that there is a strong opinion that Python is a dynamic language and adding types is to annotation functions with metadata for static analysis tools. Adding types is not for runtime type checking, nor is to imply that type signatures are required.
Another useful snippet to provide context.
Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.
Historically, Python has a very heavy batteries-included approach to the standard library. To put the size of the standard library into perspective, there’s a recent PEP-594 to remove “dead batteries” from the standard library. The list of packages to remove is quite interesting. This battery-centric included approach motivates questions about the standard library.
Does the Python standard library need to be any bigger?
Why can’t (or shouldn’t)
dataclasses
style or functionality be off-loaded to the community to develop libraries (e.g.,attrs
andtraitlets
)? Is a “half” or “minimal” solution really worth adding enough value?Does Python standard lib need yet another competing mechanism to define a class or data container?
Is all of these packages in the standard lib really that useful in practice? For example, is the CSV parser used when
pandas.read_csv
exists?Do these features in the standard library bog down the Python core team?
A recent talk at Python Language Summit in May of 2019, “Batteries included, But They’re Leaking“ by Amber Brown brought up some contentious ideas on the current state of the standard library (in practice, that ship has sailed many moons ago and I’m not sure there’s a lot of constructive discussion that can be had on this topic).
I don’t really have any solid answers to any of these questions. Two of the most popular and Python libs, requests
and numpy
are both outside of the standard library (for good reasons that may be orthogonal to motivations of adding dataclasses
to the standard library) and are thriving.
Without hooks at a per field
level validation hooks among other features, I’m struggling to find the usefulness of dataclasses
for Python developers, particularly in the context of current polished third-party libraries, such as attrs
, pydantic
, or traitlets
.
For comparison, let’s take a look at the third-party libraries that have inspired dataclasses
.
Third Party Libraries
Let’s take a quick look into two of the third-party libraries, attrs and pydantic that inspired dataclasses
. I believe both of these libraries are supported by mypy
.
Attrs
Similar to the dataclasses
approach, the types in attrs'
are only used as annotations (e.g., __annotations__
) and are not enforced at runtime. However, the general validation mechanism trivially enables runtime type validation. For the purposes of this example, let’s also add custom validation on the name
of the Person
as well as adding type checking.
1 | from typing import Optional |
The abstract of Raymond Hettinger’s talk from Pycon has an interesting summary of dataclasses
.
Dataclasses are shown to be the next step in a progression of data aggregation tools: tuple, dict, simple class, bunch recipe, named tuples, records, attrs, and then dataclasses. Each builds upon the one that came before, adding expressiveness at the expense of complexity.
I’m not sure I completely agree. The dataclasses
implementation looks closer to attrs
-lite, than the “next step of progression”.
Pydantic
Another alternative is pydantic. This is a bit more opinionated design. It also has a nice Schema
abstraction to communicate core metadata on the fields as well as first class support for serialization hooks. The pydantic
library also has a dataclasses
wrapper layer that can be accessed via pydantic.dataclasses
.
Here’s an example of defining our Person
data model.
1 | from typing import Optional |
Overall, I like the general style and approach, however, it does have a few quarks. Specifically the keyword only usage as well as unexpected casting behavior of int
s to str
s.
The pydantic
API also supports rich metadata that could be useful for generating commandline interfaces for a given schema data model and emitted JSONSchema.
1 | from pydantic import BaseModel, validator, ValidationError, Schema |
Summary And Conclusion
dataclasses
offers a terse syntax for defining a class or data container that has type annotations using a code generation approachdataclasses
field
metadata can be used to define defaults, communicate which fields should be used ineq
orhash
,lte
, etc…dataclasses
has a__post_init
hook that can be used for validationdataclasses
By design does not do type validation. It only adds__annotation__
to the data container for static analyzers to consume, such asmypy
- Since
dataclasses
is now in the standard lib, this means feature enhancement, bug fixes, and backwards compatibility are now coupled the official Python release process - Raymond’s Pycon talk mentions the end-to-end develop time on
dataclasses
was 200+ hrs
Initially, I was a intrigued by the addition of dataclasses
to the standard library. However, after a deeper dive into the dataclasses
, it’s not clear to me that these are particularly useful for Python developers. I believe third-party solutions such as attrs
or pydantic
might be a better fit due to their validation hooks and richer feature sets. It will be interesting to see the adoption of dataclasses
by both the Python core as well as third-party developers.
For a deeper look and comparison of the 3 (or 4) models to define a class or data container in Python, please consult this Notebook in this Gist
Best on all your Python-ing!
Series: Functional Programming Techniques In Python
This is a 4 Part Series that explores functional centric design style and patterns in Python.
Part 1 (notebook) starts with different mechanisms of how to define functions in Python and quickly moves to using closures and functools.partial
. We then move on to adding functools.reduce
and composition with compose to our toolbox. Finally, we conclude with adding lazy map
and filter
to our toolbox and create a data pipeline that takes a stream of records to compute common statics using a max heap as the reducer.
In Part 2 (notebook), we explore building a REST client using functional-centric design style. Using an iterative approach, we build up a REST client using small functions leveraging clsoures and passing functions as first class citizens to methods. To conclude, we wrap the API and expose the REST client via a simple Python class.
Part 3 (notebook) follows similarly to spirit to Part 2. We build a commandline interface leveraging argparse
from the Python standard library. Sometimes libraries such as argparse can have rough edges or friction points in the API that introduce duplication or design issues. Part 3 focuses on iterative building up an expressive commandline interface to a subparser commandline tool using closures and functions to smooth out the rough edges of argparse.
. There’s also examples of using a Strategy-ish pattern with type annotated functions to enable configuring logging as well as custom error handling.
Part 4 (notebook) concludes with some gotchas with regards to scope in closures, a brief example of decorators and a few suggestions for leverging function-centric designs in your APIs or applications.
If you’re a OO wizard, a Data Scientist/Analysist, or a backend dev, this series can be useful to add another design approach in your toolbelt to designing APIs or programs.
Best to you and your Python’ing!
Now
This is a “now page” which itemizes professional, personal work and other priorities that I’m concentrating on right now.
What I’m currently Working On
- Leveraging Scala programming language to build robust and maintainable code
- Ammonite for scripting and REPL-ish driven development in Scala
- Dashboard tooling using Panel
- Deeper dive into Advanced design patterns in Akka
- Exploring different workflow technologies (e.g., apache-airflow, prefect) for scientific computing and data analysis
- R for data exploration and visualization utilizing the most excellent dplyr and ggplot2
- Exploring Python PEPs >= 3.8 to better understand the direction and future iterations of the Python Programming Language