python

Exploring TypedDict in Python 3.8

This post will explore the new TypedDict feature in Python and explore leveraging TypedDict combined with the static analysis tool mypy to improve the robustness of your Python code.

PEP-589

TypedDict was proposed in PEP-589 and accepted in Python 3.8.

A few key quotes from PEP-589 can provide context and motivation for the problem that TypedDict is attempting to address.

This PEP proposes a type constructor typing.TypedDict to support the use case where a dictionary object has a specific set of string keys, each with a value of a specific type.

More generally, representing pure data objects using only Python primitive types such as dictionaries, strings and lists has had certain appeal. They are are easy to serialize and deserialize even when not using JSON. They trivially support various useful operations with no extra effort, including pretty-printing (through str() and the pprint module), iteration, and equality comparisons.

This particular section of the PEP is interesting and suggests that TypedDict can be particularly useful for retrofitting legacy code (with type annotations).

Dataclasses are a more recent alternative to solve this use case, but there is still a lot of existing code that was written before dataclasses became available, especially in large existing codebases where type hinting and checking has proven to be helpful. Unlike dictionary objects, dataclasses don’t directly support JSON serialization, though there is a third-party package that implements it

The reference implementation was defined in mypy_extensions and can be installed in Python 3.7 (e.g., pip install mypy_extensions), or using typing.TypedDict in Python 3.8.

These following examples are run with mypy 0.711 and examples shown below can be obtained from this gist.

Motivation: Dictionary-Mania

Here’s a common example where a type-checking tool (e.g., mypy) won’t be able to help you catch type errors in your code.

1
2
3
4
5
6
7
8
9
def example_0() -> int:
"""Simple Example of Using raw dict and how mypy won't catch
these errors with the keys
"""

m = dict(name='Star Wars', year=1234)

# mypy will NOT catch this error
t = m['name'] + 100

However, with TypedDict, you can define this a structural-typing-ish interface to dict for a specific data model.

Using Python < 3.8 will require from mypy_extensions import TypedDict whereas, Python >= 3.8 will require from typing import TypedDict.

Let’s create a simple Movie data model example and explore how mypy can be used to help catch type errors.

Example 1: Basic Usage of TypedDict

1
2
3
4
5
6
7
8
9
10
11
12
class Movie(TypedDict):
name: str
year: int


def example_01():
m = Movie(name='Star Wars', year=1977)
# or
m2:Movie = dict(name='Star Wars', year=1977)

# the type checker will catch this
n = m['name'] + 100

To enable runnable code that purposely has errors that can be caught by mypy, let’s define a helper function to require a specific Exception type to be raised.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import logging
log = logging.getLogger(__name__)


def log_expected_error(ex, fx):
raised_error = False
try:
return fx()
except ex as e:
raised_error = True
log.info(f"Got Expected error `{e}` of type {ex} from {fx.__name__}")
finally:
if not raised_error:
log.error(f"Expected {fx} to raise {ex}")

Example 2: Exploring Mutating and Assignment of TypedDicts

Let’s mutate the Movie TypedDict instance and explore how mypy can catch type errors during assignment.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def example_02() -> int:
m = Movie(name='Star Wars', year=1977)
log.info(m)

# mypy will catch this
m['name'] = 11111

def f() -> int:
m['year'] = m['year'] + 'asfdsasdf'
return 0

log_expected_error(TypeError, f)

# Use dict methods to mutate
# Note, current verison of mypy is confused
# by this and generates `"Movie" has no attribute "clear"`
m.clear()

def f2() -> int:
# mypy won't catch this KeyError from .clear()
return m['year'] + 100

log_expected_error(KeyError, f2)

# Can we mix Movie and a raw dict?
d2 = dict(extras=True, alpha=1234, name=12345, year='1978')

# mypy will raise TypeError here
m.update(d2)

log.info(m)

# Update a Movie with a Movie
m2 = Movie(name='Star Wars', year=1977)
new_m = Movie(name='Movie Title', year=1234)

# Both of these are proper Movie TypedDict
# hence, no mypy type error
m.update(new_m)
log.info(m2)

There’s a few interesting items to note.

  • mypy will catch assignment errors
  • The current version of mypy will get a bit confused with dict methods, such as .clear(). Moreover, .clear() will also yield KeyErrors (related, see total=False keyword of the TypedDict)
  • mypy will only allow merging dicts that are the same type. You can’t mix TypedDict and a raw dict without mypy raising an issue

Example #3: TypedDicts total Keyword Argument

There’s a total keyword to the TypedDict that communicates that the dict does not need to be completely well formed. This is particularly interesting in how the mypy interpets the types.

For example, X with alpha, beta and gamma as ints, will be

1
2
3
4
5
6
7
8
9
class X(TypedDict, total=False):
alpha:int
beta:int
gamma:int

x:X = dict()
x['alpha'] = 1
x['beta'] = 2
x['gamma'] = 3

Lets dive deeper using a variation of the previously defined Movie example using total=False to explore how mypy interprets the ‘incomplete’ data model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class Movie2(TypedDict, total=False):
name:str
year:int
release_year: Optional[int]


def example_03() -> int:
"""
Explore with defining an 'incomplete' Movie data model and how
None/Null checking works with mypy
"""

m = Movie2(name='Star Wars')
log.info(m)

def f() -> int:
# mypy will catch this
m['name'] = 1234
return 0

# Use dict methods to mutate
# mypy is confused by this. The error is:
# `"Movie" has no attribute "clear"`
m.clear()

def f2() -> int:
# mypy doesn't catch this NPE
# I don't think it treats the type
# as Optional[int]
m['year'] = m['year'] + 100
return 0

log_expected_error(KeyError, f2)

# Explicit test with release_year which
# is fundamentally Optional[int]

def f3() -> int:
# mypy WILL catch this NPE
m['release_year'] = m['release_year'] + 100
return 0

log_expected_error(KeyError, f3)


# This works as expected and
m2 = Movie2(name='Star Wars', release_year=2049)

# This works as expected and mypy won't raise an error
if m2['release_year'] is not None:
t = m2['release_year'] + 10

Finally, let’s explore how isinstance works with TypedDict

Example 4: TypedDict and isinstance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def example_04() -> int:
"""Testing isinstance"""

m = Movie(name='Movie', year=1234)


def f() -> int:
is_true = isinstance(m, dict)
return 0


# This is a bit unexpected that this
# will raise an exception at runtime
# ` Cannot use isinstance() with a TypedDict type`
def f2() -> int:
is_true = isinstance(m, Movie)
return 0

log_expected_error(TypeError, f2)

The important item to note here is that you can NOT use isinstance with TypedDict. Python will raise a runtime error of TypeError. Specifically the error you’ll see is show below.

1
TypeError: TypedDict does not support instance and class checks

Summary

  • TypedDict + mypy can be valuable to help catch type errors in Python and can help with heavy dictionary-centric interfaces in your application or library
  • TypedDict can be used in Python 3.7 using mypy_extensions package
  • TypedDict can be used in Python 2.7 using mypy_extensions and the 2.7 ‘namedtuple-esque’ syntax style (e.g., Movie = TypedDict('Movie', {'title':str, year:int}))
  • Using the total=False keyword to TypedDict can introduce large wholes the static typechecking process yielding KeyErrors. The keyword total=False should be used judiciously (if at all)
  • isinstance should not be used with TypedDict as it will raise a runtime TypeError exception
  • Be mindful when using TypeDict methods such as clear()
  • TypeDict introduces a new (somewhat) competing data modeling alternative to dataclasses, typing.NamedTuple, “classic” classes and third-party libraries, such as pydantic and attrs. It’s not completely clear to me how all these different competing data model abstractions models are going to age gracefully

I believe TypedDict can be a value tool to help improve clarity of interfaces, specifically in legacy code that is a bit dictionary-mania heavy. However, for new code, I would suggest avoid using TypedDict in favor of the thin data models, such as pydantic and attrs.

Best to you and your Python-ing.

P.S. A runnable form of the code used in the post can be found in this gist.

Python Dashboards with Panel: Kicking the Tires

PyViz recently release the first official release (0.60) of Panel. Overall, I’m digging the iterative development model of developing dashboard components within Juypter lab/notebook.

Here’s an example notebook that will demonstrate creating a few dashboard components in Panel. The raw notebook can be launched using mybinder.org)

Python 3.8.0b1 Positional-Only Arguments and the Walrus Operator

On June 4th 2019, Python 3.8.0b1 was released. The official changelog is here.

There are two interesting syntactic changes/features that were added which I believe are useful to explore in some depth. Specifically, the new “walrus”‘ := operator and the new Positional-Only function parameter features.

Walrus

First, the “walrus” expression operator (:=) defined in PEP-572

…naming sub-parts of a large expression can assist an interactive debugger, providing useful display hooks and partial results. Without a way to capture sub-expressions inline, this would require refactoring of the original code; with assignment expressions, this merely requires the insertion of a few name := markers. Removing the need to refactor reduces the likelihood that the code be inadvertently changed as part of debugging (a common cause of Heisenbugs), and is easier to dictate to another programmer.

A (contrived) example using Python 3.8.0b1 built from source “3.8.0b1+ (heads/3.8:23f41a64ea)”

1
2
3
4
5
>>> xs = list(range(0, 10))
>>> xs
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [(x, y) for x in xs if (y := x * 2) < 5]
[(0, 0), (1, 2), (2, 4)]

The idea is to use an expression based approach to remove unnecessary chatter and potential bugs of storing local state.

Another simple example:

1
2
3
4
5
6
7
8
>>> # Python 3.8
>>> import re
>>> rx = re.compile(r'([A-z]*)\.([A-z]*)')
>>> def g(first, last): return f"First: {first} Last: {last}"
>>> names = ['ralph', 'steve.smith']
>>> for name in names:
... if (match := rx.match(name)): print(g(*match.groups()))
First: steve Last: smith

As a side note, many of these “None-ish” based examples in the PEP (somewhat mechanically) look like a map, flatMap, 'foreach' on Option[T] cases in Scala.

Python doesn’t really do this well due to its inside-out nature of composing maps/filter/generators (versus a left to right model). Nevertheless, here’s the example to demonstrate the general idea using a functional centric approach.

1
2
3
4
5
>>> def processor(sx): return rx.match(sx)
>>> def not_none(x): return x is not None
>>> def printer(x): print(g(*x.groups()))
>>> _ = list(map(printer, filter(not_none, map(processor, names))))
First: steve Last: smith

The “Exceptional cases” described in the PEP are worth investigating in more detail. There’s several cases where “Valid, though probably confusing” is used.

For example:

1
2
y := f(x)  # INVALID
(y := f(x)) # Valid, though not recommended

Note that the “walrus” operator can also be used in function definitions.

1
2
def foo(answer = p := 42): return "" # INVALID
def foo(answer=(p := 42)): return "" # Valid, though not great style

Positional Only Args

The other interesting feature added to Python 3.8 is Positional-Only arguments in function definitions.

For as long as I can recall, Python has had this fundamental feature (or bug) on how functions or methods are called.

For example,

1
2
3
4
5
6
7
8
9
def f(x, y=1234): return x + y
>>> f(1)
1235
>>> f(1, 2)
3
>>> f(x=1, y=2)
3
>>> f(y=1, x=2)
3

Often this fundamental ambiguity of function call “style” isn’t really a big deal. However, it can leak local variable names as part of the public interface.As a result, minor variable renaming can potentially break interfaces. It’s also not clear what should be a keyword only argument or a positional only argument with a default. For example, simply changing f(x, y=1234) to f(n, y=1234) can potentially break the interface depending on the call “style”.

I’ve worked with a few developers over the years that viewed this as a feature and thought that this style made the API calls more explicit. For example:

1
2
3
4
def compute(alpha, beta, gamma):
return 0

compute(alpha=90, gamma=80, beta=120)

I never really liked this looseness of positional vs keyword and would (if possible) try to discourage its use. Regardless, it can be argued this is a feature of the language (at least in Python <= 3.7). I would guess that many Python developers are also leveraging the unpacking dict style as well to call functions.

1
2
d = dict(alpha=90, gamma=70, beta=120)
compute(**d)

In Python 3.0, function definitions using Keyword-Only arguments was added (see PEP-3102 from 2006) using the * delimiter. All arguments to the right of the * are Keyword-Only arguments.

1
2
def f(a, b, *, c=1234):
return a + b + c

Unfortunately, this still leaves a fundamental issue with clearly defining function arguments. There are three cases: Positional-Only, Positional or Keyword, and Keyword-Only. PEP-3102 only solves the Keyword-Only case and doesn’t address the other two cases.

Hence in Python < 3.8, there’s still a fundamental ambiguity when defining a function and how it can be called.

For example:

1
2
3
4
5
6
7

def f(a, b=1, *, c=1234):
return a + b + c

f(1, 2, c=3)
f(a=1, b=2, c=3)

Starting with Python 3.8.0, a Positional-Only parameters mechanism was added. The details are captured in PEP-570

Similar to the * delimiter in Python 3.0.0 for Keyword-Only args), a / delimiter was added to clearly delineate Positional-Only (or conversely Keyword-Only args) in function or method definitions. This makes the three cases of function arguments unambigious in how they should be called.

Here’s a few examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def f0(a, b, c=1234, /):
# a, b, c are only positional args.
return a + b + c

def f1(a, b, /, c=1234):
# a, b must be a positional,
# b can be positional or keyword
return a + b + c

def f2(a, b, *, c=1234):
# a, b can be keyword or positional
# but c MUST be keyword
return a + b + c

def f3(a, b, /, *, c=1234):
# a, b only positional
# c only keyword
# e.g., # f3(1, 2, c=3)
return a + b + c

Combining the / and * with the type annotations yields:

1
2
3
4
5
def f4(a:int, b:int, /, *, c:int=1234):
# this can only be called as
# f4(1, 2, c=3)
return a + b + c

We can dive a bit deeper and inspect the function signature via inspect.

1
2
3
4
import inspect
def extractor(f):
sx = inspect.signature(f)
return [(v.name, v.kind, v.default) for v in sx.parameters.values()]

Let’s inspect each example:

1
2
3
def pf(f):
print(f"Function {f.__name__} with type annotations {f.__annotations__}")
print(extractor(f))

Running in the REPL yeilds:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> funcs = (f0, f1, f2, f2, f2, f4)
>>> _ = list(map(pf, funcs))
Function f0 with type annotations {}
[('a', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('c', <_ParameterKind.POSITIONAL_ONLY: 0>, 1234)]
Function f1 with type annotations {}
[('a', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('c', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, 1234)]
Function f2 with type annotations {}
[('a', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('c', <_ParameterKind.KEYWORD_ONLY: 3>, 1234)]
Function f2 with type annotations {}
[('a', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('c', <_ParameterKind.KEYWORD_ONLY: 3>, 1234)]
Function f2 with type annotations {}
[('a', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_OR_KEYWORD: 1>, <class 'inspect._empty'>), ('c', <_ParameterKind.KEYWORD_ONLY: 3>, 1234)]
Function f4 with type annotations {'a': <class 'int'>, 'b': <class 'int'>, 'c': <class 'int'>}
[('a', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('b', <_ParameterKind.POSITIONAL_ONLY: 0>, <class 'inspect._empty'>), ('c', <_ParameterKind.KEYWORD_ONLY: 3>, 1234)]

Note, you can use bind to call the func (this must also adhere to the correct function definition of arg and keywords in the function of interest).

Also, it’s worth noting that both scipy and numpy have been using this / style in the docs for some time.

When Should I Start Adopting these Features?

If you’re a library developer that has packages on PyPi, it might not be clear when it’s “safe” to start leveraging these features. I was only able to find one source of Python 3 adoption and as a result, I’m only able outline a very crude model.

On December 23, 2016, Python 3.6 was officially released. In the Fall of 2018, JetBrains release the Python Developer Survey which contains the Python 2/3 breakdown, as well as the breakdown of different versions within Python 3. As of the Fall 2018, 54% of Python 3 developers were using Python 3.6.x. Therefore, using this very crude model, if you assume that the rate of adoption of 3.6 and 3.8 are the same and if the minimum threshold of adoption of 3.8 is 54%, then you’ll need to wait approximately 2 years before starting to leverage these 3.8 features.

Jetbrains Python Survey

When you do plan to leverage these 3.8 specific features and pushing a package to the Python Package Index, then I would suggest clearly defining the Python version in the setup.py. For more details, see the official packaging docs for more details.

1
2
# setup.py
setup(python_requires='>=3.8')

Summary and Conclusion

  • Python 3.8 added the “walrus” operator := that enables results of expressions to be used
  • It’s recommended reading the Exceptional Cases for better understanding of where to (and to not) use the := operator.
  • Python 3.8 added Positional-Only function definitions using the / delimiter
  • Defining functions with Positional-Only arguments will require a trailing / in the definition. E.g., def adder(n, m, /): return 0
  • There are changes in the standard lib to communicate. It’s not clear how locked down or backward compatible the interfaces were changes. Here’s a random example of the function signature of functools.partial being updated to use /.
  • Positional-Only arguments should improve consistency of API calls across Python runtimes (e.g., cpython and pypi)
  • The Positional-Only PEP-570 outlines improvements in performance, however, I wasn’t able to find any performance studies on this topic.
  • Migrating to 3.8 might involve potentially breaking API changes based on the usage of / in the Python 3.8 standard lib
  • For core library authors of libs on pypi, I would recommend using the crude approximation (described above) of approximately 2 years away from being able to adopt the new 3.8 features
  • For mypy users, you might want to make sure you investigate the supported versions of Python 3.8 (More Details on the compatiblity matrix)

I understand the general motivation to solve core friction points or ambiguities at the language level, however, the new syntatic changes are a little too noisy for my tastes. Specifically the Positional-Only / combined with the * and type annotations. Regardless, the (Python 3.8) ship has sailed long ago. I would encourage the Python community to periodically track and provide feedback on the current PEPs to help guide the evolution of the Python programming language. And finally, Python 3.8.0 (beta and future 3.8 RCs) bugs should be filed to https://bugs.python.org .

Best to you and your Python-ing!

Further Reading

P.S. A reminder that the PSF has a Q2 2019 fundraiser that ends June 30th.

Dataclasses in Python 3.7

TDLR: Initially, I was interested in the dataclasses feature that was added to Python 3.7 standard library. However after a closer look, It’s not clear to me if dataclasses provide enough value. Third-party libraries such as attrs and pydantic offer considerably more value. Moreover, by adding dataclasses to the standard library, Python now has 3 (or 4) overlapping mechanisms (namedtuple, typing.NamedTuple, “classic” classes, and dataclasses) for defining a core concept in the Python programming language.

In Python 3.7, dataclasses were added to the standard library. There’s a terrific talk by Raymond Hettinger that is a great introduction to the history and design of dataclasses.

A demonstration and comparison of the 3 different models is shown below.

For the demo, I’ll use a Person data model with an id (int), a name (str), and an optional favorite color (Optional[str]). The Person model will also have custom eq and hash using only the Persons id. I’m a fan of immutable classes, so the container data model will be immutable if possible.

Here’s an example using the “classic” class style leveraging Python 3 type annotations.

Nothing particularly interesting for any Python dev. We do have to write a bit of boilerplate in the constructor, however, this also enables some convenience translation layers (e.g., converting datetime as a string to datetime instance). I’ve omitted the immutability aspect due to the boilerplate that would be added.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from typing import Optional

class Person(object):
def __init__(self, id:int, name:str, favorite_color:Optional[str] = None):
self.id = id
self.name = name
self.favorite_color = favorite_color

def __repr__(self):
return "<{} id:{} name:{}>".format(self.__class__.__name__, self.id, self.name)

def __hash__(self):
return hash(self.id)

def __eq__(self, other):
if self.__class__ == other.__class__:
return self.id == other.id
return False

We can also write this using the typing.NamedTuple (or an untyped version using the Python 2 style collections.namedtuple).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from typing import NamedTuple, Optional

class Person(NamedTuple):
id:int
name:str
favorite_color:Optional[str] = None

def __hash__(self):
return hash(self.id)

def __eq__(self, other):
if self.__class__ == other.__class__:
return self.id == other.id
return False

Nothing too interesting here either. This approach does have well understood downsides due to the tuple nature of the underlying design. Specific downsides include index access and comparison operators (e.g., __lte__) which leverage the sortablity of tuples that can potentially introduce unexpected behaviors.

Let’s use the new Python 3.7 dataclasses approach.

1
2
3
4
5
6
7
from dataclasses import dataclass, field

@dataclass(frozen=True)
class Person:
id:int
name:str = field(hash=False, compare=False)
favorite_color: Optional[str] = field(default_factory=lambda :None, hash=False, compare=False)

The dataclasses design is a declarative centric approach yielding a terse and clean interface to define a class in Python.

It’s important to note that none of the 3 approaches in the standard lib support any mechanism for type checking at runtime. The dataclasses API has a __post_init hook that can be used to add any custom validation (typechecking or otherwise).

I think it’s useful to dig a bit deeper and understand the development requirements, or conversely, the non-goals of the dataclasses API. PEP-557 provides some key insights (for context, one of the libraries mention in the quote below is attrs).

Here’s a few useful snippets from the PEP.

One main design goal of Data Classes is to support static type checkers. The use of PEP 526 syntax is one example of this, but so is the design of the fields() function and the @dataclass decorator. Due to their very dynamic nature, some of the libraries mentioned above are difficult to use with static type checkers.

It’s not clear to me if these “very dynamic” libraries still have an issue with popular static analysis tools such as mypy (as of mypy 0.57, attrs is supported). Nevertheless, it’s very clear from the PEP that there is a strong opinion that Python is a dynamic language and adding types is to annotation functions with metadata for static analysis tools. Adding types is not for runtime type checking, nor is to imply that type signatures are required.

Another useful snippet to provide context.

Data Classes are not, and are not intended to be, a replacement mechanism for all of the above libraries. But being in the standard library will allow many of the simpler use cases to instead leverage Data Classes. Many of the libraries listed have different feature sets, and will of course continue to exist and prosper.

Historically, Python has a very heavy batteries-included approach to the standard library. To put the size of the standard library into perspective, there’s a recent PEP-594 to remove “dead batteries” from the standard library. The list of packages to remove is quite interesting. This battery-centric included approach motivates questions about the standard library.

  • Does the Python standard library need to be any bigger?

  • Why can’t (or shouldn’t) dataclasses style or functionality be off-loaded to the community to develop libraries (e.g., attrs and traitlets)? Is a “half” or “minimal” solution really worth adding enough value?

  • Does Python standard lib need yet another competing mechanism to define a class or data container?

  • Is all of these packages in the standard lib really that useful in practice? For example, is the CSV parser used when pandas.read_csv exists?

  • Do these features in the standard library bog down the Python core team?

A recent talk at Python Language Summit in May of 2019, “Batteries included, But They’re Leaking“ by Amber Brown brought up some contentious ideas on the current state of the standard library (in practice, that ship has sailed many moons ago and I’m not sure there’s a lot of constructive discussion that can be had on this topic).

I don’t really have any solid answers to any of these questions. Two of the most popular and Python libs, requests and numpy are both outside of the standard library (for good reasons that may be orthogonal to motivations of adding dataclasses to the standard library) and are thriving.

Without hooks at a per field level validation hooks among other features, I’m struggling to find the usefulness of dataclasses for Python developers, particularly in the context of current polished third-party libraries, such as attrs, pydantic, or traitlets.

For comparison, let’s take a look at the third-party libraries that have inspired dataclasses.

Third Party Libraries

Let’s take a quick look into two of the third-party libraries, attrs and pydantic that inspired dataclasses. I believe both of these libraries are supported by mypy.

Attrs

Similar to the dataclasses approach, the types in attrs' are only used as annotations (e.g., __annotations__) and are not enforced at runtime. However, the general validation mechanism trivially enables runtime type validation. For the purposes of this example, let’s also add custom validation on the name of the Person as well as adding type checking.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from typing import Optional

import attr
from attr.validators import instance_of as vof


def is_positive_nonzero(instance, attribute, value):
if value < 1:
raise ValueError(f"Value {value} must be >0")


def check_non_empty_str(inst, attribute, value):
if not value:
raise ValueError("name must be a non-empty string")


@attr.s(auto_attribs=True, frozen=True)
class Person:
id:int = attr.ib(validator=[vof(int), is_positive_nonzero])
name:str = attr.ib(validator=[vof(str), check_non_empty_str])
favorite_color: Optional[str] = attr.ib(default=None, validator=[vof((str, type(None)))])

def __hash__(self):
return hash(self.id)

def __eq__(self, other):
if self.__class__ == other.__class__:
return self.id == other.id
return False

The abstract of Raymond Hettinger’s talk from Pycon has an interesting summary of dataclasses.

Dataclasses are shown to be the next step in a progression of data aggregation tools: tuple, dict, simple class, bunch recipe, named tuples, records, attrs, and then dataclasses. Each builds upon the one that came before, adding expressiveness at the expense of complexity.

I’m not sure I completely agree. The dataclasses implementation looks closer to attrs-lite, than the “next step of progression”.

Pydantic

Another alternative is pydantic. This is a bit more opinionated design. It also has a nice Schema abstraction to communicate core metadata on the fields as well as first class support for serialization hooks. The pydantic library also has a dataclasses wrapper layer that can be accessed via pydantic.dataclasses.

Here’s an example of defining our Person data model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from typing import Optional
import pydantic
from pydantic import BaseModel, validator, PositiveInt


class Person(BaseModel):
class Config:
allow_mutation = False
validate_all = True

id:PositiveInt
name:str
favorite_color: Optional[str] = None

@validator('name')
def validate_name(cls, v):
if not v:
raise ValueError("Name can't be an empty")
return v

def __hash__(self):
return hash(self.id)

def __eq__(self, other):
if self.__class__ == other.__class__:
return self.id == other.id
return False

Overall, I like the general style and approach, however, it does have a few quarks. Specifically the keyword only usage as well as unexpected casting behavior of ints to strs.

The pydantic API also supports rich metadata that could be useful for generating commandline interfaces for a given schema data model and emitted JSONSchema.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from pydantic import BaseModel, validator, ValidationError, Schema

class Person(BaseModel):
class Config:
allow_mutation = False

# '...' means the value is required.
id: PositiveInt = Schema(..., title="User Id")
name:str = Schema(..., title="User name")
favorite_color:Optional[str] = Schema(None,
title="User Favorite Color",
description="Favorite Color. Provided as case sensitive english spelling")

@validator('name')
def validate_name(cls, v):
if not v:
raise ValueError("Name can't be an empty")
return v

def __hash__(self):
return hash(self.id)

def __eq__(self, other):
if self.__class__ == other.__class__:
return self.id == other.id
return False

#Person.schema() will emit the JSONSchema as a dict.

Summary And Conclusion

  • dataclasses offers a terse syntax for defining a class or data container that has type annotations using a code generation approach
  • dataclasses field metadata can be used to define defaults, communicate which fields should be used in eq or hash, lte, etc…
  • dataclasses has a __post_init hook that can be used for validation
  • dataclasses By design does not do type validation. It only adds __annotation__ to the data container for static analyzers to consume, such as mypy
  • Since dataclasses is now in the standard lib, this means feature enhancement, bug fixes, and backwards compatibility are now coupled the official Python release process
  • Raymond’s Pycon talk mentions the end-to-end develop time on dataclasses was 200+ hrs

Initially, I was a intrigued by the addition of dataclasses to the standard library. However, after a deeper dive into the dataclasses, it’s not clear to me that these are particularly useful for Python developers. I believe third-party solutions such as attrs or pydantic might be a better fit due to their validation hooks and richer feature sets. It will be interesting to see the adoption of dataclasses by both the Python core as well as third-party developers.

For a deeper look and comparison of the 3 (or 4) models to define a class or data container in Python, please consult this Notebook in this Gist

Best on all your Python-ing!

Series: Functional Programming Techniques In Python

This is a 4 Part Series that explores functional centric design style and patterns in Python.

Part 1 (notebook) starts with different mechanisms of how to define functions in Python and quickly moves to using closures and functools.partial. We then move on to adding functools.reduce and composition with compose to our toolbox. Finally, we conclude with adding lazy map and filter to our toolbox and create a data pipeline that takes a stream of records to compute common statics using a max heap as the reducer.

In Part 2 (notebook), we explore building a REST client using functional-centric design style. Using an iterative approach, we build up a REST client using small functions leveraging clsoures and passing functions as first class citizens to methods. To conclude, we wrap the API and expose the REST client via a simple Python class.

Part 3 (notebook) follows similarly to spirit to Part 2. We build a commandline interface leveraging argparse from the Python standard library. Sometimes libraries such as argparse can have rough edges or friction points in the API that introduce duplication or design issues. Part 3 focuses on iterative building up an expressive commandline interface to a subparser commandline tool using closures and functions to smooth out the rough edges of argparse.. There’s also examples of using a Strategy-ish pattern with type annotated functions to enable configuring logging as well as custom error handling.

Part 4 (notebook) concludes with some gotchas with regards to scope in closures, a brief example of decorators and a few suggestions for leverging function-centric designs in your APIs or applications.

If you’re a OO wizard, a Data Scientist/Analysist, or a backend dev, this series can be useful to add another design approach in your toolbelt to designing APIs or programs.

Best to you and your Python’ing!

Functional Programming Techniques in Python Part 4

Functional Programming Techniques in Python Part 4

Functional Programming Techniques in Python Part 3

Functional Programming Techniques in Python Part 3

Functional Programming Techniques in Python Part 2

Functional Programming Techniques in Python Part 2

Functional Programming Techniques in Python Part 1

Functional Programming Techniques in Python Part 1

Now

This is a “now page” which itemizes professional, personal work and other priorities that I’m concentrating on right now.

What I’m currently Working On

  • Leveraging Scala programming language to build robust and maintainable code
  • Ammonite for scripting and REPL-ish driven development in Scala
  • Dashboard tooling using Panel
  • Deeper dive into Advanced design patterns in Akka
  • Exploring different workflow technologies (e.g., apache-airflow, prefect) for scientific computing and data analysis
  • R for data exploration and visualization utilizing the most excellent dplyr and ggplot2
  • Exploring Python PEPs >= 3.8 to better understand the direction and future iterations of the Python Programming Language