Three packages that port the tidyverse to Python

reference

Working in Python but miss tidyverse syntax? These packages can help.

Published

May 9, 2022

A woman holding an instrument while a man who is facing away from us paints her, the room has luxurious decorations. — Johannes Vermeer, The Allegory of Painting (1666)

As I’ve been saying every year for the past seven years or so, I am learning Python. (It’s been a journey.)

Python packages like pandas have several ways to work with data. There are several options for indexing, slicing, etc. They have a lot of flexibility but also a lot of conventions to remember.

I am familiar with the grammar of the tidyverse, which provides a consistent set of verbs to solve common data manipulation challenges. I investigated ways to port tidyverse-like verbs to Python (hopefully making Python a little easier to grasp).

Here are three packages that do just that.

siuba

The siuba package, created by Michael Chow, allows you to use dplyr-like syntax with pandas. Siuba ports over several functions, including select(), filter(), mutate(), summarize(), and arrange(). The package also allows you to use group_by() and a >> pipe.

Let’s check out a few examples using the palmerpenguins dataset (R, Python).

from palmerpenguins import load_penguins
penguins = load_penguins()
from siuba import group_by, summarize, _

(penguins
  >> group_by(_.species)
  >> summarize(n = _.species.count())
)

	species	n
0	Adelie	152
1	Chinstrap	68
2	Gentoo	124

library(palmerpenguins)
library(dplyr)

(penguins %>% 
    group_by(species) %>%
    summarize(n = n()))

# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

Thanks to the documentation and interactive tutorials available for siuba, it’s easy to see the parallels and differences with dplyr so that you can craft these data manipulation tasks yourself.

from palmerpenguins import load_penguins
penguins = load_penguins()
from siuba import select

(penguins
  >> select(-_.isalpha(), _.species)
  >> group_by(_.species)
  >> summarize(
      bill_length_mm = _.bill_length_mm.mean(),
      bill_depth_mm = _.bill_depth_mm.mean(),
      flipper_length_mm = _.flipper_length_mm.mean(),
      body_mass_g = _.body_mass_g.mean()
  )
)

	species	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	Adelie	38.791391	18.346358	189.953642	3700.662252
1	Chinstrap	48.833824	18.420588	195.823529	3733.088235
2	Gentoo	47.504878	14.982114	217.186992	5076.016260

(penguins %>%
  group_by(species) %>%
  summarize(across(!where(is.character), mean, na.rm = TRUE)))

# A tibble: 3 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>      <dbl>          <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie        NA           38.8          18.3              190.       3701.
2 Chinstrap     NA           48.8          18.4              196.       3733.
3 Gentoo        NA           47.5          15.0              217.       5076.
# … with 2 more variables: sex <dbl>, year <dbl>

plotnine

The plotnine package, created by Hassan Kibirige, lets you use a grammar of graphics for Python.

You can use siuba and plotnine together, similar to how you would use dplyr and ggplot2 together.

from siuba import *
from plotnine import *
from palmerpenguins import load_penguins
penguins = load_penguins()

(penguins
 # using siuba pipe
 >> ggplot(aes(x = 'flipper_length_mm', y = 'body_mass_g'))
 # creating plotnine plot
  + geom_point(aes(color = 'species', shape = 'species'),
             size = 3,
             alpha = 0.8)
  + theme_minimal()
  + labs(title = "Penguin size, Palmer Station LTER",
         subtitle = "Flipper length and body mass for Adelie, Chinstrap, and Gentoo Penguins",
        x = "Flipper length (mm)",
        y = "Body mass (g)",
        color = "Penguin species",
        shape = "Penguin species"))

Folks have heuristics to translate ggplot2 code to plotnine. These help understand the nuances between the two.

pyjanitor

OK, this one is cheating a bit because the janitor package by Sam Firke is not part of the tidyverse. One more package that uses ‘tidyverse-like’ verbs is pyjanitor. With pyjanitor, you can clean column names, identify duplicate entries, and more.

from janitor import clean_names
import pandas as pd
import numpy as np

example_df = {
    'Terrible Name 1': ['name1', 'name2', 'name3', 'name4'],
    'PascalCase': [150.0, 200.0, 300.0, 400.0],
    'this_has_punctuation?': [np.nan, np.nan, np.nan, np.nan],
}

pd.DataFrame.from_dict(example_df).clean_names()

	terrible_name_1	pascalcase	this_has_punctuation_
0	name1	150.0	NaN
1	name2	200.0	NaN
2	name3	300.0	NaN
3	name4	400.0	NaN

Python’s pandas allows you to method chain with pipes. They can be used with pyjanitor, as well.

from janitor import clean_names, remove_empty
import pandas as pd
import numpy as np

example_df = {
    'Terrible Name 1': ['name1', 'name2', 'name3', 'name4'],
    'PascalCase': [150.0, 200.0, 300.0, 400.0],
    'this_has_punctuation?': [np.nan, np.nan, np.nan, np.nan],
}

(pd.DataFrame.from_dict(example_df)
    .pipe(clean_names)
    .pipe(remove_empty)
)

	terrible_name_1	pascalcase
0	name1	150.0
1	name2	200.0
2	name3	300.0
3	name4	400.0

Conclusion

While Python syntax and conventions are still on my “to-learn” list, it is helpful to know there are packages that can bring the familiarity of the tidyverse to Python.

Additional resources

I came across other packages that port the tidyverse to Python. I didn’t investigate them deeply but wanted to share in case they are helpful.

If you recommend these packages or know of others, please let me know on Twitter!

2024-01-13 Note

Adam Austin started a thread of dplyr-alike Python packages on Mastodon.

Working in Python but miss #rstats' tidyverse? 🐍 Here are three+ packages that port tidyverse-like syntax to use in your #pydata work.

Blog post: https://t.co/ScGycjYyOo
— Isabella Velásquez (@ivelasq3) May 9, 2022