I couldn't find a function that would summarize the content in the polars dataframe just like glimpse and summary do it in R?
Polars has a describe method:
df = pl.DataFrame({
'a': [1.0, 2.8, 3.0],
'b': [4, 5, 6],
"c": [True, False, True]
})
df.describe()
shape: (5, 4)
╭──────────┬───────┬─────┬──────╮
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═════╪══════╡
│ "mean" ┆ 2.267 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "std" ┆ 1.102 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "min" ┆ 1 ┆ 4 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "max" ┆ 3 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "median" ┆ 2.8 ┆ 5 ┆ null │
Which reports, like R's summary, descriptive statistics per column. I have not used glimpse before, but a quick Google suggests it does something similar to Polar's head, but then with the output stacked vertically, so it is easier to digest when there are many columns.
Related
Let's say I want to make a list of functions, ie aggs=['sum','std','mean','min','max']
then if I have an arbitrary df
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4]})
I want to be able to do something like (this obviously doesn't work)
df.with_columns([pl.col('a').x() for x in aggs])
Is there a way to do that? aggs need not be a list of strings but just the easiest way to type out my intention for the purpose of this question. Additionally it'd need to have room for .suffix()
I know I could have a function that has all the aggs in the function and takes arbitrary dfs as a parameter which is like my backup plan so I'm hoping for something that resembles the above.
Would this work for you?
df.with_columns([getattr(pl.col("a"), x)().suffix("_" + x) for x in aggs])
shape: (3, 7)
┌─────┬─────┬───────┬───────┬────────┬───────┬───────┐
│ a ┆ b ┆ a_sum ┆ a_std ┆ a_mean ┆ a_min ┆ a_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════╪═══════╪═══════╪════════╪═══════╪═══════╡
│ 1 ┆ 2 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
│ 2 ┆ 3 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
│ 3 ┆ 4 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
└─────┴─────┴───────┴───────┴────────┴───────┴───────┘
I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.
I've been able to get the result I'd like outside of the desired lazy expression framework with:
import polars as pl
from scipy.stats import poisson
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))
However, in reality I'd like this to look like:
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
[
... # bunch of other expressions here
poisson_cdf()
]
)
where poisson_cdf is some polars expression like:
def poisson_cdf():
# this is just for illustration, clearly wont work
return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")
I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.
Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!
If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:
>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k ┆ cdf │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.018316 │
│ 1 ┆ 0.091578 │
│ 2 ┆ 0.238103 │
│ 3 ┆ 0.43347 │
│ ... ┆ ... │
│ 6 ┆ 0.889326 │
│ 7 ┆ 0.948866 │
│ 8 ┆ 0.978637 │
│ 9 ┆ 0.991868 │
└─────┴──────────┘
The result is the same as returned by poisson.cdf:
>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k ┆ cdf ┆ cdf2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.018316 ┆ 0.018316 │
│ 1 ┆ 0.091578 ┆ 0.091578 │
│ 2 ┆ 0.238103 ┆ 0.238103 │
│ 3 ┆ 0.43347 ┆ 0.43347 │
│ ... ┆ ... ┆ ... │
│ 6 ┆ 0.889326 ┆ 0.889326 │
│ 7 ┆ 0.948866 ┆ 0.948866 │
│ 8 ┆ 0.978637 ┆ 0.978637 │
│ 9 ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘
It sounds like you want to use .map() instead of .apply() - which will pass whole columns at once.
df.select([
pl.all(),
# ...
pl.struct(["count", "expected_count"])
.map(lambda x:
poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
.flatten()
.alias("poisson_cdf")
])
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ --- | --- | --- │
│ i64 | f64 | f64 │
╞═══════╪════════════════╪═════════════╡
│ 9 | 7.7 | 0.75308 │
│ 2 | 0.2 | 0.998852 │
│ 3 | 0.7 | 0.994247 │
│ 4 | 1.1 | 0.994565 │
│ 5 | 7.5 | 0.241436 │
└───────┴────────────────┴─────────────┘
You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those
still have fast columnar operation through the NumPy API.
Specifically you want the pdtr function.
You then want to use reduce rather than map or apply as those are for generic python functions and aren't going to perform as well.
So if we have...
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))
then we can add to it with
df=df.with_columns([
pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df
shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9 ┆ 7.7 ┆ 0.75308 ┆ 0.75308 │
│ 2 ┆ 0.2 ┆ 0.998852 ┆ 0.998852 │
│ 3 ┆ 0.7 ┆ 0.994247 ┆ 0.994247 │
│ 4 ┆ 1.1 ┆ 0.994565 ┆ 0.994565 │
│ 5 ┆ 7.5 ┆ 0.241436 ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘
You can see it gives the same answer.
E.g. if I have
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
how would I find the cumulative sum of each row?
Expected output:
a b
0 1 5
1 2 7
2 3 9
Here's the equivalent in pandas:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).cumsum(axis=1)
a b
0 1 5
1 2 7
2 3 9
but I can't figure out how to do it in polars
Edit: Polars 0.14.18 and later
As of Polars 0.14.18, we can use the new polars.cumsum function to simplify this. (Note: this is slightly different than the polars.Expr.cumsum Expression, in that it acts as a root Expression.)
Using the same DataFrame as below:
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.select([
pl.exclude(my_cols),
pl.cumsum(my_cols).alias('result')
])
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
Before Polars 0.14.18
Polars is column-oriented, and as such does not have the concept of a axis. Still, we can use the list evaluation context to solve this.
First, let's expand you data slightly:
df = pl.DataFrame({
"id": ['a', 'b', 'c'],
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
})
df
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 4 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 │
└─────┴─────┴─────┴─────┘
The Algorithm
Here's a general-purpose performant algorithm that will solve this. We'll walk through it below.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
How it works
First, we'll select the names of the numeric columns. You can name these explicitly if you like, e.g., my_cols=['a','b','c'].
Next, we'll gather up the column values into a list using polars.concat_list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 4, 7] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 5, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 6, 9] │
└─────┴─────┴─────┴─────┴───────────┘
From here, we'll use the arr.eval context to run our cumsum on the list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬────────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪════════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 5, 12] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 7, 15] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 9, 18] │
└─────┴─────┴─────┴─────┴────────────┘
Next, we'll break the list into a struct using arr.to_struct, and name the fields the corresponding names from our selected numeric columns.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ struct[3] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ {1,5,12} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ {2,7,15} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ {3,9,18} │
└─────┴─────┴─────┴─────┴───────────┘
And finally, we'll use unnest to break the struct into columns. (But first we must drop the original columns or else we'll get two columns with the same name.)
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
There may be a simpler and faster way, but here is the programmatic solution.
Concatenate the values along the columns into a list
Calculate the cumulative sum over the list (the result is still a list)
Get values for each column in the result
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df.select([
pl.concat_list(pl.all())
.arr.eval(pl.element().cumsum())
.alias('cs')
]).select([
pl.col('cs').arr.get(i).alias(name)
for i, name in enumerate(df.columns)
])
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 9 │
└─────┴─────┘
I have a Polars dataframe:
┌───────────┬────────────┬────────────┬────────────┬
│ Name ┆ Purchase ┆ Size ┆ Color ┆
│ ┆ Time ┆ ┆ ┆
│ --- ┆ --- ┆ --- ┆ ┆
│ str ┆ datetime[μ ┆ i64 ┆ --- ┆
│ ┆ s] ┆ ┆ str ┆
╞═══════════╪════════════╪════════════╪════════════╪
│ T-Shirt ┆ 2022-02-14 ┆ 12 ┆ Blue ┆
│ ┆ 14:40:09.1 ┆ ┆ ┆
│ ┆ 00 ┆ ┆ ┆
└───────────┴────────────┴────────────┴────────────
And I would like to convert each row of this dataframe into an object which contains information extracted from the rows, e.g:
PurchasedObject(Name, Size, Color)
Is it possible with polars to create a new column in the dataframe, which contains for each row the corresponding object?
Which would be the best way to achieve this in Polars?
Thank you!
Polars has a struct datatype that can be used to pack columns together in a single datatype.
import polars as pl
from datetime import datetime
pl.DataFrame({
"name": ["t-shirt"],
"purchased": [datetime(2022, 2, 4)],
"size": [12],
"color": ["blue"]
}).with_column(
pl.struct(["name", "size", "color"]).alias("purchased_struct")
)
shape: (1, 5)
┌─────────┬─────────────────────┬──────┬───────┬───────────────────────┐
│ name ┆ purchased ┆ size ┆ color ┆ purchased_struct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 ┆ str ┆ struct[3] │
╞═════════╪═════════════════════╪══════╪═══════╪═══════════════════════╡
│ t-shirt ┆ 2022-02-04 00:00:00 ┆ 12 ┆ blue ┆ {"t-shirt",12,"blue"} │
└─────────┴─────────────────────┴──────┴───────┴───────────────────────┘
So consider this snippet
import polars as pl
df = pl.DataFrame({'class': ['a', 'a', 'b', 'b'], 'name': ['Ron', 'Jon', 'Don', 'Von'], 'score': [0.2, 0.5, 0.3, 0.4]})
df.groupby('class').agg([pl.col('score').max()])
This gives me:
class score
b 0.4
a 0.5
But I want the entire row of the group that corresponded to the maximum score. I can do a join with the original dataframe like
sdf = df.groupby('class').agg([pl.col('score').max()])
sdf.join(df, on=['class', 'score'])
To get
class score name
a 0.5 Jon
b 0.4 Von
Is there any way to avoid the join and include the name column as part of the groupby aggregation?
You can use a sort_by expression to sort your observations in each group by score, and then use the last expression to take the last observation.
For example, to take all columns:
df.groupby('class').agg([
pl.all().sort_by('score').last(),
])
shape: (2, 3)
┌───────┬──────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
└───────┴──────┴───────┘
Edit: using over
If you have more than one observation that is the max, another easy way to get all rows is to use over.
For example, if your data has two students in class b ('Von' and 'Yvonne') who tied for highest score:
df = pl.DataFrame(
{
"class": ["a", "a", "b", "b", "b"],
"name": ["Ron", "Jon", "Don", "Von", "Yvonne"],
"score": [0.2, 0.5, 0.3, 0.4, 0.4],
}
)
df
shape: (5, 3)
┌───────┬────────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪════════╪═══════╡
│ a ┆ Ron ┆ 0.2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Don ┆ 0.3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Yvonne ┆ 0.4 │
└───────┴────────┴───────┘
df.filter(pl.col('score') == pl.col('score').max().over('class'))
shape: (3, 3)
┌───────┬────────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪════════╪═══════╡
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Yvonne ┆ 0.4 │
└───────┴────────┴───────┘