Polars get grouped rows where column value is maximum

Polars get grouped rows where column value is maximum - python

So consider this snippet
import polars as pl
df = pl.DataFrame({'class': ['a', 'a', 'b', 'b'], 'name': ['Ron', 'Jon', 'Don', 'Von'], 'score': [0.2, 0.5, 0.3, 0.4]})
df.groupby('class').agg([pl.col('score').max()])
This gives me:
class score
b 0.4
a 0.5
But I want the entire row of the group that corresponded to the maximum score. I can do a join with the original dataframe like
sdf = df.groupby('class').agg([pl.col('score').max()])
sdf.join(df, on=['class', 'score'])
To get
class score name
a 0.5 Jon
b 0.4 Von
Is there any way to avoid the join and include the name column as part of the groupby aggregation?

You can use a sort_by expression to sort your observations in each group by score, and then use the last expression to take the last observation.
For example, to take all columns:
df.groupby('class').agg([
pl.all().sort_by('score').last(),
])
shape: (2, 3)
┌───────┬──────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
└───────┴──────┴───────┘
Edit: using over
If you have more than one observation that is the max, another easy way to get all rows is to use over.
For example, if your data has two students in class b ('Von' and 'Yvonne') who tied for highest score:
df = pl.DataFrame(
{
"class": ["a", "a", "b", "b", "b"],
"name": ["Ron", "Jon", "Don", "Von", "Yvonne"],
"score": [0.2, 0.5, 0.3, 0.4, 0.4],
}
)
df
shape: (5, 3)
┌───────┬────────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪════════╪═══════╡
│ a ┆ Ron ┆ 0.2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Don ┆ 0.3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Yvonne ┆ 0.4 │
└───────┴────────┴───────┘
df.filter(pl.col('score') == pl.col('score').max().over('class'))
shape: (3, 3)
┌───────┬────────┬───────┐
│ class ┆ name ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪════════╪═══════╡
│ a ┆ Jon ┆ 0.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Von ┆ 0.4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ Yvonne ┆ 0.4 │
└───────┴────────┴───────┘

Related

Split a dataframe into n dataframes by column value in polars

I have a large Polars dataframe that I'd like to split into n number of dataframes given the size. Like take dataframe and split it into 2 or 3 or 5 dataframes.
There are several observations that will show up for each column and would like to choose splitting into a chosen number of dataframes. A simple example is like the following where I am splitting on a specific id, but would like to have similar behave, but more like split into 2 approximately even dataframes since the full example has a large number of identifiers.
df = pl.DataFrame({'Identifier': [1234,1234, 2345,2345],
'DateColumn': ['2022-02-13','2022-02-14', '2022-02-13',
'2022-02-14']
})
df2 = df.with_columns(
[
pl.col('DateColumn').str.strptime(pl.Date).cast(pl.Date)
]
)
print(df)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘
df1 = df.filter(
pl.col('Identifier')==1234
)
df2 = df.filter(
pl.col('Identifier')==2345
)
print(df1)
shape: (2, 2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
└────────────┴────────────┘
print(df2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘

If you want to divide your DataFrame by let's say your identifier, the best way to do so is use the partition_by method.
df = pl.DataFrame({
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "m", "m", "l"],
})
df.partition_by(groups="foo", maintain_order=True)
[shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A ┆ 1 ┆ k │
│ A ┆ 2 ┆ l │
└─────┴─────┴─────┘,
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ B ┆ 2 ┆ m │
│ B ┆ 4 ┆ m │
└─────┴─────┴─────┘,
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ C ┆ 2 ┆ l │
└─────┴─────┴─────┘]
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html
This automatically divides the DataFrame by values in a column.

How to Write Poisson CDF as Python Polars Expression

I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.
I've been able to get the result I'd like outside of the desired lazy expression framework with:
import polars as pl
from scipy.stats import poisson
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))
However, in reality I'd like this to look like:
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
[
... # bunch of other expressions here
poisson_cdf()
]
)
where poisson_cdf is some polars expression like:
def poisson_cdf():
# this is just for illustration, clearly wont work
return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")
I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.
Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!

If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:
>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k ┆ cdf │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.018316 │
│ 1 ┆ 0.091578 │
│ 2 ┆ 0.238103 │
│ 3 ┆ 0.43347 │
│ ... ┆ ... │
│ 6 ┆ 0.889326 │
│ 7 ┆ 0.948866 │
│ 8 ┆ 0.978637 │
│ 9 ┆ 0.991868 │
└─────┴──────────┘
The result is the same as returned by poisson.cdf:
>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k ┆ cdf ┆ cdf2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.018316 ┆ 0.018316 │
│ 1 ┆ 0.091578 ┆ 0.091578 │
│ 2 ┆ 0.238103 ┆ 0.238103 │
│ 3 ┆ 0.43347 ┆ 0.43347 │
│ ... ┆ ... ┆ ... │
│ 6 ┆ 0.889326 ┆ 0.889326 │
│ 7 ┆ 0.948866 ┆ 0.948866 │
│ 8 ┆ 0.978637 ┆ 0.978637 │
│ 9 ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

It sounds like you want to use .map() instead of .apply() - which will pass whole columns at once.
df.select([
pl.all(),
# ...
pl.struct(["count", "expected_count"])
.map(lambda x:
poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
.flatten()
.alias("poisson_cdf")
])
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ --- | --- | --- │
│ i64 | f64 | f64 │
╞═══════╪════════════════╪═════════════╡
│ 9 | 7.7 | 0.75308 │
│ 2 | 0.2 | 0.998852 │
│ 3 | 0.7 | 0.994247 │
│ 4 | 1.1 | 0.994565 │
│ 5 | 7.5 | 0.241436 │
└───────┴────────────────┴─────────────┘

You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those
still have fast columnar operation through the NumPy API.
Specifically you want the pdtr function.
You then want to use reduce rather than map or apply as those are for generic python functions and aren't going to perform as well.
So if we have...
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))
then we can add to it with
df=df.with_columns([
pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df
shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9 ┆ 7.7 ┆ 0.75308 ┆ 0.75308 │
│ 2 ┆ 0.2 ┆ 0.998852 ┆ 0.998852 │
│ 3 ┆ 0.7 ┆ 0.994247 ┆ 0.994247 │
│ 4 ┆ 1.1 ┆ 0.994565 ┆ 0.994565 │
│ 5 ┆ 7.5 ┆ 0.241436 ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘
You can see it gives the same answer.

Take cumsum of each row in polars

E.g. if I have
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
how would I find the cumulative sum of each row?
Expected output:
a b
0 1 5
1 2 7
2 3 9
Here's the equivalent in pandas:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).cumsum(axis=1)
a b
0 1 5
1 2 7
2 3 9
but I can't figure out how to do it in polars

Edit: Polars 0.14.18 and later
As of Polars 0.14.18, we can use the new polars.cumsum function to simplify this. (Note: this is slightly different than the polars.Expr.cumsum Expression, in that it acts as a root Expression.)
Using the same DataFrame as below:
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.select([
pl.exclude(my_cols),
pl.cumsum(my_cols).alias('result')
])
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
Before Polars 0.14.18
Polars is column-oriented, and as such does not have the concept of a axis. Still, we can use the list evaluation context to solve this.
First, let's expand you data slightly:
df = pl.DataFrame({
"id": ['a', 'b', 'c'],
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
})
df
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 4 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 │
└─────┴─────┴─────┴─────┘
The Algorithm
Here's a general-purpose performant algorithm that will solve this. We'll walk through it below.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
How it works
First, we'll select the names of the numeric columns. You can name these explicitly if you like, e.g., my_cols=['a','b','c'].
Next, we'll gather up the column values into a list using polars.concat_list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 4, 7] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 5, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 6, 9] │
└─────┴─────┴─────┴─────┴───────────┘
From here, we'll use the arr.eval context to run our cumsum on the list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬────────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪════════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 5, 12] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 7, 15] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 9, 18] │
└─────┴─────┴─────┴─────┴────────────┘
Next, we'll break the list into a struct using arr.to_struct, and name the fields the corresponding names from our selected numeric columns.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ struct[3] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ {1,5,12} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ {2,7,15} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ {3,9,18} │
└─────┴─────┴─────┴─────┴───────────┘
And finally, we'll use unnest to break the struct into columns. (But first we must drop the original columns or else we'll get two columns with the same name.)
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘

There may be a simpler and faster way, but here is the programmatic solution.
Concatenate the values along the columns into a list
Calculate the cumulative sum over the list (the result is still a list)
Get values for each column in the result
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df.select([
pl.concat_list(pl.all())
.arr.eval(pl.element().cumsum())
.alias('cs')
]).select([
pl.col('cs').arr.get(i).alias(name)
for i, name in enumerate(df.columns)
])
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 9 │
└─────┴─────┘

Convert each dataframe row to Object in Polars

I have a Polars dataframe:
┌───────────┬────────────┬────────────┬────────────┬
│ Name ┆ Purchase ┆ Size ┆ Color ┆
│ ┆ Time ┆ ┆ ┆
│ --- ┆ --- ┆ --- ┆ ┆
│ str ┆ datetime[μ ┆ i64 ┆ --- ┆
│ ┆ s] ┆ ┆ str ┆
╞═══════════╪════════════╪════════════╪════════════╪
│ T-Shirt ┆ 2022-02-14 ┆ 12 ┆ Blue ┆
│ ┆ 14:40:09.1 ┆ ┆ ┆
│ ┆ 00 ┆ ┆ ┆
└───────────┴────────────┴────────────┴────────────
And I would like to convert each row of this dataframe into an object which contains information extracted from the rows, e.g:
PurchasedObject(Name, Size, Color)
Is it possible with polars to create a new column in the dataframe, which contains for each row the corresponding object?
Which would be the best way to achieve this in Polars?
Thank you!

Polars has a struct datatype that can be used to pack columns together in a single datatype.
import polars as pl
from datetime import datetime
pl.DataFrame({
"name": ["t-shirt"],
"purchased": [datetime(2022, 2, 4)],
"size": [12],
"color": ["blue"]
}).with_column(
pl.struct(["name", "size", "color"]).alias("purchased_struct")
)
shape: (1, 5)
┌─────────┬─────────────────────┬──────┬───────┬───────────────────────┐
│ name ┆ purchased ┆ size ┆ color ┆ purchased_struct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 ┆ str ┆ struct[3] │
╞═════════╪═════════════════════╪══════╪═══════╪═══════════════════════╡
│ t-shirt ┆ 2022-02-04 00:00:00 ┆ 12 ┆ blue ┆ {"t-shirt",12,"blue"} │
└─────────┴─────────────────────┴──────┴───────┴───────────────────────┘

Polars python equivalent to glimpse and summary in R

I couldn't find a function that would summarize the content in the polars dataframe just like glimpse and summary do it in R?

Polars has a describe method:
df = pl.DataFrame({
'a': [1.0, 2.8, 3.0],
'b': [4, 5, 6],
"c": [True, False, True]
})
df.describe()
shape: (5, 4)
╭──────────┬───────┬─────┬──────╮
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═════╪══════╡
│ "mean" ┆ 2.267 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "std" ┆ 1.102 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "min" ┆ 1 ┆ 4 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "max" ┆ 3 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "median" ┆ 2.8 ┆ 5 ┆ null │
Which reports, like R's summary, descriptive statistics per column. I have not used glimpse before, but a quick Google suggests it does something similar to Polar's head, but then with the output stacked vertically, so it is easier to digest when there are many columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Polars get grouped rows where column value is maximum - python

Related

Split a dataframe into n dataframes by column value in polars

How to Write Poisson CDF as Python Polars Expression

Take cumsum of each row in polars

Convert each dataframe row to Object in Polars

Polars python equivalent to glimpse and summary in R

Categories

Resources