How to Write Poisson CDF as Python Polars Expression - python

I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.
I've been able to get the result I'd like outside of the desired lazy expression framework with:
import polars as pl
from scipy.stats import poisson
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))
However, in reality I'd like this to look like:
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
[
... # bunch of other expressions here
poisson_cdf()
]
)
where poisson_cdf is some polars expression like:
def poisson_cdf():
# this is just for illustration, clearly wont work
return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")
I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.
Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!

If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:
>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k ┆ cdf │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.018316 │
│ 1 ┆ 0.091578 │
│ 2 ┆ 0.238103 │
│ 3 ┆ 0.43347 │
│ ... ┆ ... │
│ 6 ┆ 0.889326 │
│ 7 ┆ 0.948866 │
│ 8 ┆ 0.978637 │
│ 9 ┆ 0.991868 │
└─────┴──────────┘
The result is the same as returned by poisson.cdf:
>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k ┆ cdf ┆ cdf2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.018316 ┆ 0.018316 │
│ 1 ┆ 0.091578 ┆ 0.091578 │
│ 2 ┆ 0.238103 ┆ 0.238103 │
│ 3 ┆ 0.43347 ┆ 0.43347 │
│ ... ┆ ... ┆ ... │
│ 6 ┆ 0.889326 ┆ 0.889326 │
│ 7 ┆ 0.948866 ┆ 0.948866 │
│ 8 ┆ 0.978637 ┆ 0.978637 │
│ 9 ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

It sounds like you want to use .map() instead of .apply() - which will pass whole columns at once.
df.select([
pl.all(),
# ...
pl.struct(["count", "expected_count"])
.map(lambda x:
poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
.flatten()
.alias("poisson_cdf")
])
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ --- | --- | --- │
│ i64 | f64 | f64 │
╞═══════╪════════════════╪═════════════╡
│ 9 | 7.7 | 0.75308 │
│ 2 | 0.2 | 0.998852 │
│ 3 | 0.7 | 0.994247 │
│ 4 | 1.1 | 0.994565 │
│ 5 | 7.5 | 0.241436 │
└───────┴────────────────┴─────────────┘

You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those
still have fast columnar operation through the NumPy API.
Specifically you want the pdtr function.
You then want to use reduce rather than map or apply as those are for generic python functions and aren't going to perform as well.
So if we have...
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))
then we can add to it with
df=df.with_columns([
pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df
shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9 ┆ 7.7 ┆ 0.75308 ┆ 0.75308 │
│ 2 ┆ 0.2 ┆ 0.998852 ┆ 0.998852 │
│ 3 ┆ 0.7 ┆ 0.994247 ┆ 0.994247 │
│ 4 ┆ 1.1 ┆ 0.994565 ┆ 0.994565 │
│ 5 ┆ 7.5 ┆ 0.241436 ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘
You can see it gives the same answer.

Related

Split a dataframe into n dataframes by column value in polars

I have a large Polars dataframe that I'd like to split into n number of dataframes given the size. Like take dataframe and split it into 2 or 3 or 5 dataframes.
There are several observations that will show up for each column and would like to choose splitting into a chosen number of dataframes. A simple example is like the following where I am splitting on a specific id, but would like to have similar behave, but more like split into 2 approximately even dataframes since the full example has a large number of identifiers.
df = pl.DataFrame({'Identifier': [1234,1234, 2345,2345],
'DateColumn': ['2022-02-13','2022-02-14', '2022-02-13',
'2022-02-14']
})
df2 = df.with_columns(
[
pl.col('DateColumn').str.strptime(pl.Date).cast(pl.Date)
]
)
print(df)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘
df1 = df.filter(
pl.col('Identifier')==1234
)
df2 = df.filter(
pl.col('Identifier')==2345
)
print(df1)
shape: (2, 2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
└────────────┴────────────┘
print(df2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘
If you want to divide your DataFrame by let's say your identifier, the best way to do so is use the partition_by method.
df = pl.DataFrame({
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "m", "m", "l"],
})
df.partition_by(groups="foo", maintain_order=True)
[shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A ┆ 1 ┆ k │
│ A ┆ 2 ┆ l │
└─────┴─────┴─────┘,
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ B ┆ 2 ┆ m │
│ B ┆ 4 ┆ m │
└─────┴─────┴─────┘,
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ C ┆ 2 ┆ l │
└─────┴─────┴─────┘]
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html
This automatically divides the DataFrame by values in a column.

How can I rotate/shift/increment one particular column's values in Polars DataFrame?

I have a polars dataframe as follows:
df = pl.DataFrame(
dict(
day=[1, 1, 1, 3, 3, 3, 5, 5, 8, 8, 9, 9, 9],
value=[1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
)
)
I want to incrementally rotate the values in column 'day'? By incremental rotation, I mean for each value, change it to its next larger value exists in the column, and if the value is the largest, then change it to null/None.
Basically, the result I expect should be the following:
pl.DataFrame(
dict(
day=[3, 3, 3, 5, 5, 5, 8, 8, 9, 9, None, None, None],
value=[1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
)
)
Is there some particular polars-python idiomatic way to achieve this?
If day is sorted - you could group together - shift - then explode back?
(df.groupby("day", maintain_order=True)
.agg_list()
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day")))
shape: (13, 2)
┌──────┬───────┐
│ day | value │
│ --- | --- │
│ i64 | i64 │
╞══════╪═══════╡
│ 3 | 1 │
│ 3 | 2 │
│ 3 | 2 │
│ 5 | 3 │
│ 5 | 5 │
│ 5 | 2 │
│ 8 | 1 │
│ 8 | 2 │
│ 9 | 7 │
│ 9 | 3 │
│ null | 5 │
│ null | 3 │
│ null | 4 │
└──────┴───────┘
Perhaps another approach is to .rank() the column.
.search_sorted() for rank + 1 could find the positions of the next "group".
The max values could be nulled out then passed to .take() to get the new values.
(df.with_columns(
pl.col("day").rank("dense")
.cast(pl.Int64)
.alias("rank"))
.with_columns(
pl.col("rank")
.search_sorted(pl.col("rank") + 1)
.alias("idx"))
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx"))
.with_columns(
pl.col("day").take(pl.col("idx"))
.alias("new"))
)
shape: (13, 5)
┌─────┬───────┬──────┬──────┬──────┐
│ day | value | rank | idx | new │
│ --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | u32 | i64 │
╞═════╪═══════╪══════╪══════╪══════╡
│ 1 | 1 | 1 | 3 | 3 │
│ 1 | 2 | 1 | 3 | 3 │
│ 1 | 2 | 1 | 3 | 3 │
│ 3 | 3 | 2 | 6 | 5 │
│ 3 | 5 | 2 | 6 | 5 │
│ 3 | 2 | 2 | 6 | 5 │
│ 5 | 1 | 3 | 8 | 8 │
│ 5 | 2 | 3 | 8 | 8 │
│ 8 | 7 | 4 | 10 | 9 │
│ 8 | 3 | 4 | 10 | 9 │
│ 9 | 5 | 5 | null | null │
│ 9 | 3 | 5 | null | null │
│ 9 | 4 | 5 | null | null │
└─────┴───────┴──────┴──────┴──────┘
Feels like I'm missing an obvious simpler approach here..
#jqurious, what I'd recommend for remapping values is a join. Joins are heavily optimized and scale very well, especially on machines with a good number of cores.
As an example, let's benchmark some solutions.
First, some data
Let's use enough data to avoid spurious results from "microbenchmarking" using tiny datasets. (I see this all too often - tiny datasets with benchmark results down to a few microseconds or milliseconds.)
On my 32-core system with 512 GB of RAM, that means expanding the dataset to one billion records. (Choose a different value below as appropriate for your computing platform.)
import polars as pl
import numpy as np
import time
rng = np.random.default_rng(1)
nbr_rows = 1_000_000_000
df = pl.DataFrame(
dict(
day=rng.integers(1, 1_000_000, nbr_rows),
value=rng.integers(1, 1_000_000, nbr_rows),
)
).with_row_count()
df
shape: (1000000000, 3)
┌───────────┬────────┬────────┐
│ row_nr ┆ day ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╡
│ 0 ┆ 473189 ┆ 747152 │
│ 1 ┆ 511822 ┆ 298575 │
│ 2 ┆ 755167 ┆ 868027 │
│ 3 ┆ 950463 ┆ 289295 │
│ ... ┆ ... ┆ ... │
│ 999999996 ┆ 828237 ┆ 503917 │
│ 999999997 ┆ 909996 ┆ 447681 │
│ 999999998 ┆ 309104 ┆ 588174 │
│ 999999999 ┆ 485525 ┆ 198567 │
└───────────┴────────┴────────┘
Assumption: Not sorted
Let's suppose that we cannot assume that the data is sorted by day. (We'll have to adapt the solutions somewhat.)
Join
Here's the results using a join. If you watch your CPU usage, for example using top in Linux, you'll see that the algorithm is heavily multi-threaded. It spends the majority of its time spread across all cores of your system.
start = time.perf_counter()
(
df
.join(
df
.select(pl.col('day').unique().sort())
.with_columns(
pl.col('day').shift(-1).alias('new_day')
),
how='inner',
on='day',
)
)
print(time.perf_counter() - start)
shape: (1000000000, 4)
┌───────────┬────────┬────────┬─────────┐
│ row_nr ┆ day ┆ value ┆ new_day │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╪═════════╡
│ 0 ┆ 473189 ┆ 747152 ┆ 473190 │
│ 1 ┆ 511822 ┆ 298575 ┆ 511823 │
│ 2 ┆ 755167 ┆ 868027 ┆ 755168 │
│ 3 ┆ 950463 ┆ 289295 ┆ 950464 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 999999996 ┆ 828237 ┆ 503917 ┆ 828238 │
│ 999999997 ┆ 909996 ┆ 447681 ┆ 909997 │
│ 999999998 ┆ 309104 ┆ 588174 ┆ 309105 │
│ 999999999 ┆ 485525 ┆ 198567 ┆ 485526 │
└───────────┴────────┴────────┴─────────┘
>>> print(time.perf_counter() - start)
20.85321443199973
groupby-explode
Now let's try the groupby-explode solution. This algorithm will spend a good share of time in single-threaded mode.
I've had to add a sort after the grouping step because the algorithm assumes sorted data in the steps after it.
start = time.perf_counter()
(
df
.groupby("day", maintain_order=False)
.agg_list()
.sort(['day'])
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day"))
)
print(time.perf_counter() - start)
shape: (1000000000, 3)
┌──────┬───────────┬────────┐
│ day ┆ row_nr ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪═══════════╪════════╡
│ 2 ┆ 197731 ┆ 4093 │
│ 2 ┆ 3154732 ┆ 433246 │
│ 2 ┆ 4825468 ┆ 436316 │
│ 2 ┆ 4927362 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ null ┆ 993596728 ┆ 25604 │
│ null ┆ 995160321 ┆ 575415 │
│ null ┆ 996690852 ┆ 490825 │
│ null ┆ 999391650 ┆ 92113 │
└──────┴───────────┴────────┘
>>> print(time.perf_counter() - start)
54.04602192300081
rank
Now, the rank method. This algorithm will spend nearly all its time in single-threaded mode.
I've also had to add a sort here, as the ranks are assumed to be sorted in the search_sorted step.
start = time.perf_counter()
(
df
.sort(['day'])
.with_columns(
pl.col("day").rank("dense").cast(pl.Int64).alias("rank")
)
.with_columns(
pl.col("rank").search_sorted(pl.col("rank") + 1).alias("idx")
)
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx")
)
.with_columns(
pl.col("day").take(pl.col("idx")).alias("new")
)
)
print(time.perf_counter() - start)
shape: (1000000000, 6)
┌───────────┬────────┬────────┬────────┬──────┬──────┐
│ row_nr ┆ day ┆ value ┆ rank ┆ idx ┆ new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═══════════╪════════╪════════╪════════╪══════╪══════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 1 ┆ 1907 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 1 ┆ 1907 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 1 ┆ 1907 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 1 ┆ 1907 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ 999999 ┆ null ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ 999999 ┆ null ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ 999999 ┆ null ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ 999999 ┆ null ┆ null │
└───────────┴────────┴────────┴────────┴──────┴──────┘
>>> print(time.perf_counter() - start)
98.63108555600047
Assumption: Sorted by day
If we can assume that our data is already sorted by day, we can cut out unnecessary steps in our algorithms - as well as see some decent increases in speed.
We'll sort the data first and re-run our algorithms. Note that sorting sets the sorted flag on the day column, which allows algorithms to take shortcuts to increase speed. (If not sorting manually, then the set_sorted method can be used tell Polars that the column is pre-sorted.)
df = df.sort(['day'])
df
shape: (1000000000, 3)
┌───────────┬────────┬────────┐
│ row_nr ┆ day ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╡
│ 197731 ┆ 1 ┆ 4093 │
│ 3154732 ┆ 1 ┆ 433246 │
│ 4825468 ┆ 1 ┆ 436316 │
│ 4927362 ┆ 1 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 │
│ 995160321 ┆ 999999 ┆ 575415 │
│ 996690852 ┆ 999999 ┆ 490825 │
│ 999391650 ┆ 999999 ┆ 92113 │
└───────────┴────────┴────────┘
Join
The code employing a join needs no changes; however, it does see an incredible speedup.
start = time.perf_counter()
(
df
.join(
df
.select(pl.col('day').unique().sort())
.with_columns(
pl.col('day').shift(-1).alias('new_day')
),
how='inner',
on='day',
)
)
print(time.perf_counter() - start)
shape: (1000000000, 4)
┌───────────┬────────┬────────┬─────────┐
│ row_nr ┆ day ┆ value ┆ new_day │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╪═════════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ null │
└───────────┴────────┴────────┴─────────┘
>>> print(time.perf_counter() - start)
8.71159654099938
Note the same exact join algorithm now finishes in only 8.7 seconds rather than 20.9 seconds, largely due to the data being pre-sorted, and the sorted flag being set on day.
groupby-explode
We'll eliminate the superfluous sort within the algorithm, and re-run it.
start = time.perf_counter()
(
df
.groupby("day", maintain_order=True)
.agg_list()
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day"))
)
print(time.perf_counter() - start)
shape: (1000000000, 3)
┌──────┬───────────┬────────┐
│ day ┆ row_nr ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪═══════════╪════════╡
│ 2 ┆ 197731 ┆ 4093 │
│ 2 ┆ 3154732 ┆ 433246 │
│ 2 ┆ 4825468 ┆ 436316 │
│ 2 ┆ 4927362 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ null ┆ 993596728 ┆ 25604 │
│ null ┆ 995160321 ┆ 575415 │
│ null ┆ 996690852 ┆ 490825 │
│ null ┆ 999391650 ┆ 92113 │
└──────┴───────────┴────────┘
>>> print(time.perf_counter() - start)
8.249637401000655
Note how this algorithm takes slightly less time than the join algorithm, all due to the assumption of day being pre-sorted.
rank
Again, we'll now eliminated the superfluous sort and re-run the algorithm.
start = time.perf_counter()
(
df
.with_columns(
pl.col("day").rank("dense").cast(pl.Int64).alias("rank")
)
.with_columns(
pl.col("rank").search_sorted(pl.col("rank") + 1).alias("idx")
)
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx")
)
.with_columns(
pl.col("day").take(pl.col("idx")).alias("new")
)
)
print(time.perf_counter() - start)
shape: (1000000000, 6)
┌───────────┬────────┬────────┬────────┬──────┬──────┐
│ row_nr ┆ day ┆ value ┆ rank ┆ idx ┆ new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═══════════╪════════╪════════╪════════╪══════╪══════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 1 ┆ 1907 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 1 ┆ 1907 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 1 ┆ 1907 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 1 ┆ 1907 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ 999999 ┆ null ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ 999999 ┆ null ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ 999999 ┆ null ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ 999999 ┆ null ┆ null │
└───────────┴────────┴────────┴────────┴──────┴──────┘
>>> print(time.perf_counter() - start)
48.90440067800046
Although this algorithm now takes roughly half the time, it's not quite as fast as the join or groupby-explode algorithms.
Of course, wall-clock performance is not the end-all-be-all. But when problems scale up, joins are particularly good tools, even when we cannot make assumptions regarding the sorted-ness of our data.

Python Polars find the length of a string in a dataframe

I am trying to count the number of letters in a string in Polars.
I could probably just use an apply method and get the len(Name).
However, I was wondering if there is a polars specific method?
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
print(mydf)
│start_date ┆ Name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════╡
│ 2020-01-02 ┆ John │
│ 2020-01-03 ┆ Joe │
│ 2020-01-04 ┆ James │
In the end John would have 5, Joe would be 3 and James would be 5
I thought something like below might work based on the Pandas equivalent
# Assume that its a Pandas Dataframe
mydf['count'] = mydf ['Name'].str.len()
# Polars equivalent - ERRORs
mydf = mydf.with_columns(
pl.col('Name').str.len().alias('count')
)
You can use
.str.lengths() that counts number of bytes in the UTF8 string (doc) - faster
.str.n_chars() that counts number of characters (doc)
mydf.with_columns([
pl.col("Name").str.lengths().alias("len")
])
┌────────────┬───────┬─────┐
│ start_date ┆ Name ┆ len │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞════════════╪═══════╪═════╡
│ 2020-01-02 ┆ John ┆ 4 │
│ 2020-01-03 ┆ Joe ┆ 3 │
│ 2020-01-04 ┆ James ┆ 5 │
└────────────┴───────┴─────┘

Sample from each group in polars dataframe?

I'm looking for a function along the lines of
df.groupby('column').agg(sample(10))
so that I can take ten or so randomly-selected elements from each group.
This is specifically so I can read in a LazyFrame and work with a small sample of each group as opposed to the entire dataframe.
Update:
One approximate solution is:
df = lf.groupby('column').agg(
pl.all().sample(.001)
)
df = df.explode(df.columns[1:])
Update 2
That approximate solution is just the same as sampling the whole dataframe and doing a groupby after. No good.
Let start with some dummy data:
n = 100
seed = 0
df = pl.DataFrame(
{
"groups": (pl.arange(0, n, eager=True) % 5).shuffle(seed=seed),
"values": pl.arange(0, n, eager=True).shuffle(seed=seed)
}
)
df
shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 55 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 40 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 57 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 99 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:
df.groupby("groups").agg(pl.count())
shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪═══════╡
│ 1 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0 ┆ 20 │
└────────┴───────┘
Sample our data
Now we are going to use a window function to take a sample of our data.
df.filter(
pl.arange(0, pl.count()).shuffle().over("groups") < 10
)
shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 85 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 84 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 19 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
For every group in over("group") the pl.arange(0, pl.count()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.
A solution using lambda
df = (
lf.groupby('column')
.apply(lambda x: x.sample(10))
)
We can try making our own groupby-like functionality and sampling from the filtered subsets.
samples = []
cats = df.get_column('column').unique().to_list()
for cat in cats:
samples.append(df.filter(pl.col('column') == cat).sample(10))
samples = pl.concat(samples)
Found partition_by in the documentation, this should be more efficient, since at least the groups are made with the api and in single pass of the dataframe. Sampling each group is still linear unfortunately.
pl.concat([x.sample(10) for x in df.partition_by(groups="column")])
Third attempt, sampling indices:
import numpy as np
import random
indices = df.groupby("group").agg(pl.col("value").agg_groups()).get_column("value").to_list()
sampled = np.array([random.sample(x, 10) for x in indices]).flatten()
df[sampled]

Polars python equivalent to glimpse and summary in R

I couldn't find a function that would summarize the content in the polars dataframe just like glimpse and summary do it in R?
Polars has a describe method:
df = pl.DataFrame({
'a': [1.0, 2.8, 3.0],
'b': [4, 5, 6],
"c": [True, False, True]
})
df.describe()
shape: (5, 4)
╭──────────┬───────┬─────┬──────╮
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═════╪══════╡
│ "mean" ┆ 2.267 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "std" ┆ 1.102 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "min" ┆ 1 ┆ 4 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "max" ┆ 3 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "median" ┆ 2.8 ┆ 5 ┆ null │
Which reports, like R's summary, descriptive statistics per column. I have not used glimpse before, but a quick Google suggests it does something similar to Polar's head, but then with the output stacked vertically, so it is easier to digest when there are many columns.

Categories