I am trying to count the number of letters in a string in Polars.
I could probably just use an apply method and get the len(Name).
However, I was wondering if there is a polars specific method?
import polars as pl
mydf = pl.DataFrame(
{"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
"Name": ["John", "Joe", "James"]})
print(mydf)
│start_date ┆ Name │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════╡
│ 2020-01-02 ┆ John │
│ 2020-01-03 ┆ Joe │
│ 2020-01-04 ┆ James │
In the end John would have 5, Joe would be 3 and James would be 5
I thought something like below might work based on the Pandas equivalent
# Assume that its a Pandas Dataframe
mydf['count'] = mydf ['Name'].str.len()
# Polars equivalent - ERRORs
mydf = mydf.with_columns(
pl.col('Name').str.len().alias('count')
)
You can use
.str.lengths() that counts number of bytes in the UTF8 string (doc) - faster
.str.n_chars() that counts number of characters (doc)
mydf.with_columns([
pl.col("Name").str.lengths().alias("len")
])
┌────────────┬───────┬─────┐
│ start_date ┆ Name ┆ len │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞════════════╪═══════╪═════╡
│ 2020-01-02 ┆ John ┆ 4 │
│ 2020-01-03 ┆ Joe ┆ 3 │
│ 2020-01-04 ┆ James ┆ 5 │
└────────────┴───────┴─────┘
Related
I have a large Polars dataframe that I'd like to split into n number of dataframes given the size. Like take dataframe and split it into 2 or 3 or 5 dataframes.
There are several observations that will show up for each column and would like to choose splitting into a chosen number of dataframes. A simple example is like the following where I am splitting on a specific id, but would like to have similar behave, but more like split into 2 approximately even dataframes since the full example has a large number of identifiers.
df = pl.DataFrame({'Identifier': [1234,1234, 2345,2345],
'DateColumn': ['2022-02-13','2022-02-14', '2022-02-13',
'2022-02-14']
})
df2 = df.with_columns(
[
pl.col('DateColumn').str.strptime(pl.Date).cast(pl.Date)
]
)
print(df)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘
df1 = df.filter(
pl.col('Identifier')==1234
)
df2 = df.filter(
pl.col('Identifier')==2345
)
print(df1)
shape: (2, 2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 1234 ┆ 2022-02-13 │
│ 1234 ┆ 2022-02-14 │
└────────────┴────────────┘
print(df2)
┌────────────┬────────────┐
│ Identifier ┆ DateColumn │
│ --- ┆ --- │
│ i64 ┆ str │
╞════════════╪════════════╡
│ 2345 ┆ 2022-02-13 │
│ 2345 ┆ 2022-02-14 │
└────────────┴────────────┘
If you want to divide your DataFrame by let's say your identifier, the best way to do so is use the partition_by method.
df = pl.DataFrame({
"foo": ["A", "A", "B", "B", "C"],
"N": [1, 2, 2, 4, 2],
"bar": ["k", "l", "m", "m", "l"],
})
df.partition_by(groups="foo", maintain_order=True)
[shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A ┆ 1 ┆ k │
│ A ┆ 2 ┆ l │
└─────┴─────┴─────┘,
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ B ┆ 2 ┆ m │
│ B ┆ 4 ┆ m │
└─────┴─────┴─────┘,
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ N ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ C ┆ 2 ┆ l │
└─────┴─────┴─────┘]
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.partition_by.html
This automatically divides the DataFrame by values in a column.
I have a collection of polars expressions being used to generate features for an ML model. I'd like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc...). I so far have not found an easy way of achieving this.
I've been able to get the result I'd like outside of the desired lazy expression framework with:
import polars as pl
from scipy.stats import poisson
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))
However, in reality I'd like this to look like:
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
[
... # bunch of other expressions here
poisson_cdf()
]
)
where poisson_cdf is some polars expression like:
def poisson_cdf():
# this is just for illustration, clearly wont work
return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")
I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality - leading to absurd execution time.
Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!
If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:
>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k ┆ cdf │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.018316 │
│ 1 ┆ 0.091578 │
│ 2 ┆ 0.238103 │
│ 3 ┆ 0.43347 │
│ ... ┆ ... │
│ 6 ┆ 0.889326 │
│ 7 ┆ 0.948866 │
│ 8 ┆ 0.978637 │
│ 9 ┆ 0.991868 │
└─────┴──────────┘
The result is the same as returned by poisson.cdf:
>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k ┆ cdf ┆ cdf2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.018316 ┆ 0.018316 │
│ 1 ┆ 0.091578 ┆ 0.091578 │
│ 2 ┆ 0.238103 ┆ 0.238103 │
│ 3 ┆ 0.43347 ┆ 0.43347 │
│ ... ┆ ... ┆ ... │
│ 6 ┆ 0.889326 ┆ 0.889326 │
│ 7 ┆ 0.948866 ┆ 0.948866 │
│ 8 ┆ 0.978637 ┆ 0.978637 │
│ 9 ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘
It sounds like you want to use .map() instead of .apply() - which will pass whole columns at once.
df.select([
pl.all(),
# ...
pl.struct(["count", "expected_count"])
.map(lambda x:
poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
.flatten()
.alias("poisson_cdf")
])
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ --- | --- | --- │
│ i64 | f64 | f64 │
╞═══════╪════════════════╪═════════════╡
│ 9 | 7.7 | 0.75308 │
│ 2 | 0.2 | 0.998852 │
│ 3 | 0.7 | 0.994247 │
│ 4 | 1.1 | 0.994565 │
│ 5 | 7.5 | 0.241436 │
└───────┴────────────────┴─────────────┘
You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those
still have fast columnar operation through the NumPy API.
Specifically you want the pdtr function.
You then want to use reduce rather than map or apply as those are for generic python functions and aren't going to perform as well.
So if we have...
df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))
then we can add to it with
df=df.with_columns([
pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df
shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9 ┆ 7.7 ┆ 0.75308 ┆ 0.75308 │
│ 2 ┆ 0.2 ┆ 0.998852 ┆ 0.998852 │
│ 3 ┆ 0.7 ┆ 0.994247 ┆ 0.994247 │
│ 4 ┆ 1.1 ┆ 0.994565 ┆ 0.994565 │
│ 5 ┆ 7.5 ┆ 0.241436 ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘
You can see it gives the same answer.
Does anyone know how to parse YYYY Week into a date column in Polars?
I have tried this code but it throws an error. Thx
import polars as pl
pl.DataFrame(
{
"week": [201901, 201902, 201903, 201942, 201943, 201944]
}).with_columns(pl.col('week').cast(pl.Utf8).str.strptime(pl.Date, fmt='%Y%U').alias("date"))
This seems like a bug (although one with the underlying rust package chrono rather than polars itself). I tried using base python's strptime and it ignores the %U and just gives the first of the year for all cases so you can either do string manipulation and math like this (assuming you don't need an exact response)
pl.DataFrame({
"week": [201901, 201902, 201903, 201942, 201943, 201944]
}) \
.with_columns(pl.col('week').cast(pl.Utf8)) \
.with_columns([pl.col('week').str.slice(0,4).cast(pl.Int32).alias('year'),
pl.col('week').str.slice(4,2).cast(pl.Int32).alias('week')]) \
.select(pl.date(pl.col('year'),1,1) + pl.duration(days=(pl.col('week')-1)*7).alias('date'))
If you look at the definition of %U, it's supposed to be based the xth Sunday of the year whereas my math is just multiplying by 7.
Another approach is to make a df of dates, then make the strftime of them and then join the dfs. So that might be like this:
dfdates=pl.DataFrame({'date':pl.date_range(datetime(2019,1,1), datetime(2019,12,31),'1d').cast(pl.Date())}) \
.with_columns(pl.col('date').dt.strftime("%Y%U").alias('week')) \
.groupby('week').agg(pl.col('date').min())
And then joining it with what you have
pl.DataFrame({
"week": [201901, 201902, 201903, 201942, 201943, 201944]
}).with_columns(pl.col('week').cast(pl.Utf8())).join(dfdates, on='week')
shape: (6, 2)
┌────────┬────────────┐
│ week ┆ date │
│ --- ┆ --- │
│ str ┆ date │
╞════════╪════════════╡
│ 201903 ┆ 2019-01-20 │
│ 201944 ┆ 2019-11-03 │
│ 201902 ┆ 2019-01-13 │
│ 201943 ┆ 2019-10-27 │
│ 201942 ┆ 2019-10-20 │
│ 201901 ┆ 2019-01-06 │
└────────┴────────────┘
That's really weird mate, looks like only dates on 2019 are broken, take a look at my example bellow:
pl.DataFrame(
{
"week": [
202201,
202202,
202203,
202242,
202243,
202244,
202101,
202102,
202103,
202142,
202143,
202144,
201901,
201902,
201903,
201942,
201943,
201944,
201801,
201802,
201803,
201842,
201843,
201844,
]
}
).with_columns(pl.format("{}0", "week")).with_columns(
pl.col("week").str.strptime(pl.Date, fmt="%Y%W%w", strict=False).alias("teste")
)
shape: (24, 2)
┌─────────┬────────────┐
│ week ┆ teste │
│ --- ┆ --- │
│ str ┆ date │
╞═════════╪════════════╡
│ 2022010 ┆ 2022-01-09 │
│ 2022020 ┆ 2022-01-16 │
│ 2022030 ┆ 2022-01-23 │
│ 2022420 ┆ 2022-10-23 │
│ 2022430 ┆ 2022-10-30 │
│ 2022440 ┆ 2022-11-06 │
│ 2021010 ┆ 2021-01-10 │
│ 2021020 ┆ 2021-01-17 │
│ 2021030 ┆ 2021-01-24 │
│ 2021420 ┆ 2021-10-24 │
│ 2021430 ┆ 2021-10-31 │
│ 2021440 ┆ 2021-11-07 │
│ 2019010 ┆ null │
│ 2019020 ┆ null │
│ 2019030 ┆ null │
│ 2019420 ┆ null │
│ 2019430 ┆ null │
│ 2019440 ┆ null │
│ 2018010 ┆ 2018-01-07 │
│ 2018020 ┆ 2018-01-14 │
│ 2018030 ┆ 2018-01-21 │
│ 2018420 ┆ 2018-10-21 │
│ 2018430 ┆ 2018-10-28 │
│ 2018440 ┆ 2018-11-04 │
└─────────┴────────────┘
Besides the bug I always use the following expression to parse week counts to proper dates
.with_columns(pl.format("{}0", "week")).with_columns(pl.col("week").str.strptime(pl.Date, fmt="%Y%W%w", strict=False)
It is important take note that is necessary to concatenate a weekday, to really parse this pattern, I think this is mentioned on the other post comments.
I'm looking for a function along the lines of
df.groupby('column').agg(sample(10))
so that I can take ten or so randomly-selected elements from each group.
This is specifically so I can read in a LazyFrame and work with a small sample of each group as opposed to the entire dataframe.
Update:
One approximate solution is:
df = lf.groupby('column').agg(
pl.all().sample(.001)
)
df = df.explode(df.columns[1:])
Update 2
That approximate solution is just the same as sampling the whole dataframe and doing a groupby after. No good.
Let start with some dummy data:
n = 100
seed = 0
df = pl.DataFrame(
{
"groups": (pl.arange(0, n, eager=True) % 5).shuffle(seed=seed),
"values": pl.arange(0, n, eager=True).shuffle(seed=seed)
}
)
df
shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 55 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 40 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 57 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 99 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:
df.groupby("groups").agg(pl.count())
shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪═══════╡
│ 1 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0 ┆ 20 │
└────────┴───────┘
Sample our data
Now we are going to use a window function to take a sample of our data.
df.filter(
pl.arange(0, pl.count()).shuffle().over("groups") < 10
)
shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 85 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 84 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 19 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
For every group in over("group") the pl.arange(0, pl.count()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.
A solution using lambda
df = (
lf.groupby('column')
.apply(lambda x: x.sample(10))
)
We can try making our own groupby-like functionality and sampling from the filtered subsets.
samples = []
cats = df.get_column('column').unique().to_list()
for cat in cats:
samples.append(df.filter(pl.col('column') == cat).sample(10))
samples = pl.concat(samples)
Found partition_by in the documentation, this should be more efficient, since at least the groups are made with the api and in single pass of the dataframe. Sampling each group is still linear unfortunately.
pl.concat([x.sample(10) for x in df.partition_by(groups="column")])
Third attempt, sampling indices:
import numpy as np
import random
indices = df.groupby("group").agg(pl.col("value").agg_groups()).get_column("value").to_list()
sampled = np.array([random.sample(x, 10) for x in indices]).flatten()
df[sampled]
I couldn't find a function that would summarize the content in the polars dataframe just like glimpse and summary do it in R?
Polars has a describe method:
df = pl.DataFrame({
'a': [1.0, 2.8, 3.0],
'b': [4, 5, 6],
"c": [True, False, True]
})
df.describe()
shape: (5, 4)
╭──────────┬───────┬─────┬──────╮
│ describe ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪═══════╪═════╪══════╡
│ "mean" ┆ 2.267 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "std" ┆ 1.102 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "min" ┆ 1 ┆ 4 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "max" ┆ 3 ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ "median" ┆ 2.8 ┆ 5 ┆ null │
Which reports, like R's summary, descriptive statistics per column. I have not used glimpse before, but a quick Google suggests it does something similar to Polar's head, but then with the output stacked vertically, so it is easier to digest when there are many columns.