I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).
I have a dataframe with the amounts to be distributed:
Name
Amount
A
100
B
300
C
250
And a target DataFrame to which I want to append the distributed values (in a new column) using the common "Name" column.
Name
Item
Price
A
x1
40
A
x2
60
B
y1
50
B
y2
150
B
y3
200
C
z1
400
The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).
So the result in this case should look like this:
Name
Item
Price
Assigned amount
A
x1
40
40
A
x2
60
60
B
y1
50
50
B
y2
150
150
B
y3
200
100
C
z1
400
250
In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.
Is there an efficient way to do this?
My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.
Edit
Example dataframes:
import polars as pl
df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})
#jqurious, good answer. This might be slightly more succinct:
(
df2.join(df1, on="Name")
.with_columns(
pl.min([
pl.col('Price'),
pl.col('Amount') -
pl.col('Price').cumsum().shift_and_fill(1, 0).over('Name')
])
.clip_min(0)
.alias('assigned')
)
)
shape: (6, 5)
┌──────┬──────┬───────┬────────┬──────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪════════╪══════════╡
│ A ┆ x1 ┆ 40 ┆ 100 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 100 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 300 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 300 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 300 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 ┆ 250 │
└──────┴──────┴───────┴────────┴──────────┘
You can take the minimum value of the Price or the Difference.
.clip_min(0) can be used to replace the negatives.
[Edit: See #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's answer for a neater way to write this.]
(
df2
.join(df1, on="Name")
.with_columns(
cumsum = pl.col("Price").cumsum().over("Name"))
.with_columns(
assigned = pl.col("Amount") - (pl.col("cumsum") - pl.col("Price")))
.with_columns(
assigned = pl.min(["Price", "assigned"]).clip_min(0))
)
shape: (6, 6)
┌──────┬──────┬───────┬────────┬────────┬──────────┐
│ Name | Item | Price | Amount | cumsum | assigned │
│ --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 │
╞══════╪══════╪═══════╪════════╪════════╪══════════╡
│ A | x1 | 40 | 100 | 40 | 40 │
│ A | x2 | 60 | 100 | 100 | 60 │
│ B | y1 | 50 | 300 | 50 | 50 │
│ B | y2 | 150 | 300 | 200 | 150 │
│ B | y3 | 200 | 300 | 400 | 100 │
│ C | z | 400 | 250 | 400 | 250 │
└──────┴──────┴───────┴────────┴────────┴──────────┘
This assumes the order of the df is the order of priority, if not, sort it first.
You first want to join your two dfs then make a helper column that is the cumsum of Price less Price. I call that spent. It's more like a potential spent because there's no guarantee it doesn't go over Amount.
Add another two helper columns, one for the difference between Amount and spent which we'll call have1 as that's the amount we have. In the sample data this didn't come up but we need to make sure this isn't less than 0 so we add another column which is just literally zero, we'll call it z.
Add another helper column which will be the greater value between 0 and have1 and we'll call it have2.
Lastly, we'll determine the Assigned amount as smaller value between have2 and Price.
df1.join(df2, on='Name') \
.with_columns((pl.col("Price").cumsum()-pl.col("Price")).over("Name").alias("spent")) \
.with_columns([(pl.col("Amount")-pl.col("spent")).alias("have1"), pl.lit(0).alias('z')]) \
.with_columns(pl.concat_list([pl.col('z'), pl.col('have1')]).arr.max().alias('have2')) \
.with_columns(pl.concat_list([pl.col('have2'), pl.col("Price")]).arr.min().alias("Assigned amount")) \
.select(["Name", "Item","Price","Assigned amount"])
You can reduce this to a single nested expression like this...
df1.join(df2, on='Name') \
.select(["Name", "Item","Price",
pl.concat_list([
pl.concat_list([
pl.repeat(0, pl.count()),
pl.col("Amount")-(pl.col("Price").cumsum()-pl.col("Price")).over("Name")
]).arr.max(),
pl.col("Price")
]).arr.min().alias("Assigned amount")
])
shape: (6, 4)
┌──────┬──────┬───────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Assigned amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪═════════════════╡
│ A ┆ x1 ┆ 40 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 │
└──────┴──────┴───────┴─────────────────┘
Related
I am currently trying to replicate ngroup behaviour in polars to get consecutive group indexes (the dataframe will be grouped over two columns). For the R crowd, this would be achieved in the dplyr world with dplyr::group_indices or the newer dplyr::cur_group_id.
As shown in the repro, I've tried couple avenues without much succcess, both approaches miss group sequentiality and merely return row counts by group.
Quick repro:
import polars as pl
import pandas as pd
df = pd.DataFrame(
{
"id": ["a", "a", "a", "a", "b", "b", "b", "b"],
"cat": [1, 1, 2, 2, 1, 1, 2, 2],
}
)
df_pl = pl.from_pandas(df)
print(df.groupby(["id", "cat"]).ngroup())
# This is the desired behaviour
# 0 0
# 1 0
# 2 1
# 3 1
# 4 2
# 5 2
# 6 3
# 7 3
print(df_pl.select(pl.count().over(["id", "cat"])))
# This is only counting observation by group
# ┌───────┐
# │ count │
# │ --- │
# │ u32 │
# ╞═══════╡
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# │ 2 │
# └───────┘
# shape: (4, 3)
print(df_pl.groupby(["id", "cat"]).agg([pl.count().alias("test")]))
# shape: (4, 3)
# ┌─────┬─────┬──────┐
# │ id ┆ cat ┆ test │
# │ --- ┆ --- ┆ --- │
# │ str ┆ i64 ┆ u32 │
# ╞═════╪═════╪══════╡
# │ a ┆ 1 ┆ 2 │
# │ a ┆ 2 ┆ 2 │
# │ b ┆ 1 ┆ 2 │
# │ b ┆ 2 ┆ 2 │
# └─────┴─────┴──────┘
Edit
As #jqurious points out we can use rank for this:
(df.with_row_count("idx")
.select(
pl.first("idx").over(["id", "cat"]).rank("dense") - 1)
)
shape: (8, 1)
┌─────┐
│ idx │
│ --- │
│ u32 │
╞═════╡
│ 0 │
│ 0 │
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
│ 3 │
└─────┘
The following might be more clear:
df = pl.DataFrame(
{
"id": ["a", "a", "a", "a", "b", "b", "b", "b"],
"cat": [1, 1, 2, 2, 1, 1, 2, 2],
}
)
(
# Add row count to each line to create an index.
df.with_row_count("idx")
# Group on id and cat column.
.groupby(
["id", "cat"],
maintain_order=True,
)
.agg(
# Create a list of all index positions per group.
pl.col("idx")
)
# Add a new row count for each group.
.with_row_count("ngroup")
# Expand idx list column to separate rows.
.explode("idx")
# Reorder columns.
.select(["idx", "ngroup", "id", "cat"])
# Optionally sort by original order.
.sort("idx")
)
┌─────┬────────┬─────┬─────┐
│ idx ┆ ngroup ┆ id ┆ cat │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ str ┆ i64 │
╞═════╪════════╪═════╪═════╡
│ 0 ┆ 0 ┆ a ┆ 1 │
│ 1 ┆ 0 ┆ a ┆ 1 │
│ 2 ┆ 1 ┆ a ┆ 2 │
│ 3 ┆ 1 ┆ a ┆ 2 │
│ 4 ┆ 2 ┆ b ┆ 1 │
│ 5 ┆ 2 ┆ b ┆ 1 │
│ 6 ┆ 3 ┆ b ┆ 2 │
│ 7 ┆ 3 ┆ b ┆ 2 │
└─────┴────────┴─────┴─────┘
I have a polars dataframe as follows:
df = pl.DataFrame(
dict(
day=[1, 1, 1, 3, 3, 3, 5, 5, 8, 8, 9, 9, 9],
value=[1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
)
)
I want to incrementally rotate the values in column 'day'? By incremental rotation, I mean for each value, change it to its next larger value exists in the column, and if the value is the largest, then change it to null/None.
Basically, the result I expect should be the following:
pl.DataFrame(
dict(
day=[3, 3, 3, 5, 5, 5, 8, 8, 9, 9, None, None, None],
value=[1, 2, 2, 3, 5, 2, 1, 2, 7, 3, 5, 3, 4],
)
)
Is there some particular polars-python idiomatic way to achieve this?
If day is sorted - you could group together - shift - then explode back?
(df.groupby("day", maintain_order=True)
.agg_list()
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day")))
shape: (13, 2)
┌──────┬───────┐
│ day | value │
│ --- | --- │
│ i64 | i64 │
╞══════╪═══════╡
│ 3 | 1 │
│ 3 | 2 │
│ 3 | 2 │
│ 5 | 3 │
│ 5 | 5 │
│ 5 | 2 │
│ 8 | 1 │
│ 8 | 2 │
│ 9 | 7 │
│ 9 | 3 │
│ null | 5 │
│ null | 3 │
│ null | 4 │
└──────┴───────┘
Perhaps another approach is to .rank() the column.
.search_sorted() for rank + 1 could find the positions of the next "group".
The max values could be nulled out then passed to .take() to get the new values.
(df.with_columns(
pl.col("day").rank("dense")
.cast(pl.Int64)
.alias("rank"))
.with_columns(
pl.col("rank")
.search_sorted(pl.col("rank") + 1)
.alias("idx"))
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx"))
.with_columns(
pl.col("day").take(pl.col("idx"))
.alias("new"))
)
shape: (13, 5)
┌─────┬───────┬──────┬──────┬──────┐
│ day | value | rank | idx | new │
│ --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | u32 | i64 │
╞═════╪═══════╪══════╪══════╪══════╡
│ 1 | 1 | 1 | 3 | 3 │
│ 1 | 2 | 1 | 3 | 3 │
│ 1 | 2 | 1 | 3 | 3 │
│ 3 | 3 | 2 | 6 | 5 │
│ 3 | 5 | 2 | 6 | 5 │
│ 3 | 2 | 2 | 6 | 5 │
│ 5 | 1 | 3 | 8 | 8 │
│ 5 | 2 | 3 | 8 | 8 │
│ 8 | 7 | 4 | 10 | 9 │
│ 8 | 3 | 4 | 10 | 9 │
│ 9 | 5 | 5 | null | null │
│ 9 | 3 | 5 | null | null │
│ 9 | 4 | 5 | null | null │
└─────┴───────┴──────┴──────┴──────┘
Feels like I'm missing an obvious simpler approach here..
#jqurious, what I'd recommend for remapping values is a join. Joins are heavily optimized and scale very well, especially on machines with a good number of cores.
As an example, let's benchmark some solutions.
First, some data
Let's use enough data to avoid spurious results from "microbenchmarking" using tiny datasets. (I see this all too often - tiny datasets with benchmark results down to a few microseconds or milliseconds.)
On my 32-core system with 512 GB of RAM, that means expanding the dataset to one billion records. (Choose a different value below as appropriate for your computing platform.)
import polars as pl
import numpy as np
import time
rng = np.random.default_rng(1)
nbr_rows = 1_000_000_000
df = pl.DataFrame(
dict(
day=rng.integers(1, 1_000_000, nbr_rows),
value=rng.integers(1, 1_000_000, nbr_rows),
)
).with_row_count()
df
shape: (1000000000, 3)
┌───────────┬────────┬────────┐
│ row_nr ┆ day ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╡
│ 0 ┆ 473189 ┆ 747152 │
│ 1 ┆ 511822 ┆ 298575 │
│ 2 ┆ 755167 ┆ 868027 │
│ 3 ┆ 950463 ┆ 289295 │
│ ... ┆ ... ┆ ... │
│ 999999996 ┆ 828237 ┆ 503917 │
│ 999999997 ┆ 909996 ┆ 447681 │
│ 999999998 ┆ 309104 ┆ 588174 │
│ 999999999 ┆ 485525 ┆ 198567 │
└───────────┴────────┴────────┘
Assumption: Not sorted
Let's suppose that we cannot assume that the data is sorted by day. (We'll have to adapt the solutions somewhat.)
Join
Here's the results using a join. If you watch your CPU usage, for example using top in Linux, you'll see that the algorithm is heavily multi-threaded. It spends the majority of its time spread across all cores of your system.
start = time.perf_counter()
(
df
.join(
df
.select(pl.col('day').unique().sort())
.with_columns(
pl.col('day').shift(-1).alias('new_day')
),
how='inner',
on='day',
)
)
print(time.perf_counter() - start)
shape: (1000000000, 4)
┌───────────┬────────┬────────┬─────────┐
│ row_nr ┆ day ┆ value ┆ new_day │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╪═════════╡
│ 0 ┆ 473189 ┆ 747152 ┆ 473190 │
│ 1 ┆ 511822 ┆ 298575 ┆ 511823 │
│ 2 ┆ 755167 ┆ 868027 ┆ 755168 │
│ 3 ┆ 950463 ┆ 289295 ┆ 950464 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 999999996 ┆ 828237 ┆ 503917 ┆ 828238 │
│ 999999997 ┆ 909996 ┆ 447681 ┆ 909997 │
│ 999999998 ┆ 309104 ┆ 588174 ┆ 309105 │
│ 999999999 ┆ 485525 ┆ 198567 ┆ 485526 │
└───────────┴────────┴────────┴─────────┘
>>> print(time.perf_counter() - start)
20.85321443199973
groupby-explode
Now let's try the groupby-explode solution. This algorithm will spend a good share of time in single-threaded mode.
I've had to add a sort after the grouping step because the algorithm assumes sorted data in the steps after it.
start = time.perf_counter()
(
df
.groupby("day", maintain_order=False)
.agg_list()
.sort(['day'])
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day"))
)
print(time.perf_counter() - start)
shape: (1000000000, 3)
┌──────┬───────────┬────────┐
│ day ┆ row_nr ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪═══════════╪════════╡
│ 2 ┆ 197731 ┆ 4093 │
│ 2 ┆ 3154732 ┆ 433246 │
│ 2 ┆ 4825468 ┆ 436316 │
│ 2 ┆ 4927362 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ null ┆ 993596728 ┆ 25604 │
│ null ┆ 995160321 ┆ 575415 │
│ null ┆ 996690852 ┆ 490825 │
│ null ┆ 999391650 ┆ 92113 │
└──────┴───────────┴────────┘
>>> print(time.perf_counter() - start)
54.04602192300081
rank
Now, the rank method. This algorithm will spend nearly all its time in single-threaded mode.
I've also had to add a sort here, as the ranks are assumed to be sorted in the search_sorted step.
start = time.perf_counter()
(
df
.sort(['day'])
.with_columns(
pl.col("day").rank("dense").cast(pl.Int64).alias("rank")
)
.with_columns(
pl.col("rank").search_sorted(pl.col("rank") + 1).alias("idx")
)
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx")
)
.with_columns(
pl.col("day").take(pl.col("idx")).alias("new")
)
)
print(time.perf_counter() - start)
shape: (1000000000, 6)
┌───────────┬────────┬────────┬────────┬──────┬──────┐
│ row_nr ┆ day ┆ value ┆ rank ┆ idx ┆ new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═══════════╪════════╪════════╪════════╪══════╪══════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 1 ┆ 1907 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 1 ┆ 1907 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 1 ┆ 1907 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 1 ┆ 1907 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ 999999 ┆ null ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ 999999 ┆ null ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ 999999 ┆ null ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ 999999 ┆ null ┆ null │
└───────────┴────────┴────────┴────────┴──────┴──────┘
>>> print(time.perf_counter() - start)
98.63108555600047
Assumption: Sorted by day
If we can assume that our data is already sorted by day, we can cut out unnecessary steps in our algorithms - as well as see some decent increases in speed.
We'll sort the data first and re-run our algorithms. Note that sorting sets the sorted flag on the day column, which allows algorithms to take shortcuts to increase speed. (If not sorting manually, then the set_sorted method can be used tell Polars that the column is pre-sorted.)
df = df.sort(['day'])
df
shape: (1000000000, 3)
┌───────────┬────────┬────────┐
│ row_nr ┆ day ┆ value │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╡
│ 197731 ┆ 1 ┆ 4093 │
│ 3154732 ┆ 1 ┆ 433246 │
│ 4825468 ┆ 1 ┆ 436316 │
│ 4927362 ┆ 1 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 │
│ 995160321 ┆ 999999 ┆ 575415 │
│ 996690852 ┆ 999999 ┆ 490825 │
│ 999391650 ┆ 999999 ┆ 92113 │
└───────────┴────────┴────────┘
Join
The code employing a join needs no changes; however, it does see an incredible speedup.
start = time.perf_counter()
(
df
.join(
df
.select(pl.col('day').unique().sort())
.with_columns(
pl.col('day').shift(-1).alias('new_day')
),
how='inner',
on='day',
)
)
print(time.perf_counter() - start)
shape: (1000000000, 4)
┌───────────┬────────┬────────┬─────────┐
│ row_nr ┆ day ┆ value ┆ new_day │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪════════╪════════╪═════════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ null │
└───────────┴────────┴────────┴─────────┘
>>> print(time.perf_counter() - start)
8.71159654099938
Note the same exact join algorithm now finishes in only 8.7 seconds rather than 20.9 seconds, largely due to the data being pre-sorted, and the sorted flag being set on day.
groupby-explode
We'll eliminate the superfluous sort within the algorithm, and re-run it.
start = time.perf_counter()
(
df
.groupby("day", maintain_order=True)
.agg_list()
.with_columns(pl.col("day").shift(-1))
.explode(pl.exclude("day"))
)
print(time.perf_counter() - start)
shape: (1000000000, 3)
┌──────┬───────────┬────────┐
│ day ┆ row_nr ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞══════╪═══════════╪════════╡
│ 2 ┆ 197731 ┆ 4093 │
│ 2 ┆ 3154732 ┆ 433246 │
│ 2 ┆ 4825468 ┆ 436316 │
│ 2 ┆ 4927362 ┆ 83493 │
│ ... ┆ ... ┆ ... │
│ null ┆ 993596728 ┆ 25604 │
│ null ┆ 995160321 ┆ 575415 │
│ null ┆ 996690852 ┆ 490825 │
│ null ┆ 999391650 ┆ 92113 │
└──────┴───────────┴────────┘
>>> print(time.perf_counter() - start)
8.249637401000655
Note how this algorithm takes slightly less time than the join algorithm, all due to the assumption of day being pre-sorted.
rank
Again, we'll now eliminated the superfluous sort and re-run the algorithm.
start = time.perf_counter()
(
df
.with_columns(
pl.col("day").rank("dense").cast(pl.Int64).alias("rank")
)
.with_columns(
pl.col("rank").search_sorted(pl.col("rank") + 1).alias("idx")
)
.with_columns(
pl.when(pl.col("idx") != pl.col("idx").max())
.then(pl.col("idx"))
.alias("idx")
)
.with_columns(
pl.col("day").take(pl.col("idx")).alias("new")
)
)
print(time.perf_counter() - start)
shape: (1000000000, 6)
┌───────────┬────────┬────────┬────────┬──────┬──────┐
│ row_nr ┆ day ┆ value ┆ rank ┆ idx ┆ new │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═══════════╪════════╪════════╪════════╪══════╪══════╡
│ 197731 ┆ 1 ┆ 4093 ┆ 1 ┆ 1907 ┆ 2 │
│ 3154732 ┆ 1 ┆ 433246 ┆ 1 ┆ 1907 ┆ 2 │
│ 4825468 ┆ 1 ┆ 436316 ┆ 1 ┆ 1907 ┆ 2 │
│ 4927362 ┆ 1 ┆ 83493 ┆ 1 ┆ 1907 ┆ 2 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 993596728 ┆ 999999 ┆ 25604 ┆ 999999 ┆ null ┆ null │
│ 995160321 ┆ 999999 ┆ 575415 ┆ 999999 ┆ null ┆ null │
│ 996690852 ┆ 999999 ┆ 490825 ┆ 999999 ┆ null ┆ null │
│ 999391650 ┆ 999999 ┆ 92113 ┆ 999999 ┆ null ┆ null │
└───────────┴────────┴────────┴────────┴──────┴──────┘
>>> print(time.perf_counter() - start)
48.90440067800046
Although this algorithm now takes roughly half the time, it's not quite as fast as the join or groupby-explode algorithms.
Of course, wall-clock performance is not the end-all-be-all. But when problems scale up, joins are particularly good tools, even when we cannot make assumptions regarding the sorted-ness of our data.
In Pandas I can do the following:
data = pd.DataFrame(
{
"era": ["01", "01", "02", "02", "03", "10"],
"pred1": [1, 2, 3, 4, 5,6],
"pred2": [2,4,5,6,7,8],
"pred3": [3,5,6,8,9,1],
"something_else": [5,4,3,67,5,4],
})
pred_cols = ["pred1", "pred2", "pred3"]
ERA_COL = "era"
DOWNSAMPLE_CROSS_VAL = 10
test_split = ['01', '02', '10']
test_split_index = data[ERA_COL].isin(test_split)
downsampled_train_split_index = train_split_index[test_split_index].index[::DOWNSAMPLE_CROSS_VAL]
data.loc[test_split_index, "pred1"] = somefunction()["another_column"]
How can I achieve the same in Polars? I tried to do some data.filter(****) = somefunction()["another_column"], but the filter output is not assignable with Polars.
Let's see if I can help. It would appear that what you want to accomplish is to replace a subset/filtered portion of a column with values derived from other one or more other columns.
For example, if you are attempting to accomplish this:
ERA_COL = "era"
test_split = ["01", "02", "10"]
test_split_index = data[ERA_COL].isin(test_split)
data.loc[test_split_index, "pred1"] = -2 * data["pred3"]
print(data)
>>> print(data)
era pred1 pred2 pred3 something_else
0 01 -6 2 3 5
1 01 -10 4 5 4
2 02 -12 5 6 3
3 02 -16 6 8 67
4 03 5 7 9 5
5 10 -2 8 1 4
We would accomplish the above in Polars using a when/then/otherwise expression:
(
pl.from_pandas(data)
.with_column(
pl.when(pl.col(ERA_COL).is_in(test_split))
.then(-2 * pl.col('pred3'))
.otherwise(pl.col('pred1'))
.alias('pred1')
)
)
shape: (6, 5)
┌─────┬───────┬───────┬───────┬────────────────┐
│ era ┆ pred1 ┆ pred2 ┆ pred3 ┆ something_else │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╪════════════════╡
│ 01 ┆ -6 ┆ 2 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 01 ┆ -10 ┆ 4 ┆ 5 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 02 ┆ -12 ┆ 5 ┆ 6 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 02 ┆ -16 ┆ 6 ┆ 8 ┆ 67 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 03 ┆ 5 ┆ 7 ┆ 9 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ -2 ┆ 8 ┆ 1 ┆ 4 │
└─────┴───────┴───────┴───────┴────────────────┘
Is this what you were looking to accomplish?
A few things as general points.
polars syntax doesn't attempt to match that of pandas.
In polars, you can only assign a whole df, you can't assign part of a df.
polars doesn't use an internal index so there's no index to record
For your problem, assuming there isn't already a natural index, you'd want to make an explicit index.
pldata=pl.from_pandas(data).with_row_count(name='myindx)
then recording the index would be
test_split_index = pldata.filter(pl.col(ERA_COL).is_in(test_split)).select('myindx').to_series()
For your last bit on the final assignment, without knowing anything about somefunction my best guess is that you'd want to do that with a join.
Maybe something like:
pldata=pldata.join(
pl.from_pandas(some_function()['another_column']) \
.with_column(test_split_index.alias('myindex')),
on='myindex')
Your test_split_index is actually a bool not the index whereas the above is the actual index so take that with a grain of salt.
All that being said, polars has free copies of data so rather than keeping track of index positions manually (as error prone as that can be), you can just make 2 new dfs since, under the hood, it just references the data, it doesn't make a physical copy of it.
Something like:
testdata=pldata.filter(pl.col(ERA_COL).is_in(test_split))
traindata=pldata.filter(~pl.col(ERA_COL).is_in(test_split))
I'm looking for a function along the lines of
df.groupby('column').agg(sample(10))
so that I can take ten or so randomly-selected elements from each group.
This is specifically so I can read in a LazyFrame and work with a small sample of each group as opposed to the entire dataframe.
Update:
One approximate solution is:
df = lf.groupby('column').agg(
pl.all().sample(.001)
)
df = df.explode(df.columns[1:])
Update 2
That approximate solution is just the same as sampling the whole dataframe and doing a groupby after. No good.
Let start with some dummy data:
n = 100
seed = 0
df = pl.DataFrame(
{
"groups": (pl.arange(0, n, eager=True) % 5).shuffle(seed=seed),
"values": pl.arange(0, n, eager=True).shuffle(seed=seed)
}
)
df
shape: (100, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 55 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 40 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 57 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 99 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
This gives us 100 / 5, is 5 groups of 20 elements. Let's verify that:
df.groupby("groups").agg(pl.count())
shape: (5, 2)
┌────────┬───────┐
│ groups ┆ count │
│ --- ┆ --- │
│ i64 ┆ u32 │
╞════════╪═══════╡
│ 1 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 20 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0 ┆ 20 │
└────────┴───────┘
Sample our data
Now we are going to use a window function to take a sample of our data.
df.filter(
pl.arange(0, pl.count()).shuffle().over("groups") < 10
)
shape: (50, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════╪════════╡
│ 0 ┆ 85 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 84 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 19 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 87 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 96 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 43 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 44 │
└────────┴────────┘
For every group in over("group") the pl.arange(0, pl.count()) expression creates an index row. We then shuffle that range so that we take a sample and not a slice. Then we only want to take the index values that are lower than 10. This creates a boolean mask that we can pass to the filter method.
A solution using lambda
df = (
lf.groupby('column')
.apply(lambda x: x.sample(10))
)
We can try making our own groupby-like functionality and sampling from the filtered subsets.
samples = []
cats = df.get_column('column').unique().to_list()
for cat in cats:
samples.append(df.filter(pl.col('column') == cat).sample(10))
samples = pl.concat(samples)
Found partition_by in the documentation, this should be more efficient, since at least the groups are made with the api and in single pass of the dataframe. Sampling each group is still linear unfortunately.
pl.concat([x.sample(10) for x in df.partition_by(groups="column")])
Third attempt, sampling indices:
import numpy as np
import random
indices = df.groupby("group").agg(pl.col("value").agg_groups()).get_column("value").to_list()
sampled = np.array([random.sample(x, 10) for x in indices]).flatten()
df[sampled]
i have a question regarding fill null values, is it possible to backfill data from other columns as in pandas?
Working pandas example on how to backfill data :
df.loc[:, ['A', 'B', 'C']] = df[['A', 'B', 'C']].fillna(
value={'A': df['D'],
'B': df['D'],
'C': df['D'],
})
Polars example as i tried to backfill data from column D to column A if the value is null, but it's not working:
df = pl.DataFrame(
{"date": ["2020-01-01 00:00:00", "2020-01-07 00:00:00", "2020-01-14 00:00:00"],
"A": [3, 4, 7],
"B": [3, 4, 5],
"C": [0, 1, 2],
"D": [1, 2, 5]})
df = df.with_column(pl.col("date").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"))
date_range = df.select(pl.arange(df["date"][0], df["date"]
[-1] + 1, step=1000*60*60*24).cast(pl.Datetime).alias("date"))
df = (date_range.join(df, on="date", how="left"))
df['D'] = df['D'].fill_null("forward")
print(df)
df[:, ['A']] = df[['A']].fill_null({
'A': df['D']
}
)
print(df)
Kind regards,
Tom
In the example you show and the accomponied pandas code. A fillna doesn't fill any null values, because the other columns are also NaN. So I am going to assume that you want to fill missing values by values of another column that doesn't have missing values, but correct me if I am wrong.
import polars as pl
from polars import col
df = pl.DataFrame({
"a": [0, 1, 2, 3, None, 5, 6, None, 8, None],
"b": range(10),
})
out = df.with_columns([
pl.when(col("a").is_null()).then(col("b")).otherwise(col("a")).alias("a"),
pl.when(col("a").is_null()).then(col("b").shift(1)).otherwise(col("a")).alias("a_filled_lag"),
pl.when(col("a").is_null()).then(col("b").mean()).otherwise(col("a")).alias("a_filled_mean")
])
print(out)
In the example above, we use a when -> then -> othwerwise expression to fill missing values by another columns values. Think about if else expressions but then on whole columns.
I gave 3 examples, one where we fill on that value, one where we fill with the lagged value, and one where we fill with the mean value of the other column.
The snippet above produces:
shape: (10, 4)
┌─────┬─────┬──────────────┬───────────────┐
│ a ┆ b ┆ a_filled_lag ┆ a_filled_mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════════╪═══════════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 5 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 6 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 7 ┆ 6 ┆ 4.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9 ┆ 9 ┆ 8 ┆ 4.5 │
└─────┴─────┴──────────────┴───────────────┘
The fillna() method is used to fill null values in pandas.
df['D'] = df['D'].fillna(df['A'].mean())
The above code will replace null values of D column with the mean value of A column.