polars: n_unique(), but as a window function - python

I need a way to find out how many unique pairs of values from two columns are in a certain context.
Basically like n_unique, but as a window function.
To illustrate with a toy example:
import polars as pl
dataframe = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
# unique: 1 2 3 1 2 - 1 2 -
# n_unique: -- 3 -- -- 2 -- -- 2 --
})
I would like to write:
dataframe = (
dataframe
.with_column(
pl.n_unique(['column1', 'column2']).over('context').alias('n_unique')
)
)
to get the number of unique value pairs from column1, column2 within the window of column 'context'. But that does not work.
One attempt I made was this:
(dataframe
.with_column(
pl.concat_list(['column1', 'column2']).alias('pair')
)
.with_column(
pl.n_unique('pair').over('context')
)
)
To be honest, I wasn't really expecting this to work, and indeed it doesn't:
PanicException: this operation is not implemented/valid for this dtype: List(Int64)
But then what can be done? The "Power BI way" would be to string merge columns, but isn't there a proper way?
EDIT: I found one way, but I don't like it...
(dataframe
.with_column(
pl.when(pl.all().cumcount().over(['context', 'column1']) == 0)
.then(pl.n_unique('column2').over(['context', 'column1']))
.otherwise(pl.lit(0))
.alias('n_unique')
)
.with_column(
pl.col('n_unique').sum().over('context')
)
)

All expressions are this functional construct Fn(Series) -> Series. Meaning that if you want to compute something over multiple columns, you must ensure that there are multiple columns in the input Series.
We can easily do this by packing them into a Struct data type.
import polars as pl
df = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
})
df.with_column(
pl.struct(["column1", "column2"]).n_unique().over("context").alias("n_unique")
)
shape: (9, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ context ┆ column1 ┆ column2 ┆ n_unique │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ u32 │
╞═════════╪═════════╪═════════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
└─────────┴─────────┴─────────┴──────────┘

Related

Selecting columns based on a condition in Polars

I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:
df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))
which gave me an error, which is probably due to the second part of the expression.
df.select(pl.all().n_unique() < 100)
This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?
It's helpful if you include an example to save others from having to create one.
df = pl.DataFrame({
"col1": ["A", "B", "C", "D"],
"col2": ["A", "A", "C", "A"],
"col3": ["A", "B", "A", "B"],
"col4": [1, 2, 3, 4],
})
You are selecting the string columns with pl.col(pl.Utf8)
>>> df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪══════╡
│ A | A | A │
│ B | A | B │
│ C | C | A │
│ D | A | B │
└──────┴──────┴──────┘
You can chain .n_unique() to the pl.col() to run it just on those columns.
>>> df.select(pl.col(pl.Utf8).n_unique() < 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ bool | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘
You can loop over this result and extract the .name for each true column.
There is no .is_true() but .all() is equivalent.
>>> [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() < 3) if col.all() ]
['col2', 'col3']
You can then select just those columns:
df.select(
col.name for col in
df.select(pl.col(pl.Utf8).n_unique() < 3)
if col.all()
)
shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ --- | --- │
│ str | str │
╞══════╪══════╡
│ A | A │
│ A | B │
│ C | A │
│ A | B │
└──────┴──────┘
You could get the name of the columns by doing a melt followed by a groupby, but I'm not too sure how to turn this into an expression
df = pl.DataFrame(
{
"val1": ["a", "b", "c"],
"val2": ["d", "d", "d"],
}
)
columns = (
df.select(pl.col(pl.Utf8))
.melt()
.groupby("variable")
.agg(pl.col("value").n_unique())
.filter(pl.col("value") >= 3)
.get_column("variable")
.to_list()
)
df.select(columns)

Can I get elements from column of lists by list of indexes?

In (Py)Polars there is method of subset list elements in column of lists according to list of indexes in other column? I.e. arr.get() accepts only Integer and not accept Expressions (like pl.col('prices').arr.get(pl.col('idxs').arr.first())) ?
Can I get some like:
df = pl.DataFrame(
{'idxs': [[0], [1], [0, 2]],
'prices': [[0.0, 3.5], [4.6, 0.0], [0.0, 7.8, 0.0]]}
)
(df
.with_column(
pl.col('prices').arr.get(pl.col('idxs')).alias('zero_prices')
)
)
Can be resolved with apply UDF python function to pl.struct(pl.all())
Like
def get_zero_prices(cols):
return [float(el) for i, el in enumerate(cols['prices']) if I in cols['idxs']]
(df
.with_column(
pl.struct(pl.all()).apply(lambda x: get_zero_prices(x)).alias('zero_prices')
)
)
But this looks not so ideomatic
What you want is to be able to utilize the full expression API whilst operating on certain sub-elements or groups. That's what a groupby is!
So ideally we groom our DataFrame in a state where very group corresponds to the elements of our lists.
First we start with some data and and then we add a row_idx that will represent out unique groups.
df = pl.DataFrame({
"idx": [[0], [1], [0, 2]],
"array": [["a", "b"], ["c", "d"], ["e", "f", "g"]]
}).with_row_count("row_nr")
print(df)
shape: (3, 3)
┌────────┬───────────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ list[i64] ┆ list[str] │
╞════════╪═══════════╪═════════════════╡
│ 0 ┆ [0] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [1] ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2] ┆ ["e", "f", "g"] │
└────────┴───────────┴─────────────────┘
Next we explode by the "idx" column so that we can we create the groups for our groupby.
df = df.explode("idx")
print(df)
shape: (4, 3)
┌────────┬─────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ list[str] │
╞════════╪═════╪═════════════════╡
│ 0 ┆ 0 ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ ["e", "f", "g"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ ["e", "f", "g"] │
└────────┴─────┴─────────────────┘
Finally we can apply the groupby and take the subelements for each list/group.
(df
.groupby("row_nr")
.agg([
pl.col("array").first(),
pl.col("idx"),
pl.col("array").first().take(pl.col("idx")).alias("arr_taken")
])
)
This returns:
shape: (3, 4)
┌────────┬─────────────────┬───────────┬────────────┐
│ row_nr ┆ array ┆ idx ┆ arr_taken │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ list[str] ┆ list[i64] ┆ list[str] │
╞════════╪═════════════════╪═══════════╪════════════╡
│ 0 ┆ ["a", "b"] ┆ [0] ┆ ["a"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c", "d"] ┆ [1] ┆ ["d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["e", "f", "g"] ┆ [0, 2] ┆ ["e", "g"] │
└────────┴─────────────────┴───────────┴────────────┘

DataFrame challenge: mapping ID to value in different row. Preferably with Polars

Consider this example:
import polars as pl
df = pl.DataFrame({
'ID': ['0', '1', '2', '3', '4', '5','6', '7', '8', '9', '10'],
'Name' : ['A','','','','B','','C','','','D', ''],
'Element' : ['', '4', '4', '0', '', '4', '', '0', '9', '', '6']
})
The 'Name' is linked to an 'ID'. This ID is used as a value in the 'Element' column. How do I map the correct 'Name' to the elements? Also I want to group the elements by 'Name' ('Name_list'), count them and sort by counted values ('E_count').
The resulting df would be:
Name_list Element E_count
-------------------------
'B' '4' 3
'A' '0' 2
'C' '6' 1
'D' '9' 1
Feedback very much appreciated; even a Pandas solution.
Here's a Polars solution. We'll use a join to link the ID and Element columns (after some filtering and summarizing).
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.groupby("Element").agg(pl.count().alias("E_count")),
left_on="ID",
right_on="Element",
how="left",
)
.sort('E_count', reverse=True)
.rename({"Name":"Name_list", "ID":"Element"})
)
Note: this differs from the solution listed in your answer. The Name D is associated with ID 9 (not 10).
shape: (4, 3)
┌───────────┬─────────┬─────────┐
│ Name_list ┆ Element ┆ E_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════════╡
│ B ┆ 4 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ A ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ C ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ D ┆ 9 ┆ 1 │
└───────────┴─────────┴─────────┘
You can also use the polars.Series.value_counts method, which looks somewhat cleaner:
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.get_column("Element").value_counts(),
left_on="ID",
right_on="Element",
how="left",
)
.sort("counts", reverse=True)
.rename({"Name": "Name_list", "ID": "Element", "counts": "E_count"})
)
If I understood your problem correctly, then you could use pandas and do the following:
countdf = pd.merge(df,df[['ID','Name']],left_on='Element',right_on='ID',how='inner')
countdf = pd.DataFrame(countdf.groupby('Name_y')['Element'].count())
result = pd.merge(countdf,df[['Name','ID']],left_on='Name_y',right_on='Name',how='left')
result[['Name','ID','Element']]
using pandas
We can use map to map the values and using the where condition to keep from making name as null. lastly, its a groupby
df['Name'] = df['Name'].where(cond=df['Element']=="",
other=df[df['Element']!=""]['Element'].map(lambda x: df[df['ID'] == x]['Name'].tolist()[0]),
axis=0)
df[df['Element'] != ""].groupby(['Name','Element']).count().reset_index()
Name Element ID
0 A 0 2
1 B 4 3
2 C 6 1
3 D 9 1
Try this, you don't need groupby nor joins, just map and value_counts:
df.drop('Element', axis=1)\
.query('Name != "" ')\
.assign(E_count = df['ID'].map(df['Element'].value_counts()))
Output:
ID Name E_count
0 0 A 2.0
4 4 B 3.0
6 6 C 1.0
9 9 D 1.0

Pandas create a unique id for each row based on a condition

I've a dataset where one of the column is as below. I'd like to create a new column based on the below condition.
For values in column_name, if 1 is present, create a new id. If 0 is present, also create a new id. But if 1 is repeated in more than 1 continuous rows, then id should be same for all rows. The sample output result can be seen below.
column_name
1
0
0
1
1
1
1
0
0
1
column_name -- ID
1 -- 1
0 -- 2
0 -- 3
1 -- 4
1 -- 4
1 -- 4
1 -- 4
0 -- 5
0 -- 6
1 -- 7
Say your Series is
s = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
Then you can use:
>>> ((s != 1) | (s.shift(1) != 1)).cumsum()
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
dtype: int64
This checks that either the current entry is not 1, or that the previous entry is not 1, and then performs a cumulative sum on the result.
Essentially leveraging the fact that a 1 in the Series lagged by another 1 should be treated as part of the same group, while every 0 calls for an increment. One of four things will happen:
1) 0 with a preceding 0 : Increment by 1
2) 0 with a preceding 1 : Increment by 1
3) 1 with a preceding 1 : Increment by 0
4) 1 with a preceding 0: Increment by 1
(df['column_name'] + df['column_name'].shift(1)).\ ## Creates a Series with values 0, 1, or 2 (first field is NaN)
fillna(0).\ ## Fills first field with 0
isin([0,1]).\ ## True for cases 1, 2, and 4 described above, else False (case 3)
astype('int').\ ## Integerizes it
cumsum()
Output:
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
At this stage I would just use a regular python for loop
column_name = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
ID = [1]
for i in range(1, len(column_name)):
ID.append(ID[-1] + ((column_name[i] + column_name[i-1]) < 2))
print(ID)
>>> [1, 2, 3, 4, 4, 4, 4, 5, 6, 7]
And then you can assign ID as a column in your dataframe

Python: use apply on groups separatly after grouping dataframe

My data frame looks like this:
┌────┬──────┬──────┐
│ No │ col1 │ col2 │
├────┼──────┼──────┤
│ 1 │ A │ 5.0 │
│ 1 │ B1 │ 10.0 │
│ 1 │ B2 │ 20.0 │
│ 2 │ A │ 0.0 │
│ 2 │ B1 │ 0.0 │
│ 2 │ C1 │ 0.0 │
│ 3 │ A │ 0.0 │
│ 3 │ B1 │ 5.0 │
│ 3 │ C1 │ 20.0 │
│ 3 │ C2 │ 30.0 │
└────┴──────┴──────┘
First I used groupby to group the data frame by column No.
I would like to do three things now:
get a list of values from column No where col2 == 0.0 in all rows of this group (in this case No.2)
get a list of No's where col2 != 0.0 for col1 == 'A' but at least one other row of the group has col2 == 0.0 (in this case No.3)
get a list of No's where minimum 1 row contains col2 == 0.0 (No.2 and 3)
Sorry for asking three issues at once. Hope that is ok.
Thank you:)
You can use:
g = df['col2'].eq(0).groupby(df['No'])
a = g.all()
a = a.index[a].tolist()
print (a)
[2]
b1 = (df['col2'].ne(0) & df['col1'].eq('A')).groupby(df['No']).any()
b2 = (df['col2'].eq(0) & df['col1'].ne('A')).groupby(df['No']).any()
b = b1 & b2
b = b.index[b].tolist()
print (b)
[]
c = g.any()
c = c.index[c].tolist()
print (c)
[2,3]
Another solution should be custom function for return boolean DataFrame and final create dictionary with 3 lists:
def f(x):
a = x['col2'].eq(0)
b1 = x['col2'].ne(0) & x['col1'].eq('A')
b2 = a & x['col1'].ne('A')
b = b1.any() & b2.any()
return pd.Series([a.all(), b, a.any()], index=list('abc'))
m = df.groupby('No').apply(f)
print (m)
a b c
No
1 False False False
2 True False True
3 False False True
fin = {x: m[x].index[m[x]].tolist() for x in m.columns}
print (fin)
{'a': [2], 'b': [], 'c': [2, 3]}

Categories