Selecting columns based on a condition in Polars - python

I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:
df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))
which gave me an error, which is probably due to the second part of the expression.
df.select(pl.all().n_unique() < 100)
This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?

It's helpful if you include an example to save others from having to create one.
df = pl.DataFrame({
"col1": ["A", "B", "C", "D"],
"col2": ["A", "A", "C", "A"],
"col3": ["A", "B", "A", "B"],
"col4": [1, 2, 3, 4],
})
You are selecting the string columns with pl.col(pl.Utf8)
>>> df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪══════╡
│ A | A | A │
│ B | A | B │
│ C | C | A │
│ D | A | B │
└──────┴──────┴──────┘
You can chain .n_unique() to the pl.col() to run it just on those columns.
>>> df.select(pl.col(pl.Utf8).n_unique() < 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ bool | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘
You can loop over this result and extract the .name for each true column.
There is no .is_true() but .all() is equivalent.
>>> [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() < 3) if col.all() ]
['col2', 'col3']
You can then select just those columns:
df.select(
col.name for col in
df.select(pl.col(pl.Utf8).n_unique() < 3)
if col.all()
)
shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ --- | --- │
│ str | str │
╞══════╪══════╡
│ A | A │
│ A | B │
│ C | A │
│ A | B │
└──────┴──────┘

You could get the name of the columns by doing a melt followed by a groupby, but I'm not too sure how to turn this into an expression
df = pl.DataFrame(
{
"val1": ["a", "b", "c"],
"val2": ["d", "d", "d"],
}
)
columns = (
df.select(pl.col(pl.Utf8))
.melt()
.groupby("variable")
.agg(pl.col("value").n_unique())
.filter(pl.col("value") >= 3)
.get_column("variable")
.to_list()
)
df.select(columns)

Related

How to make dummy columns only on variables witch appropriate number of categories and suffisant share category in column?

I have DataFrame in Python Pandas like below (both types of columns: numeric and object):
data types:
COL1 - numeric
COL2 - object
COL3 - object
COL1
COL2
COL3
...
COLn
111
A
Y
...
...
222
A
Y
...
...
333
B
Z
...
...
444
C
Z
...
...
555
D
P
...
...
And i need to make dummy coding (pandas.get_dummies()) only on categorical variables which has:
max 3 categories in variable
The minimum percentage of the category's share of the variable is 0.4
So, for example:
COL2 does not meetr requirement nr. 1 (has 4 different categories: A, B, C, D), so remove it
In COL3 category "P" does not meet requirements nr.2 (share is 1/5 = 0.2), so use only categories "Y" and "Z" to dummy coding
So, as a result I need something like below:
COL1 | COL3_Y | COL3_Z | ... | COLn
-----|--------|--------|------|------
111 | 1 | 0 | ... | ...
222 | 1 | 0 | ... | ...
333 | 0 | 1 | ... | ...
444 | 0 | 1 | ... | ...
555 | 0 | 0 | ... | ...
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"COL1": [111, 222, 333, 444, 555],
"COL2": ["A", "A", "B", "C", "D"],
"COL3": ["Y", "Y", "Z", "Z", "P"],
"COL4": ["U", "U", "W", "V", "V"],
}
)
Here is one way to do it:
# Save first column in a new dataframe for later use
new_df = df["COL1"]
# Get number of unique values in each columns
s = df.nunique() # COL1 5, COL2 4, COL3 3, COL4 3
# Filter out rows with too many categories
tmp = df.loc[:, s[s <= 3].index]
# Filter out values with insuffisant percentage
# Get dummies and concat new columns
for col in tmp.columns:
frq = tmp[col].value_counts() / tmp.shape[0]
other_tmp = tmp[col]
other_tmp = other_tmp[
other_tmp.isin(frq[frq >= 0.4].index.get_level_values(0).tolist())
]
other_tmp = pd.get_dummies(other_tmp)
new_df = pd.concat([new_df, other_tmp], axis=1)
# Cleanup
new_df = new_df.fillna(0).astype(int)
Then:
print(new_df)
# Output
COL1 Y Z U V
0 111 1 0 1 0
1 222 1 0 1 0
2 333 0 1 0 0
3 444 0 1 0 1
4 555 0 0 0 1

polars: n_unique(), but as a window function

I need a way to find out how many unique pairs of values from two columns are in a certain context.
Basically like n_unique, but as a window function.
To illustrate with a toy example:
import polars as pl
dataframe = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
# unique: 1 2 3 1 2 - 1 2 -
# n_unique: -- 3 -- -- 2 -- -- 2 --
})
I would like to write:
dataframe = (
dataframe
.with_column(
pl.n_unique(['column1', 'column2']).over('context').alias('n_unique')
)
)
to get the number of unique value pairs from column1, column2 within the window of column 'context'. But that does not work.
One attempt I made was this:
(dataframe
.with_column(
pl.concat_list(['column1', 'column2']).alias('pair')
)
.with_column(
pl.n_unique('pair').over('context')
)
)
To be honest, I wasn't really expecting this to work, and indeed it doesn't:
PanicException: this operation is not implemented/valid for this dtype: List(Int64)
But then what can be done? The "Power BI way" would be to string merge columns, but isn't there a proper way?
EDIT: I found one way, but I don't like it...
(dataframe
.with_column(
pl.when(pl.all().cumcount().over(['context', 'column1']) == 0)
.then(pl.n_unique('column2').over(['context', 'column1']))
.otherwise(pl.lit(0))
.alias('n_unique')
)
.with_column(
pl.col('n_unique').sum().over('context')
)
)
All expressions are this functional construct Fn(Series) -> Series. Meaning that if you want to compute something over multiple columns, you must ensure that there are multiple columns in the input Series.
We can easily do this by packing them into a Struct data type.
import polars as pl
df = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
})
df.with_column(
pl.struct(["column1", "column2"]).n_unique().over("context").alias("n_unique")
)
shape: (9, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ context ┆ column1 ┆ column2 ┆ n_unique │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ u32 │
╞═════════╪═════════╪═════════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
└─────────┴─────────┴─────────┴──────────┘

Can I get elements from column of lists by list of indexes?

In (Py)Polars there is method of subset list elements in column of lists according to list of indexes in other column? I.e. arr.get() accepts only Integer and not accept Expressions (like pl.col('prices').arr.get(pl.col('idxs').arr.first())) ?
Can I get some like:
df = pl.DataFrame(
{'idxs': [[0], [1], [0, 2]],
'prices': [[0.0, 3.5], [4.6, 0.0], [0.0, 7.8, 0.0]]}
)
(df
.with_column(
pl.col('prices').arr.get(pl.col('idxs')).alias('zero_prices')
)
)
Can be resolved with apply UDF python function to pl.struct(pl.all())
Like
def get_zero_prices(cols):
return [float(el) for i, el in enumerate(cols['prices']) if I in cols['idxs']]
(df
.with_column(
pl.struct(pl.all()).apply(lambda x: get_zero_prices(x)).alias('zero_prices')
)
)
But this looks not so ideomatic
What you want is to be able to utilize the full expression API whilst operating on certain sub-elements or groups. That's what a groupby is!
So ideally we groom our DataFrame in a state where very group corresponds to the elements of our lists.
First we start with some data and and then we add a row_idx that will represent out unique groups.
df = pl.DataFrame({
"idx": [[0], [1], [0, 2]],
"array": [["a", "b"], ["c", "d"], ["e", "f", "g"]]
}).with_row_count("row_nr")
print(df)
shape: (3, 3)
┌────────┬───────────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ list[i64] ┆ list[str] │
╞════════╪═══════════╪═════════════════╡
│ 0 ┆ [0] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [1] ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2] ┆ ["e", "f", "g"] │
└────────┴───────────┴─────────────────┘
Next we explode by the "idx" column so that we can we create the groups for our groupby.
df = df.explode("idx")
print(df)
shape: (4, 3)
┌────────┬─────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ list[str] │
╞════════╪═════╪═════════════════╡
│ 0 ┆ 0 ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ ["e", "f", "g"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ ["e", "f", "g"] │
└────────┴─────┴─────────────────┘
Finally we can apply the groupby and take the subelements for each list/group.
(df
.groupby("row_nr")
.agg([
pl.col("array").first(),
pl.col("idx"),
pl.col("array").first().take(pl.col("idx")).alias("arr_taken")
])
)
This returns:
shape: (3, 4)
┌────────┬─────────────────┬───────────┬────────────┐
│ row_nr ┆ array ┆ idx ┆ arr_taken │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ list[str] ┆ list[i64] ┆ list[str] │
╞════════╪═════════════════╪═══════════╪════════════╡
│ 0 ┆ ["a", "b"] ┆ [0] ┆ ["a"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c", "d"] ┆ [1] ┆ ["d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["e", "f", "g"] ┆ [0, 2] ┆ ["e", "g"] │
└────────┴─────────────────┴───────────┴────────────┘

DataFrame challenge: mapping ID to value in different row. Preferably with Polars

Consider this example:
import polars as pl
df = pl.DataFrame({
'ID': ['0', '1', '2', '3', '4', '5','6', '7', '8', '9', '10'],
'Name' : ['A','','','','B','','C','','','D', ''],
'Element' : ['', '4', '4', '0', '', '4', '', '0', '9', '', '6']
})
The 'Name' is linked to an 'ID'. This ID is used as a value in the 'Element' column. How do I map the correct 'Name' to the elements? Also I want to group the elements by 'Name' ('Name_list'), count them and sort by counted values ('E_count').
The resulting df would be:
Name_list Element E_count
-------------------------
'B' '4' 3
'A' '0' 2
'C' '6' 1
'D' '9' 1
Feedback very much appreciated; even a Pandas solution.
Here's a Polars solution. We'll use a join to link the ID and Element columns (after some filtering and summarizing).
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.groupby("Element").agg(pl.count().alias("E_count")),
left_on="ID",
right_on="Element",
how="left",
)
.sort('E_count', reverse=True)
.rename({"Name":"Name_list", "ID":"Element"})
)
Note: this differs from the solution listed in your answer. The Name D is associated with ID 9 (not 10).
shape: (4, 3)
┌───────────┬─────────┬─────────┐
│ Name_list ┆ Element ┆ E_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════════╡
│ B ┆ 4 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ A ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ C ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ D ┆ 9 ┆ 1 │
└───────────┴─────────┴─────────┘
You can also use the polars.Series.value_counts method, which looks somewhat cleaner:
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.get_column("Element").value_counts(),
left_on="ID",
right_on="Element",
how="left",
)
.sort("counts", reverse=True)
.rename({"Name": "Name_list", "ID": "Element", "counts": "E_count"})
)
If I understood your problem correctly, then you could use pandas and do the following:
countdf = pd.merge(df,df[['ID','Name']],left_on='Element',right_on='ID',how='inner')
countdf = pd.DataFrame(countdf.groupby('Name_y')['Element'].count())
result = pd.merge(countdf,df[['Name','ID']],left_on='Name_y',right_on='Name',how='left')
result[['Name','ID','Element']]
using pandas
We can use map to map the values and using the where condition to keep from making name as null. lastly, its a groupby
df['Name'] = df['Name'].where(cond=df['Element']=="",
other=df[df['Element']!=""]['Element'].map(lambda x: df[df['ID'] == x]['Name'].tolist()[0]),
axis=0)
df[df['Element'] != ""].groupby(['Name','Element']).count().reset_index()
Name Element ID
0 A 0 2
1 B 4 3
2 C 6 1
3 D 9 1
Try this, you don't need groupby nor joins, just map and value_counts:
df.drop('Element', axis=1)\
.query('Name != "" ')\
.assign(E_count = df['ID'].map(df['Element'].value_counts()))
Output:
ID Name E_count
0 0 A 2.0
4 4 B 3.0
6 6 C 1.0
9 9 D 1.0

Python: use apply on groups separatly after grouping dataframe

My data frame looks like this:
┌────┬──────┬──────┐
│ No │ col1 │ col2 │
├────┼──────┼──────┤
│ 1 │ A │ 5.0 │
│ 1 │ B1 │ 10.0 │
│ 1 │ B2 │ 20.0 │
│ 2 │ A │ 0.0 │
│ 2 │ B1 │ 0.0 │
│ 2 │ C1 │ 0.0 │
│ 3 │ A │ 0.0 │
│ 3 │ B1 │ 5.0 │
│ 3 │ C1 │ 20.0 │
│ 3 │ C2 │ 30.0 │
└────┴──────┴──────┘
First I used groupby to group the data frame by column No.
I would like to do three things now:
get a list of values from column No where col2 == 0.0 in all rows of this group (in this case No.2)
get a list of No's where col2 != 0.0 for col1 == 'A' but at least one other row of the group has col2 == 0.0 (in this case No.3)
get a list of No's where minimum 1 row contains col2 == 0.0 (No.2 and 3)
Sorry for asking three issues at once. Hope that is ok.
Thank you:)
You can use:
g = df['col2'].eq(0).groupby(df['No'])
a = g.all()
a = a.index[a].tolist()
print (a)
[2]
b1 = (df['col2'].ne(0) & df['col1'].eq('A')).groupby(df['No']).any()
b2 = (df['col2'].eq(0) & df['col1'].ne('A')).groupby(df['No']).any()
b = b1 & b2
b = b.index[b].tolist()
print (b)
[]
c = g.any()
c = c.index[c].tolist()
print (c)
[2,3]
Another solution should be custom function for return boolean DataFrame and final create dictionary with 3 lists:
def f(x):
a = x['col2'].eq(0)
b1 = x['col2'].ne(0) & x['col1'].eq('A')
b2 = a & x['col1'].ne('A')
b = b1.any() & b2.any()
return pd.Series([a.all(), b, a.any()], index=list('abc'))
m = df.groupby('No').apply(f)
print (m)
a b c
No
1 False False False
2 True False True
3 False False True
fin = {x: m[x].index[m[x]].tolist() for x in m.columns}
print (fin)
{'a': [2], 'b': [], 'c': [2, 3]}

Categories