Related
I want to select columns in a Polars DataFrame based on a condition. In my case, I want to select all string columns that have less than 100 unique values. Naively I tried:
df.select((pl.col(pl.Utf8)) & (pl.all().n_unique() < 100))
which gave me an error, which is probably due to the second part of the expression.
df.select(pl.all().n_unique() < 100)
This doesn't select columns but instead returns a single row DataFrame of bool values. I'm new to polars and still can't quite wrap my head around the expression API, I guess. What am I doing wrong?
It's helpful if you include an example to save others from having to create one.
df = pl.DataFrame({
"col1": ["A", "B", "C", "D"],
"col2": ["A", "A", "C", "A"],
"col3": ["A", "B", "A", "B"],
"col4": [1, 2, 3, 4],
})
You are selecting the string columns with pl.col(pl.Utf8)
>>> df.select(pl.col(pl.Utf8))
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪══════╡
│ A | A | A │
│ B | A | B │
│ C | C | A │
│ D | A | B │
└──────┴──────┴──────┘
You can chain .n_unique() to the pl.col() to run it just on those columns.
>>> df.select(pl.col(pl.Utf8).n_unique() < 3)
shape: (1, 3)
┌───────┬──────┬──────┐
│ col1 | col2 | col3 │
│ --- | --- | --- │
│ bool | bool | bool │
╞═══════╪══════╪══════╡
│ false | true | true │
└───────┴──────┴──────┘
You can loop over this result and extract the .name for each true column.
There is no .is_true() but .all() is equivalent.
>>> [ col.name for col in df.select(pl.col(pl.Utf8).n_unique() < 3) if col.all() ]
['col2', 'col3']
You can then select just those columns:
df.select(
col.name for col in
df.select(pl.col(pl.Utf8).n_unique() < 3)
if col.all()
)
shape: (4, 2)
┌──────┬──────┐
│ col2 | col3 │
│ --- | --- │
│ str | str │
╞══════╪══════╡
│ A | A │
│ A | B │
│ C | A │
│ A | B │
└──────┴──────┘
You could get the name of the columns by doing a melt followed by a groupby, but I'm not too sure how to turn this into an expression
df = pl.DataFrame(
{
"val1": ["a", "b", "c"],
"val2": ["d", "d", "d"],
}
)
columns = (
df.select(pl.col(pl.Utf8))
.melt()
.groupby("variable")
.agg(pl.col("value").n_unique())
.filter(pl.col("value") >= 3)
.get_column("variable")
.to_list()
)
df.select(columns)
I need a way to find out how many unique pairs of values from two columns are in a certain context.
Basically like n_unique, but as a window function.
To illustrate with a toy example:
import polars as pl
dataframe = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
# unique: 1 2 3 1 2 - 1 2 -
# n_unique: -- 3 -- -- 2 -- -- 2 --
})
I would like to write:
dataframe = (
dataframe
.with_column(
pl.n_unique(['column1', 'column2']).over('context').alias('n_unique')
)
)
to get the number of unique value pairs from column1, column2 within the window of column 'context'. But that does not work.
One attempt I made was this:
(dataframe
.with_column(
pl.concat_list(['column1', 'column2']).alias('pair')
)
.with_column(
pl.n_unique('pair').over('context')
)
)
To be honest, I wasn't really expecting this to work, and indeed it doesn't:
PanicException: this operation is not implemented/valid for this dtype: List(Int64)
But then what can be done? The "Power BI way" would be to string merge columns, but isn't there a proper way?
EDIT: I found one way, but I don't like it...
(dataframe
.with_column(
pl.when(pl.all().cumcount().over(['context', 'column1']) == 0)
.then(pl.n_unique('column2').over(['context', 'column1']))
.otherwise(pl.lit(0))
.alias('n_unique')
)
.with_column(
pl.col('n_unique').sum().over('context')
)
)
All expressions are this functional construct Fn(Series) -> Series. Meaning that if you want to compute something over multiple columns, you must ensure that there are multiple columns in the input Series.
We can easily do this by packing them into a Struct data type.
import polars as pl
df = pl.DataFrame({
'context': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'column1': [1, 1, 0, 1, 0, 0, 1, 0, 1],
'column2': [1, 0, 0, 0, 1, 1, 1, 0, 1]
})
df.with_column(
pl.struct(["column1", "column2"]).n_unique().over("context").alias("n_unique")
)
shape: (9, 4)
┌─────────┬─────────┬─────────┬──────────┐
│ context ┆ column1 ┆ column2 ┆ n_unique │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ u32 │
╞═════════╪═════════╪═════════╪══════════╡
│ 1 ┆ 1 ┆ 1 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0 ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 ┆ 2 │
└─────────┴─────────┴─────────┴──────────┘
In (Py)Polars there is method of subset list elements in column of lists according to list of indexes in other column? I.e. arr.get() accepts only Integer and not accept Expressions (like pl.col('prices').arr.get(pl.col('idxs').arr.first())) ?
Can I get some like:
df = pl.DataFrame(
{'idxs': [[0], [1], [0, 2]],
'prices': [[0.0, 3.5], [4.6, 0.0], [0.0, 7.8, 0.0]]}
)
(df
.with_column(
pl.col('prices').arr.get(pl.col('idxs')).alias('zero_prices')
)
)
Can be resolved with apply UDF python function to pl.struct(pl.all())
Like
def get_zero_prices(cols):
return [float(el) for i, el in enumerate(cols['prices']) if I in cols['idxs']]
(df
.with_column(
pl.struct(pl.all()).apply(lambda x: get_zero_prices(x)).alias('zero_prices')
)
)
But this looks not so ideomatic
What you want is to be able to utilize the full expression API whilst operating on certain sub-elements or groups. That's what a groupby is!
So ideally we groom our DataFrame in a state where very group corresponds to the elements of our lists.
First we start with some data and and then we add a row_idx that will represent out unique groups.
df = pl.DataFrame({
"idx": [[0], [1], [0, 2]],
"array": [["a", "b"], ["c", "d"], ["e", "f", "g"]]
}).with_row_count("row_nr")
print(df)
shape: (3, 3)
┌────────┬───────────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ list[i64] ┆ list[str] │
╞════════╪═══════════╪═════════════════╡
│ 0 ┆ [0] ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [1] ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2] ┆ ["e", "f", "g"] │
└────────┴───────────┴─────────────────┘
Next we explode by the "idx" column so that we can we create the groups for our groupby.
df = df.explode("idx")
print(df)
shape: (4, 3)
┌────────┬─────┬─────────────────┐
│ row_nr ┆ idx ┆ array │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ list[str] │
╞════════╪═════╪═════════════════╡
│ 0 ┆ 0 ┆ ["a", "b"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ ["c", "d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 ┆ ["e", "f", "g"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ ["e", "f", "g"] │
└────────┴─────┴─────────────────┘
Finally we can apply the groupby and take the subelements for each list/group.
(df
.groupby("row_nr")
.agg([
pl.col("array").first(),
pl.col("idx"),
pl.col("array").first().take(pl.col("idx")).alias("arr_taken")
])
)
This returns:
shape: (3, 4)
┌────────┬─────────────────┬───────────┬────────────┐
│ row_nr ┆ array ┆ idx ┆ arr_taken │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ list[str] ┆ list[i64] ┆ list[str] │
╞════════╪═════════════════╪═══════════╪════════════╡
│ 0 ┆ ["a", "b"] ┆ [0] ┆ ["a"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ ["c", "d"] ┆ [1] ┆ ["d"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["e", "f", "g"] ┆ [0, 2] ┆ ["e", "g"] │
└────────┴─────────────────┴───────────┴────────────┘
Consider this example:
import polars as pl
df = pl.DataFrame({
'ID': ['0', '1', '2', '3', '4', '5','6', '7', '8', '9', '10'],
'Name' : ['A','','','','B','','C','','','D', ''],
'Element' : ['', '4', '4', '0', '', '4', '', '0', '9', '', '6']
})
The 'Name' is linked to an 'ID'. This ID is used as a value in the 'Element' column. How do I map the correct 'Name' to the elements? Also I want to group the elements by 'Name' ('Name_list'), count them and sort by counted values ('E_count').
The resulting df would be:
Name_list Element E_count
-------------------------
'B' '4' 3
'A' '0' 2
'C' '6' 1
'D' '9' 1
Feedback very much appreciated; even a Pandas solution.
Here's a Polars solution. We'll use a join to link the ID and Element columns (after some filtering and summarizing).
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.groupby("Element").agg(pl.count().alias("E_count")),
left_on="ID",
right_on="Element",
how="left",
)
.sort('E_count', reverse=True)
.rename({"Name":"Name_list", "ID":"Element"})
)
Note: this differs from the solution listed in your answer. The Name D is associated with ID 9 (not 10).
shape: (4, 3)
┌───────────┬─────────┬─────────┐
│ Name_list ┆ Element ┆ E_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═══════════╪═════════╪═════════╡
│ B ┆ 4 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ A ┆ 0 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ C ┆ 6 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ D ┆ 9 ┆ 1 │
└───────────┴─────────┴─────────┘
You can also use the polars.Series.value_counts method, which looks somewhat cleaner:
import polars as pl
(
df.select(["Name", "ID"])
.filter(pl.col("Name") != "")
.join(
df.get_column("Element").value_counts(),
left_on="ID",
right_on="Element",
how="left",
)
.sort("counts", reverse=True)
.rename({"Name": "Name_list", "ID": "Element", "counts": "E_count"})
)
If I understood your problem correctly, then you could use pandas and do the following:
countdf = pd.merge(df,df[['ID','Name']],left_on='Element',right_on='ID',how='inner')
countdf = pd.DataFrame(countdf.groupby('Name_y')['Element'].count())
result = pd.merge(countdf,df[['Name','ID']],left_on='Name_y',right_on='Name',how='left')
result[['Name','ID','Element']]
using pandas
We can use map to map the values and using the where condition to keep from making name as null. lastly, its a groupby
df['Name'] = df['Name'].where(cond=df['Element']=="",
other=df[df['Element']!=""]['Element'].map(lambda x: df[df['ID'] == x]['Name'].tolist()[0]),
axis=0)
df[df['Element'] != ""].groupby(['Name','Element']).count().reset_index()
Name Element ID
0 A 0 2
1 B 4 3
2 C 6 1
3 D 9 1
Try this, you don't need groupby nor joins, just map and value_counts:
df.drop('Element', axis=1)\
.query('Name != "" ')\
.assign(E_count = df['ID'].map(df['Element'].value_counts()))
Output:
ID Name E_count
0 0 A 2.0
4 4 B 3.0
6 6 C 1.0
9 9 D 1.0
I have a dataframe which looks like this
pd.DataFrame({'A': ['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7'],
...: 'x': [2, 2, 3, 2, 3, 1, 3],
...: 'maxValue_1': [2, 1, 2, 3, 4, 2, 1]})
Out[7]:
A x maxValue_1
0 C1 2 2
1 C2 2 1
2 C3 3 2
3 C4 2 3
4 C5 3 4
5 C6 1 2
6 C7 3 1
maxValue_2 = 2
I need to check whether column 'x' is equal or greater than the max(df.maxValue_1, maxValue_2)
Resulting dataframe should look like this.
A x maxValue_1 result
0 C1 2 2 True
1 C2 2 1 True
2 C3 3 2 True
3 C4 2 3 False
4 C5 3 4 False
5 C6 1 2 False
6 C7 3 1 True
How can I code this in an efficient manner without having to add variable 'maxValue_2' to the dataframe?
df['result'] = df['x'] >= np.maximum(df['maxValue_1'], maxValue_2)
print(df)
Prints:
A x maxValue_1 result
0 C1 2 2 True
1 C2 2 1 True
2 C3 3 2 True
3 C4 2 3 False
4 C5 3 4 False
5 C6 1 2 False
6 C7 3 1 True
df['result'] = df.apply(lambda row: row.x >= max(row.maxValue_1, maxValue_2), axis=1)
To accomplish this, we’ll use numpy’s built-in where() function. This function takes three arguments in sequence: the condition we’re testing for, the value to assign to our new column if that condition is true, and the value to assign if it is false. It looks like this:
np.where(condition, value if condition is true, value if condition is false)
And the function maximum() to get the max of the giving values:
df['result'] = np.where(df['x']>= np.maximum(df['maxValue_1'], maxValue_2) , True, False)
OUTPUT:
A x maxValue_1 result
0 C1 2 2 True
1 C2 2 1 True
2 C3 3 2 True
3 C4 2 3 False
4 C5 3 4 False
5 C6 1 2 False
6 C7 3 1 True
You can also define the series of maxes through set comprehension and compare to the x series like so
df['result'] = df['x'] >= [ max(max_value2, row['maxValue_1']) for idx, row in df.iterrows()]
I have found that the apply method is extremely useful for this type of problem. We want to apply some function, model_map to the dataframe.
def model_map(row):
if row['x'] >= max(df.maxValue_1, maxValue_2):
return True
else:
return False
df['result'] = df.apply(lambda row : model_map(row), axis=1)
This will give you a nice way to create a column based on a function.