I have a Polars dataframe:
┌───────────┬────────────┬────────────┬────────────┬
│ Name ┆ Purchase ┆ Size ┆ Color ┆
│ ┆ Time ┆ ┆ ┆
│ --- ┆ --- ┆ --- ┆ ┆
│ str ┆ datetime[μ ┆ i64 ┆ --- ┆
│ ┆ s] ┆ ┆ str ┆
╞═══════════╪════════════╪════════════╪════════════╪
│ T-Shirt ┆ 2022-02-14 ┆ 12 ┆ Blue ┆
│ ┆ 14:40:09.1 ┆ ┆ ┆
│ ┆ 00 ┆ ┆ ┆
└───────────┴────────────┴────────────┴────────────
And I would like to convert each row of this dataframe into an object which contains information extracted from the rows, e.g:
PurchasedObject(Name, Size, Color)
Is it possible with polars to create a new column in the dataframe, which contains for each row the corresponding object?
Which would be the best way to achieve this in Polars?
Thank you!
Polars has a struct datatype that can be used to pack columns together in a single datatype.
import polars as pl
from datetime import datetime
pl.DataFrame({
"name": ["t-shirt"],
"purchased": [datetime(2022, 2, 4)],
"size": [12],
"color": ["blue"]
}).with_column(
pl.struct(["name", "size", "color"]).alias("purchased_struct")
)
shape: (1, 5)
┌─────────┬─────────────────────┬──────┬───────┬───────────────────────┐
│ name ┆ purchased ┆ size ┆ color ┆ purchased_struct │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ i64 ┆ str ┆ struct[3] │
╞═════════╪═════════════════════╪══════╪═══════╪═══════════════════════╡
│ t-shirt ┆ 2022-02-04 00:00:00 ┆ 12 ┆ blue ┆ {"t-shirt",12,"blue"} │
└─────────┴─────────────────────┴──────┴───────┴───────────────────────┘
Related
Let's say I want to make a list of functions, ie aggs=['sum','std','mean','min','max']
then if I have an arbitrary df
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4]})
I want to be able to do something like (this obviously doesn't work)
df.with_columns([pl.col('a').x() for x in aggs])
Is there a way to do that? aggs need not be a list of strings but just the easiest way to type out my intention for the purpose of this question. Additionally it'd need to have room for .suffix()
I know I could have a function that has all the aggs in the function and takes arbitrary dfs as a parameter which is like my backup plan so I'm hoping for something that resembles the above.
Would this work for you?
df.with_columns([getattr(pl.col("a"), x)().suffix("_" + x) for x in aggs])
shape: (3, 7)
┌─────┬─────┬───────┬───────┬────────┬───────┬───────┐
│ a ┆ b ┆ a_sum ┆ a_std ┆ a_mean ┆ a_min ┆ a_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
╞═════╪═════╪═══════╪═══════╪════════╪═══════╪═══════╡
│ 1 ┆ 2 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
│ 2 ┆ 3 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
│ 3 ┆ 4 ┆ 6 ┆ 1.0 ┆ 2.0 ┆ 1 ┆ 3 │
└─────┴─────┴───────┴───────┴────────┴───────┴───────┘
I've got a polars DataFrame that I am wanting to calculate a 5 year smooth moving average on. However, I don't want to just groupby_dynamic on the year column, but I have a geographical unit, that I am wanting to groupby on as well.
For example:
┌────────────┬─────────┬──────────┬────────────────────────────┬─────┬────────────┬────────────────────────────┬────────────────┬───────────┐
│ year ┆ state ┆ district ┆ candidate ┆ ... ┆ totalvotes ┆ Candidate_Name ┆ State_District ┆ Last_Name │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ i64 ┆ str ┆ ┆ i64 ┆ str ┆ str ┆ str │
╞════════════╪═════════╪══════════╪════════════════════════════╪═════╪════════════╪════════════════════════════╪════════════════╪═══════════╡
│ 1976-01-01 ┆ alabama ┆ 1 ┆ BILL DAVENPORT ┆ ... ┆ 157170 ┆ bill davenport ┆ alabama-1 ┆ davenport │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1976-01-01 ┆ alabama ┆ 1 ┆ JACK EDWARDS ┆ ... ┆ 157170 ┆ jack edwards ┆ alabama-1 ┆ edwards │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1976-01-01 ┆ alabama ┆ 1 ┆ WRITEIN ┆ ... ┆ 157170 ┆ writein ┆ alabama-1 ┆ writein │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1976-01-01 ┆ alabama ┆ 2 ┆ J CAROLE KEAHEY ┆ ... ┆ 156362 ┆ j carole keahey ┆ alabama-2 ┆ keahey │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1976-01-01 ┆ alabama ┆ 2 ┆ WILLIAM L "BILL" DICKINSON ┆ ... ┆ 156362 ┆ william l "bill" dickinson ┆ alabama-2 ┆ dickinson │
Specifically, with this example, I want to groupby year and calculate the 5 year smooth moving average for the particular State_District.
What would be the most efficient way to do this?
I've tried:
mapped = election_lab.filter(
pl.col("party") == "DEMOCRAT"
).groupby_dynamic(
["year", "State_District"], every = "5y"
).agg(
pl.apply(exprs = ["candidatevotes", "totalvotes"], f = lambda x: x[0]/x[1]).alias("Dem_Vote_Share")
)
This one understandably indicated that I couldn't pass a list to groupby_dyanmic. So I thought I might do them in stages.
So I tried:
mapped = election_lab.filter(
pl.col("party") == "DEMOCRAT"
).groupby(
["year", "State_District"]
).agg(
pl.apply(exprs = ["candidatevotes", "totalvotes"], f = lambda x: x[0]/x[1]).alias("Dem_Vote_Share")
)
mapped_dynamic = mapped.groupby_dynamic(
"year", every = "5y"
).agg(
pl.avg("Dem_Vote_Share").alias("SMA")
)
But this doesn't do what I'd want exactly either... and just returned a column of all null.
E.g. if I have
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
how would I find the cumulative sum of each row?
Expected output:
a b
0 1 5
1 2 7
2 3 9
Here's the equivalent in pandas:
>>> import pandas as pd
>>> pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]}).cumsum(axis=1)
a b
0 1 5
1 2 7
2 3 9
but I can't figure out how to do it in polars
Edit: Polars 0.14.18 and later
As of Polars 0.14.18, we can use the new polars.cumsum function to simplify this. (Note: this is slightly different than the polars.Expr.cumsum Expression, in that it acts as a root Expression.)
Using the same DataFrame as below:
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.select([
pl.exclude(my_cols),
pl.cumsum(my_cols).alias('result')
])
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
Before Polars 0.14.18
Polars is column-oriented, and as such does not have the concept of a axis. Still, we can use the list evaluation context to solve this.
First, let's expand you data slightly:
df = pl.DataFrame({
"id": ['a', 'b', 'c'],
"a": [1, 2, 3],
"b": [4, 5, 6],
"c": [7, 8, 9],
})
df
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 4 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 │
└─────┴─────┴─────┴─────┘
The Algorithm
Here's a general-purpose performant algorithm that will solve this. We'll walk through it below.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
How it works
First, we'll select the names of the numeric columns. You can name these explicitly if you like, e.g., my_cols=['a','b','c'].
Next, we'll gather up the column values into a list using polars.concat_list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 4, 7] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 5, 8] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 6, 9] │
└─────┴─────┴─────┴─────┴───────────┘
From here, we'll use the arr.eval context to run our cumsum on the list.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬────────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[i64] │
╞═════╪═════╪═════╪═════╪════════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ [1, 5, 12] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ [2, 7, 15] │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ [3, 9, 18] │
└─────┴─────┴─────┴─────┴────────────┘
Next, we'll break the list into a struct using arr.to_struct, and name the fields the corresponding names from our selected numeric columns.
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
)
shape: (3, 5)
┌─────┬─────┬─────┬─────┬───────────┐
│ id ┆ a ┆ b ┆ c ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ struct[3] │
╞═════╪═════╪═════╪═════╪═══════════╡
│ a ┆ 1 ┆ 4 ┆ 7 ┆ {1,5,12} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 2 ┆ 5 ┆ 8 ┆ {2,7,15} │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ 3 ┆ 6 ┆ 9 ┆ {3,9,18} │
└─────┴─────┴─────┴─────┴───────────┘
And finally, we'll use unnest to break the struct into columns. (But first we must drop the original columns or else we'll get two columns with the same name.)
my_cols = [s.name for s in df if s.is_numeric()]
(
df
.with_column(
pl.concat_list(my_cols)
.arr.eval(pl.element().cumsum())
.arr.to_struct(name_generator=lambda idx: my_cols[idx])
.alias('result')
)
.drop(my_cols)
.unnest('result')
)
shape: (3, 4)
┌─────┬─────┬─────┬─────┐
│ id ┆ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a ┆ 1 ┆ 5 ┆ 12 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 2 ┆ 7 ┆ 15 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c ┆ 3 ┆ 9 ┆ 18 │
└─────┴─────┴─────┴─────┘
There may be a simpler and faster way, but here is the programmatic solution.
Concatenate the values along the columns into a list
Calculate the cumulative sum over the list (the result is still a list)
Get values for each column in the result
import polars as pl
df = pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
df.select([
pl.concat_list(pl.all())
.arr.eval(pl.element().cumsum())
.alias('cs')
]).select([
pl.col('cs').arr.get(i).alias(name)
for i, name in enumerate(df.columns)
])
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 9 │
└─────┴─────┘
So I have a Polars dataframe looking as such
df = pl.DataFrame(
{
"ItemId": [15148, 15148, 24957],
"SuffixFactor": [19200, 200, 24],
"ItemRand": [254, -1, -44],
"Stat0": ['+5 Defense', '+$i Might', '+9 Vitality'],
"Amount": ['', '7', '']
}
)
I want to replace $i in the column "Stat0" with Amount whenever Stat0 contains i$
I have tried a couple different things such as:
df = df.with_column(
pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)
Expected result
result = pl.DataFrame(
{
"ItemId": [15148, 15148, 24957],
"SuffixFactor": [19200, 200, 24],
"ItemRand": [254, -1, -44],
"Stat0": ['+5 Defense', '+7 Might', '+9 Vitality'],
"Amount": ['', '7', '']
}
)
But this doesn't seem to work.
I hope someone can help.
Best regards
Edit: Polars >= 0.14.4
As of Polars 0.14.4, the replace and replace_all expressions allow an Expression for the value parameter. Thus, we can solve this more simply as:
df.with_column(
pl.col('Stat0').str.replace(r'\$i', pl.col('Amount'))
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ +7 Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ │
└────────┴──────────────┴──────────┴─────────────┴────────┘
Polars < 0.14.4
The problem is that the replace method does not take an Expression, only a constant. Thus, we cannot use a column as replacement values, only a constant.
We can get around this in two ways.
Slow: using apply
This method uses python code to perform the replacement. Since we are executing python bytecode using apply, it will be slow. If your DataFrame is small, then this won't be too painfully slow.
(
df
.with_column(
pl.struct(['Stat0', 'Amount'])
.apply(lambda cols: cols['Stat0'].replace('$i', cols['Amount']))
.alias('Stat0')
)
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬─────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ +7 Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ │
└────────┴──────────────┴──────────┴─────────────┴────────┘
Fast: using split_exact and when/then/otherwise
This method uses all Polars Expressions. As such, it will be much faster, especially for large DataFrames.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
.with_column(
pl.when(pl.col('field_1').is_null())
.then(pl.col('field_0'))
.otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
.alias('Stat0')
)
.drop(['field_0', 'field_1'])
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Amount ┆ Stat0 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str │
╞════════╪══════════════╪══════════╪════════╪═════════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ ┆ +5 Defense │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ 7 ┆ +7 Might │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ ┆ +9 Vitality │
└────────┴──────────────┴──────────┴────────┴─────────────┘
How it works: we first split the Stat0 column on $i using split_exact. This will produce a struct.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
)
shape: (3, 5)
┌────────┬──────────────┬──────────┬──────────────────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ Stat0 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ struct[2] ┆ str │
╞════════╪══════════════╪══════════╪══════════════════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ {"+5 Defense",null} ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ {"+"," Might"} ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ {"+9 Vitality",null} ┆ │
└────────┴──────────────┴──────────┴──────────────────────┴────────┘
Notice that when Stat0 does not contain $i, the second member of the struct is null. We'll use this fact to our advantage.
In the next step, we break the struct into separate columns, using unnest.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
)
shape: (3, 6)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0 ┆ field_1 ┆ Amount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ null ┆ │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ + ┆ Might ┆ 7 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ null ┆ │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┘
This creates two new columns: field_0 and field_1.
From here, we use when/then/otherwise and concat_str to construct the final result
Basically:
when $i does not appear in the Stat0 column, then the string is not split, and field_1 is null, so we can use the value in field_0 as is.
when $i does appear in Stat0, then the string is split into two parts: field_0 and field_1. We simply concatenate the parts back together, putting Amount in the middle.
(
df
.with_column(
pl.col('Stat0').str.split_exact('$i', 1)
)
.unnest('Stat0')
.with_column(
pl.when(pl.col('field_1').is_null())
.then(pl.col('field_0'))
.otherwise(pl.concat_str(['field_0', 'Amount', 'field_1']))
.alias('Stat0')
)
)
shape: (3, 7)
┌────────┬──────────────┬──────────┬─────────────┬─────────┬────────┬─────────────┐
│ ItemId ┆ SuffixFactor ┆ ItemRand ┆ field_0 ┆ field_1 ┆ Amount ┆ Stat0 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ str │
╞════════╪══════════════╪══════════╪═════════════╪═════════╪════════╪═════════════╡
│ 15148 ┆ 19200 ┆ 254 ┆ +5 Defense ┆ null ┆ ┆ +5 Defense │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 15148 ┆ 200 ┆ -1 ┆ + ┆ Might ┆ 7 ┆ +7 Might │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 24957 ┆ 24 ┆ -44 ┆ +9 Vitality ┆ null ┆ ┆ +9 Vitality │
└────────┴──────────────┴──────────┴─────────────┴─────────┴────────┴─────────────┘
when using polars.dataframe, it sometimes displays dots instead of actual datas. i dont know what im missing.(have no problem when using pandas). can can you tell me what i should do?
import polars as pl
import pandas as pd
class new:
xyxy = '124'
a = [[[0.45372647047042847, 0.7791867852210999, 0.05796612799167633,
0.08813457936048508, 0.9122178554534912, 0, 'corn'],
[0.5337053537368774, 0.605276882648468, 0.043029140681028366, 0.06894499808549881,
0.8814031481742859, 0, 'corn'],
[0.47244399785995483, 0.5134297609329224, 0.03258286789059639, 0.054770857095718384,
0.8650641441345215, 0, 'corn'],
[0.4817340672016144, 0.42551395297050476, 0.02438574656844139, 0.04052922874689102,
0.8646907806396484, 0, 'corn'],
[0.5215370059013367, 0.4616119861602783, 0.027680961415171623, 0.04423023760318756,
0.8433780670166016, 0, 'corn'],
[0.5168840885162354, 0.4077163636684418, 0.021290680393576622, 0.034322340041399,
0.8073480129241943, 0, 'corn'],
[0.4868599772453308, 0.3901885747909546, 0.01746474765241146, 0.02876533754169941,
0.631712794303894, 0, 'corn'],
[0.5133631825447083, 0.3870452046394348, 0.014495659619569778, 0.02186509035527706,
0.6174931526184082, 0, 'corn'],
[0.5155017375946045, 0.3974197208881378, 0.01627129688858986, 0.03393130749464035,
0.4413506090641022, 0, 'corn']]]
ca = 'xmin', 'ymin', 'xmax', 'ymax', 'confidence', 'class', 'name' # xyxy columns
cb = 'xcenter', 'ycenter', 'width', 'height', 'confidence', 'class', 'name' # xywh
columns
for k, c in zip(['xyxy', 'xyxyn', 'xywh', 'xywhn'], [ca, ca, cb,
cb]):
setattr(new, k, [pl.DataFrame(x, columns=c, orient="row") for x in a])
#setattr(new, k, [pd.DataFrame(x, columns=c, orient="row") for x in a])
print (new.xyxy[0])
Use polars.Config.set_tbl_rows to control the number of displayed rows:
pl.Config.set_tbl_rows(1000)
print(new.xyxy[0])
# Output
shape: (9, 7)
┌────────────────┬────────────────┬────────────────┬────────────────┬───────────────┬───────┬──────┐
│ xmin ┆ ymin ┆ xmax ┆ ymax ┆ confidence ┆ class ┆ name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 ┆ str │
╞════════════════╪════════════════╪════════════════╪════════════════╪═══════════════╪═══════╪══════╡
│ 0.453726470470 ┆ 0.779186785221 ┆ 0.057966127991 ┆ 0.088134579360 ┆ 0.91221785545 ┆ 0 ┆ corn │
│ 42847 ┆ 0999 ┆ 67633 ┆ 48508 ┆ 34912 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.533705353736 ┆ 0.605276882648 ┆ 0.043029140681 ┆ 0.068944998085 ┆ 0.88140314817 ┆ 0 ┆ corn │
│ 8774 ┆ 468 ┆ 028366 ┆ 49881 ┆ 42859 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.472443997859 ┆ 0.513429760932 ┆ 0.032582867890 ┆ 0.054770857095 ┆ 0.86506414413 ┆ 0 ┆ corn │
│ 95483 ┆ 9224 ┆ 59639 ┆ 718384 ┆ 45215 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.481734067201 ┆ 0.425513952970 ┆ 0.024385746568 ┆ 0.040529228746 ┆ 0.86469078063 ┆ 0 ┆ corn │
│ 6144 ┆ 50476 ┆ 44139 ┆ 89102 ┆ 96484 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.521537005901 ┆ 0.461611986160 ┆ 0.027680961415 ┆ 0.044230237603 ┆ 0.84337806701 ┆ 0 ┆ corn │
│ 3367 ┆ 2783 ┆ 171623 ┆ 18756 ┆ 66016 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.516884088516 ┆ 0.407716363668 ┆ 0.021290680393 ┆ 0.034322340041 ┆ 0.80734801292 ┆ 0 ┆ corn │
│ 2354 ┆ 4418 ┆ 576622 ┆ 399 ┆ 41943 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.486859977245 ┆ 0.390188574790 ┆ 0.017464747652 ┆ 0.028765337541 ┆ 0.63171279430 ┆ 0 ┆ corn │
│ 3308 ┆ 9546 ┆ 41146 ┆ 69941 ┆ 3894 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.513363182544 ┆ 0.387045204639 ┆ 0.014495659619 ┆ 0.021865090355 ┆ 0.61749315261 ┆ 0 ┆ corn │
│ 7083 ┆ 4348 ┆ 569778 ┆ 27706 ┆ 84082 ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0.515501737594 ┆ 0.397419720888 ┆ 0.016271296888 ┆ 0.033931307494 ┆ 0.44135060906 ┆ 0 ┆ corn │
│ 6045 ┆ 1378 ┆ 58986 ┆ 64035 ┆ 41022 ┆ ┆ │
└────────────────┴────────────────┴────────────────┴────────────────┴───────────────┴───────┴──────┘