Polars columns subtract order does not matter (aperently) - python

I would like to use polars, but when I try to subtract a 1x3 numpy array from three columns of the DataFrame. The problem is that is does not matter in which order the subtraction is applied:
import numpy as np
import polars as pl
# create polars dataframe:
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pl.DataFrame(data, columns=['x', 'y', 'z']).with_columns(
pl.all().cast(pl.Float64)
)
# subraction array:
arr = np.array([2, 5, 8], dtype=np.float64)
# subtract shit array from DataFrame
df.with_columns((
pl.col('x') - arr[0],
pl.col('y') - arr[1],
pl.col('z') - arr[2],
))
"""
This one is corrct, top row should be negative and bottom row positive
| | x | y | z |
|---:|----:|----:|----:|
| 0 | -1 | -1 | -1 |
| 1 | 0 | 0 | 0 |
| 2 | 1 | 1 | 1 |
"""
df.with_columns((
arr[0] - pl.col('x'),
arr[1] - pl.col('y'),
arr[2] - pl.col('z'),
))
"""
This one is incorrect. The top row should be positive and the bottom row should
be negative.
| | x | y | z |
|---:|----:|----:|----:|
| 0 | -1 | -1 | -1 |
| 1 | 0 | 0 | 0 |
| 2 | 1 | 1 | 1 |
"""

Can't reproduce this, looks fine to me as of 0.16.5:
In [57]: df.with_columns((
...: pl.col('x') - arr[0],
...: pl.col('y') - arr[1],
...: pl.col('z') - arr[2],
...: ))
...:
Out[57]:
shape: (3, 3)
┌──────┬──────┬──────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════╪══════╪══════╡
│ -1.0 ┆ -1.0 ┆ -1.0 │
│ 0.0 ┆ 0.0 ┆ 0.0 │
│ 1.0 ┆ 1.0 ┆ 1.0 │
└──────┴──────┴──────┘
In [58]: df.with_columns((
...: arr[0] - pl.col('x'),
...: arr[1] - pl.col('y'),
...: arr[2] - pl.col('z'),
...: ))
Out[58]:
shape: (3, 3)
┌──────┬──────┬──────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════╪══════╪══════╡
│ 1.0 ┆ 1.0 ┆ 1.0 │
│ 0.0 ┆ 0.0 ┆ 0.0 │
│ -1.0 ┆ -1.0 ┆ -1.0 │
└──────┴──────┴──────┘

Related

How to use apply better in Polars?

I have a polars dataframe illustrated as follows.
import polars as pl
df = pl.DataFrame(
{
"a": [1, 4, 3, 2, 8, 4, 5, 6],
"b": [2, 3, 1, 3, 9, 7, 6, 8],
"c": [1, 1, 1, 1, 2, 2, 2, 2],
}
)
The task I have is
groupby column "c"
for each group, check whether all numbers from column "a" is less than corresponding values from column "b".
If so, just return a column same as "a" in the groupby context.
Otherwise, apply a third-party function called "convert" which takes two numpy arrays and return a single numpy array with the same size, so in my case, I can first convert column "a" and "b" to numpy arrays and supply them as inputs to "convert". Finally, return the array returned from "convert" (probably need to transform it to polars series before returning) in the groupby context.
So, for the example above, the output I want is as follows (exploded after groupby for better illustration).
shape: (8, 2)
┌─────┬─────┐
│ c ┆ a │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
│ 1 ┆ 3 │
│ 1 ┆ 1 │
│ 1 ┆ 2 │
│ 2 ┆ 8 │
│ 2 ┆ 4 │
│ 2 ┆ 5 │
│ 2 ┆ 6 │
└─────┴─────┘
With the assumption,
>>> import numpy as np
>>> convert(np.array([1, 4, 3, 2]), np.array([2, 3, 1, 3]))
np.array([1, 3, 1, 2])
# [1, 4, 3, 2] is from column a of df when column c is 1, and [2, 3, 1, 3] comes from column b of df when column c is 1.
# I have to apply my custom python function 'convert' for the c == 1 group, because not all values in a are smaller than those in b according to the task description above.
My question is how am I supposed to implement this logic in a performant or polars idiomatic way without sacrificing so much speed gained from running Rust code and parallelization?
The reason I ask is because from my understanding, using apply with custom python function will slow down the program, but in my case, in certain scenarios, I will not need to resort to a third-party function for help. So, is there any way I can get the best of worlds somehow? (for scenarios where no third-party function is required, get full benefits of polars, and only apply third-party function when necessary).
It sounds like you want to find matching groups:
(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
)
shape: (4, 4)
┌────────┬─────┬─────┬─────┐
│ row_nr | a | b | c │
│ --- | --- | --- | --- │
│ u32 | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╡
│ 0 | 1 | 2 | 1 │
│ 1 | 4 | 3 | 1 │
│ 2 | 3 | 1 | 1 │
│ 3 | 2 | 3 | 1 │
└────────┴─────┴─────┴─────┘
And apply your custom function over each group.
(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
.select(
pl.col("row_nr"),
pl.apply(
["a", "b"], # np.minimum is just for example purposes
lambda s: np.minimum(s[0], s[1]))
.over("c"))
)
shape: (4, 2)
┌────────┬─────┐
│ row_nr | a │
│ --- | --- │
│ u32 | i64 │
╞════════╪═════╡
│ 0 | 1 │
│ 1 | 3 │
│ 2 | 1 │
│ 3 | 2 │
└────────┴─────┘
(Note: there may be some useful information in How to Write Poisson CDF as Python Polars Expression with regards to scipy/numpy ufuncs and potentially avoiding .apply())
You can then .join() the result back into the original data.
(
df
.with_row_count()
.join(
df
.with_row_count()
.filter(
(pl.col("a") >= pl.col("b"))
.any()
.over("c"))
.select(
pl.col("row_nr"),
pl.apply(
["a", "b"],
lambda s: np.minimum(s[0], s[1]))
.over("c")),
on="row_nr",
how="left")
)
shape: (8, 5)
┌────────┬─────┬─────┬─────┬─────────┐
│ row_nr | a | b | c | a_right │
│ --- | --- | --- | --- | --- │
│ u32 | i64 | i64 | i64 | i64 │
╞════════╪═════╪═════╪═════╪═════════╡
│ 0 | 1 | 2 | 1 | 1 │
│ 1 | 4 | 3 | 1 | 3 │
│ 2 | 3 | 1 | 1 | 1 │
│ 3 | 2 | 3 | 1 | 2 │
│ 4 | 8 | 9 | 2 | null │
│ 5 | 4 | 7 | 2 | null │
│ 6 | 5 | 6 | 2 | null │
│ 7 | 6 | 8 | 2 | null │
└────────┴─────┴─────┴─────┴─────────┘
You can then fill in the nulls.
.with_columns(
pl.col("a_right").fill_null(pl.col("a")))

Split value between polars DataFrame rows

I would like to find a way to distribute the values of a DataFrame among the rows of another DataFrame using polars (without iterating through the rows).
I have a dataframe with the amounts to be distributed:
Name
Amount
A
100
B
300
C
250
And a target DataFrame to which I want to append the distributed values (in a new column) using the common "Name" column.
Name
Item
Price
A
x1
40
A
x2
60
B
y1
50
B
y2
150
B
y3
200
C
z1
400
The rows in the target are sorted and the assigned amount should match the price in each row (as long as there is enough amount remaining).
So the result in this case should look like this:
Name
Item
Price
Assigned amount
A
x1
40
40
A
x2
60
60
B
y1
50
50
B
y2
150
150
B
y3
200
100
C
z1
400
250
In this example, we can distribute the amounts for A, so that they are the same as the price. However, for the last item of B and for C we write the remaining amounts as the prices are too high.
Is there an efficient way to do this?
My initial solution was to calculate the cumulative sum of the Price in a new column in the target dataframe, then left join the source DataFrame and subtract the values of the cumulative sum. This would work if the amount is high enough, but for the last item of B and C I would get negative values and not the remaining amount.
Edit
Example dataframes:
import polars as pl
df1 = pl.DataFrame({"Name": ["A", "B", "C"], "Amount": [100, 300, 250]})
df2 = pl.DataFrame({"Name": ["A", "A", "B", "B", "B", "C"], "Item": ["x1", "x2", "y1", "y2", "y3", "z"],"Price": [40, 60, 50, 150, 200, 400]})
#jqurious, good answer. This might be slightly more succinct:
(
df2.join(df1, on="Name")
.with_columns(
pl.min([
pl.col('Price'),
pl.col('Amount') -
pl.col('Price').cumsum().shift_and_fill(1, 0).over('Name')
])
.clip_min(0)
.alias('assigned')
)
)
shape: (6, 5)
┌──────┬──────┬───────┬────────┬──────────┐
│ Name ┆ Item ┆ Price ┆ Amount ┆ assigned │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪════════╪══════════╡
│ A ┆ x1 ┆ 40 ┆ 100 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 100 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 300 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 300 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 300 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 ┆ 250 │
└──────┴──────┴───────┴────────┴──────────┘
You can take the minimum value of the Price or the Difference.
.clip_min(0) can be used to replace the negatives.
[Edit: See #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's answer for a neater way to write this.]
(
df2
.join(df1, on="Name")
.with_columns(
cumsum = pl.col("Price").cumsum().over("Name"))
.with_columns(
assigned = pl.col("Amount") - (pl.col("cumsum") - pl.col("Price")))
.with_columns(
assigned = pl.min(["Price", "assigned"]).clip_min(0))
)
shape: (6, 6)
┌──────┬──────┬───────┬────────┬────────┬──────────┐
│ Name | Item | Price | Amount | cumsum | assigned │
│ --- | --- | --- | --- | --- | --- │
│ str | str | i64 | i64 | i64 | i64 │
╞══════╪══════╪═══════╪════════╪════════╪══════════╡
│ A | x1 | 40 | 100 | 40 | 40 │
│ A | x2 | 60 | 100 | 100 | 60 │
│ B | y1 | 50 | 300 | 50 | 50 │
│ B | y2 | 150 | 300 | 200 | 150 │
│ B | y3 | 200 | 300 | 400 | 100 │
│ C | z | 400 | 250 | 400 | 250 │
└──────┴──────┴───────┴────────┴────────┴──────────┘
This assumes the order of the df is the order of priority, if not, sort it first.
You first want to join your two dfs then make a helper column that is the cumsum of Price less Price. I call that spent. It's more like a potential spent because there's no guarantee it doesn't go over Amount.
Add another two helper columns, one for the difference between Amount and spent which we'll call have1 as that's the amount we have. In the sample data this didn't come up but we need to make sure this isn't less than 0 so we add another column which is just literally zero, we'll call it z.
Add another helper column which will be the greater value between 0 and have1 and we'll call it have2.
Lastly, we'll determine the Assigned amount as smaller value between have2 and Price.
df1.join(df2, on='Name') \
.with_columns((pl.col("Price").cumsum()-pl.col("Price")).over("Name").alias("spent")) \
.with_columns([(pl.col("Amount")-pl.col("spent")).alias("have1"), pl.lit(0).alias('z')]) \
.with_columns(pl.concat_list([pl.col('z'), pl.col('have1')]).arr.max().alias('have2')) \
.with_columns(pl.concat_list([pl.col('have2'), pl.col("Price")]).arr.min().alias("Assigned amount")) \
.select(["Name", "Item","Price","Assigned amount"])
You can reduce this to a single nested expression like this...
df1.join(df2, on='Name') \
.select(["Name", "Item","Price",
pl.concat_list([
pl.concat_list([
pl.repeat(0, pl.count()),
pl.col("Amount")-(pl.col("Price").cumsum()-pl.col("Price")).over("Name")
]).arr.max(),
pl.col("Price")
]).arr.min().alias("Assigned amount")
])
shape: (6, 4)
┌──────┬──────┬───────┬─────────────────┐
│ Name ┆ Item ┆ Price ┆ Assigned amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞══════╪══════╪═══════╪═════════════════╡
│ A ┆ x1 ┆ 40 ┆ 40 │
│ A ┆ x2 ┆ 60 ┆ 60 │
│ B ┆ y1 ┆ 50 ┆ 50 │
│ B ┆ y2 ┆ 150 ┆ 150 │
│ B ┆ y3 ┆ 200 ┆ 100 │
│ C ┆ z ┆ 400 ┆ 250 │
└──────┴──────┴───────┴─────────────────┘

How to compute a group weighted average controlling for null values in Polars?

A group weighted average without null is pretty straightforward
import polars as pl
data = {"id":[1, 1, 2, 2], "a" : [2, 1, 1, 3], "b":[0,1,2,3], "weights":[0.5, 1, 0.2, 3]}
df = pl.DataFrame(data)
weighted_average = (pl.col(["a", "b"]) * pl.col("weights")).sum() / pl.col("weights").sum()
df.groupby("id").agg(weighted_average)
shape: (2, 3)
┌─────┬──────────┬──────────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 1 ┆ 1.333333 ┆ 0.666667 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2.875 ┆ 2.9375 │
└─────┴──────────┴──────────┘
However, column result from group containing None/null values would be invalid.
This is due to the last term of the expression being not subsetted by the null mask
of column with null values.
Example:
data = {"id":[1, 1, 2, 2], "a" : [2, None, 1, 3], "b":[0,1,2,3], "weights":[0.5, 1, 0.2, 3]}
df = pl.DataFrame(data)
weighted_average = (pl.col(["a", "b"]) * pl.col("weights")).sum() / pl.col("weights").sum()
df.groupby("id").agg(weighted_average)
shape: (2, 3)
┌─────┬──────────┬──────────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 2 ┆ 2.875 ┆ 2.9375 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.666667 ┆ 0.666667 │
└─────┴──────────┴──────────┘
Value of column a group 1, should be equal to 2: 2 * 0.5 /0.5 but is instead calculated
as 2 * 0.5 / (0.5 + 1) = 0.66
How to get the right results ?
I.e. how to subset the denominator by other column mask when required?
You can mask out the weights by the null value of the other column.
data = {
"id": [1, 1, 2, 2],
"a": [2, None, 1, 3],
"b": [0,1,2,3],
"weights": [0.5, 1, 0.2, 3]
}
df = pl.DataFrame(data)
# mask out the weights that are null in 'a' or 'b'
masked_weights = pl.col("weights") * pl.col(["a", "b"]).is_not_null()
weighted_average = (pl.col(["a", "b"]) * pl.col("weights")).sum() / masked_weights.sum()
(df.groupby("id")
.agg(weighted_average)
)
Ouptuts
shape: (2, 3)
┌─────┬───────┬──────────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪═══════╪══════════╡
│ 1 ┆ 2.0 ┆ 0.666667 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2.875 ┆ 2.9375 │
└─────┴───────┴──────────┘

Assigning to a subset of a Dataframe (with a selection or other method) in python Polars

In Pandas I can do the following:
data = pd.DataFrame(
{
"era": ["01", "01", "02", "02", "03", "10"],
"pred1": [1, 2, 3, 4, 5,6],
"pred2": [2,4,5,6,7,8],
"pred3": [3,5,6,8,9,1],
"something_else": [5,4,3,67,5,4],
})
pred_cols = ["pred1", "pred2", "pred3"]
ERA_COL = "era"
DOWNSAMPLE_CROSS_VAL = 10
test_split = ['01', '02', '10']
test_split_index = data[ERA_COL].isin(test_split)
downsampled_train_split_index = train_split_index[test_split_index].index[::DOWNSAMPLE_CROSS_VAL]
data.loc[test_split_index, "pred1"] = somefunction()["another_column"]
How can I achieve the same in Polars? I tried to do some data.filter(****) = somefunction()["another_column"], but the filter output is not assignable with Polars.
Let's see if I can help. It would appear that what you want to accomplish is to replace a subset/filtered portion of a column with values derived from other one or more other columns.
For example, if you are attempting to accomplish this:
ERA_COL = "era"
test_split = ["01", "02", "10"]
test_split_index = data[ERA_COL].isin(test_split)
data.loc[test_split_index, "pred1"] = -2 * data["pred3"]
print(data)
>>> print(data)
era pred1 pred2 pred3 something_else
0 01 -6 2 3 5
1 01 -10 4 5 4
2 02 -12 5 6 3
3 02 -16 6 8 67
4 03 5 7 9 5
5 10 -2 8 1 4
We would accomplish the above in Polars using a when/then/otherwise expression:
(
pl.from_pandas(data)
.with_column(
pl.when(pl.col(ERA_COL).is_in(test_split))
.then(-2 * pl.col('pred3'))
.otherwise(pl.col('pred1'))
.alias('pred1')
)
)
shape: (6, 5)
┌─────┬───────┬───────┬───────┬────────────────┐
│ era ┆ pred1 ┆ pred2 ┆ pred3 ┆ something_else │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪═══════╪═══════╪════════════════╡
│ 01 ┆ -6 ┆ 2 ┆ 3 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 01 ┆ -10 ┆ 4 ┆ 5 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 02 ┆ -12 ┆ 5 ┆ 6 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 02 ┆ -16 ┆ 6 ┆ 8 ┆ 67 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 03 ┆ 5 ┆ 7 ┆ 9 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ -2 ┆ 8 ┆ 1 ┆ 4 │
└─────┴───────┴───────┴───────┴────────────────┘
Is this what you were looking to accomplish?
A few things as general points.
polars syntax doesn't attempt to match that of pandas.
In polars, you can only assign a whole df, you can't assign part of a df.
polars doesn't use an internal index so there's no index to record
For your problem, assuming there isn't already a natural index, you'd want to make an explicit index.
pldata=pl.from_pandas(data).with_row_count(name='myindx)
then recording the index would be
test_split_index = pldata.filter(pl.col(ERA_COL).is_in(test_split)).select('myindx').to_series()
For your last bit on the final assignment, without knowing anything about somefunction my best guess is that you'd want to do that with a join.
Maybe something like:
pldata=pldata.join(
pl.from_pandas(some_function()['another_column']) \
.with_column(test_split_index.alias('myindex')),
on='myindex')
Your test_split_index is actually a bool not the index whereas the above is the actual index so take that with a grain of salt.
All that being said, polars has free copies of data so rather than keeping track of index positions manually (as error prone as that can be), you can just make 2 new dfs since, under the hood, it just references the data, it doesn't make a physical copy of it.
Something like:
testdata=pldata.filter(pl.col(ERA_COL).is_in(test_split))
traindata=pldata.filter(~pl.col(ERA_COL).is_in(test_split))

fill_null() values with other columns data

i have a question regarding fill null values, is it possible to backfill data from other columns as in pandas?
Working pandas example on how to backfill data :
df.loc[:, ['A', 'B', 'C']] = df[['A', 'B', 'C']].fillna(
value={'A': df['D'],
'B': df['D'],
'C': df['D'],
})
Polars example as i tried to backfill data from column D to column A if the value is null, but it's not working:
df = pl.DataFrame(
{"date": ["2020-01-01 00:00:00", "2020-01-07 00:00:00", "2020-01-14 00:00:00"],
"A": [3, 4, 7],
"B": [3, 4, 5],
"C": [0, 1, 2],
"D": [1, 2, 5]})
df = df.with_column(pl.col("date").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"))
date_range = df.select(pl.arange(df["date"][0], df["date"]
[-1] + 1, step=1000*60*60*24).cast(pl.Datetime).alias("date"))
df = (date_range.join(df, on="date", how="left"))
df['D'] = df['D'].fill_null("forward")
print(df)
df[:, ['A']] = df[['A']].fill_null({
'A': df['D']
}
)
print(df)
Kind regards,
Tom
In the example you show and the accomponied pandas code. A fillna doesn't fill any null values, because the other columns are also NaN. So I am going to assume that you want to fill missing values by values of another column that doesn't have missing values, but correct me if I am wrong.
import polars as pl
from polars import col
df = pl.DataFrame({
"a": [0, 1, 2, 3, None, 5, 6, None, 8, None],
"b": range(10),
})
out = df.with_columns([
pl.when(col("a").is_null()).then(col("b")).otherwise(col("a")).alias("a"),
pl.when(col("a").is_null()).then(col("b").shift(1)).otherwise(col("a")).alias("a_filled_lag"),
pl.when(col("a").is_null()).then(col("b").mean()).otherwise(col("a")).alias("a_filled_mean")
])
print(out)
In the example above, we use a when -> then -> othwerwise expression to fill missing values by another columns values. Think about if else expressions but then on whole columns.
I gave 3 examples, one where we fill on that value, one where we fill with the lagged value, and one where we fill with the mean value of the other column.
The snippet above produces:
shape: (10, 4)
┌─────┬─────┬──────────────┬───────────────┐
│ a ┆ b ┆ a_filled_lag ┆ a_filled_mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════════════╪═══════════════╡
│ 0 ┆ 0 ┆ 0 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 5 ┆ 5 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 6 ┆ 6 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 7 ┆ 6 ┆ 4.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9 ┆ 9 ┆ 8 ┆ 4.5 │
└─────┴─────┴──────────────┴───────────────┘
The fillna() method is used to fill null values in pandas.
df['D'] = df['D'].fillna(df['A'].mean())
The above code will replace null values of D column with the mean value of A column.

Categories