I've asked a similar question to this and it got no reply, so I thought I'd take a different approach and see if anyone knows how to do this;
First I'll tell you my goal and what I already know:
I am currently cleaning a dataset and need to backward fill the dataset to get rid of some of the NaN values.
From the image below
I would like to backward fill the Na columns of the same X column value, and fill the Na cell with a Y value that has a row value of 1
This image shows what outcome I would like
I already know I can use
df.loc[df['Y'] == 1] = df.loc[:,].bfill(limit=1)
to get it to only fill cells that are matching with a Y value row of 1 (hence the bottom Na cell is not filled).
Here is my question: Using the code above, it fills the middle Na because the Y value to the left is 1, this is fine for the top cell because the source cell and Na cell both have the X value of 1, although for the middle Na there is an X value of 2 and 3. So, is there a way to fill cells that share the same X value down the row? (the X values need to be the same between the source and the Na, if not, nothing happens.)
Thanks!
We can try with loc + groupby bfill:
df.loc[df['Y'] == 1, 'Z'] = df.groupby('X')['Z'].bfill()
groupby will ensure that each group of X values is treated independently, bfill will backfill per group. df['Y'] == 1 ensures that only rows with Y value of 1 will be updated.
df:
X Y Z
0 1 1 2.0
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Initial Frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'X': [1, 1, 2, 3, 3, 4],
'Y': [1, 2, 1, 1, 2, 1],
'Z': [np.nan, 2, np.nan, 3, np.nan, 4]})
df:
X Y Z
0 1 1 NaN
1 1 2 2.0
2 2 1 NaN
3 3 1 3.0
4 3 2 NaN
5 4 1 4.0
Edit to bfill all columns except X and Y use:
df.loc[df['Y'] == 1, df.columns.difference(['X', 'Y'])] = df.groupby('X').bfill()
Try using shift:
df.loc[df['Y'].eq(1) & df['X'].shift(-1).eq(df['X']), 'Z'] = df['Z'].bfill(limit=1)
Related
i have data frame like this,
id
value
a
2
a
4
a
3
a
5
b
1
b
4
b
3
c
1
c
nan
c
5
the resulted data frame contain new column ['average'] and to get its values will be:
make group-by(id)
first row in 'average' column per each group is equal to its corresponding value in 'value'
other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
the resulted data frame must be :
id
value
average
a
2
2
a
4
2
a
3
3
a
5
3
b
1
1
b
4
1
b
3
2.5
c
1
1
c
nan
1
c
5
1
You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0
Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)
I would like to calculate a new column "change".
The value of the new column shall be calculated as follows:
X / Z (one cell above)
--> first row would be empty.
--> second row would be 1 / 6 = 0,16
-> second row would be 5 / 10 = 0,5 ...and so on..
df = pd.DataFrame({'id': ['a', 'a', 'a', 'b', 'b','b'],
'Z': [6,10,8, 6, 6,15],
'X': [2,1,5, 2, 3,20]})
var = df.columns[1]
In the forum I found this:
df['change'] = df['X'] / df[var].shift(1)
It works fine. But I needed to insert a "groupby" argument and cant get it to work.
I tried this:
df['change'] = df.groupby('id').apply(lambda x: x['X'] / x[var].shift(1))
But I get an error:
"incompatible index of inserted column with frame index"
I am afraid I have not fully understood this lambda function.
Any ideas how to get that right?
Thanks in advance!
You can just divide X column by shifted groups:
df["change"] = df["X"] / df.groupby("id")["Z"].shift(1)
print(df)
Prints:
id Z X change
0 a 6 2 NaN
1 a 10 1 0.166667
2 a 8 5 0.500000
3 b 6 2 NaN
4 b 6 3 0.500000
5 b 15 20 3.333333
I have a dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'AA': [1, 1, 2, 2], 'BB': [3, 0, -1, 3.4]})
Now, I need to compare the values in 'AA' and 'BB' then create and fill 'CC' column with the lower values.
I tried many ways, but without defining another function, I think there should be an effective way. What can I try next?
Then using min with axis
df.min(1)
Out[263]:
0 1.0
1 0.0
2 -1.0
3 2.0
dtype: float64
Use min and axis, selecting only the series you need:
df['CC'] = df[['AA', 'BB']].min(axis=1)
df
AA BB CC
0 1 3.0 1.0
1 1 0.0 0.0
2 2 -1.0 -1.0
3 2 3.4 2.0
I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8
I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN