how to fill dataframe with former value? [duplicate] - python

This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 3 years ago.
I import the data from an excel file. But the format of merged cells in excel file does not match in python. Therefore, I have to modify the data in python.
for example: the data I import in python looks like
0 aa
1 NaN
2 NaN
3 NaN
4 b
5 NaN
6 NaN
7 NaN
8 NaN
9 ccc
10 NaN
11 NaN
12 NaN
13 dd
14 NaN
15 NaN
16 NaN
the result I want is:
0 aa
1 aa
2 aa
3 aa
4 b
5 b
6 b
7 b
8 b
9 ccc
10 ccc
11 ccc
12 ccc
13 dd
14 dd
15 dd
16 dd
I tried to use for loop to fix the problem. But it took lots of time and I have a huge dataset. I do not know if there is a faster way to do it.

Looks like .fillna() is your friend – quoting the documentation::
We can also propagate non-null values forward or backward.
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4

This is exactly the use of the .fillna() function in pandas

You can get your desired result with the help of apply AND fillna methods :-
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'A':['a', np.nan, np.nan, 'b', np.nan]})
l = []
def change(value):
if value == "bhale":
value = l[-1]
return value
else:
l.append(value)
return value
# First converting NaN values into any string value like `bhale` here
df['A'] = df['A'].fillna('bhale')
df["A"] = df['A'].apply(change) # Using apply method.
df
I hope it may help you.

Related

Drop group if another column only has duplicate values - pandas dataframe

My question is quite similar to this one: Drop group if another column has duplicate values - pandas dataframe
I have the following dataframe:
letter value many other variables
A 5
A 5
A 8
A 9
B 3
B 10
B 10
B 4
B 5
B 9
C 10
C 10
C 10
D 6
D 8
D 10
E 5
E 5
E 5
F 4
F 4
And when grouping it by letter I want to remove all the resulting groups that only have value that repeats, thus getting a result like this:
letter value many other variables
A 5
A 5
A 8
A 9
B 3
B 10
B 10
B 4
B 5
B 9
D 6
D 8
D 10
I am afraid that if I use the duplicate() function similarly to the question I mentioned at the beggining I would be deleting groups (or the rows in) 'A' and 'B' which should rather stay in their place.
You have several possibilities.
Using duplicated and groupby.transform:
m = (df.duplicated(subset=['letter', 'value'], keep=False)
.groupby(df['letter']).transform('all')
)
out = df[~m]
NB. this won't drop groups with a single row.
Using groupby.transform and nunique:
out = df[df.groupby('letter')['value'].transform('nunique').gt(1)]
NB. this will drop groups with a single row.
Output:
letter value many other variables
0 A 5 NaN NaN NaN
1 A 5 NaN NaN NaN
2 A 8 NaN NaN NaN
3 A 9 NaN NaN NaN
4 B 3 NaN NaN NaN
5 B 10 NaN NaN NaN
6 B 10 NaN NaN NaN
7 B 4 NaN NaN NaN
8 B 5 NaN NaN NaN
9 B 9 NaN NaN NaN
13 D 6 NaN NaN NaN
14 D 8 NaN NaN NaN
15 D 10 NaN NaN NaN

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

How to merge multiple dataframe columns within a common dataframe in pandas in fastest way possible?

Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.

how to merge two dataframes if the index and length both do not match?

i have two data frames predictor_df and solution_df like this :
predictor_df
1000 A B C
1001 1 2 3
1002 4 5 6
1003 7 8 9
1004 Nan Nan Nan
and a solution_df
0 D
1 10
2 11
3 12
the reason for the names is that the predictor_df is used to do some analysis on it's columns to arrive at analysis_df . My analysis leaves the rows with Nan values in predictor_df and hence the shorter solution_df
Now i want to know how to join these two dataframes to obtain my final dataframe as
A B C D
1 2 3 10
4 5 6 11
7 8 9 12
Nan Nan Nan
please guide me through it . thanks in advance.
Edit : i tried to merge the two dataframes but the result comes like this ,
A B C D
1 2 3 Nan
4 5 6 Nan
7 8 9 Nan
Nan Nan Nan
Edit 2 : also when i do pd.concat([predictor_df, solution_df], axis = 1)
it becomes like this
A B C D
Nan Nan Nan 10
Nan Nan Nan 11
Nan Nan Nan 12
Nan Nan Nan Nan
You could use reset_index with drop=True which resets the index to the default integer index.
pd.concat([df_1.reset_index(drop=True), df_2.reset_index(drop=True)], axis=1)
A B C D
0 1 2 3 10.0
1 4 5 6 11.0
2 7 8 9 12.0
3 Nan Nan Nan NaN

How can I use apply with pandas rolling_corr()

I posted this a while ago but no one could solve the problem.
first let's create some correlated DataFrames and call rolling_corr(), with dropna() as I am going to sparse it up later and no min_period set as I want to keep the results robust and consistent with the set window
hey=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hoo=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hey_corr= rolling_corr(hey.dropna(),hoo.dropna(), 4)
gives me
In [388]: hey_corr
Out[388]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991087 0.978383 0.992614
4 0.974117 0.974871 0.989411
5 0.966969 0.972894 0.997427
6 0.942064 0.994681 0.996529
7 0.932688 0.986505 0.991353
8 0.935591 0.966705 0.980186
9 0.969994 0.977517 0.931809
10 0.979783 0.956659 0.923954
11 0.987701 0.959434 0.961002
12 0.907483 0.986226 0.978658
13 0.940320 0.985458 0.967748
14 0.952916 0.992365 0.973929
now when I sparse it up it gives me...
hey.ix[5:8,0] = np.nan
hey.ix[6:10,1] = np.nan
hoo.ix[5:8,0] = np.nan
hoo.ix[6:10,1] = np.nan
hey_corr_sparse = rolling_corr(hey.dropna(),hoo.dropna(), 4)
hey_corr_sparse
Out[398]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
11 0.996801 0.998218 0.992538
12 0.994919 0.998656 0.995235
13 0.994899 0.997465 0.997950
14 0.971828 0.937512 0.994037
chucks of data are missing, it looks like we only have data where the dropna() can form a complete window across the dataframe
I can solve the problem with a ugly iter-fudge as follows...
hey_corr_sparse = DataFrame(np.nan, index=hey.index,columns=hey.columns)
for i in hey_corr_sparse.columns:
hey_corr_sparse.ix[:,i] = rolling_corr(hey.ix[:,i].dropna(),hoo.ix[:,i].dropna(), 4)
hey_corr_sparse
Out[406]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
5 NaN 0.944246 0.961917
6 NaN NaN 0.941467
7 NaN NaN 0.963183
8 NaN NaN 0.980530
9 0.993865 NaN 0.984484
10 0.997691 NaN 0.998441
11 0.978982 0.991095 0.997462
12 0.914663 0.990844 0.998134
13 0.933355 0.995848 0.976262
14 0.971828 0.937512 0.994037
Does anyone in the community know if it is possible make this an array function to give this result, I've attempted to use .apply but drawn a blank, is it even possible to .apply a function that works on two data structures (hey and hoo in this example)?
many thanks, LW
you can try this:
>>> def sparse_rolling_corr(ts, other, window):
... return rolling_corr(ts.dropna(), other[ts.name].dropna(), window).reindex_like(ts)
...
>>> hey.apply(sparse_rolling_corr, args=(hoo, 4))

Categories