Imputing values into a dataframe based on another dataframe and a condition - python

Suppose I have the following dataframes:
df1 = pd.DataFrame({'col1':['a','b','c','d'],'col2':[1,2,3,4]})
df2 = pd.DataFrame({'col3':['a','x','a','c','b']})
I wonder how can I look up on df1 and make a new column on df2 and replace values from col2 in it, for those values that there is no data I shall impute 0, the result should look like the following:
col3 col4
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2

Use Series.map with Series.fillna:
df2['col2'] = df2['col3'].map(df1.set_index('col1')['col2']).fillna(0).astype(int)
print (df2)
col3 col2
0 a 1
1 x 0
2 a 1
3 c 3
4 b 2
Or DataFrame.merge, better if need append multiple columns:
df = df2.merge(df1.rename(columns={'col1':'col3'}), how='left').fillna(0)
print (df)
col3 col2
0 a 1.0
1 x 0.0
2 a 1.0
3 c 3.0
4 b 2.0

Related

Concat pandas dataframes in Python with different row size without getting NaN values

I have to combine some dataframes in Python. I've tried to combine them using concat operation, but I am getting NaN values because each dataframe has different row size. For example:
DATAFRAME 1:
col1
0 1
DATAFRAME 2:
col2
0 5
DATAFRAME 3:
col3
0 7
1 8
2 9
COMBINED DATAFRAME:
col1 col2 col3
0 1.0 5.0 7
1 NaN NaN 8
2 NaN NaN 9
In this example, dataframe 1 and dataframe 2 only have 1 row. However, dataframe 3 has 3 rows. When I combine these 3 dataframes, I get NaN values for columns col1 and col2 in the new dataframe. I'd like to get a dataframe where the values for col1 and col2 are always the same. In this case, the expected dataframe would look like this:
EXPECTED DATAFRAME:
col1 col2 col3
0 1 5 7
1 1 5 8
2 1 5 9
Any idea? Thanks in advance
Use concat and ffill:
df = pd.concat([df1, df2, df3], axis=1).fill()
You can use ffill() on your merged dataframe to fill in the blanks with the previous value:
df.ffill()
col1 col2 col3
0 1 5 7
1 1 5 8
2 1 5 9

Duplicating each column of a csv and changing values of every column cell based on a condition in python

I am a new user to everything especially python and pandas. I have a .csv file with more than 1000 columns and around 250 rows. The values of the rows are either 0 and 1 or empty cells. An example of the csv file is given below:
ID col1 col2 col3 col4 . . ............... col1000
1 1 0 1 1
2 0 1 1
3 1 0 0
.
.
.
.
250 0 1 0 0
There are two things that i want to do:
First, i want to duplicate all 1000 columns (except ID column) with same cell values and the column names as the original columns and then place each copied column next to the original column in the following order:
col1 col1 col2 col2 col3 col3 col4 col4 ...... col1000 col1000
Second, i want to replace the values in the cells based on on the following conditions:
If there is 1 in the original cell, The value in the copied column should remain 1 and if there is 0 in the original column cell then the value of the copied column should be changed to -1. If the original cell is empty then the value of original cell and the copied cell should be filled with 0 values.
The output csv file will be:
ID col1 col1 col2 col2 col3 col3 col4 col4 . ........... col1000 col1000
1 1 1 0 -1 0 0 1 1 1 1
2 0 -1 0 0 1 1 1 1 0 0
3 0 0 1 1 0 0 0 -1 0 -1
.
.
.
.
250 0 -1 1 1 0 0 0 -1 0 -1
I am not able to solve it and really appreciate if someone could help me out; Thanks...
You can try this to see if it works.
import pandas as pd
import numpy as np
Starting Data
df = pd.DataFrame({'col1':[1,0,np.NaN,np.NaN,1],'col2':[1,0,np.NaN,np.NaN,1],'col3':[1,0,np.NaN,np.NaN,1]})
First make a copy of the original df.
df_copy = df.copy()
Then replace the values in the copy based on criteria above.
columns = df_copy.columns
df_copy[columns] = np.where(df_copy[columns]==0,-1,df_copy[columns])
Then fill the blank values with 0.
df_copy = df_copy.fillna(0)
Add a column count for sorting.
df.loc['total'] = np.arange(len(df.columns))
df_copy.loc['total'] = np.arange(len(df_copy.columns))
Then concatenate the two df's together
new_df = pd.concat([df,df_copy],axis=1)
Sort the columns using the column count row, then drop the row from the new df
new_df = new_df.sort_values(by='total',axis=1)
new_df = new_df.loc[~new_df.index.isin(['total'])]
You can do the following steps, the trick is to use column index to getting the correct column sequences:
# create copied data and concat into original
df2 = pd.concat([df, df.replace(0,-1).fillna(0).drop('ID', axis=1)], 1)
# since column names are same, we need to use index
cols = [x for x in df2.columns if x != 'ID']
cols = dict(enumerate(cols))
# get correct index for column names
cols_index = [x[0] for x in sorted(cols.items(), key=lambda x: x[1])]
# fix column names
idcol = df2[['ID']]
df2 = df2.drop('ID', 1).iloc[:,cols_index]
# add the ID column
df2 = pd.concat([idcol, df2], 1).fillna(0)
print(df2)
ID col1 col1 col2 col2 col3 col3
0 1 1.0 1.0 1.0 1.0 1.0 1.0
1 2 0.0 0.0 0.0 0.0 0.0 0.0
2 3 NaN -1.0 NaN -1.0 NaN -1.0
3 4 NaN -1.0 NaN -1.0 NaN -1.0
4 5 1.0 1.0 1.0 1.0 1.0 1.0
Sample Data
df = pd.DataFrame({'ID': list(range(1,6)),
'col1':[1,0,np.NaN,np.NaN,1],
'col2':[1,0,np.NaN,np.NaN,1],
'col3':[1,0,np.NaN,np.NaN,1]})
You can use this (similar approach as other answer with use of pandas built-in functions to replace):
df2 = df.copy().replace(0,-1).fillna(0).drop(['ID'],1)
df = pd.concat([df.fillna(0),df2],1)
output and sample input (for a different input sample, but the comparison of columns are apparent):
input:
ID col1 col2 col3 col4 col1000
0 1 1 0 1 1.0 NaN
1 2 0 1 1 NaN NaN
2 3 1 0 0 NaN NaN
250 250 0 1 0 0.0 NaN
output:
ID col1 col2 col3 col4 ... col1000 col1 col2 col3 col4 ... col1000
0 1 1 0 1 1.0 0.0 1 -1 1 1.0 0.0
1 2 0 1 1 0.0 0.0 -1 1 1 0.0 0.0
2 3 1 0 0 0.0 0.0 1 -1 -1 0.0 0.0
...
250 250 0 1 0 0.0 0.0 -1 1 -1 -1.0 0.0

Group by within a groupby then averaging

Let's say I have a dataframe (I'll just use a simple example) that looks like this:
import pandas as pd
df = {'Col1':[3,4,2,6,5,7,3,4,9,7,1,3],
'Col2':['B','B','B','B','A','A','A','A','C','C','C','C',],
'Col3':[1,1,2,2,1,1,2,2,1,1,2,2]}
df = pd.DataFrame(df)
Which gives a dataframe like so:
Col1 Col2 Col3
0 3 B 1
1 4 B 1
2 2 B 2
3 6 B 2
4 5 A 1
5 7 A 1
6 3 A 2
7 4 A 2
8 9 C 1
9 7 C 1
10 1 C 2
11 3 C 2
What I want to do is several steps:
1) For each unique value in Col2, and for each unique value in Col3, average Col1. So a desired output would be:
Avg Col2 Col3
1 3.5 B 1
2 4 B 2
3 6 A 1
4 3.5 A 2
5 8 C 1
6 2 C 2
2) Now, for each unique value in Col3, I want the highest average and the corresponding value in Col2. So
Best Avg Col2 Col3
1 8 C 1
2 4 B 2
My attempt has been using df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'}).groupby(['Col3']).agg({'Col1':'max'})
This gives me the highest average for each Col3 value, but not the corresponding Col2 label. Thank you for any help you can give!
After you first groupby do sort_values + drop_duplicates
g1=df.groupby(['Col3','Col2'], as_index = False).agg({'Col1':'mean'})
g1.sort_values('Col1').drop_duplicates('Col3',keep='last')
Out[569]:
Col3 Col2 Col1
4 2 B 4.0
2 1 C 8.0
Or in case you have duplicate max value of mean
g1[g1.Col1==g1.groupby('Col3').Col1.transform('max')]
Do the following (I modified your code slightly,
to make it a bit shorter):
df2 = df.groupby(['Col3','Col2'], as_index = False).mean()
When you print the result, for your input, you will get:
Col3 Col2 Col1
0 1 A 6.0
1 1 B 3.5
2 1 C 8.0
3 2 A 3.5
4 2 B 4.0
5 2 C 2.0
Then run:
res = df2.iloc[df2.groupby('Col3').Col1.idxmax()]
When you print the result, you will get:
Col3 Col2 Col1
2 1 C 8.0
4 2 B 4.0
As you can see:
idxmax gives the index of the row with "maximal" element (for each
group),
this result you can use as the argument of iloc.

Set dataframe column using values from matching indices in another dataframe

I would like to set values in col2 of DF1 using the value held at the matching index of col2 in DF2:
DF1:
col1 col2
index
0 a
1 b
2 c
3 d
4 e
5 f
DF2:
col1 col2
index
2 a x
3 d y
5 f z
DF3:
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If I just try and set DF1['col2'] = DF2['col2'] then col2 comes out as all NaN values in DF3 - I take it this is because the indices are different. However when I try and use map() to do something like:
DF1.index.to_series().map(DF2['col2'])
then I still get the same NaN column, but I thought it would map the values over where the index matches...
What am I not getting?
You need join or assign:
df = df1.join(df2['col2'])
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
Or:
df1 = df1.assign(col2=df2['col2'])
#same like
#df1['col2'] = df2['col2']
print (df1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z
If no match and all values are NaNs check if indices have same dtype in both df:
print (df1.index.dtype)
print (df2.index.dtype)
If not, then use astype:
df1.index = df1.index.astype(int)
df2.index = df2.index.astype(int)
Bad solution (check index 2):
df = df2.combine_first(df1)
print (df)
col1 col2
index
0 a NaN
1 b NaN
2 a x
3 d y
4 e NaN
5 f z
You can simply concat as you are combining based on index
df = pd.concat([df1['col1'], df2['col2']],axis = 1)
col1 col2
index
0 a NaN
1 b NaN
2 c x
3 d y
4 e NaN
5 f z

replace value based on other dataframe

There are two dataframe with same columns, index and the order of the columns are the same. I call them tableA and tableB.
tableA = pd.DataFrame({'col1':[np.NaN,1,2],'col2':[2,3,np.NaN]})
tableB = pd.DataFrame({'col1':[2,4,2],'col2':[2,3,5]})
tableA tableB
col1 col2 col1 col2
0 na 2 0 2 2
1 1 3 1 4 5
2 2 na 2 2 5
I want to replace some value of tableB to 'NA' where the value of same position of tableA is na.
For now, I use loop to do it column by column.
for n in range(tableB.shape[1]):
tableB.iloc[:,n] = tableB.iloc[:,n].where(pd.isnull(tableA.iloc[:,n])==False,'NA')
tableB
col1 col2
0 NA 2
1 4 5
2 2 NA
Is there other way to do it without using loop? I have tried using replace but it can only change the first column.
tableB.replace(pd.isnull(tableA), 'NA', inplace=True) #only adjust the first column.
Thanks for your help!
I think you need where or numpy.where:
1.
df = tableB.where(tableA.notnull())
print (df)
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
2.
df = pd.DataFrame(np.where(tableA.notnull(), tableB, np.nan),
columns=tableB.columns,
index=tableB.index)
print (df)
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
You could use mask
In [7]: tableB.mask(tableA.isnull())
Out[7]:
col1 col2
0 NaN 2.0
1 4.0 3.0
2 2.0 NaN
tableB[tableA.isnull()] = np.nan

Categories