Pandas create a data frame based on two other 'sub' frames - python

I have two Pandas data frames. df1 has columns ['a','b','c'] and df2 has columns ['a','c','d']. Now, I create a new data frame df3 with columns ['a',
b','c','d'].
I want to fill df3 with all the inputs from df1 and df2. For example, if I have x rows in df1, and y rows in df2, then I will have x+y rows in df3.
Which Pandas function fills the new dataframe based on partial columns?

Example data:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[2, 3, 4], 'd':['h', 'j', 'k']})
df2 = pd.DataFrame({'a':[5, 6, 7], 'b':[1, 1, 1], 'c':[2, 2, 2]})
Code:
df1.append(df2)
Out:
a b c d
0 1 2 NaN h
1 2 3 NaN j
2 3 4 NaN k
0 5 1 2.0 NaN
1 6 1 2.0 NaN
2 7 1 2.0 NaN

use append function of dataframe https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
anotherFrame = df1.append(df2, ignore_index=True)
another way is merge - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html
df1.merge(df2, how='outer')

How about:
df1 = pd.DataFrame({"a": [1,2], "b": [3,4], "c": [5,6]})
df2 = pd.DataFrame({"a": [7,8], "c": [9,10], "d": [11,12]})
df3 = df1.append(df2, sort=False)
df3
a b c d
0 1 3.0 5 NaN
1 2 4.0 6 NaN
0 7 NaN 9 11.0
1 8 NaN 10 12.0

Related

replace with multiple conditions not updating in pandas

I am trying to replace a value based on the row index, and for only certain columns in a dataframe.
for columns b and c, I want to replace the value 1 with np.nan, for rows 1, 2 and 3
df = pd.DataFrame(data={'a': ['"dog", "cat"', '"dog"', '"mouse"', '"mouse", "cat", "bird"', '"circle", "square"', '"circle"', '"triangle", "square"', '"circle"'],
'b': [1,1,3,4,5,1,2,3],
'c': [3,4,1,3,2,1,0,0],
'd': ['a','a','b','c','b','c','d','e'],
'id': ['group1','group1','group1','group1', 'group2','group2','group2','group2']})
I am using the following line but its not updating in place, and if I try assigning it, returns only the subset of amended rows, rather than an update version of the original dataframe.
df[df.index.isin([1,2,3])][['b','c']].replace(1, np.nan, inplace=True)
You could do it like this:
df.loc[1:3, ['b', 'c']] = df.loc[1:3, ['b', 'c']].replace(1, np.nan)
Output:
>>> df
a b c d id
0 "dog", "cat" 1.0 3.0 a group1
1 "dog" NaN 4.0 a group1
2 "mouse" 3.0 NaN b group1
3 "mouse", "cat", "bird" 4.0 3.0 c group1
4 "circle", "square" 5.0 2.0 b group2
5 "circle" 1.0 1.0 c group2
6 "triangle", "square" 2.0 0.0 d group2
7 "circle" 3.0 0.0 e group2
A more dynamic version:
cols = ['b', 'c']
rows = slice(1, 3) # or [1, 2, 3] if you want
df.loc[rows, cols] = df.loc[rows, cols].replace(1, np.nan)

Pandas new dataframe that has sum of columns from another

I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0

Merging a dataframe with another dataframe with constant values from the first dataframe

I would like to merge two data frames, df1 and df2:
import pandas as pd
df1 = pd.DataFrame({
'A': ['a'],
'B': ['b'],
'C': ['c']
})
df2 = pd.DataFrame({
'W': [1, 2, 3],
'X': [4, 5, 6],
'Y': [7, 8, 9],
'Z': [10, 11, 12]
})
df1: (will always have only one row)
df2: (can have any number of rows)
In a way that all the columns of df1 are added to the df2 dataframe with all the rows having the same values present in the df1 dataframe.
I have tried:
df3 = pd.concat([df1,df2], sort=False, axis=1)
But this is giving me NaN's:
But i want all the rows to have the same constant value that is present in df1 like:
I would also like to maintain having the new columns from df1 be before the columns of df2 as above.
What might be the most efficient way to achieve this.
We can do an outer merge on an artificially created key:
df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12
Use DataFrame.assign with selecting first row and then change order of columns by DataFrame.reindex:
df3 = df2.assign(**df1.iloc[0]).reindex(df1.columns.union(df2.columns, sort=False),axis=1)
print (df3)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12
Or add rows to df1 by df2.index with method='ffill':
df3 = pd.concat([df1.reindex(df2.index, method='ffill'),df2], sort=False, axis=1)
print (df3)
A B C W X Y Z
0 a b c 1 4 7 10
1 a b c 2 5 8 11
2 a b c 3 6 9 12

Python: How to replace missing values column wise by median

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

join 2 pandas dataframes with different number of columns

Consider we have 2 dataframes:
df = pd.DataFrame(columns = ['a','b','c']) ##empty
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
How can I join them in order the result to be this:
a b c
-----
1 3 Nan
---------
2 4 Nan
-------
Use reindex by columns from df:
df = pd.DataFrame(columns = ['a','b','c'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d).reindex(columns=df.columns)
print (df1)
a b c
0 1 3 NaN
1 2 4 NaN
Difference betwen soluions - if columns are not sorted get different output:
#different order
df = pd.DataFrame(columns = ['c','a','b'])
d = {'a': [1, 2], 'b': [3, 4]}
df1 = pd.DataFrame(data=d)
print (df1.reindex(columns=df.columns))
c a b
0 NaN 1 3
1 NaN 2 4
print (df1.merge(df,how='left'))
a b c
0 1 3 NaN
1 2 4 NaN
How can I join them
If you have the dataframe existing somewhere(not creating a new), do :
df1.merge(df,how='left')
a b c
0 1 3 NaN
1 2 4 NaN
Note: This produces sorted columns. So if order of columns are already sorted, this will work fine , else not.

Categories