Equivalent of R rbind.fill in Python Pandas - python

R's plyr function has:
rbind.fill() which is a way to append data frames with unequal number of columns.
Is there a similar function for python / pandas DataFrame?

You are looking for the function concat:
import pandas as pd
df1 = pd.DataFrame({'col1':['a','b'],'col2':[33,44]})
df2 = pd.DataFrame({'col3':['dog'],'col2':[32], 'col4':[1]})
In [8]: pd.concat([df1, df2])
Out[8]:
col1 col2 col3 col4
0 a 33 NaN NaN
1 b 44 NaN NaN
0 NaN 32 dog 1

Related

Does NaN interfere with column concatenation in pandas?

I am attempting to merge two columns which contain strings and nans.
When I attempt to merge them I cannot deal with the nan values.
df['col3] = df['col1'] + df['col2']
returns only my col2 values
df['col3'] = df['col1'].map(str) + df['col2'].map(str)
returns my nans attached to each other.
If I don't use .map(str) then the .nan values don't concatenate at all.
Is there a way to concatenate two dataframe columns so that if either of them are nan they aren't concatenated. Unless both are nan in which case I do want the return nan.
Example:
df
col0 col1 col2 col3
X A nan A
Y nan B B
Z nan nan nan
W '' B B
You could index first on the last two columns and ffill:
df['col3'] = df[['col1', 'col2']].ffill(1).col2
col0 col1 col2 col3
0 X A NaN A
1 Y NaN B B
2 Z NaN NaN NaN
3 W '' B B
This is fillna
df['col3']=df.col2.fillna(df.col1)
Apply np.where and in case both values exist I have combine both
import numpy as np
df = pd.DataFrame(data={"col1":["A",np.nan,"B",np.nan,"C"],
"col2":[np.nan,"B",np.nan,np.nan,"d"]})
df['col3'] = np.where(df['col1'].isnull(),df['col2'],
np.where(df['col2'].isnull(),df['col1'],df['col1']+df['col2']))
col1 col2 col3
0 A NaN A
1 NaN B B
2 B NaN B
3 NaN NaN NaN
4 C d Cd
fillna() and replace() is what you're looking for, here's fully working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'col1': ["A", "B", "C", np.nan],
'col2': ["D", "E", np.nan, np.nan]
})
df['col3']= df['col1'].fillna('') + df['col2'].fillna('')
df['col3'] = df['col3'].replace('', np.nan)
print(df)
It'll first replace NaN values with empty string and then if both are empty, will replace it back with NaN.
Output:
col1 col2 col3
0 A D AD
1 B E BE
2 C NaN C
3 NaN NaN NaN

Add row values of all columns when a particular column value is null until it gets the not null values?

I have a data frame like this:
df
col1 col2 col3 col4
A 12 34 XX
B 20 25 PP
B nan nan nan
nan P 54 nan
nan R nan nan
nan nan nan PQ
C D 32 SS
R S 32 RS
If col1 value is null I want to add all the values of other columns untill it finds the notnull element in col1
The data frame i am looking for should look like:
col1 col2 col3 col4
A 12 34 XX
B 20 25 PP
B PR 54 PQ
C D 32 SS
R S 32 RS
How to do in in most efficient way using python/pandas
If want processes all columns like strings first forward filling missing values in col1, replace NaNs to empty strings, convert all values to strings and use sum:
df['col1'] = df['col1'].ffill()
df = df.set_index('col1').fillna('').astype(str).sum(level=0).reset_index()
print (df)
col1 col2 col3 col4
0 A 12 34.0 XX
1 B PR 54.0 PQ
2 C D 32.0 SS
print (df.dtypes)
col1 object
col2 object
col3 object
col4 object
dtype: object
If need processes only numeric columns with aggregate method, e.g. mean use lambda function with if-else:
df['col1'] = df['col1'].ffill()
c = df.select_dtypes(object).columns
df[c] = df[c].fillna('')
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ''.join(x)
df = df.groupby('col1').agg(f).reset_index()
print (df)
col1 col2 col3 col4
0 A 12 34.0 XX
1 B PR 54.0 PQ
2 C D 32.0 SS
print (df.dtypes)
col1 object
col2 object
col3 float64
col4 object
dtype: object
EDIT: New helper column is used:
df['new'] = df['col1'].notna().cumsum()
df['col1'] = df['col1'].ffill()
c = df.select_dtypes(object).columns
df[c] = df[c].fillna('')
f = lambda x: x.mean() if np.issubdtype(x.dtype, np.number) else ''.join(x)
df = df.groupby(['col1', 'new']).agg(f).reset_index(level=1, drop=True).reset_index()

Loading the dictionary values in to the row values of a dataframe in pandas

I have a python dictionary
{1:cat,
2:dog,
3:sheep,
4:foo,
5:bar,
6:fish,
7:lion,
8:shark,
9:zebra,
10:snake}
Also I have pandas dataframe as following
df:
ID col1 col2 col2 col4
18938 1 Nan 5 Nan
17839 Nan 2 Nan 8
72902 3 5 9 Nan
78298 7 Nan Nan 6
Now I am trying to replace or map the values of each cell in each column the dictionary values and trying to concat all the column values to a new column.
The new df should look like:
ID col1 col2 col2 col4 new_col
18938 cat Nan bar Nan cat|bar
17839 Nan dog Nan shark dog|shark
72902 sheep bar zebra Nan sheep|bar|zebra
78298 lion Nan Nan fish lion|fish
I am trying to achieve the 2nd step which is concat all the columns using the code
df['new_col'] = df.drop('ID',1).agg(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but I am unable to get the first step working
I used
df = df.columns.map(dict)
but it is not giving me the expected answer I need.
You could try this:
df = df.set_index('ID')
d1 = pd.concat([df[i].replace('Nan',pd.np.nan).dropna().astype(int).map(d) for i in df.columns], axis=1)
d1['new_col'] = d1.apply(lambda x: '|'.join(x.dropna()), axis=1)
print(d1)
Or if you want a little slower but more concise code:
d1 = df.apply(lambda x: x.replace('Nan',pd.np.nan).dropna().astype(int).map(d))
d1['new_col'] = d1.apply(lambda x: '|'.join(x.dropna()), axis=1)
d1
Output:
col1 col2 col2.1 col4 new_col
ID
17839 NaN dog NaN shark dog|shark
18938 cat NaN bar NaN cat|bar
72902 sheep bar zebra NaN sheep|bar|zebra
78298 lion NaN NaN fish lion|fish
Use df.replace():
df = df.replace(dict)
Note that if the keys in your dictionary are strings you may need regex=True:
df = df.replace(dict, regex=True)
Example:
import pandas as pd
d = {1:"cat",
2:"dog",
3:"sheep",
4:"foo",
5:"bar",
6:"fish",
7:"lion",
8:"shark",
9:"zebra",
10:"snake"}
df = pd.DataFrame({'ID': [123, 456], 'col1': [1, 2], 'col2': [5, 6]})
df = df.replace(d)
print(df)
Output:
ID col1 col2
0 123 cat bar
1 456 dog fish

Pandas Combining two rows into one [duplicate]

Given the following dataframe:
import pandas as pd
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'A','A']})
df
COL1 COL2
0 A NaN
1 NaN A
2 A A
I would like to create a column ('COL3') that uses the value from COL1 per row unless that value is null (or NaN). If the value is null (or NaN), I'd like for it to use the value from COL2.
The desired result is:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
Thanks in advance!
In [8]: df
Out[8]:
COL1 COL2
0 A NaN
1 NaN B
2 A B
In [9]: df["COL3"] = df["COL1"].fillna(df["COL2"])
In [10]: df
Out[10]:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
You can use np.where to conditionally set column values.
df = df.assign(COL3=np.where(df.COL1.isnull(), df.COL2, df.COL1))
>>> df
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A
If you don't mind mutating the values in COL2, you can update them directly to get your desired result.
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
>>> df
COL1 COL2
0 A NaN
1 NaN B
2 A B
df.COL2.update(df.COL1)
>>> df
COL1 COL2
0 A A
1 NaN B
2 A A
Using .combine_first, which gives precedence to non-null values in the Series or DataFrame calling it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'COL1': ['A', np.nan,'A'],
'COL2' : [np.nan,'B','B']})
df['COL3'] = df.COL1.combine_first(df.COL2)
Output:
COL1 COL2 COL3
0 A NaN A
1 NaN B B
2 A B A
If we mod your df slightly then you will see that this works and in fact will work for any number of columns so long as there is a single valid value:
In [5]:
df = pd.DataFrame({'COL1': ['B', np.nan,'B'],
'COL2' : [np.nan,'A','A']})
df
Out[5]:
COL1 COL2
0 B NaN
1 NaN A
2 B A
In [6]:
df.apply(lambda x: x[x.first_valid_index()], axis=1)
Out[6]:
0 B
1 A
2 B
dtype: object
first_valid_index will return the index value (in this case column) that contains the first non-NaN value:
In [7]:
df.apply(lambda x: x.first_valid_index(), axis=1)
Out[7]:
0 COL1
1 COL2
2 COL1
dtype: object
So we can use this to index into the series
You can also use mask which replaces the values where COL1 is NaN by column COL2:
In [8]: df.assign(COL3=df['COL1'].mask(df['COL1'].isna(), df['COL2']))
Out[8]:
COL1 COL2 COL3
0 A NaN A
1 NaN A A
2 A A A

How repeat index in pandas pivot table such that nan fill record-less rows?

Pardon my poorly phrased question--I'm not sure how to word it.
Given this pandas pivot table,
df = pd.DataFrame({'col1': list('AABB'),
'col2': list('acab'),
'values': [1,3,4,5]})
pt = pd.pivot_table(df,
index=['col1', 'col2'],
values='values',
aggfunc=sum)
Output:
col1 col2
A a 1
c 3
B a 4
b 5
How can I make the pivot table output this instead:
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN
If you convert your column to the category data type (new in pandas 0.15!) you will get the aggregation that you are after:
df.col2 = df.col2.astype('category')
In [378]: df.groupby(['col1','col2']).sum()
Out[378]:
values
col1 col2
A a 1
b NaN
c 3
B a 4
b 5
c NaN

Categories