Is there any way to join a Series to a DataFrame directly?
The join would be on a field of the dataframe and on the index of the series.
The only way I found was to convert the series to a dataframe first, as in the code below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['a'] = np.arange(0, 4)
df['b'] = np.arange(100, 104)
s = pd.Series(data=np.arange(100, 103))
# this doesn't work
# myjoin = pd.merge(df, s, how='left', left_on='a', right_index=True)
# this does
s = s.reset_index()
# s becomes a Dataframe
# note you cannot reset the index of a series inplace
myjoin = pd.merge(df, s, how='left', left_on='a', right_on='index')
print myjoin
I guess http://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html might help.
For example inner/outer join.
pd.concat((df,s), axis=1)
Out[26]:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
3 3 103 NaN
In [27]: pd.concat((df,s), axis=1, join='inner')
Out[27]:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
That's a very late answer, but what worked for me was building a dataframe with the columns you want to retrieve in your series, name this series as the index you need,
append the series to the dataframe (if you have supplementary elements in the series they are added to the dataframe, which in some application may be convenient), then join the final dataframe by this index to the original dataframe you want to expand. Agreed it is not direct, but that's still the most convenient way if you have a lot of series, instead of transforming each in a dataframe first.
Try concat():
import numpy as np
import pandas as pd
df= pd.DataFrame()
df['a']= np.arange(0,4)
df['b']= np.arange(100,104)
s =pd.Series(data = np.arange(100,103))
new_df = pd.concat((df, s), axis=1)
print new_df
This prints:
a b 0
0 0 100 100
1 1 101 101
2 2 102 102
3 3 103 NaN
Related
Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()
My dataframes are like below
df1
id c1
1 abc
2 def
3 ghi
df2
id set1
1 [123,456]
2 [789]
When I join df1 and df2 (final_data = df1.merge(df2, how = 'left')). It gives me
final_df
id c1 set1
1 abc [123,456]
2 def [789]
3 ghi NaN
I'm using below code to replace NaN with empty array []
for row in final_df.loc[final_df.set1.isnull(), 'set1'].index:
final_df.at[row, 'set1'] = []
The issue is if df2 is empty dataframe. It is giving
ValueError: setting an array element with a sequence.
PS: I'm using pandas 0.23.4 version
Pandas is not designed to be used with series of lists. You lose all vectorised functionality and any manipulations on such series involve inefficient, Python-level loops.
One work-around is to define a series of empty lists:
res = df1.merge(df2, how='left')
empty = pd.Series([[] for _ in range(len(df.index))], index=df.index)
res['set1'] = res['set1'].fillna(empty)
print(res)
id c1 set1
0 1 abc [123, 456]
1 2 def [789]
2 3 ghi []
A better idea at this point, if viable, is to split your lists into separate series:
res = res.join(pd.DataFrame(res.pop('set1').values.tolist()))
print(res)
id c1 0 1
0 1 abc 123.0 456.0
1 2 def 789.0 NaN
2 3 ghi NaN NaN
This is is not ideal but will get your work done
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1,'abc'],[2,'def'],[3,'ghi']], columns=['id', 'c1'])
df2 = pd.DataFrame([[1,[123,456]],[2,[789]]], columns=['id', 'set1'])
df=pd.merge(df1,df2, how='left', on='id')
df['set1'].fillna(0, inplace=True)
df['set1']=df['set1'].apply( lambda x:pd.Series({'set1': [] if x == 0 else x}))
print(df)
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
"df.loc[1] = sr" will drop the column if it isn't in df
df = pd.DataFrame(columns = ['a','b'])
sr = pd.Series({'a':1,'b':2,'c':3})
df.loc[1] = sr
df will be like:
a b
1 1 2
I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])
I'm having trouble using pd.merge after groupby. Here's my hypothetical:
import pandas as pd
from pandas import DataFrame
import numpy as np
df1 = DataFrame({'key': [1,1,2,2,3,3],
'var11': np.random.randn(6),
'var12': np.random.randn(6)})
df2 = DataFrame({'key': [1,2,3],
'var21': np.random.randn(3),
'var22': np.random.randn(3)})
#group var11 in df1 by key
grouped = df1['var11'].groupby(df1['key'])
# calculate the mean of var11 by key
grouped = grouped.mean()
print grouped
key
1 1.399430
2 0.568216
3 -0.612843
dtype: float64
print grouped.index
Int64Index([1, 2, 3], dtype='int64')
print df2
key var21 var22
0 1 -0.381078 0.224325
1 2 0.836719 -0.565498
2 3 0.323412 -1.616901
df2 = pd.merge(df2, grouped, left_on = 'key', right_index = True)
At this point, I get IndexError: list index out of range.
When using groupby, the grouping variable ('key' in this example) becomes the index for the resultant series, which is why I specify 'right_index = True'. I've tried other syntax without success. Any advice?
I think you should just do this:
In [140]:
df2 = pd.merge(df2,
pd.DataFrame(grouped, columns=['mean']),
left_on='key',
right_index=True)
print df2
key var21 var22 mean
0 1 0.324476 0.701254 0.400313
1 2 -1.270500 0.055383 -0.293691
2 3 0.804864 0.566747 0.628787
[3 rows x 4 columns]
The reason it didn't work is that grouped is a Series not a DataFrame