Merging pandas DataFrames with deleting some part of data [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have these two dictionaries, that I would like to merge, but without deleting data of a, but deleting the non-matching values of b:
a = pd.DataFrame({'coname': ['Apple','Microsoft','JPMorgan','Facebook','Intel','McKinsey'],
'eva': [20, 18, 73, 62, 56, 92],
'ratio': [4, 7, 1, 6, 9, 8]
})
b = pd.DataFrame({'coname': ['Apple','Microsoft','JPMorgan','Netflix','Total','Ford'],
'city': ['Cupertino','Seattle','NYC','Palo Alto','Paris', 'Detroit'],
'state': ['CA','WA','NY','CA','Ile de France', 'MI']
})
I want to following output: EDITED
coname eva ratio city state
0 Apple 20.0 4.0 Cupertino CA
1 Microsoft 18.0 7.0 Seattle WA
2 JPMorgan 73.0 1.0 NYC NY
3 Facebook 62.0 6.0 NaN NaN
4 Intel 56.0 9.0 NaN NaN
5 McKinsey 92.0 8.0 NaN NaN
I have tried
a = pd.merge(a,b, on = 'coname', how='outer')
for i in a['coname']:
if i in b['coname']:
a.drop(i)
with but I only get this:
coname eva ratio city state
0 Apple 20.0 4.0 Cupertino CA
1 Microsoft 18.0 7.0 Seattle WA
2 JPMorgan 73.0 1.0 NYC NY
3 Facebook 62.0 6.0 NaN NaN
4 Intel 56.0 9.0 NaN NaN
5 McKinsey 92.0 8.0 NaN NaN
6 Netflix NaN NaN Palo Alto CA
7 Total NaN NaN Paris Ile de France
8 Ford NaN NaN Detroit MI

how='left' will do the trick:
pd.merge(a,b, on = 'coname', how='left')
Result:
coname
eva
ratio
city
state
0
Apple
20
4
Cupertino
CA
1
Microsoft
18
7
Seattle
WA
2
JPMorgan
73
1
NYC
NY
3
Facebook
62
6
nan
nan
4
Intel
56
9
nan
nan
5
McKinsey
92
8
nan
nan

Related

How to get all column's first values of consecutive groups and also max of different bins of the respective group in pandas dataframe?

I have dataframe as :
pandas dataframe
I'm consecutively grouping by 'Name' column for total counts, consecutive counts & for 'Age' column I'm applying min, max to generate dataframe as :
Then, I'm getting only first value of every column for each consecutive group as :
Then, I'm trying to get all column values where max 'Age' between 5-20 from each consecutive group is present and then, I'm trying to concat this dataframe with the dataframe which has first values. But I got the output as :
But the expected output is :
Also, this is for a single bin i.e., 5-20, how to include for more than 1 bins, for example, if 1 bin is 5-20 & next bin is 25-40, the expected output is :
For above outputs, this is the code what I have written :
import numpy as np
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['tom', 5], ['nick', 15], ['juli', 14], ['tom', 20],['tom', 10], ['tom', 10], ['juli', 17], ['tom', 30], ['nick', 19], ['juli', 24], ['juli', 29],['tom', 0], ['juli', 76]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
print("df = ",df)
print("")
# acquire min, max, count, consecutive same names
df['min'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('min')#df.groupby("Name",sort=False)['Age'].transform('min')
df['max'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Age'].transform('max')#df.groupby("Name",sort=False)['Age'].transform('max')
df['count'] = df.groupby("Name",sort=False)['Name'].transform('count')
df['cons'] = df.groupby(df['Name'].ne(df['Name'].shift()).cumsum())['Name'].transform('size')
print(df)
# take the first column values of every consecutive group
df_t = df
temp_df=df.groupby(df['Name'].ne(df['Name'].shift()).cumsum(),as_index=False)[df.columns].agg('first')
print("")
print("temp_df = ",temp_df)
df_t = df_t.reset_index()
df_t = df_t.drop(['index'], axis=1)
print("df_t = ", df_t)
# check max of bin 5-20 for every consecutive group
df_t1 = df_t.groupby(df_t['Name'].ne(df_t['Name'].shift()).cumsum(),as_index=False).apply(lambda x:x['Age'][(x['Age'] >= 5) & (x['Age'] < 20)].agg(lambda y : y.idxmax()))
print("")
print("df_t1 = ", df_t1)
# checking for condition if value is np array
a = df_t1.tolist()
b=[]
c = np.array([2])
c = c.astype('int64')
for i in a:
if type(i)== type(c[0]):
b.append(i)
else:
continue
df_t1 = df_t.iloc[b]
print("")
print("output df_t1 = ", df_t1)
# concat the bin max and first value df
concatdf = pd.concat([temp_df, df_t1],axis=1)
print("")
print("concatdf = ", concatdf)
Thank you in advance :)
You can greatly simplify your code by doing a single groupby for almost all indicators excepts the cumulated count.
Then just mask your data according to your criterion and concatenate.
I believe this is doing what you want:
group = df['Name'].ne(df['Name'].shift()).cumsum()
df2 = (df
.groupby(group, as_index=False)
.agg(**{'Name': ('Name', 'first'),
'Age': ('Age', 'first'),
'min': ('Age', 'min'),
'max': ('Age', 'max'),
'cons': ('Age', 'count')
})
.assign(count=lambda d: d.groupby('Name')['cons'].transform('sum'))
)
out = pd.concat([df2, df2.where(df2['max'].between(5,20))], axis=1)
output:
Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN
For more bins:
bins = [(5,20), (25,40)]
out = pd.concat([df2]+[df2.where(df2['max'].between(a,b)) for a,b in bins], axis=1)
output:
Name Age min max cons count Name Age min max cons count Name Age min max cons count
0 tom 10 5 10 2 7 tom 10.0 5.0 10.0 2.0 7.0 NaN NaN NaN NaN NaN NaN
1 nick 15 15 15 1 2 nick 15.0 15.0 15.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
2 juli 14 14 14 1 5 juli 14.0 14.0 14.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
3 tom 20 10 20 3 7 tom 20.0 10.0 20.0 3.0 7.0 NaN NaN NaN NaN NaN NaN
4 juli 17 17 17 1 5 juli 17.0 17.0 17.0 1.0 5.0 NaN NaN NaN NaN NaN NaN
5 tom 30 30 30 1 7 NaN NaN NaN NaN NaN NaN tom 30.0 30.0 30.0 1.0 7.0
6 nick 19 19 19 1 2 nick 19.0 19.0 19.0 1.0 2.0 NaN NaN NaN NaN NaN NaN
7 juli 24 24 29 2 5 NaN NaN NaN NaN NaN NaN juli 24.0 24.0 29.0 2.0 5.0
8 tom 0 0 0 1 7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 juli 76 76 76 1 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

pandas groupby rolling behaviour

Here my pandas:
df = pd.DataFrame({
'location': ['USA','USA','USA','USA', 'France','France','France','France'],
'date':['2020-11-20','2020-11-21','2020-11-22','2020-11-23', '2020-11-20','2020-11-21','2020-11-22','2020-11-23'],
'dm':[5.,4.,2.,2.,17.,3.,3.,7.]
})
For a precise location (so groupby is needed) I want the mean of dm over 2 days. If I use this :
df['rolling']=df.groupby('location').dm.rolling(2).mean().values
I obtain this incorrect pandas
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 10.0
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 5.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 4.5
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 2.0
While it should be:
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Two questions:
what my syntax is actually doing ?
what is the correct way to proceed ?
Here is problem groupby create new level of MultiIndex, so for matching original index values is necessary remove it by Series.reset_index with drop=True, if use .value then is no alignemnt by index, so order should be different like here:
df['rolling']=df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True)
print (df)
location date dm rolling
0 USA 2020-11-20 5.0 NaN
1 USA 2020-11-21 4.0 4.5
2 USA 2020-11-22 2.0 3.0
3 USA 2020-11-23 2.0 2.0
4 France 2020-11-20 17.0 NaN
5 France 2020-11-21 3.0 10.0
6 France 2020-11-22 3.0 3.0
7 France 2020-11-23 7.0 5.0
Details:
print (df.groupby('location').dm.rolling(2).mean())
location
France 4 NaN
5 10.0
6 3.0
7 5.0
USA 0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64
print (df.groupby('location').dm.rolling(2).mean().reset_index(level=0, drop=True))
4 NaN
5 10.0
6 3.0
7 5.0
0 NaN
1 4.5
2 3.0
3 2.0
Name: dm, dtype: float64

Python Pandas Dataframe Dynamic Adding column

My data frame is below:
Date Country GDP
0 2011 United States 345.0
1 2012 United States 0.0
2 2013 United States 457.0
3 2014 United States 577.0
4 2015 United States 0.0
5 2016 United States 657.0
6 2011 UK 35.0
7 2012 UK 64.0
8 2013 UK 54.0
9 2014 UK 67.0
10 2015 UK 687.0
11 2016 UK 0.0
12 2011 China 34.0
13 2012 China 54.0
14 2013 China 678.0
15 2014 China 355.0
16 2015 China 5678.0
17 2016 China 345.0
I want to calculate what is the GDP percentage of one country among all 3 countries each year. I would like to add one more column called parc in the dataframe:
I implemented below code:
import pandas as pd
countrylist=['United States','UK','China']
for country in countrylist:
for year in range (2011,2016):
df['perc']=(df['GDP'][(df['Country']==country) & (df['Date']==year)]).astype(float)/df['GDP'][df['Date']==year].sum()
print (df['perc'])
My output is like
0 0.833333
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
0 NaN
1 0.0
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
0 NaN
1 NaN
2 0.384357
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
....
I noticed that my previous results got wipe out when new loop start. So ultimately I only have last perc value. I should provide some position info when df['perc'] happen such as:
df['perc'][([(df['Country']==country) & (df['Date']==year)])]=(df['GDP'][(df['Country']==country) & (df['Date']==year)]).astype(float)/df['GDP'][df['Date']==year].sum()
But it doesn't work. How can I dynamically insert value?
Ideally, I should have:
Date Country GDP perc
0 2011 United States 345.0 0.81
1 2012 United States 0.0 0.0
2 2013 United States 457.0 0.23
3 2014 United States 577.0 xx
4 2015 United States 0.0 xx
5 2016 United States 657.0 xx
6 2011 UK 35.0 xx
7 2012 UK 64.0 xx
8 2013 UK 54.0 xx
9 2014 UK 67.0 xx
10 2015 UK 687.0 xx
11 2016 UK 0.0 xx
12 2011 China 34.0 xx
13 2012 China 54.0 xx
14 2013 China 678.0 xx
15 2014 China 355.0 xx
16 2015 China 5678.0 xx
17 2016 China 345.0 xx
You can using transform sum here
df.GDP/df.groupby('Date').GDP.transform('sum')
Out[161]:
0 0.833333
1 0.000000
2 0.384357
3 0.577578
4 0.000000
5 0.655689
6 0.084541
7 0.542373
8 0.045416
9 0.067067
10 0.107934
11 0.000000
12 0.082126
13 0.457627
14 0.570227
15 0.355355
16 0.892066
17 0.344311
Name: GDP, dtype: float64

In python, reading multiple CSV's, with different headers, into one dataframe

I have dozens of csv files with similar (but not always exactly the same) headers. For instance, one has:
Year Month Day Hour Minute Direct Diffuse D_Global D_IR Zenith Test_Site
One has:
Year Month Day Hour Minute Direct Diffuse2 D_Global D_IR U_Global U_IR Zenith Test_Site
(Notice one lacks "U_Global" and "U_IR", the other has "Diffuse2" instead of "Diffuse")
I know how to pass multiple csv's into my script, but how do I have the csv's only pass values to columns in which they currently have values? And perhaps pass "Nan" to all other columns in that row.
Ideally I'd have something like:
'Year','Month','Day','Hour','Minute','Direct','Diffuse','Diffuse2','D_Global','D_IR','U_Global','U_IR','Zenith','Test_Site'
1992,1,1,0,3,-999.00,-999.00,"Nan",-999.00,-999.00,"Nan","Nan",122.517,"BER"
2013,5,30,15,55,812.84,270.62,"Nan",1078.06,-999.00,"Nan","Nan",11.542,"BER"
2004,9,1,0,1,1.04,79.40,"Nan",78.67,303.58,61.06,310.95,85.142,"ALT"
2014,12,1,0,1,0.00,0.00,"Nan",-999.00,226.95,0.00,230.16,115.410,"ALT"
The other caveat, is that this dataframe needs to be appended to. It needs to remain as multiple csv files are passed into it. I think I'll probably have it write out to it's own csv at the end (it's eventually going to NETCDF4).
Assuming you have the following CSV files:
test1.csv:
year,month,day,Direct
1992,1,1,11
2013,5,30,11
2004,9,1,11
test2.csv:
year,month,day,Direct,Direct2
1992,1,1,21,201
2013,5,30,21,202
2004,9,1,21,203
test3.csv:
year,month,day,File3
1992,1,1,text1
2013,5,30,text2
2004,9,1,text3
2016,1,1,unmatching_date
Solution:
import glob
import pandas as pd
files = glob.glob(r'd:/temp/test*.csv')
def get_merged(files, **kwargs):
df = pd.read_csv(files[0], **kwargs)
for f in files[1:]:
df = df.merge(pd.read_csv(f, **kwargs), how='outer')
return df
print(get_merged(files))
Output:
year month day Direct Direct Direct2 File3
0 1992 1 1 11.0 21.0 201.0 text1
1 2013 5 30 11.0 21.0 202.0 text2
2 2004 9 1 11.0 21.0 203.0 text3
3 2016 1 1 NaN NaN NaN unmatching_date
UPDATE: usual idiomatic pd.concat(list_of_dfs) solution wouldn't work here, because it's joining by indexes:
In [192]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[192]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [193]: pd.concat([pd.read_csv(f) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[193]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
or using index_col=None explicitly:
In [194]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=0, ignore_index=True)
Out[194]:
Direct Direct Direct2 File3 day month year
0 NaN 11.0 NaN NaN 1 1 1992
1 NaN 11.0 NaN NaN 30 5 2013
2 NaN 11.0 NaN NaN 1 9 2004
3 21.0 NaN 201.0 NaN 1 1 1992
4 21.0 NaN 202.0 NaN 30 5 2013
5 21.0 NaN 203.0 NaN 1 9 2004
6 NaN NaN NaN text1 1 1 1992
7 NaN NaN NaN text2 30 5 2013
8 NaN NaN NaN text3 1 9 2004
9 NaN NaN NaN unmatching_date 1 1 2016
In [195]: pd.concat([pd.read_csv(f, index_col=None) for f in glob.glob(file_mask)], axis=1, ignore_index=True)
Out[195]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1992.0 1.0 1.0 11.0 1992.0 1.0 1.0 21.0 201.0 1992 1 1 text1
1 2013.0 5.0 30.0 11.0 2013.0 5.0 30.0 21.0 202.0 2013 5 30 text2
2 2004.0 9.0 1.0 11.0 2004.0 9.0 1.0 21.0 203.0 2004 9 1 text3
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2016 1 1 unmatching_date
The following more idiomatic solution works, but it changes original order of columns and rows / data:
In [224]: dfs = [pd.read_csv(f, index_col=None) for f in glob.glob(r'd:/temp/test*.csv')]
...:
...: common_cols = list(set.intersection(*[set(x.columns.tolist()) for x in dfs]))
...:
...: pd.concat((df.set_index(common_cols) for df in dfs), axis=1).reset_index()
...:
Out[224]:
month day year Direct Direct Direct2 File3
0 1 1 1992 11.0 21.0 201.0 text1
1 1 1 2016 NaN NaN NaN unmatching_date
2 5 30 2013 11.0 21.0 202.0 text2
3 9 1 2004 11.0 21.0 203.0 text3
Can't pandas take care of this automagically?
http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-using-append
If your indices overlap, don't forget to add 'ignore_index=True'
First, run through all the files to define the common headers :
csv_path = './csv_files'
csv_separator = ','
full_headers = []
for fn in os.listdir(csv_path):
with open(fn, 'r') as f:
headers = f.readline().split(csv_separator)
full_headers += full_headers + list(set(full_headers) - set(headers))
Then write your header line into your output file, and run again through all the files to fill it.
You can use : csv.DictReader(open('myfile.csv')) to be able to match the headers to their designated column simply.

Pandas Sort Multiindex by Group Sum

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'County':['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B'],
'Hospital':['a','b','c','d','e','a','b','c','e','a','b','c','d','e','a','b','c','e'],
'Enrollment':[44,55,42,57,95,54,27,55,81,54,65,23,89,76,34,12,1,67],
'Year':['2012','2012','2012','2012','2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013','2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
c NaN 1.0
d NaN 89.0
e 88.0 NaN
B a 54.0 54.0
b 55.0 NaN
e NaN 71.5
C a NaN 34.0
b 27.0 65.0
c 42.0 NaN
D b NaN 12.0
c 55.0 23.0
d 57.0 NaN
I need to sort the data frame such that County is sorted descendingly by the sum of Enrollment for the most recent year (I want to avoid using '2013' directly) like this:
Enrollment
Year 2012 2013
County Hospital
B a 54 54
b 55 NaN
e NaN 71.5
C a NaN 34
b 27 65
c 42 NaN
A a 44 NaN
c NaN 1
d NaN 89
e 88 NaN
D b NaN 12
c 55 23
d 57 NaN
Then, I'd like each hospital sorted within each county, descendingly, but 2013 enrollments like this:
Enrollment
Year 2012 2013
County Hospital
B e NaN 71.5
a 54 54
b 55 NaN
C b 27 65
a NaN 34
c 42 NaN
A d NaN 89
c NaN 1
a 44 NaN
e 88 NaN
D c 55 23
b NaN 12
d 57 NaN
So far, I've tried using groupby to get the sums and merge the back but have not had any luck:
d2.groupby('County').sum()
Thanks in advance!
You could:
max_col = max(d2.columns.get_level_values(1)) # get column 2013
d2['sum'] = d2.groupby(level='County').transform('sum').loc[:, ('Enrollment', max_col)]
d2 = d2.sort_values(['sum', ('Enrollment', max_col)], ascending=[False, False])
to get:
Enrollment sum
Year 2012 2013
County Hospital
B e NaN 71.5 125.5
a 54.0 54.0 125.5
b 55.0 NaN 125.5
C b 27.0 65.0 99.0
a NaN 34.0 99.0
c 42.0 NaN 99.0
A d NaN 89.0 90.0
c NaN 1.0 90.0
a 44.0 NaN 90.0
e 88.0 NaN 90.0
D c 55.0 23.0 35.0
b NaN 12.0 35.0
d 57.0 NaN 35.0

Categories