multiple files combination in pandas - python

Assume the file1 is:
State Date
0 NSW 01/02/16
1 NSW 01/03/16
3 VIC 01/04/16
...
100 TAS 01/12/17
File 2 is:
State 01/02/16 01/03/16 01/04/16 .... 01/12/17
0 VIC 10000 12000 14000 .... 17600
1 NSW 50000
....
Now I would like to join these two files based on Date
In the other words, I want to combine the file1's Date column with file2 columns' date.

I believe you need melt with merge, parameter on is possible omit for merge by all columns same in both DataFrames:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df = df2.melt('State', var_name='Date', value_name='col').merge(df1, how='right')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
Solution with left join:
df = df1.merge(df2.melt('State', var_name='Date', value_name='col'), how='left')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN

You can melt the second data frame to a long format, then merge with first data frame to get the values.
import pandas as pd
df1 = pd.DataFrame({'State': ['NSW','NSW','VIC'],
'Date': ['01/02/16', '01/03/16', '01/04/16']})
df2 = pd.DataFrame([['VIC',10000,12000,14000],
['NSW',50000,60000,62000]],
columns=['State', '01/02/16', '01/03/16', '01/04/16'])
df1.merge(pd.melt(df2, id_vars=['State'], var_name='Date'), on=['State', 'Date'])
# returns:
Date State value
0 01/02/16 NSW 50000
1 01/03/16 NSW 60000
2 01/04/16 VIC 14000

Related

Replace column value in one Panda Dataframe with column in another Panda Dataframe with conditions

I have the following 3 Panda Dataframe. I want to replace company and division columns the ID from their respective company and division dataframe.
pd_staff:
id name company division
P001 John Sunrise Headquarter
P002 Jane Falcon Digital Research & Development
P003 Joe Ashford Finance
P004 Adam Falcon Digital Sales
P004 Barbara Sunrise Human Resource
pd_company:
id name
1 Sunrise
2 Falcon Digital
3 Ashford
pd_division:
id name
1 Headquarter
2 Research & Development
3 Finance
4 Sales
5 Human Resource
This is the end result that I am trying to produce
id name company division
P001 John 1 1
P002 Jane 2 2
P003 Joe 3 3
P004 Adam 2 4
P004 Barbara 1 5
I have tried to combine Staff and Company using this code
pd_staff.loc[pd_staff['company'].isin(pd_company['name']), 'company'] = pd_company.loc[pd_company['name'].isin(pd_staff['company']), 'id']
which produces
id name company
P001 John 1.0
P002 Jane NaN
P003 Joe NaN
P004 Adam NaN
P004 Barbara NaN
You can do:
pd_staff['company'] = pd_staff['company'].map(pd_company.set_index('name')['id'])
pd_staff['division'] = pd_staff['division'].map(pd_division.set_index('name')['id'])
print(pd_staff):
id name company division
0 P001 John 1 1
1 P002 Jane 2 2
2 P003 Joe 3 3
3 P004 Adam 2 4
4 P004 Barbara 1 5
This will achieve the desired results
df_merge = df.merge(df2, how = 'inner', right_on = 'name', left_on = 'company', suffixes=('', '_y'))
df_merge = df_merge.merge(df3, how = 'inner', left_on = 'division', right_on = 'name', suffixes=('', '_z'))
df_merge = df_merge[['id', 'name', 'id_y', 'id_z']]
df_merge.columns = ['id', 'name', 'company', 'division']
df_merge.sort_values('id')
first, lets modify df company and df division a little bit
df2.rename(columns={'name':'company'},inplace=True)
df3.rename(columns={'name':'division'},inplace=True)
Then
df1=df1.merge(df2,on='company',how='left').merge(df3,on='division',how='left')
df1=df1[['id_x','name','id_y','id']]
df1.rename(columns={'id_x':'id','id_y':'company','id':'division'},inplace=True)
Use apply, you can have a function thar will replace the values. from the second excel you will pass the field to look up to and what's to replace in this. Here I am replacing Sunrise by 1 because it is in the second excel.
import pandas as pd
df = pd.read_excel('teste.xlsx')
df2 = pd.read_excel('ids.xlsx')
def altera(df33, field='Sunrise', new_field='1'): # for showing pourposes I left default values but they are to pass from the second excel
return df33.replace(field, new_field)
df.loc[:, 'company'] = df['company'].apply(altera)

Merge two dataframes with subheaders

So I have my first dataframe that has countries as headers and infected and death values as subheaders,
df
Dates Antigua & Barbuda Australia
Infected Dead Infected Dead
2020-01-22 0 0 0 0...
2020-01-23 0 0 0 0...
...
then I have my second dataframe,
df_indicators
Dates Location indicator_1 indicator_2 .....
2020-04-24 Afghanistan 0 0
2020-04-25 Afghanistan 0 0
...
2020-04-24 Yemen 0 0
2020-04-25 Yemen 0 0
I want to merge the dataframes so that the indicator columns become subheaders of the countries column like in df with the infected and dead subheaders.
What I want to produce is something like this,
df_merge
Dates Antigua & Barbuda
Infected Dead indicator_1 indicator_2....
2020-04-24 0 0 0 0...
There are so many indicators that are all named something different that I don't feel I can call them all so not sure if theres a way I can do this easily.
Thank you in advance for any help!
Because there are duplicates first aggregate by mean and then reshape by Series.unstack with DataFrame.swaplevel:
df2 = df_indicators.groupby(['Dates','Location']).mean().unstack().swaplevel(0,1,axis=1)
Or with DataFrame.pivot_table:
df2 = (df.pivot_table(index='Dates', columns='Location', aggfunc='mean')
.swaplevel(0,1,axis=1))
And last join with sorting MultiIndex in columns:
df = pd.concat([df, df2], axis=1).sort_index(axis=1)

Join on previous year with additional calculations

Assume that I have a Pandas data frame that looks like this:
df = pd.DataFrame({
"YEAR":[2000,2000,2001,2001,2002],
"VISITORS":[100,2000,200,300,250],
"SALES":[5000,2500,23500,1512,3510],
"MONTH":[1,2,1,2,1],
"LOCATION":["Loc1", "Loc2", "Loc1" , "Loc2" , "Loc1"]})
I want to join this data frame on MONTH, LOCATION columns with a previous year data of the same Pandas data frame.
I tried this:
def calculate(df):
result_all_years = []
for current_year in df["YEAR"].unique():
df_previous = df.copy()
df_previous = df_previous[df_previous["YEAR"] == current_year - 1]
df_previous.rename(
columns={
"VISITORS": "VISITORS_LAST_YEAR",
"SALES": "SALES_LAST_YEAR",
"YEAR": "PREVIOUS_YEAR",
},
inplace=True,
)
df_current = df[df["YEAR"] == current_year]
df_current = df_current.merge(
df_previous,
how="left",
on=["MONTH", "LOCATION" ]
)
# There are many simular calculations and additional columns to be added like the following:
df_current["SALES_DIFF"] = df_current["SALES"] - df_current["SALES_LAST_YEAR"]
result_all_years.append(df_current)
return pd.concat(result_all_years, ignore_index=True).round(3)
The code in the calculate function is working fine. But is there any faster method to do that? Possibly faster?
Try to merge with the same dataframe and manipulate it accordingly
diff_df = pd.merge(df, df, left_on = [df['YEAR'], df['MONTH'], df['LOCATION']], suffixes=('', '_PREV'),
right_on = [df['YEAR']+1, df['MONTH'], df['LOCATION']])
diff_df = diff_df[['YEAR', 'YEAR_PREV', 'MONTH', 'LOCATION','VISITORS','VISITORS_PREV','SALES','SALES_PREV']]
diff_df = diff_df.assign(VISITORS_DIFF = (diff_df['VISITORS_PREV'] - diff_df['VISITORS']),
SALES_DIFF = (diff_df['SALES_PREV'] - diff_df['SALES']))
Output
YEAR YEAR_PREV MONTH LOCATION VISITORS VISITORS_PREV SALES SALES_PREV VISITORS_DIFF SALES_DIFF
2001 2000 1 Loc1 200 100 23500 5000 -100 -18500
2001 2000 2 Loc2 300 2000 1512 2500 1700 988
2002 2001 1 Loc1 250 200 3510 23500 -50 19990
IIUC, you can muse merge on the dataframe itself with the incremented YEAR:
(df.merge(df.assign(YEAR=df['YEAR']+1).drop(columns=['MONTH']),
on=['YEAR', 'LOCATION'],
how='left',
suffixes=('', '_LAST_YEAR'))
.assign(SALES_DIFF=lambda d: d['SALES']-d['SALES_LAST_YEAR'],
LAST_YEAR=lambda d: d['YEAR'].sub(1).mask(d['SALES_DIFF'].isna())
)
)
output:
YEAR VISITORS SALES MONTH LOCATION VISITORS_LAST_YEAR SALES_LAST_YEAR SALES_DIFF LAST_YEAR
0 2000 100 5000 1 Loc1 NaN NaN NaN NaN
1 2000 2000 2500 2 Loc2 NaN NaN NaN NaN
2 2001 200 23500 1 Loc1 100.0 5000.0 18500.0 2000.0
3 2001 300 1512 2 Loc2 2000.0 2500.0 -988.0 2000.0
4 2002 250 3510 1 Loc1 200.0 23500.0 -19990.0 2001.0

replace row data in pandas based on another dataframe

I've a sample dataframe
name
0 Newyork
1 Los Angeles
2 Ohio
3 Washington DC
4 Kentucky
Also I've a second dataframe
name ratio
0 Newyork 1:2
1 Kentucky 3:7
2 Florida 1:5
3 SF 2:9
How can I replace the data of name column in the df2 with not available, if the name is present in df1?
Desired result:
name ratio
0 Not Available 1:2
1 Not Available 3:7
2 Florida 1:5
3 SF 2:9
Use numpy.where:
df2['name'] = np.where(df2['name'].isin(df1['name']), 'Not Available', df2['name'])

Pandas MultiIndex DataFrame Sorting

I am looking for a way to sort a column in a DataFrame with multiple index levels. In my DataFrame index level 0 is state name ("STNAME") and index level 1 is city name ("CTYNAME").
My initial DataFrame looks like this:
In:
df = census_df
df = df.set_index(["STNAME" ,"CTYNAME"])
df = df.loc[: ,["CENSUS2010POP"]]
print(df.head())
Out:
CENSUS2010POP
STNAME CTYNAME
Alabama Alabama 4779736
Autauga County 54571
Baldwin County 182265
Barbour County 27457
Bibb County 22915
However, when I try to apply sorting to "CENSUS2010POP" column it ruins all the hierarchy:
In:
df = census_df
df = df.set_index(["STNAME" ,"CTYNAME"])
df = df.loc[: ,["CENSUS2010POP"]]
df = df.sort_values("CENSUS2010POP")
print (df.head())
Out:
CENSUS2010POP
STNAME CTYNAME
Texas Loving County 82
Hawaii Kalawao County 90
Texas King County 286
Kenedy County 416
Nebraska Arthur County 460
I am wondering if there's a way to sort column and index levels
Any help would be much appreciated
You can add STNAME to the sort_values
df.sort_values(['STNAME','CENSUS2010POP'])
On random data:
np.random.seed(1)
df = pd.DataFrame({
'STNAME':[0]*4+[1]*4,
'CTYNAME':[0,1,2,3]*2,
'CENSUS2010POP':np.random.randint(10,100,8)
}).set_index(['STNAME', 'CTYNAME'])
Output is:
CENSUS2010POP
STNAME CTYNAME
0 3 19
1 22
0 47
2 82
1 1 15
3 74
0 85
2 89

Categories