Merge two dataframes with subheaders - python

So I have my first dataframe that has countries as headers and infected and death values as subheaders,
df
Dates Antigua & Barbuda Australia
Infected Dead Infected Dead
2020-01-22 0 0 0 0...
2020-01-23 0 0 0 0...
...
then I have my second dataframe,
df_indicators
Dates Location indicator_1 indicator_2 .....
2020-04-24 Afghanistan 0 0
2020-04-25 Afghanistan 0 0
...
2020-04-24 Yemen 0 0
2020-04-25 Yemen 0 0
I want to merge the dataframes so that the indicator columns become subheaders of the countries column like in df with the infected and dead subheaders.
What I want to produce is something like this,
df_merge
Dates Antigua & Barbuda
Infected Dead indicator_1 indicator_2....
2020-04-24 0 0 0 0...
There are so many indicators that are all named something different that I don't feel I can call them all so not sure if theres a way I can do this easily.
Thank you in advance for any help!

Because there are duplicates first aggregate by mean and then reshape by Series.unstack with DataFrame.swaplevel:
df2 = df_indicators.groupby(['Dates','Location']).mean().unstack().swaplevel(0,1,axis=1)
Or with DataFrame.pivot_table:
df2 = (df.pivot_table(index='Dates', columns='Location', aggfunc='mean')
.swaplevel(0,1,axis=1))
And last join with sorting MultiIndex in columns:
df = pd.concat([df, df2], axis=1).sort_index(axis=1)

Related

Counting String Values in Pivot Across Multiple Columns

I'd like to use Pandas to pivot a table into multiple columns, and get the count of their values.
In this example table:
LOCATION
ADDRESS
PARKING TYPE
AAA0001
123 MAIN
LARGE LOT
AAA0001
123 MAIN
SMALL LOT
AAA0002
456 TOWN
LARGE LOT
AAA0003
789 AVE
MEDIUM LOT
AAA0003
789 AVE
MEDIUM LOT
How do I pivot out this table to show total counts of each string within "Parking Type"? Maybe my mistake is calling this a "pivot?"
Desired output:
LOCATION
ADDRESS
SMALL LOT
MEDIUM LOT
LARGE LOT
AAA0001
123 MAIN
1
0
1
AAA0002
456 TOWN
0
0
1
AAA0003
789 AVE
0
2
0
Currently, I have a pivot going, but it is only counting the values of the first column, and leaving everything else as 0s. Any guidance would be amazing.
Current Code:
pivot = pd.pivot_table(df, index=["LOCATION"], columns=['PARKING TYPE'], aggfunc=len)
pivot = pivot.reset_index()
pivot.columns = pivot.columns.to_series().apply(lambda x: "".join(x))
You could use pd.crosstab:
out = (pd.crosstab(index=[df['LOCATION'], df['ADDRESS']], columns=df['PARKING TYPE'])
.reset_index()
.rename_axis(columns=[None]))
or you could use pivot_table (but you have to pass "ADDRESS" into the index as well):
out = (pd.pivot_table(df, index=['LOCATION','ADDRESS'], columns=['PARKING TYPE'], values='ADDRESS', aggfunc=len, fill_value=0)
.reset_index()
.rename_axis(columns=[None]))
Output:
LOCATION ADDRESS LARGE LOT MEDIUM LOT SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0
You can use get_dummies() and then a grouped sum to get a row per your groups:
>>> pd.get_dummies(df, columns=['PARKING TYPE']).groupby(['LOCATION','ADDRESS'],as_index=False).sum()
LOCATION ADDRESS PARKING TYPE_LARGE LOT PARKING TYPE_MEDIUM LOT PARKING TYPE_SMALL LOT
0 AAA0001 123 MAIN 1 0 1
1 AAA0002 456 TOWN 1 0 0
2 AAA0003 789 AVE 0 2 0

Summary of data for each month

I do have health diagnosis data for last year and I did like get count of diagnosis for each month. Here is my data:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
print (df2)
How can I get diagnosis counts for each month. Example is how many diagnosis of bacteria sepsis we had for that month. Final result is a table showing value counts of diagnosis for each month
If you want to see results per month, you can use pivot_table.
df2.pivot_table(index=['outcome','diagnosis'], columns=pd.to_datetime(df2['Date']).dt.month, aggfunc='size', fill_value=0)
Date 4 5 6
outcome diagnosis
alive Risk sepsis 0 1 0
bacteria sepsis 2 0 0
dead Neonatal sepsis 0 0 1
Sepsis 0 1 0
4,5,6 are the months in your dataset.
Try playing around with the parameters here, you might be able to land on a better view that suits your ideal result better.
I modified your dataframe by setting the Date column as index:
import pandas as pd
cars2 = {'ID': [22,100,47,35,60],
'Date': ['2020-04-11','2021-04-12','2020-05-13','2020-05-14', '2020-06-15'],
'diagnosis': ['bacteria sepsis','bacteria sepsis','Sepsis','Risk sepsis','Neonatal sepsis'],
'outcome': ['alive','alive','dead','alive','dead']
}
df2 = pd.DataFrame(cars2, columns = ['ID','Date', 'diagnosis', 'outcome'])
df2.index = pd.to_datetime(df2['Date']) # <--- I set your Date column as the index (also convert it to datetime)
df2.drop('Date',inplace=True, axis=1) # <--- Drop the Date column
print (df2)
if you groupby the dataframe by a pd.Grouper and the columns you want to group with (diagnosis and outcome):
df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
Output:
ID
Date diagnosis outcome
2020-04-30 bacteria sepsis alive 1
2020-05-31 Risk sepsis alive 1
Sepsis dead 1
2020-06-30 Neonatal sepsis dead 1
2021-04-30 bacteria sepsis alive 1
Note: the freq='M' in pd.Grouper groups the dataframe by month. Read more about the freq attribute here
Edit: Assigning the grouped dataframe to new_df and resetting the other indices except Date:
new_df = df2.groupby([pd.Grouper(freq='M'), 'diagnosis','outcome']).count()
new_df.reset_index(level=[1,2],inplace=True)
Iterate over each month and get the table separately inside df_list:
df_list = [] # <--- this will contain each separate table for each month
for month in np.unique(new_df.index):
df_list += [pd.DataFrame(new_df.loc[[month]])]
df_list[0] # <-- get the first dataframe in df_list
will return:
diagnosis outcome ID
Date
2020-04-30 bacteria sepsis alive 1
First you need to create a month variable through to_datetime() function; then you can group by the month and make a value_counts() within the month
import pandas as pd
df2['month'] = pd.to_datetime(df2['Date']).dt.month
df2.groupby('month').apply(lambda x: x['diagnosis'].value_counts())
month
4 bacteria sepsis 2
5 Risk sepsis 1
Sepsis 1
6 Neonatal sepsis 1
Name: diagnosis, dtype: int64
I think what you mean by for each month is not only mean month figure only, but year-month combination. As such, let's approach as follows:
First, we create a 'year-month' column according to the Date column. Then use .groupby() on this new year-month column and get .value_counts() on column diagnosis, as follows:
df2['year-month'] = pd.to_datetime(df2['Date']).dt.strftime("%Y-%m")
df2.groupby('year-month')['diagnosis'].value_counts().to_frame(name='Count').reset_index()
Result:
year-month diagnosis Count
0 2020-04 bacteria sepsis 1
1 2020-05 Risk sepsis 1
2 2020-05 Sepsis 1
3 2020-06 Neonatal sepsis 1
4 2021-04 bacteria sepsis 1

How to suppress a pandas dataframe?

I have this data frame:
age Income Job yrs
Churn Own Home
0 0 39.497576 42.540247 7.293301
1 42.667392 58.975215 8.346974
1 0 44.499774 45.054619 7.806146
1 47.615546 60.187945 8.525210
Born from this line of code:
gb = df3.groupby(['Churn', 'Own Home'])['age', 'Income', 'Job yrs'].mean()
I want to "suppress" or unstack this data frame so that it looks like this:
Churn Own Home age Income Job yrs
0 0 0 39.49 42.54 7.29
1 0 1 42.66 58.97 8.34
2 1 0 44.49 45.05 7.80
3 1 1 47.87 60.18 8.52
I have tried using both .stack() and .unstack() with no luck, also I was not able to find anything online talking about this. Any help is greatly appreciated.
Your dataFrame looks like a MultiIndex that you can revert to a single index using the command :
gb.reset_index(level=[0,1])

multiple files combination in pandas

Assume the file1 is:
State Date
0 NSW 01/02/16
1 NSW 01/03/16
3 VIC 01/04/16
...
100 TAS 01/12/17
File 2 is:
State 01/02/16 01/03/16 01/04/16 .... 01/12/17
0 VIC 10000 12000 14000 .... 17600
1 NSW 50000
....
Now I would like to join these two files based on Date
In the other words, I want to combine the file1's Date column with file2 columns' date.
I believe you need melt with merge, parameter on is possible omit for merge by all columns same in both DataFrames:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df = df2.melt('State', var_name='Date', value_name='col').merge(df1, how='right')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
Solution with left join:
df = df1.merge(df2.melt('State', var_name='Date', value_name='col'), how='left')
print (df)
State Date col
0 NSW 01/02/16 50000.0
1 NSW 01/03/16 NaN
2 VIC 01/04/16 14000.0
3 TAS 01/12/17 NaN
You can melt the second data frame to a long format, then merge with first data frame to get the values.
import pandas as pd
df1 = pd.DataFrame({'State': ['NSW','NSW','VIC'],
'Date': ['01/02/16', '01/03/16', '01/04/16']})
df2 = pd.DataFrame([['VIC',10000,12000,14000],
['NSW',50000,60000,62000]],
columns=['State', '01/02/16', '01/03/16', '01/04/16'])
df1.merge(pd.melt(df2, id_vars=['State'], var_name='Date'), on=['State', 'Date'])
# returns:
Date State value
0 01/02/16 NSW 50000
1 01/03/16 NSW 60000
2 01/04/16 VIC 14000

Pandas Multi-Index EWMA: Comparing same minute over multiple days

I am trying to plug a data set into Pandas and am doing something a bit unique with the approach.
I have a data set that looks like the following:
Date, Time, Venue, Volume, SummedVolume
2015-09-14, 09:30, NYSE, 1000, 10000
2015-09-14, 09:31, NYSE, 1100, 10100
However, I have this data sliced by minute per date. I have files going back a number of days, so I call a certain number of them and concat them into my DataFrame, typically using the last 20 days.
What I would like to do is use pandas ewma to do an ewma on the exact same minute of the day, across those 20 days, by Venue. So what the result would be, is comparing the 09:30 minute across the last 20 days for NYSE, using an alpha of 0.5 (which I think would be span=20 in this case). Obviously, sort the data so that the oldest data is at the back and newest data is at the front is critical, so I am doing that as well, the data can't be in a random order.
Right now I am able to get pandas to do simple math (means, etc) on this data set using groupby on Time and Venue (shown below). However, when I try to do an ewma on this, I get errors about not being able to do an ewma on a non-unique data set - which is reasonable. But adding the Date into the MultiIndex kind of wrecks being able to compare the same exact minute to that minute on other dates.
Can anyone think of a solution here?
frame = pd.DataFrame()
concat = []
for fn in files:
df = pd.read_csv(fn, index_col=None, header=0)
concat.append(df)
frame = pd.concat(concat)
df = pd.DataFrame(frame)
if conf == "VenueStats":
grouped = df.groupby(['time','Venue'], sort=True)
elif conf == "SymbolStats":
grouped = df.groupby(['time','Symbol'], sort=True)
stats = grouped.mean().astype(int)
stats.to_csv('out.csv')
Initial output from df.head() before the mean (I changed the Venue names and values to 0 since this is sensitive information):
Date Time Venue Volume SummedVolume
0 2015-09-14 17:00 NYSE 0 0
1 2015-09-14 17:00 ARCA 0 0
2 2015-09-14 17:00 AMEX 0 0
3 2015-09-14 17:00 NASDAQ 0 0
4 2015-09-14 17:00 BATS 0 0
Output from stats.head() after the mean:
Volume SummedVolume
Time Venue
00:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0
Here is what is different from doing a mean (above) to when I try to do the ewma:
for fn in files:
df = pd.read_csv(fn, index_col=[0,1,2], header=0) #0=Date,1=Time,2=Venue
concat.append(df)
frame = pd.concat(concat)
df = pd.DataFrame(frame, columns=['Volume','SummedVolume'])
if conf == "VenueStats":
stats = df.groupby(df.index).apply(lambda x: pd.ewma(x,span=20))
elif conf == "SymbolStats":
stats = df.groupby(df.index).apply(lambda x: pd.ewma(x,span=20))
Here is the df.head() from the ewma version and the stats.head() from the ewma version (they look the same):
Volume SummedVolume
Date Time Venue
2015-09-14 17:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0
Volume SummedVolume
Date Time Venue
2015-09-14 17:00 NYSE 0 0
ARCA 0 0
AMEX 0 0
NASDAQ 0 0
BATS 0 0
You want to pivot your data so that dates are down one axis and time the other.
It is difficult to work on this problem without some reproduceable data, but the solution would be something like this:
df2 = (df.reset_index()
.groupby(['tradeDate', 'time', 'exchange'])
.first() # Given that the data is unique by selected grouping
.unstack(['exchange', 'time'])
pd.ewma(df2, span=20)

Categories