Selective summation of columns in a pandas dataframe - python

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day's data for one state. As many people know, the pandemic is hitting different states differently -- New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not "NY". What I want to do, though, is exclude multiple states with something like a "not in" function (this doesn't work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the 'positive' column, with 'NY' standing in for 'tristate'):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
Any help would be welcome.

to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
or with pivot_table, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300

You can exclude multiple values of states by using isin with a NOT(~) sign:
all_states[~(all_states['state'].isin(["NY", "NJ", "CT"]))]
So, your code would be:
not_tristate = all_states[~(all_states['state'].isin(['NY','NJ','CT']))].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

Related

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

Remove extra index in a dataframe

I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.
My code is
data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')
The output is this table with extra index:
I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.
The dataset
https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing
I hope pandas.DataFrame.droplevel can do the job for you query
import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)
Output
df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)
MultiIndex([('service_type_id', '3 Phase Wiring'),
('service_type_id', 'AV Equipment')
...
('service_type_id', 'Yoga Lessons'),
('service_type_id', 'Zumba Classes')],
names=[None, 'service_type'], length=188)
#print(data_tr.info())
Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows
print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
dtype='object', name='service_type', length=188)
It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.
For insight, try to execute just
data.groupby(['transaction_id', 'service_type']).sum()
Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot
data_tr = data[['transaction_id'
, 'service_type'
, 'service_type_id']
].pivot(index = 'transaction_id'
, columns= 'service_type'
, values = 'service_type_id'
).fillna(0)
Depending on how you like your data structure, you can follow up with a .reset_index()

groupby.mean function dividing by pre-group count rather than post-group count

So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation

Percent Change in for-loop

I have a dataframe where I've set both the District and the Year as a multilevel index. I want to calculate the percentage change for each column ('DEM', 'REP', etc) for each district for each year.
I have consulted this previous SO question and tried using the following code:
for idx, districts_bydistrict_select in districts_bydistrict.groupby(level=[0, 1]):
y = districts_bydistrict.pct_change()
print(pd.DataFrame(y))
However it is not recognizing to start the pct_change() calculation over when there is a new District. I realize I am probably missing some part of the for-loop.
You can simply specify the level in your groupby.
districts_bydistrict.groupby(level='Year').pct_change()
You can unstack the districts so that you just have time in the index, compute pct_change, and then restack the districts.
districts_bydistrict.unstack('DISTRICTS').pct_change().stack()

Iterate through and overwrite specific values in a pandas dataframe

I have a large dataframe collating a bunch of basketball data (screenshot below). Every column to the right of Opp Lineup is a dummy variable indicating if that player (indicated in the column name) is in the current lineup (the last part of the column name is team name, which needs to be compared to the opponent column to make sure two players with the same number and name on different teams don't mess it up). I know several ways of iterating through a pandas dataframe (iterrows, itertuples, iteritems), but I don't know the way to accomplish what I need to, which is for each line in each column:
Compare the team (columnname.split()[2:]) to the Opponent column (except for LSU players)
See if the name (columnname.split()[:2]) is in Opp Lineup or, for LSU players, lineup
If the above conditions are satisfied, replace that value with 1, otherwise leave it as 0
What is the best method for looping through the dataframe and accomplishing this task? Speed doesn't really matter in this instance. I understand all of the logic involved, except I'm not familiar enough with pandas to know how to loop through it, and trying various things I've seen on Google isn't working.
Consider a reshape/pivot solution as your data is in wide format but you need to compare values row-wise in long format. So, first melt your data so all column headers become an actual column 'Player' and its corresponding value to 'IsInLineup'. Run your conditional comparison for dummy values, and then pivot back to original structure with players across column headers. Of course, I do not have actual data to test this example fully.
# MELT
reshapedf = pd.melt(df, id_vars=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'],
var_name='Player', value_name='IsInLineup')
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
reshapedf['IsInLineup'] = reshapedf.apply(lambda row: (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and
' '.join(row['Player'].split(' ')[2:]) in row['Opponent'])*1, axis=1)
# PIVOT (UNMELT)
df2 = reshapedf.pivot_table(index=['Opponent', 'Lineup', 'Minutes', 'Plus Minus',
'Plus Minus Per Minute', 'Opp Lineup'], columns='Player').reset_index()
df2.columns = df2.columns.droplevel(0).rename(None)
df2.columns = df.columns
If above lambda function looks a little complex, try equivalent apply defined function():
# APPLY FUNCTION (SPLITTING VALUE AND THEN JOINING FOR SUBSET STRING)
def f(row):
if (' '.join(row['Player'].split(' ')[:2]) in row['Opp Lineup'] and \
' '.join(row['Player'].split(' ')[2:]) in row['Opponent']):
return 1
else:
return 0
reshapedf['IsInLineup'] = reshapedf.apply(f,axis=1)
I ended up using a work around. I iterated through using df.iterrows and for each one created a list for each iteration where checked for the value I wanted and then appended the 0 or 1 to the temporary list. Then I simply inserted it to the dataframe. Possibly not the most efficient memory-wise, but it worked.

Categories