Grouping with Python - python

I have a dataset I am trying to group by some common values and then sum up some other values. The tricky part is I want to add some sort of weighting that keeps the largest number, I'll try to elaborate more below:
I've created a dummy data frame that is along the lines of my data just for example purposes:
df = pd.DataFrame({'Family': ['Contactors', 'Contactors', 'Contactors'],
'Cell': ['EP&C', 'EXR', 'C&S'],
'Visits': ['25620', '626', '40']})
This produces a table like so:
So, in this example I would want all of the 'Contactors' to be grouped up by EP&C (as this has the highest visits to start with) but I would like all of the visits summed up and the other 'Cell' values dropped, so I would be left with something like this:
Could anyone advise?
Thanks.

IIUC, you can use:
(df
# convert to numeric
.assign(Visits=pd.to_numeric(df['Visits']))
# ensure the top row per group is the highest visits
.sort_values(by=['Family', 'Visits'], ascending=False)
# for groups per Family
.groupby('Family', sort=False, as_index=False)
# aggregate per group: Cell (first row, i.e top) and Visits (sum of rows)
.agg({'Cell': 'first', 'Visits': sum})
)
output:
Family Cell Visits
0 Contactors EP&C 26286

Related

Calculating the moving average for each unique value in a column

I have a csv file with a value that increases with time for n cities, like this:
city,date,value
saopaulo,2020-01-01,5
riodejaneiro,2020-01-01,3
curitiba,2020-01-01,7
...
saopaulo,2020-05-01,31
riodejaneiro,2020-05-01,55
curitiba,2020-05-01,41
What I want to do is to calculate the moving average of the column "value", but for each "city" separately.
I loaded the csv into a pandas dataframe, but if I calculate df["value"].rolling(3), it will calculate the moving average but for all the cities together.
What I want is to create a new column with the moving average but for each city. I was thinking about groupby, but I don't know exactly how to implement this.
You can groupby:
df.groupby('city')['value'].rolling(3).mean()
To assign:
df['roll'] = df.groupby('city')['value'].rolling(3).mean().droplevel(0)
Here you go:
def rolling_mean(group :pd.DataFrame) -> pd.DataFrame:
# Whatever operation you want to do with the cities.
# For each city group will be a dataframe of that city's rows without the city column
# I'm guessing you'd like to set the date as a sorted index
# and calculate your moving average based on that but if that's not case modify this function.
return group.set_index('date').sort_index().rolling(3).mean()
df.grouby("city").apply(rolling_mean) # Use .reset_index() if you don't need the multiindex.
maybe doing this (supose your dataframe named df)
from collections import defaultdict
data =defaultdict(list)
for (place,date,value) in df.values:
data[place].append(value)
new_df = pd.DataFrame(dict(data))
and now you have a new dataframe with each city on a column , so you can apply your function on each column ( in a for loop)
saopaulo riodejaneiro curitiba
0 5 3 7
1 31 55 41

Printing cells that correspond to largest values in original data set

I first found a cycle in a manufacturing process. I collected the 2 largest pressure values from the given cycles and printed them to a new sheet. I now need to capture the corresponding time to where the largest values land. This portion of my code looks like this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group','row_index'])
df2 = df.groupby('group')['Date/Time']
A sample snippet of the data I am trying to extract can be seen here:
Any help on this would be appreciated!
You can sort the data frame and take the last 2 rows per group. Typing this in the blind as you did not provide sample data:
df2 = (
df.sort_values(['group', 'Pressure'])
.groupby('group', sort=False)
.tail(2)
)

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

Applying changes to entire dataframe using group by

I am trying to apply changes to a dataframe for values only returned (to the best of my knowledge) by using groupby. So what I want is to find the minimum date values for each company so that I can apply the number 0 to first value in several columns (in this case df2['Research and Development Expense Lag'] and df2['Capital Expenditures Lag']). Here is what I have so far, a groupby that returns those minimum date values for each company:
df2.groupby('Ticker Symbol').apply(lambda d: \
d[d['Data Date'] == d['Data Date'].min()])
You are on the right track. You can get the index values for those rows and then use them with .loc[] to change values in those two columns:
df2.loc[df2.groupby('Ticker Symbol').apply(
lambda d: d[d['Data Date'] == d['Data Date'].min()]
)
.index
.get_level_values(1),
['Research and Development Expense Lag', 'Capital Expenditures Lag']
] = 0
The .get_level_values(1) function serves to extract the second level of the MultiIndex. The first level will contain Ticker Symbol values.

Selective summation of columns in a pandas dataframe

The COVID-19 tracking project (api described here) provides data on many aspects of the pandemic. Each row of the JSON is one day's data for one state. As many people know, the pandemic is hitting different states differently -- New York and its neighbors hardest first, with other states being hit later. Here is a subset of the data:
date,state,positive,negative
20200505,AK,371,22321
20200505,CA,56212,723690
20200505,NY,321192,707707
20200505,WY,596,10319
20200504,AK,370,21353
20200504,CA,54937,692937
20200504,NY,318953,688357
20200504,WY,586,9868
20200503,AK,368,21210
20200503,CA,53616,662135
20200503,NY,316415,669496
20200503,WY,579,9640
20200502,AK,365,21034
20200502,CA,52197,634606
20200502,NY,312977,646094
20200502,WY,566,9463
To get the entire data set I am doing this:
import pandas as pd
all_states = pd.read_json("https://covidtracking.com/api/v1/states/daily.json")
I would like to be able to summarize the data by adding up the values for one column, but only for certain states; and then adding up the same column, for the states not included before. I was able to do this, for instance:
not_NY = all_states[all_states['state'] != 'NY'].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
This creates a new dataframe from all_states, grouped by date, and summing for all the states that are not "NY". What I want to do, though, is exclude multiple states with something like a "not in" function (this doesn't work):
not_tristate = all_states[all_states['state'] not in ['NY','NJ','CT']].groupby(['date'], as_index = False).hospitalizedCurrently.sum()
Is there a way to do that? An alternate approach I tried is to create a new dataframe as a pivot table, with one row per date, one column per state, like this:
pivot_states = all_states.pivot_table(index = 'gooddate', columns = 'state', values = 'hospitalizedCurrently', aggfunc='sum')
but this still leaves me with creating new columns from summing only some columns. In SQL, I would solve the problem like this:
SELECT all_states.Date AS [Date], Sum(IIf([all_states]![state] In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS tristate, Sum(IIf([all_states]![state] Not In ("NY","NJ","CT"),[all_states]![hospitalizedCurrently],0)) AS not_tristate
FROM all_states
GROUP BY all_states.Date
ORDER BY all_states.Date;
The end result I am looking for is like this (using the sample data above and summing on the 'positive' column, with 'NY' standing in for 'tristate'):
date,not_tristate,tristate
20200502,53128,312977,366105
20200503,54563,316415,370978
20200504,55893,318953,374846
20200505,57179,321192,378371
Any help would be welcome.
to get the expected output, you can use groupby on date and np.where the states are isin the states you want, sum on positive, unstack and assign to get the column total
df_f = all_states.groupby(['date',
np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate')])\
['positive'].sum()\
.unstack()\
.assign(total=lambda x: x.sum(axis=1))
print (df_f)
not_tristate tristate total
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
or with pivot_table, you get similar result with:
print ( all_states.assign(state= np.where(all_states['state'].isin(["NY","NJ","CT"]),
'tristate', 'not_tristate'))\
.pivot_table(index='date', columns='state', values='positive',
aggfunc='sum', margins=True))
state not_tristate tristate All
date
20200502 53128 312977 366105
20200503 54563 316415 370978
20200504 55893 318953 374846
20200505 57179 321192 378371
All 220763 1269537 1490300
You can exclude multiple values of states by using isin with a NOT(~) sign:
all_states[~(all_states['state'].isin(["NY", "NJ", "CT"]))]
So, your code would be:
not_tristate = all_states[~(all_states['state'].isin(['NY','NJ','CT']))].groupby(['date'], as_index = False).hospitalizedCurrently.sum()

Categories