I have a dataframe that looks like this:
[1]: https://i.stack.imgur.com/KnHba.png
Essentially, there is a distributor name column, a sales column, and a MM/DD/YYYY Date column.
For each distributor, by month, I want the sum of sales.
What I tried:
df = df.groupby(df['Distributor Name'],df.Date.dt.month)['Sales'].sum()
This throws an error. "Unhashable type: Series". This works when I remove Distributor Name, but I don't just want the overall monthly sales. I want the monthly sales BY distributor.
Thanks in advance!!
Joel
The correct way to group by multiple columns is to put them in a list as a first argument:
result = df.groupby(['Distributor Name', df.date.dt.month])['Sales'].sum()
This creates a multiindex pandas series, with Distributor Name and Date as indices. If you wish to create a dataframe with three columns (Distributor Name, Date, Sales) you can reset index of this pandas series.
result = result.reset_index()
Related
please I need help, I am recently started learning python. Please, how do I merge rows with the same “PatientID” and the same “Resource” as one with “StartDate” and “EndDate” as the average of the merged rows?
enter image description here
Given df is the name for the pandas.DataFrame containing your data.
To get the earliest StartDate and EndDate of each patient's resource, you can write:
# Group by the 'PatientID' and 'Resource' columns
grouped_df = df.groupby(['PatientID', 'Resource'])
# Select Earliest `StartDate` and `EndDate` from aggregate.
grouped_df = grouped_df.min(['StartDate', 'EndDate'])
# Remove levels from the index.
grouped_df.reset_index(inplace=True)
Hello a Python newbie here.
I have a dataframe that shows the product and how much they sold on each date
I need to change this dataframe to show the aggregate amount of units sold.
This is just an example dataframe and the actual dataframe that I am dealing with contains hundreds of products and 3 years worth of sales data.
It would be appreciated if there would be a way to do this efficiently.
Thank you in advance!!
If product is column use DataFrame.set_index with DataFrame.cumsum for cumulative sum:
df1 = df.set_index('product').cumsum(axis=1)
If product is index:
df1 = df.cumsum(axis=1)
I have a dataframe of Covid-19 deaths by country. Countries are identified in the Country column. Sub-national classification is based on the Province column.
I want to generate a dataframe which sums all columns based on the value in the Country column (except the first 2, which are geographical data). In short, for each date, I want to compress the observations for all provinces of a country such that I get a single number for each country.
Right now, I am able to do that for a single date:
import pandas as pd
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-
19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
raw = pd.read_csv(url)
del raw['Lat']
del raw['Long']
raw.rename({'Country/Region': 'Country', 'Province/State': 'Province'}, axis=1, inplace=True)
raw2 = raw.groupby('Country')['6/29/20'].sum()
How can I achieve this for all dates?
You can use iloc:
raw2 = raw.iloc[:,4:].groupby(raw.Country).sum()
This is my first question here. So I have COVID data in python, the distribution of COVID cases in several provinces in each country.
What should I do if I want to make each country only have one data (one row), and drop the province column?
Thank you
You need to groupby date and country, apply sum() as the aggregate function, and drop the province column:
e.g. if you have a dataframe df, with columns date, country, province, and cases, then you should do:
grouped_df = df.groupby(['date', 'country']).sum().drop('province', axis=1)
I have a DataFrame with a date_time column. The date_time column contains a date and time. I also managed to convert the column to a datetime object.
I want to create a new DataFrame containing all the rows of a specific DAY.
I managed to do it when I set the date column as the index and used the "loc" method.
Is there a way to do it even if the date column is not set as the index? I only found a method which returns the rows between two days.
You can use groupby() function. Let's say your dataframe is df,
df_group = df.groupby('Date') # assuming the column containing dates is called Date.
Now you can access rows of any date by passing the date in the get_group function,
df_group.get_group('date_here')