How to count the total sales by year, month - python

I have a big csv (17985 rows) with sales in different days.The csv looks like this:
Customer Date Sale
Larry 1/2/2018 20$
Mike 4/3/2020 40$
John 12/5/2017 10$
Sara 3/2/2020 90$
Charles 9/8/2022 75$
Below is how many times that exact day appears in my csv (how many sales were made that day):
occur = df.groupby(['Date']).size()
occur
2018-01-02 32
2018-01-03 31
2018-01-04 42
2018-01-05 192
2018-01-06 26
I used crosstab, groupby and several methods but the problem is that they don't add up, or is NaN.
new_df['total_sales_that_month'] = df.groupby('Date')['Sale'].sum()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
17980 NaN
17981 NaN
17982 NaN
17983 NaN
17984 NaN
I want to group them by year and month in a dataframe, based on total sales. Using dt.year and dt.month I managed to do this:
year
month
1 2020
1 2020
7 2019
8 2019
2 2018
... ...
4 2020
4 2020
4 2020
4 2020
4 2020
What I want to have is: month/year/total_sales_that_month. What method should I apply? This is the expected output:
Month Year Total_sale_that_month
1 2018 420$
2 2018 521$
3 2018 124$
4 2018 412$
5 2018 745$

You can use groupby_sum but before you have to strip '$' from Sale column and convert as numeric:
# Clean your dataframe first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Sale'] = df['Sale'].str.strip('$').astype(float)
out = (df.groupby([df['Date'].dt.month.rename('Month'),
df['Date'].dt.year.rename('Year')])
['Sale'].sum()
.rename('Total_sale_that_month')
# .astype(str).add('$') # uncomment if '$' matters
.reset_index())
Output:
>>> out
Month Year Total_sale_that_month
0 2 2018 20.0
1 2 2020 90.0
2 3 2020 40.0
3 5 2017 10.0
4 8 2022 75.0

i share you my code,
pivot_table, reset_index and sorting,
convert your col name:
df["Dt_Customer_Y"] = pd.DatetimeIndex(df['Dt_Customer']).year
df["Dt_Customer_M"] = pd.DatetimeIndex(df['Dt_Customer']).month
pvtt = df.pivot_table(index=['Dt_Customer_Y', 'Dt_Customer_M'], aggfunc={'Income':sum})
pvtt.reset_index().sort_values(['Dt_Customer_Y', 'Dt_Customer_M'])
Dt_Customer_Y Dt_Customer_M Income
0 2012 1 856039.0
1 2012 2 487497.0
2 2012 3 921940.0
3 2012 4 881203.0

Related

Add two dataframe rows?

I have a df with year, I'm trying to combine two rows in a dataframe.
df
year
0 2020
1 2019
2 2018
3 2017
4 2016
Final df
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN
Let us do shift
df['combine'] = df.year.astype(str) + '-' + df.year.astype(str).shift(-1)
df
Out[302]:
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0

How do I perform ordered selection on multiple Columns by Value

I have a dataframe including a month and year column. Both contain strings i.e. 'September' and '2013'. How do I select all rows between September 2013 and May 2008 in one row?
df1 = stats_month_census_2[(stats_month_census_2['year'] <= '2013')
& (stats_month_census_2['year'] >= '2008')]
df2 = df1[...]
After the code above, I was going to do the same thing again but I am having a hard time coming up with clever code to simply get rid of rows that are higher in time than September 2013 ('October to December') and below May 2008. I could hard code this easily, but there must be a more pythonic way of doing this...
Or you could try below if you are looking for rows falls between 2008 to 2013 as you asked in the post "select all rows between September 2013 and May 2008"
then use pandas.Series.between:
Dataset borrowed from #jezrael..
DataFrame for Demonstration purpose:
>>> stats_month_census_2
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
Using pandas.Series.between()
>>> stats_month_census_2[stats_month_census_2['year'].between(2008, 2013, inclusive=True)]
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
If it's just a matter of datetime format, you can simply try below:
>>> stats_month_census_2[stats_month_census_2['year'].between('2008-05', '2013-09', inclusive=True)]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using DataFame.query :
>>> stats_month_census_2.query('"2008-05" <= year <= "2013-09"')
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using isin method: Select the rows between two dates
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05-01', '2013-09-01'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Or, even you can pass like below..
>>> stats_month_census_2[stats_month_census_2['year'].isin(pd.date_range('2008-05', '2013-09'))]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Using loc method by slicing off based on Index start and end dates..
Start = stats_month_census_2[stats_month_census_2['year'] =='2008-05'].index[0]
End = stats_month_census_2[stats_month_census_2['year']=='2013-09'].index[0]
>>> stats_month_census_2.loc[Start:End]
year month data
1 2008-05-01 May 3
2 2008-06-01 June 4
3 2013-09-01 September 6
Note: Just for the curiosity as #jezrael asked in comment, i'm adding how to convert the year column into datetime format:
As we have the below example DataFrame where we have two distinct columns year and month where year column has only years and month column is in literal string format So, First we need to convert the String into an int form join or add the year & month together by assign a day as 1 for all using pandas pd.to_datetime method.
df
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
5 2014 November 6
6 2014 December 7
Above is the raw DataFrame before datetime conversion So, i'm taking the below approach which i learned over the time vi SO itself.
1- First convert the month names into int form and assign it to a new column called Month as an easy go So, we can use that for conversion later.
df['Month'] = pd.to_datetime(df.month, format='%B').dt.month
2- Secondly, or at last convert Directly the year column into a proper datetime format by directly assigning to year column itself it's a kind of inplace we can say.
df['Date'] = pd.to_datetime(df[['year', 'Month']].assign(Day=1))
Now the Desired DataFrame and year column is in datetime Form:
print(df)
year month data Month
0 2008-04-01 April 1 4
1 2008-05-01 May 3 5
2 2008-06-01 June 4 6
3 2013-09-01 September 6 9
4 2013-10-01 October 5 10
5 2014-11-01 November 6 11
6 2014-12-01 December 7 12
You can easily convert the columns into a DateTime column using pd.to_datetime
>>df
month year
0 January 2000
1 April 2001
2 July 2002
3 February 2010
4 February 2018
5 March 2014
6 June 2012
7 June 2011
8 May 2009
9 November 2016
>>df['date'] = pd.to_datetime(df['month'].astype(str) + '-' + df['year'].astype(str), format='%B-%Y')
>>df
month year date
0 January 2000 2000-01-01
1 April 2001 2001-04-01
2 July 2002 2002-07-01
3 February 2010 2010-02-01
4 February 2018 2018-02-01
5 March 2014 2014-03-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
9 November 2016 2016-11-01
>>df[(df.date <= "2013-09") & (df.date >= "2008-05") ]
month year date
3 February 2010 2010-02-01
6 June 2012 2012-06-01
7 June 2011 2011-06-01
8 May 2009 2009-05-01
You can create DatetimeIndex and then select by partial string indexing:
stats_month_census_2 = pd.DataFrame({
'year': [2008, 2008, 2008, 2013,2013],
'month': ['April','May','June','September','October'],
'data':[1,3,4,6,5]
})
print (stats_month_census_2)
year month data
0 2008 April 1
1 2008 May 3
2 2008 June 4
3 2013 September 6
4 2013 October 5
s = stats_month_census_2.pop('year').astype(str) + stats_month_census_2.pop('month')
#if need year and month columns
#s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2.index = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008':'2013'])
data
2008-04-01 1
2008-05-01 3
2008-06-01 4
2013-09-01 6
2013-10-01 5
print (stats_month_census_2['2008-05':'2013-09'])
data
2008-05-01 3
2008-06-01 4
2013-09-01 6
Or create column and use between with boolean indexing:
s = stats_month_census_2['year'].astype(str) + stats_month_census_2['month']
stats_month_census_2['date'] = pd.to_datetime(s, format='%Y%B')
print (stats_month_census_2)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
4 2013 October 5 2013-10-01
df = stats_month_census_2[stats_month_census_2['date'].between('2008-05', '2013-09')]
print (df)
year month data date
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
3 2013 September 6 2013-09-01
Unfortunately this way with datetime column is not possible for select betwen years, then need pygo solution with year column:
#wrong output
df = stats_month_census_2[stats_month_census_2['date'].between('2008', '2013')]
print (df)
year month data date
0 2008 April 1 2008-04-01
1 2008 May 3 2008-05-01
2 2008 June 4 2008-06-01
Another solution:
Lets assume the df looks like below:
series name Month Year
0 fertility rate May 2008
1 CO2 emissions June 2009
2 fertility rate September 2013
3 fertility rate October 2013
4 CO2 emissions December 2014
Create a calender dictionary mapping and save in a new column
import calendar
d = dict((v,k) for k,v in enumerate(calendar.month_abbr))
stats_month_census_2['month_int'] = stats_month_census_2.Month.apply(lambda x: x[:3]).map(d)
>>stats_month_census_2
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9
3 fertility rate October 2013 10
4 CO2 emissions December 2014 12
Filter using series.between()
stats_month_census_2[stats_month_census_2.month_int.between(5,9,inclusive=True) & stats_month_census_2.Year.between(2008,2013,inclusive=True)]
Output:
series name Month Year month_int
0 fertility rate May 2008 5
1 CO2 emissions June 2009 6
2 fertility rate September 2013 9

How to merge several rows into one row based on a column with specific value in Pandas

I have a DataFrame like this way:
item_id revenue month year
1 10.0 01 2014
1 5.0 02 2013
1 6.0 04 2013
1 7.0 03 2013
2 2.0 01 2013
2 3.0 03 2013
3 5.0 04 2013
And I try to get the revenue of each item from January to March 2013 like following DataFrame:
item_it revenue year
1 12.0 2013
2 5.0 2013
3 0 2013
BUT, I am confused on how to implement it in Pandas. Any help would be appreciated.
You can slice first, then groupby and reindex to include 0 values.
month_start, month_end = 1, 3
year = 2013
res = df.loc[df['month'].between(month_start, month_end) & df['year'].eq(year)]\
.groupby('item_id')['revenue'].sum()\
.reindex(df['item_id'].unique()).fillna(0)\
.reset_index('revenue').assign(year=year)
print(res)
item_id revenue year
0 1 12.0 2013
1 2 5.0 2013
2 3 0.0 2013
You can use groupby first then sum method to get the desire output.
df.groupby(['year', 'item_id']).sum().reset_index().drop('month', axis=1).set_index('item_id')
year revenue
item_id
1 2013 18.0
2 2013 5.0
3 2013 5.0
1 2014 10.0

Pandas Grouper by weekday?

I have a pandas dataframe where the index is the date, from year 2007 to 2017.
I'd like to calculate the mean of each weekday for each year. I am able to group by year:
groups = df.groupby(TimeGrouper('A'))
years = DataFrame()
for name, group in groups:
years[name.year] = group.values
This is the way I create a new dataframe (years) where in each column I obtain each year of the time series.
If I want to see the statistics of each years (for example, the mean):
print(years.mean())
But now I would like to separate each day of the week for each year, in order to obtain the mean of each weekday for all of then.
The only thing I know is:
year=df[(df.index.year==2007)]
day_week=df[(df.index.weekday==2)]
The problem with this is that I have to change 7 times the day of the week, and then repeat this for 11 years (my time series begins on 2007 and ends on 2017), so I must do it 77 times!
Is there a way to group time by years and weekday in order to make this faster?
It seems you need groupby by DatetimeIndex.year with DatetimeIndex.weekday:
rng = pd.date_range('2017-04-03', periods=10, freq='10M')
df = pd.DataFrame({'a': range(10)}, index=rng)
print (df)
a
2017-04-30 0
2018-02-28 1
2018-12-31 2
2019-10-31 3
2020-08-31 4
2021-06-30 5
2022-04-30 6
2023-02-28 7
2023-12-31 8
2024-10-31 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean()
print (df1)
a
2017 6 0
2018 0 2
2 1
2019 3 3
2020 0 4
2021 2 5
2022 5 6
2023 1 7
6 8
2024 3 9
df1 = df.groupby([df.index.year, df.index.weekday]).mean().reset_index()
df1 = df1.rename(columns={'level_0':'years','level_1':'weekdays'})
print (df1)
years weekdays a
0 2017 6 0
1 2018 0 2
2 2018 2 1
3 2019 3 3
4 2020 0 4
5 2021 2 5
6 2022 5 6
7 2023 1 7
8 2023 6 8
9 2024 3 9

Categories