Aggregate columns with same date (sum) in csv - python

My code is returning the following data in CSV
Quantity Date of purchase
1 17 May 2022 at 5:40:20PM BST
1 2 Apr 2022 at 7:41:29PM BST
1 2 Apr 2022 at 6:42:05PM BST
1 29 Mar 2022 at 12:34:56PM BST
1 29 Mar 2022 at 10:52:54AM BST
1 29 Mar 2022 at 12:04:52AM BST
1 28 Mar 2022 at 4:49:34PM BST
1 28 Mar 2022 at 11:13:37AM BST
1 27 Mar 2022 at 8:53:05PM BST
1 27 Mar 2022 at 5:10:21PM BST
I am trying to get the dates only and adding the quantity data with the same date but below is the code for that
data = read_csv("products_sold_history_data.csv")
data['Date of purchase'] = pandas.to_datetime(data['Date of purchase'] , format='%d-%m-%Y').dt.date
but its giving me error can anyone please help how can I take the dates only from Date of purchase column and then add the quantity values in the same date.

Date format in your data is not the format that you specified: format='%d-%m-%Y'.
You could specify it explicitly, or let pandas infer the format for you by not providing the format:
pandas.to_datetime(data['Date of purchase']).dt.date
If you want to specify the format explicitly, you should provide the format that matches your data:
pandas.to_datetime(data['Date of purchase'], format='%d %b %Y at %H:%M:%S%p %Z')

here is one way to do it, where a date is created as a on-fly field and not making part of the DF.
Also, IIUC you're not concerned with the time part and only date is what you need to use for summing it up
extract the date part using regex, create a temp field dte using pandas.assign, and then a groupby to sum up the quantity
df.assign(dte = pd.to_datetime(
df['purchase'].str.extract(r'(.*)(at)')[0].str.strip())
).groupby('dte')['qty'].sum().reset_index()
dte qty
0 2022-02-06 3
1 2022-02-07 3
2 2022-02-08 2
3 2022-02-09 2
4 2022-02-10 2
5 2022-02-11 3
6 2022-02-14 1
7 2022-02-15 1
8 2022-02-19 1

Related

Calculate the Number of Users at the Start of the Month

I have a table which looks like this:
ID
Start Date
End Date
1
01/01/2022
29/01/2022
2
03/01/2022
3
15/01/2022
4
01/02/2022
01/03/2022
5
01/03/2022
01/05/2022
6
01/04/2022
So, for every row i have the start date of the contract with the user and the end date. If the contract is still present, there will be no end date.
I'm trying to get a table that looks like this:
Feb
Mar
Apr
Jun
3
3
4
3
Which counts the number of active users on the first day of the month.
What is the most efficient way to calculate this?
At the moment the only idea that came to my mind was to use a scaffold table containing the dates i'm intereseted in (the first day of every month) and from that easily create the new table I need.
But my question is, is there a better way to solve this? I would love to find a more efficient way to calculate this since i would need to repeat the exact same calculations for the number of users at the start of the week.
This might help:
# initializing dataframe
df = pd.DataFrame({'start':['01/01/2022','03/01/2022','15/01/2022','01/02/2022','01/03/2022','01/04/2022'],
'end':['29/01/2022','','','01/03/2022','01/05/2022','']})
# cleaning datetime (the empty ones are replaced with the max exit)
df['start'] = pd.to_datetime(df['start'],format='%d/%m/%Y')
df['end'] = pd.to_datetime(df['end'],format='%d/%m/%Y', errors='coerce')
df['end'].fillna(df.end.max(), inplace=True)
dt_range = pd.date_range(start=df.start.min(),end=df.end.max(),freq='MS')
df2 = pd.DataFrame(columns=['month','number'])
for dat in dt_range:
row = {'month':dat.strftime('%B - %Y'),'number':len(df[(df.start <= dat)&(df.end >= dat)])}
df2 = df2.append(row, ignore_index=True)
Output:
month number
0 January - 2022 1
1 February - 2022 3
2 March - 2022 4
3 April - 2022 4
4 May - 2022 4
Or, if you want the format as in your question:
df2.T
month January - 2022 February - 2022 March - 2022 April - 2022 May - 2022
number 1 3 4 4 4

Dealing with 'OutOfBoundsDatetime: Out of bounds nanosecond timestamp:'

I am working on a DataFrame looking at baseball games Date and their Attendance so I can create a Calendar Heatmap.
Date Attendance
1 Apr 7 44723.0
2 Apr 8 42719.0
3 Apr 9 36139.0
4 Apr 10 41253.0
5 Apr 11 20480.0
I've tried different solutions that I've come across...
- df['Date'] = df['Date'].astype('datetime64[ns]')
- df['Date'] = pd.to_datetime(df['Date'])
but I'll get the error of
'Out of bounds nanosecond timestamp: 1-04-07 00:00:00'.
From looking at my data, I don't even have a date that goes with that timestamp. I also looked at other posts on this site, and 1 potential problem is that my Dates are NOT zero padded? Could that be the cause?
you can convert to datetime if you supply a format; Ex:
df
Out[33]:
Date Attendance
1 Apr 7 44723.0
2 Apr 8 42719.0
3 Apr 9 36139.0
4 Apr 10 41253.0
5 Apr 11 20480.0
pd.to_datetime(df["Date"], format="%b %d")
Out[35]:
1 1900-04-07
2 1900-04-08
3 1900-04-09
4 1900-04-10
5 1900-04-11
Name: Date, dtype: datetime64[ns]
If you're unhappy with the base year 1900, you can add a date offset, for example
df["datetime"] = pd.to_datetime(df["Date"], format="%b %d")
df["datetime"] += pd.tseries.offsets.DateOffset(years=100)
df["datetime"]
1 2000-04-07
2 2000-04-08
3 2000-04-09
4 2000-04-10
5 2000-04-11
Name: datetime, dtype: datetime64[ns]

How can I order the table by month and then show it in plot chart? Python

I want to Order the table by the year and by month.
Sort_values doesnt work for me.
after that I need to show it in plot line chart with month over time
How can I do it?
df10=df.groupby(['year','month'],as_index=False).Sales.sum()
df10
year month Sales
0 2018 Apr 452546547.720000
1 2018 Aug 452830473.750001
2 2018 Dec 525888501.900000
3 2018 Feb 417589010.130000
4 2018 Jan 506665837.860000
5 2018 Jul 527113871.520000
6 2018 Jun 489527703.960000
7 2018 Mar 471807206.670001
8 2018 May 517740285.600000
9 2018 Nov 417862539.330000
10 2018 Oct 441153829.710001
11 2018 Sep 450298873.800000
12 2019 Apr 440397073.890000
13 2019 Feb 408684717.060001
14 2019 Jan 511212275.310001
15 2019 Mar 455560627.320000
16 2019 May 571120956.510000
sns.lineplot(x='month',y='Sales',data=df10)
'To sort by month, you need to have mont has number, or sorted string text. Either way, refer below to my code to get month as number, then plot the df however you like.
from time import strptime
df['month_num'] = [strptime(x,'%b').tm_mon for x in df['month']
df = df.soft_vlaues(['year', 'month_num')
data['y-m'] = data['year'].astype(str) +'-'+ data['month']
data['y-m'] = pd.to_datetime(data['y-m'])
sns.lineplot(y='Sales',x='y-m',data=data)
plt.xticks(rotation=45)
plt.show()
When sorting by dates, you first need to convert your data to datetime using datetime.date(year, month)
the key parameter helps you with that.
df10.sort_values(key=lambda e: datetime.date(e["year"], e["month"]))

Loop for starting month and starting year to end month and year python

Database df:
month year data
Jan 2017 ggg
Feb 2015 jhjj
Jan 2018 hjhj
Mar 2018 hjhj
and so on
Code:
def data_from_start_month_to_end_month:
for y in range(start_year,end_year):
do something
for m in range(start_month,13):
df = df[(df['month'] == m)&(df['year']== y)]
return df
This will start the code from the starting month and year but what if end month is not December, then it wont work.
Output I want:
start_month = Sep
start_year = 2000
end_month = Feb
end_year = 2019 say
So loop should work from Sep 2000 to Feb 2019 and extract the data only for those rows. (but I need the function to be generic and not hard coded
Can anyone help
You can use the below function which uses series.between after converting the inputs to datetime:
def myf(df,start_month,start_year,end_month,end_year):
s= pd.to_datetime(df['month']+df['year'].astype(str),format='%b%Y')
start = pd.to_datetime(start_month+str(start_year),format='%b%Y')
end = pd.to_datetime(end_month+str(end_year),format='%b%Y')
return df[s.between(start,end)]
myf(df,'Sep',2000,'Feb',2017)
month year data
0 Jan 2017 ggg
1 Feb 2015 jhjj
If month is number , use format='%m%Y' instead of format='%b%Y':
def myf1(df,start_month,start_year,end_month,end_year):
s= pd.to_datetime(df['month'].astype(str)+df['year'].astype(str),format='%m%Y')
start = pd.to_datetime(start_month+str(start_year),format='%b%Y')
end = pd.to_datetime(end_month+str(end_year),format='%b%Y')
return df[s.between(start,end)]
Example df:
month year data
0 1 2017 ggg
1 2 2015 jhjj
2 1 2018 hjhj
3 3 2018 hjhj
myf1(df,'Sep',2000,'Feb',2017)
month year data
0 1 2017 ggg
1 2 2015 jhjj

sort pandas dataframe of month names in correct order

I have a dataframe with names of months of the year, I.e. Jan, Feb, March etc
and I want to sort the data first by month, then by category so it looks like
Month_Name | Cat
Jan 1
Jan 2
Jan 3
Feb 1
Feb 2
Feb 3
pandas doesn't do custom sort functions for you, but you can easily add a temporary column which is the index of the month, and then sort by that
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
df["month_number"] = df["month_name"].map(months)
df.sort(columns=[...])
You may wish to take advantage of pandas' good date parsing when reading in your dataframe, though: if you store the dates as dates instead of string month names then you'll be able to sort natively by them.
Use Sort_Dataframeby_MonthandNumeric_cols function to sort dataframe by month and numeric column:
You need to install two packages are shown below.
pip install sorted-months-weekdays
pip install sort-dataframeby-monthorweek
Example:
import pandas as pd
from sorted_months_weekdays import *
from sort_dataframeby_monthorweek import *
df = pd.DataFrame([['Jan',23],['Jan',16],['Dec',35],['Apr',79],['Mar',53],['Mar',12],['Feb',3]], columns=['Month','Sum'])
df
Out[11]:
Month Sum
0 Jan 23
1 Jan 16
2 Dec 35
3 Apr 79
4 Mar 53
5 Mar 12
6 Feb 3
To get sorted dataframe by month and numeric column I have used above function.
Sort_Dataframeby_MonthandNumeric_cols(df = df, monthcolumn='Month',numericcolumn='Sum')
Out[12]:
Month Sum
0 Jan 16
1 Jan 23
2 Feb 3
3 Mar 12
4 Mar 53
5 Apr 79
6 Dec 35

Categories