How can I make a Plot counting the elements that are in a specific Time Group. I have a dataframe like this
Incident_Number Submit_Date Description
001 05/04/2017 12:00:45 Problem1
002 05/05/2017 13:00:00 Problem2
003 05/05/2017 14:00:00 Problem3
004 07/05/2017 19:00:00 Problem4
005 07/06/2017 08:00:00 Problem5
and how could be possible to make a line plot that show me the total Incidents by month, date, weekday, or year. I tried grouping by, but this take many lines, first extracting the month, year, and date and then transforming again in datetime to visualize. Any ideas?
Thanks for your help
Start with converting Submit_Date into a timedate (if it is not a timedate yet) and making it the index:
df['Submit_Date'] = pd.to_datetime(df['Submit_Date'])
df.set_index('Submit_Date', inplace = True)
Now you can resample your data at any frequency and plot it. For example, resample by 1 month (get monthly counts):
df.resample('1M').count()['Description'].plot()
Related
At the moment I am working on a time series project.
I have Daily Data points over a 5 year timespan. In between there a some days with 0 values and some days are missing.
For example:
2015-01-10 343
2015-03-10 128
Day 2 of october is missing.
In order to build a good Time Series Model I want to resample the Data to Monthly:
df.individuals.resample("M").sum()
but I am getting the following output:
2015-01-31 343.000000
2015-02-28 NaN
2015-03-31 64.500000
Somehow the months are completely wrong.
The expected output would look like this:
2015-31-10 Sum of all days
2015-30-11 Sum of all days
2015-31-12 Sum of all days
Pandas is interpreting your date as %Y-%m-%d.
You should explicitly specify your date format before doing the resample.
Try this:
df.index = pd.to_datetime(df.index, format="%Y-%d-%m")
>>> df.resample("M").sum()
2015-10-31 471
I am working on time-series data, where I have two columns date and quantity. The date is day wise. I want to add all the quantity for a month and convert it into a single date.
date is my index column
Example
quantity
date
2018-01-03 30
2018-01-05 45
2018-01-19 30
2018-02-09 10
2018-02-19 20
Output :
quantity
date
2018-01-01 105
2018-02-01 30
Thanks in advance!!
You can downsample to combine the data for each month and sum it by chaining the sum method.
df.resample("M").sum()
Check out the pandas user guide on resampling here.
You'll need to make sure your index is in datetime format for this to work. So first do: df.index = pd.to_datetime(df.index). Hat tip to sammywemmy for the same advice in the comments.
You an also use groupby to get results.
df.index = pd.to_datetime(df.index)
df.groupby(df.index.strftime('%Y-%m-01')).sum()
I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.
Currently my data looks like :
user_ID order_number order_start_date order_value week_day
237 135950 1594878.0 2018-01-01 534.0 Monday
235 32911 1594942.0 2018-01-01 89.0 Monday
232 208474 1594891.0 2018-01-01 85.0 Monday
231 9048 1594700.0 2018-01-01 224.0 Monday
228 134896 1594633.0 2018-01-01 449.0 Monday
What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:
Groupby does not inherently supports minimum maximum difference
It is not possible to perform numerical operations such as mean() on datetime series which exist as a column in a dataframe. Though possible for individual series.
Any help?
I feel like your description was practically the pseudocode!
output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())
You can then get the difference in days as numbers (rather than timedeltas):
output = [i / pd.Timedelta(days=1) for i in output]
The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?
As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedeltas since an old time and then average, but this post does it better and works well with groupby. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:
df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')
Results:
user_ID
1111 2020-01-03
I have the following dataframe which is at a day level:
BillDate S2Rate
4 2019-06-04 4686.5
3 2019-06-03 1557.5
2 2019-05-21 10073.5
1 2019-05-19 6501.5
0 2019-05-18 1378.0
I want to calculate WoW percentage, WoW increase or decrease using this data. How do I do this?
Also how do I replicate this for a YoY and Day on Day?
You should use resample. Then you can use functions like pct_change and diff to get the differences:
# df["BillDate"] = pd.to_datetime(df["BillDate"])
week_over_week = df.set_index("BillDate").resample("W").sum()
week_over_week_pct = week_over_week.pct_change()
week_over_week_increase = week_over_week.diff()
You can replace the parameter for resample with "D" for day over day, "Y" for year over year and many other options for more complex time ranges.
Set BillDate as index after coercing it to a datetime
df.set_index(pd.to_datetime(df['BillDate']), inplace=True)
df
Get rid of BillDate from columns now that you moved it to index
df.drop(columns=['BillDate'], inplace=True)
Resample to required period, calculate sum and percentage change
df.resample('W')['S2Rate'].sum().pct_change().to_frame()
Please note resample works by taking the last value in the period.
'W'-Sets date to Sunday
'M'-Sets date to last date in a month