Here's a quick problem that I, at first, dismissed as easy. An hour in, and I'm not so sure!
So, I have a list of Python datetime objects, and I want to graph them. The x-values are the year and month, and the y-values would be the amount of date objects in this list that happened in this month.
Perhaps an example will demonstrate this better (dd/mm/yyyy):
[28/02/2018, 01/03/2018, 16/03/2018, 17/05/2018]
-> ([02/2018, 03/2018, 04/2018, 05/2018], [1, 2, 0, 1])
My first attempt tried to simply group by date and year, along the lines of:
import itertools
group = itertools.groupby(dates, lambda date: date.strftime("%b/%Y"))
graph = zip(*[(k, len(list(v)) for k, v in group]) # format the data for graphing
As you've probably noticed though, this will group only by dates that are already present in the list. In my example above, the fact that none of the dates occurred in April would have been overlooked.
Next, I tried finding the starting and ending dates, and looping over the months between them:
import datetime
data = [[], [],]
for year in range(min_date.year, max_date.year):
for month in range(min_date.month, max_date.month):
k = datetime.datetime(year=year, month=month, day=1).strftime("%b/%Y")
v = sum([1 for date in dates if date.strftime("%b/%Y") == k])
data[0].append(k)
data[1].append(v)
Of course, this only works if min_date.month is smaller than max_date.month which is not necessarily the case if they span multiple years. Also, its pretty ugly.
Is there an elegant way of doing this?
Thanks in advance
EDIT: To be clear, the dates are datetime objects, not strings. They look like strings here for the sake of being readable.
I suggest use pandas:
import pandas as pd
dates = ['28/02/2018', '01/03/2018', '16/03/2018', '17/05/2018']
s = pd.to_datetime(pd.Series(dates), format='%d/%m/%Y')
s.index = s.dt.to_period('m')
s = s.groupby(level=0).size()
s = s.reindex(pd.period_range(s.index.min(), s.index.max(), freq='m'), fill_value=0)
print (s)
2018-02 1
2018-03 2
2018-04 0
2018-05 1
Freq: M, dtype: int64
s.plot.bar()
Explanation:
First create Series from list of dates and convert to_datetimes.
Create PeriodIndex by Series.dt.to_period
groupby by index (level=0) and get counts by GroupBy.size
Add missing periods by Series.reindex by PeriodIndex created by max and min values of index
Last plot, e.g. for bars - Series.plot.bar
using Counter
dates = list()
import random
import collections
for y in range(2015,2019):
for m in range(1,13):
for i in range(random.randint(1,4)):
dates.append("{}/{}".format(m,y))
print(dates)
counter = collections.Counter(dates)
print(counter)
for your problem with dates with no occurrences you can use the subtract method of Counter
generate a list with all range of dates, each date will appear on the list only once, and then you can use subtract
like so
tmp_date_list = ["{}/{}".format(m,y) for y in range(2015,2019) for m in range(1,13)]
counter.subtract(tmp_date_list)
Related
Assuming that I have a series made of daily values:
dates = pd.date_range('1/1/2004', periods=365, freq="D")
ts = pd.Series(np.random.randint(0,101, 365), index=dates)
I need to use .groupby or .reduce with a fixed schema of dates.
Use of the ts.resample('8d') isn't an option as dates need to not fluctuate within the month and the last chunk of the month needs to be flexible to address the different lengths of the months and moreover in case of a leap year.
A list of dates can be obtained through:
g = dates[dates.day.isin([1,8,16,24])]
How I can group or reduce my data to the specific schema so I can compute the sum, max, min in a more elegant and efficient way than:
for i in range(0,len(g)-1):
ts.loc[(dec[i] < ts.index) & (ts.index < dec[i+1])]
Well from calendar point of view, you can group them to calendar weeks, day of week, months and so on.
If that is something that you would be intrested in, you could do that easily with datetime and pandas for example:
import datetime
df['week'] = df['date'].dt.week #create week column
df.groupby(['week'])['values'].sum() #sum values by weeks
I have a little problem with the .loc function.
Here is the code:
date = df.loc [df ['date'] == d] .index [0]
d is a specific date (e.g. 21.11.2019)
The problem is that the weekend can take days. In the dataframe in the column date there are no values for weekend days. (contains calendar days for working days only)
Is there any way that if d is on the weekend he'll take the next day?
I would have something like index.get_loc, method = bfill
Does anyone know how to implement that for .loc?
IIUC you want to move dates of format: dd.mm.yyyy to nearest Monday, if they happen to fall during the weekend, or leave them as they are, in case they are workdays. The most efficient approach will be to just modify d before you pass it to pandas.loc[...] instead of looking for the nearest neighbour.
What I mean is:
import datetime
d="22.12.2019"
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
if(dt.weekday() in [5,6]):
dt=dt+datetime.timedelta(days=7-dt.weekday())
d=dt.strftime("%d.%m.%Y")
Output:
23.12.2019
Edit
In order to just take first date, after or on d, which has entry in your dataframe try:
import datetime
df['date']=pd.to_datetime(df['date'], format='%d.%m.%Y')
dt=datetime.datetime.strptime(d, "%d.%m.%Y")
d=df.loc[df ['date'] >= d, 'date'].min()
dr.loc[df['date']==d]...
...
I have the following data set:
df
OrderDate Total_Charged
7/9/2017 5
7/9/2017 5
7/20/2017 10
8/20/2017 6
9/20/2019 1
...
I want to make a bar plot with month_year (X-axis) and Total charged per month/year i.e. sum it over month and year. Firstly, I want to groupby month and year and next make the plot.However, I get error on the first step:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
monthly_orders=df.groupby([(df.index.year),(df.index.month)]).sum()["Total_Charged"]
Got following error:
AttributeError: 'RangeIndex' object has no attribute 'year'
What am I doing wrong (what does the error mean)? How can i fix it?
Not sure why you're grouping by the index there. If you want the group by year and month respectively you could do the following:
df["OrderDate"]=pd.to_datetime(df['OrderDate'])
df.groupby([df.OrderDate.dt.year, df.OrderDate.dt.month]).sum().plot.bar()
pandas.DataFrame.resample
This is a versatile option, that easily implements aggregation over various time ranges (e.g. weekly, daily, quarterly, etc)
Code:
A more expansive dataset:
This code block sets up the sample dataset.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# list of dates
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
df = pd.DataFrame({'OrderDate': list_of_dates,
'Total_Charged': [np.random.randint(10) for _ in range(len(list_of_dates))]})
Using resample for Monthly Sum:
requires a datetime index
df.OrderDate = pd.to_datetime(df.OrderDate)
df.set_index('OrderDate', inplace=True)
monthly_sums = df.resample('M').sum()
monthly_sums.plot.bar(figsize=(8, 6))
plt.show()
An example with Quarterly Avg:
this shows the versatility of resample compared to groupby
Quarterly would not be easily implemented with groupby
quarterly_avg = df.resample('Q').mean()
quarterly_avg.plot.bar(figsize=(8, 6))
plt.show()
So I have a pandas date_range like so
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
I want to remove all the extra days resulting from leap years.
I do a for loop
for each in index:
if each.month==2 and each.day==29:
print(each) # I actually want to delete this item from dates
But my problem is that I don't know how to delete the item. The regular python list methods and functions doesn't work.
I've looked everywhere on SO. I've looked at the documentation for pandas.date_range but found nothing
Any help will be appreciated.
You probably want to use drop to remove the rows.
import pandas as pd
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D')
leap = []
for each in dates:
if each.month==2 and each.day ==29:
leap.append(each)
dates = dates.drop(leap)
You could try creating two Series objects to store the months and days separately and use them as masks.
dates = pd.date_range(start='2005-1-1', end='2014-12-31', freq='D') #All dates between range
days = dates.day #Store all the days
months = dates.month #Store all the months
dates = dates[(days != 29) & (months != 2)] #Filter dates using a mask
Just to check if the approach works, If you change the != condition to ==, we can see the dates you wish to eliminate.
UnwantedDates = dates[(days == 29) & (months == 2)]
Output:
DatetimeIndex(['2008-02-29', '2012-02-29'], dtype='datetime64[ns]', freq=None)
You can try:
dates = dates[~dates['Date'].str.contains('02-29')]
In place of Date you will have to put the name of the column where the dates are stored.
You don't have to use the for loop so it is faster to run.
I have a list of lists composed of dates in excel float format (every minute since July 5, 1996) and an integer value associated with each date like this: [[datetime,integer]...]. I need to create a new list composed of all of the dates (no hours or minutes) and the sum of the values for all of the datetimes within that date. In other words, what is the sum of the values for each date when listolists[x][0] >= math.floor(listolists[x][0]) and listolists[x][0] < math.floor(listolists[x][0]). Thanks
Since you didn't provide any actual data (just the data structure you used, nested lists), I created some dummy data below to demonstrate how you might do a SUMIFS-type of problem in Python.
from datetime import datetime
import numpy as np
import pandas as pd
dates_list = []
# just take one month as an example of how to group by day
year = 2015
month = 12
# generate similar data to what you might have
for day in range(1, 32):
for hour in range(1, 24):
for minute in range(1, 60):
dates_list.append([datetime(year, month, day, hour, minute), np.random.randint(20)])
# unpack these nested list pairs so we have all of the dates in
# one list, and all of the values in the other
# this makes it easier for pandas later
dates, values = zip(*dates_list)
# to eventually group by day, we need to forget about all intra-day data, e.g.
# different hours and minutes. we only care about the data for a given day,
# not the by-minute observations. So, let's set all of the intra-day values to
# some constant for easier rolling-up of these dates.
new_dates = []
for d in dates:
new_d = d.replace(hour = 0, minute = 0)
new_dates.append(new_d)
# throw the new dates and values into a pandas.DataFrame object
df = pd.DataFrame({'new_dates': new_dates, 'values': values})
# here's the SUMIFS function you're looking for
grouped = df.groupby('new_dates')['values'].sum()
Let's see the results:
>>> print(grouped.head())
new_dates
2015-12-01 12762
2015-12-02 13292
2015-12-03 12857
2015-12-04 12762
2015-12-05 12561
Name: values, dtype: int64
Edit: If you want these new grouped data back in the nested list format, just do this:
new_list = [[date, value] for date, value in zip(grouped.index, grouped)]
Thanks everyone. This is the simplest code I could come up with that doesn't require panda:
for row in listolist:
for k in (0, 1):
row[k] = math.floor(float(row[k]))
date = {}
for d,v in listolist:
if d in date:
date[math.floor(d)].append(v)
else:
date[math.floor(d)] = [v]
result = [(d,sum(v)) for d,v in date.items()]