Pandas- Split dataset based on overlapping time periods - python

I have reporting time periods that start on Mondays, end on Sundays, and run for 5 weeks. For example:
11/20/2017 - 12/24/2017 = t1
11/27/2017 - 12/31/2017 = t2
I have a dataframe that consists of 6 of these periods (starting 11/20/2017) and I'm trying to split it into 6 dataframes for each time period using the LeaveDate column. My data looks like this:
Barcode LeaveDate
ABC123 2017-11-22
ABC124 2017-12-04
ABC125 2017-12-15
As the dataframe is separated, some of the barcodes will fall into multiple periods- that's OK. I know I can do:
df['period'] = df['LeaveDate'].dt.to_period('M-SUN')
df['week'] = df['period'].dt.week
To get single weeks, but I don't know how to definte a "multi-week" period. The problem also is that a barcode can full under multiple periods, so they need to be outputted to multiple dataframes. Any ideas? Thanks!

There might be a more succinct solution, but this should work (will give you a dictionary of DataFrames, one for each period):
df = pd.DataFrame([['ABC123', '2017-11-22'],
['ABC124', '2017-12-04'],
['ABC125', '2017-12-15']],
columns=['Barcode', 'LeaveDate'])
periods = [('2017-11-20', '2017-12-24'), ('2017-11-27', '2017-12-31')]
results = {}
for period in periods:
period_df = df[(df['LeaveDate'] >= period[0]) & (df['LeaveDate'] <= period[1])]
results[period] = period_df

Related

Using pandas and datetime in Jupyter to see during what hours no sales were ever made (on any day)

So I have sales data that I'm trying to analyze. I have datetime data ["Order Date Time"] and I'd like to see the most common hours for sales but more importantly I'd like to see what minutes have NO sales.
I have been spinning my wheels for a while and I can't get my brain around a solution. Any help is greatly appreciated.
I import the data:
df = pd.read_excel ('Audit Period.xlsx')
print (df)
I clean up the data:
# Remove all columns except `applieddate` and null rows
time_df = df[df["Order Date Time"].notnull()]
# Ensure the index is still sequential
time_df = time_df[["Order Date Time"]].reset_index(drop=True)
# Select the first 10 rows
time_df.head(10)
I convert to datetime and I look at the month totals:
# Convert applieddate to datetime
time_df = time_df.copy()
time_df["Order Date Time"] = time_df["Order Date Time"].apply(pd.to_datetime)
time_df = time_df.set_index(time_df["Order Date Time"])
# Group by month
grouped = time_df.resample("M").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
I try to group by hour but that gives me totals per day/hour rather than totals per hour like every order ever at noon, etc:
# Group by hour
grouped = time_df.resample("2H").count()
time_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
time_df.head(10)
And that is where I'm stuck. I'm trying to integrate the below suggestions but can't quite get a grasp on them yet. Any help would be appreciated.
Not sure if this is the most brilliant solution, but I would start by generating a dataframe at the level of detail I wanted, whether that is 1-hour intervals, 5-minute intervals, etc. Then in your df with all the actual data, you could do your grouping as you currently are doing it above. Once it is grouped, join the two. That way you have one dataframe that includes empty rows associated with time spans with no records. The tricky part will just be making sure you have your date and time formatted in a way that it will match and join properly.

Pandas Columns with Date and Time - How to sort?

I have concatenated several csv files into one dataframe to make a combined csv file. But one of the columns has both date and time (e.g 02:33:01 21-Jun-2018) after being converted to date_time format. However when I call
new_dataframe = old_dataframe.sort_values(by = 'Time')
It sorts the dataframe by time , completely ignoring date.
Index Time Depth(ft) Pit Vol(bbl) Trip Tank(bbl)
189147 00:00:00 03-May-2018 2283.3578 719.6753 54.2079
3875 00:00:00 07-May-2018 5294.7308 1338.7178 29.5781
233308 00:00:00 20-May-2018 8073.7988 630.7964 41.3574
161789 00:00:01 05-May-2018 122.2710 353.6866 58.9652
97665 00:00:01 01-May-2018 16178.8666 769.1328 66.0688
How do I get it to sort by dates and then times , so that Aprils days come first, and come in chronological order?
In order to sort with your date first and then time, your Time column should be in the right way Date followed by Time. Currently, it's opposite.
You can do this:
df['Time'] = df['Time'].str.split(' ').str[::-1].apply(lambda x: ' '.join(x))
df['Time'] = pd.to_datetime(df['Time'])
Now sort your df by Time like this:
df.sort_values('Time')

How to find the top 10 performing values of each week in python?

I would like to return the top 10 performing (by average) variables for each week in my DataFrame. It is about 2 years worth of data
I am using Python to figure this out but, would also eventually like to do it in SQL.
I have been able to produce code that returns the top 10 for the latest week but, would like results for every week
Creating df that creates the datetime range
range_max = rtbinds['pricedate'].max()
range_min = range_max - datetime.timedelta(days=7)
sliced_df = rtbinds[(rtbinds['pricedate'] >= range_min)
& (rtbinds['pricedate'] <= range_max)]
grouping by 'shadow'
sliced_df.groupby(['pricedate','cons_name']).aggregate(np.mean)
.sort_values('shadow').head(10)
returns for the first week of data.
pricedate cons_name shadow
2019-04-26 TEMP71_24753 -643.691
2019-04-27 TMP175_24736 -508.062
2019-04-25 TMP109_22593 -383.263
2019-04-23 TEMP48_24759 -376.967
2019-04-29 TEMP71_24753 -356.476
TMP175_24736 -327.230
TMP273_23483 -303.234
2019-04-27 TEMP71_24753 -294.377
2019-04-28 TMP175_24736 -272.603
TMP109_22593 -270.887
But, I would like a list that returns the top 10 for each week until the earliest date of my data
heads up pd.sort_values is sorting by default in ascending order, so when you take head(10), it's actually the worst 10 if we consider the natural ordering of real numbers.
Now for your problem, here is a solution
First we need to create some columns to identify the week of the year (rtbins is renamed df):
df['year'] = df['pricedate'].apply(lambda x: x.year)
df['week'] = df['pricedate'].apply(lambda x: x.isocalendar()[1])
Then we will group the data by ['year', 'week', 'cons_name'] :
df2 = df.groupby(['year', 'week', 'cons_name'], as_index=False).aggregate(np.mean)
You should get now a dataframe where for each (year, week) you have only one record of a cons_name with the mean shadow.
Then we will take the top 10 for each (year, week)
def udf(df):
return df.sort_values('shadow').head(10)
df2.groupby(['year', 'week'], as_index=False).apply(udf)
This should give you the result you want.

Multiple day wise plots in timeseries dataframe pandas

My data frame look like this -
In [1]: df.head()
Out[1]:
Datetime Value
2018-04-21 14:08:30.761 offline
2018-04-21 14:08:40.761 offline
2018-04-21 14:08:50.761 offline
2018-04-21 14:09:00.761 offline
2018-04-21 14:09:10.761 offline
I have data for 2 weeks. I want to plot Value against time (hours:minutes) for each day in week. If I am to see data one week at a time that also works.
I took a slice for a single day created a plot using plotly.
In[9]: df['numval'] = df.Value.apply(lambda x: 1 if x == 'online' else -1)
In[10]: df.iplot()
If can have mutiple plots similar to this for sunday to saturday using few lines it would speed up my work
Suggestions -
Something like I can put in arguments as weekday (0-6), time (x axis) and Value (y axis) and it would create 7 plots.
In[11]: df['weekday'] = df.index.weekday
In[12]: df['weekdayname'] = df.index.weekday_name
In[13]: df['time'] = df.index.time
Any library would work as I just want to see the data and will need to test out modifications to data.
Optional - Distribution curve, similar to KDE, over data would be nice
This may not be the exact answer you are looking for. Just giving an approach which could be helpful.
The approach here is to group the data based on date and then generate a plot for each group. For this you need to split the DateTime column into two columns - date and time. Code below will do that:
datetime_series = df['Datetime']
date_series = pd.Series()
time_series = pd.Series()
for datetime_string in datetime_series:
date,time = datetime_string.split(" ")
date_s = pd.Series(date,dtype=str)
time_s = pd.Series(time, dtype=str)
date_series=date_series.append(date_s, ignore_index=True)
time_series = time_series.append(time_s, ignore_index=True)
Code above will give you two separate pandas series. One for date and the other one for time. Now you can add the two columns to your dataframe
df['date'] = date_series
df['time'] = time_series
Now you can use groupby functionality to group the data based on date and plot data for each group. Something like this:
First replace 'offline' with value 0:
df1 = df.replace(to_replace='offline',value=0)
Now group the data based on date and plot:
for title, group in df1.groupby('date'):
group.plot(x='time', y='Value', title=title)

Sum large pandas dataframe based on smaller date ranges

I have a large pandas dataframe that has hourly data associated with it. I then want to parse that into "monthly" data that sums the hourly data. However, the months aren't necessarily calendar months, they typically start in the middle of one month and end in the middle of the next month.
I could build a list of the "months" that each of these date ranges fall into and loop through it, but I would think there is a much better way to do this via pandas.
Here's my current code, the last line throws an error and is the crux of the question:
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
month = pd.DataFrame({'start':['1/4/2015 00:00','1/24/2015 00:00'], 'end':['1/23/2015 23:00','2/23/2015 23:00']})
month['start'] = pd.to_datetime(month['start'])
month['end'] = pd.to_datetime(month['end'])
month['num'] = df['num'][(df['date'] >= month['start']) & (df['date'] <= month['end'])].sum()
I would want an output similar to:
start end num
0 2015-01-04 2015-01-23 23:00:00 33,251
1 2015-01-24 2015-02-23 23:00:00 39,652
but of course, I'm not getting that.
pd.merge_asof only available with pandas 0.19
combination of pd.merge_asof + query + groupby
pd.merge_asof(df, month, left_on='date', right_on='start') \
.query('date <= end').groupby(['start', 'end']).num.sum().reset_index()
explanation
pd.merge_asof
From docs
For each row in the left DataFrame, we select the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key. Both DataFrames must be sorted by the key.
But this only takes into account the start date.
query
I take care of end date with query since I now conveniently have end in my dataframe after pd.merge_asof
groupby
I trust this part is obvious`
Maybe you can convert to a period and add a number of days
# create data
dates = pd.Series(pd.date_range('1/1/2015 00:00','3/31/2015 23:45',freq='1H'))
nums = np.random.randint(0,100,dates.count())
df = pd.DataFrame({'date':dates, 'num':nums})
# offset days and then create period
df['periods'] = (df.date + pd.tseries.offsets.Day(23)).dt.to_period('M')]
# group and sum
df.groupby('periods')['num'].sum()
Output
periods
2015-01 10051
2015-02 34229
2015-03 37311
2015-04 26655
You can then shift the dates back and make new columns

Categories