I have a data series with a random date column as my index, a numbered value as well as three columns that each indicate whether a safety mechanism is activated to block the numbered value. Example is:
DateTime Safe1 Safe2 Safe3 Measurement
1/8/2013 6:06 N Y N
1/8/2013 6:23 N Y N
1/8/2013 6:40 N N N 28
1/8/2013 6:57 N N N 31
I need to resample the data using Pandas in order to create clean half-hour interval data, taking the mean of values where any exist. Of course, this removes the three safety string columns.
However, I would like to include a column that indicates Y if any combination of the safety mechanisms are activated for the entire half-hour interval.
How do I get this string column showing Y in the resampled data indicating a Y was present in the raw data amongst the three safety mechanism columns without any values in the Measurement?
Desired Output based upon above:
DateTime Safe1 Measurement
1/8/2013 6:00 Y
1/8/2013 6:30 N 29.5
I don't think it's possible to do what you want with the resample function, as there's not much customisation you can do. We have to do a TimeGrouper with a groupby operation.
First creating the data :
import pandas as pd
index = ['1/8/2013 6:06', '1/8/2013 6:23', '1/8/2013 6:40', '1/8/2013 6:57']
data = {'Safe1' : ['N', 'N', 'N', 'N'],
'Safe2': ['Y', 'Y', 'N', 'N'],
'Safe3': ['N', 'N', 'N', 'N'],
'Measurement': [0,0,28,31]}
df = pd.DataFrame(index=index, data=data)
df.index = pd.to_datetime(df.index)
df
output :
Measurement Safe1 Safe2 Safe3
2013-01-08 06:06:00 0 N Y N
2013-01-08 06:23:00 0 N Y N
2013-01-08 06:40:00 28 N N N
2013-01-08 06:57:00 31 N N N
Then let's add a helper column, called Safe, that will be a concatenation of all the Safex columns. If there's at least one Y in the Safe column, we'll know that the safety mechanism was activated.
df['Safe'] = df['Safe1'] + df['Safe2'] + df['Safe3']
print df
output :
Measurement Safe1 Safe2 Safe3 Safe
2013-01-08 06:06:00 0 N Y N NYN
2013-01-08 06:23:00 0 N Y N NYN
2013-01-08 06:40:00 28 N N N NNN
2013-01-08 06:57:00 31 N N N NNN
finally, we're going to define a custom function, that will return Y if there's at least one Y in the list of strings that is passed as an argument.
That custom function is passed on the Safe column, after we have grouped it by 30 Min intervals :
def func(x):
x = ''.join(x.values)
return 'Y' if 'Y' in x else 'N'
df.groupby(pd.TimeGrouper(freq='30Min')).agg({'Measurement': 'mean', 'Safe': func })
output :
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
Here's an answer using pandas built-in resample function.
First combine the 3 Safe values into a single column:
df['Safe'] = df.Safe1 + df.Safe2 + df.Safe3
Turn the 3-letter strings into a 0-1 variable:
df.Safe = df.Safe.apply(lambda x: 1 if 'Y' in x else 0)
Write a custom resampling function for the 'Safes' column:
def f(x):
if sum(x) > 0: return 'Y'
else: return 'N'
Finally, resample:
df.resample('30T').Safe.agg({'Safe': f}).join(df.resample('30T').Measurement.mean())
Output:
Safe Measurement
2013-01-08 06:00:00 Y 0.0
2013-01-08 06:30:00 N 29.5
I manually resample the date (easy if it is rounding)....
Here is an example
from random import shuffle
from datetime import datetime, timedelta
from itertools import zip_longest
from random import randint, randrange, seed
from tabulate import tabulate
import pandas as pd
def df_to_md(df):
print(tabulate(df, tablefmt="pipe",headers="keys"))
seed(42)
people=['tom','dick','harry']
avg_score=[90,50,10]
date_times=[n for n in pd.date_range(datetime.now()-timedelta(days=2),datetime.now(),freq='5 min').values]
scale=1+int(len(date_times)/len(people))
score =[randint(i,100)*i/10000 for i in avg_score*scale]
df=pd.DataFrame.from_records(list(zip(date_times,people*scale,score)),columns=['When','Who','Status'])
# Create 3 records tom should score 90%, dick 50% and poor harry only 10%
# Tom should score well
df_to_md(df[df.Who=='tom'].head())
The table is in Markdown format - just to easy my cut and paste....
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 |
Harry scores badly
df_to_md(df[df.Who=='harry'].head())
| | When | Who | Status |
|---:|:---------------------------|:------|---------:|
| 2 | 2019-06-18 14:17:17.457124 | harry | 0.013 |
| 5 | 2019-06-18 14:32:17.457124 | harry | 0.038 |
| 8 | 2019-06-18 14:47:17.457124 | harry | 0.023 |
| 11 | 2019-06-18 15:02:17.457124 | harry | 0.079 |
| 14 | 2019-06-18 15:17:17.457124 | harry | 0.064 |
Lets get the average per hour per person
def round_to_hour(t):
# Rounds to nearest hour by adding a timedelta hour if minute >= 30
return (t.replace(second=0, microsecond=0, minute=0, hour=t.hour)
+timedelta(hours=t.minute//30))
And generate a new column using this method.
df['WhenRounded']=df.When.apply(lambda x: round_to_hour(x))
df_to_md(df[df.Who=='tom'].head())
This should be tom's data - showing original and rounded.
| | When | Who | Status | WhenRounded |
|---:|:---------------------------|:------|---------:|:--------------------|
| 0 | 2019-06-18 14:07:17.457124 | tom | 0.9 | 2019-06-18 14:00:00 |
| 3 | 2019-06-18 14:22:17.457124 | tom | 0.846 | 2019-06-18 14:00:00 |
| 6 | 2019-06-18 14:37:17.457124 | tom | 0.828 | 2019-06-18 15:00:00 |
| 9 | 2019-06-18 14:52:17.457124 | tom | 0.9 | 2019-06-18 15:00:00 |
| 12 | 2019-06-18 15:07:17.457124 | tom | 0.819 | 2019-06-18 15:00:00 |
We can resample ... by grouping and using a grouping function
Group by the Rounded-Date, and the Person (Datetime and Str) objects) - we want in this case the mean value, but there are others also available.
df_resampled=df.groupby(by=['WhenRounded','Who'], axis=0).agg({'Status':'mean'}).reset_index()
# Output in Markdown format
df_to_md(df_resampled[df_resampled.Who=='tom'].head())
| | WhenRounded | Who | Status |
|---:|:--------------------|:------|---------:|
| 2 | 2019-06-18 14:00:00 | tom | 0.873 |
| 5 | 2019-06-18 15:00:00 | tom | 0.83925 |
| 8 | 2019-06-18 16:00:00 | tom | 0.86175 |
| 11 | 2019-06-18 17:00:00 | tom | 0.84375 |
| 14 | 2019-06-18 18:00:00 | tom | 0.8505 |
Lets check the mean for tom # 14:00
print("Check tom 14:00 .86850 ... {:6.5f}".format((.900+.846+.828+.900)/4))
Check tom 14:00 .86850 ... 0.86850
Hope this assists
Related
my data frame has daily value from 2005-01-01 to 2021-10-31.
| C1 | C2
-----------------------------
2005-01-01 | 2.7859 | -7.790
2005-01-02 |-0.7756 | -0.97
2005-01-03 |-6.892 | 2.770
2005-01-04 | 2.785 | -0.97
. . .
. . .
2021-10-28 | 6.892 | 2.785
2021-10-29 | 2.785 | -6.892
2021-10-30 |-6.892 | -0.97
2021-10-31 |-0.7756 | 2.34
I want to downsample this data frame to get quarter value as follows.
| C1 | C2
------------------------------
2005-03-01 | 2.7859 | -7.790
2005-06-30 |-0.7756 | -0.97
2005-09-30 |-6.892 | 2.770
2005-12-31 | 2.785 | -0.97
I tried to do it with Pandas resample method but it requires an aggregation method.
df = df.resample('Q').mean()
I don't want the aggregated value I want the current value at the quarter-end date as it is.
Your code works except you are not using the right function. Replace mean by last:
dti = pd.date_range('2005-01-01', '2021-10-31', freq='D')
df = pd.DataFrame(np.random.random((len(dti), 2)), columns=['C1', 'C2'], index=dti)
dfQ = df.resample('Q').last()
print(dfQ)
# Output:
C1 C2
2005-03-31 0.653733 0.334182
2005-06-30 0.425229 0.316189
2005-09-30 0.055675 0.746406
2005-12-31 0.394051 0.541684
2006-03-31 0.525208 0.413624
... ... ...
2020-12-31 0.662081 0.887147
2021-03-31 0.824541 0.363729
2021-06-30 0.064824 0.621555
2021-09-30 0.126891 0.549009
2021-12-31 0.126217 0.044822
[68 rows x 2 columns]
You can do this,
df = df[df.index.is_quarter_end]
You will filter out the dates only at the end of each quarter.
I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |
Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0
How can I perform the below manipulation with pandas?
I have this dataframe :
weight | Date | dateDay
43 | 09/03/2018 08:48:48 | 09/03/2018
30 | 10/03/2018 23:28:48 | 10/03/2018
45 | 12/03/2018 04:21:44 | 12/03/2018
25 | 17/03/2018 00:23:32 | 17/03/2018
35 | 18/03/2018 04:49:01 | 18/03/2018
39 | 19/03/2018 20:14:37 | 19/03/2018
I want this :
weight | Date | dateDay | Fun_Cum
43 | 09/03/2018 08:48:48 | 09/03/2018 | NULL
30 | 10/03/2018 23:28:48 | 10/03/2018 | -13
45 | 12/03/2018 04:21:44 | 12/03/2018 | NULL
25 | 17/03/2018 00:23:32 | 17/03/2018 | NULL
35 | 18/03/2018 04:49:01 | 18/03/2018 | 10
39 | 19/03/2018 20:14:37 | 19/03/2018 | 4
Pseudo code:
If Day does not follow Day-1 => Fun_Cum is NULL;
Else (weight day) - (weight day-1)
Thank you
This is one way using pd.Series.diff and pd.Series.shift. You can take the difference between consecutive datetime elements and access pd.Series.dt.days attribute.
df['Fun_Cum'] = df['weight'].diff()
df.loc[(df.dateDay - df.dateDay.shift()).dt.days != 1, 'Fun_Cum'] = np.nan
print(df)
weight Date dateDay Fun_Cum
0 43 2018-03-09 2018-03-09 NaN
1 30 2018-03-10 2018-03-10 -13.0
2 45 2018-03-12 2018-03-12 NaN
3 25 2018-03-17 2018-03-17 NaN
4 35 2018-03-18 2018-03-18 10.0
5 39 2018-03-19 2018-03-19 4.0
#import pandas as pd
#from datetime import datetime
#to_datetime = lambda d: datetime.strptime(d, '%d/%m/%Y')
#df = pd.read_csv('d.csv', converters={'dateDay': to_datetime})
Above part only if you reading from the file, else its just .shift() what u need
a = df
b = df.shift()
df["Fun_Cum"] = (a.weight - b.weight) * ((a.dateDay - b.dateDay).dt.days ==1)
I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.
There are a number of things I would typically do in SQL and excel that I'm trying to do with Pandas. There are a few different wrangling problems here, combined into one question because they all have the same goal.
I have a data frame df in python with three columns:
| EventID | PictureID | Date
0 | 1 | A | 2010-01-01
1 | 2 | A | 2010-02-01
2 | 3 | A | 2010-02-15
3 | 4 | B | 2010-01-01
4 | 5 | C | 2010-02-01
5 | 6 | C | 2010-02-15
EventIDs are unique. PictureIDs are not unique, although PictureID + Date are distinct.
I. First I would like to add a new column:
df['period'] = the month and year that the event falls into beginning 2010-01.
II. Second, I would like to 'melt' the data into some new dataframe that counts the number of events for a given PictureID in a given period. I'll use examples with just two periods.
| PictureID | Period | Count
0 | A | 2010-01 | 1
1 | A | 2010-02 | 2
2 | B | 2010-01 | 1
3 | C | 2010-02 | 2
So that I can then stack (?) this new data frame into something that provides period counts for all unique PictureIDs:
| PictureID | 2010-01 | 2010-02
0 | A | 1 | 2
1 | B | 1 | 0
2 | C | 0 | 2
My sense is that pandas is built do to this sort of thing easily, is that correct?
[Edit: Removed a confused third part.]
For the first two parts you can do:
>>> df['Period'] = df['Date'].map(lambda d: d.strftime('%Y-%m'))
>>> df
EventID PictureID Date Period
0 1 A 2010-01-01 00:00:00 2010-01
1 2 A 2010-02-01 00:00:00 2010-02
2 3 A 2010-02-15 00:00:00 2010-02
3 4 B 2010-01-01 00:00:00 2010-01
4 5 C 2010-02-01 00:00:00 2010-02
5 6 C 2010-02-15 00:00:00 2010-02
>>> grouped = df[['Period', 'PictureID']].groupby('Period')
>>> grouped['PictureID'].value_counts().unstack(0).fillna(0)
Period 2010-01 2010-02
A 1 2
B 1 0
C 0 2
For the third part, either I haven't understood the question well, or you haven't posted the correct numbers in the example. since the count for the A in the 3rd row should be 2? and for the C in the 6th row should be 1. If the period is six months...
Either way you should do something like this:
>>> ts = df.set_index('Date')
>>> ts.resample('6M', ...)
Update: This is a pretty ugly way to do it, I think I saw a better way to do it, but I can't find the SO question. But, this will also get the job done...
def for_half_year(row, data):
date = row['Date']
pid = row['PictureID']
# Do this 6 month checking better
if '__start' not in data or (date - data['__start']).days > 6*30:
# Reset values
for key in data:
data[key] = 0
data['__start'] = date
data[pid] = data.get(pid, -1) + 1
return data[pid]
df['PastSix'] = df.apply(for_half_year, args=({},), axis=1)