grouping based on time frequency and string column? - python

I am trying to combine a set of strings based on the time and an ID, I want to group the data using a 5 minute interval from the first time occurance.
Data:
ID | Q | Timestamp |
1 | a > b | 24/06/2017 18:11|
1 | b > b | 24/06/2017 18:12|
1 | b > c | 24/06/2017 18:13|
1 | c > d | 24/06/2017 18:14|
1 | c > e | 24/06/2017 18:17|
2 | a > b | 24/06/2017 18:12|
2 | b > c | 24/06/2017 18:15|
Desired Result:
ID | Q | Timestamp |
1 | a > b > b > b > b > c > c > d| 24/06/2017 18:11|
1 | c > e | 24/06/2017 18:17|
2 | a > b > b > c | 24/06/2017 18:12|
I am currently trying to use this:
grouped = df.groupby([pd.Grouper(freq='5M'), 'ID']).agg(lambda x: '>'.join(set(x)))
However, its not quite there, this is breaking the timestamp and failing to join in time order. It appears to only do the first timeframe also.
Any help would be much appriciated.

you are grouping with 5 month frequency:
5M = 5 months.
5min or 5T = 5 minutes.
see this time_aliases
if you will do it with 5T frequency you will get results with minutes that can be equally divided by 5 (in this case starting at 18:10), for example:
ids = [*[1]*5, 2]
q = [f'{i:02}' for i in range(6)]
dates = pd.date_range('2017-06-24 18:11', periods=6, freq='1min')
df = pd.DataFrame({'ids':ids, 'q':q,'dates':dates,})
df
ids q dates
0 1 00 2017-06-24 18:11:00
1 1 01 2017-06-24 18:12:00
2 1 02 2017-06-24 18:13:00
3 1 03 2017-06-24 18:14:00
4 1 04 2017-06-24 18:15:00
5 2 05 2017-06-24 18:16:00
grouping with 5 min frequency gives you this
grouped = df.groupby([pd.Grouper(key='dates',freq='5min'), 'ids']).agg(lambda x: '>'.join(set(x)))
grouped
q
dates ids
2017-06-24 18:10:00 1 02>03>01>00
2017-06-24 18:15:00 1 04
2 05
if you want 18:11 to be your start date you can offset your data and then offset is back:
df['dates'] -= pd.offsets.Minute(1)
grouped = df.groupby([pd.Grouper(key='dates',freq='5min'), 'ids']).agg(lambda x: '>'.join(set(x))).reset_index()
grouped['dates'] += pd.offsets.Minute(1)
grouped
dates ids q
0 2017-06-24 18:11:00 1 04>00>03>02>01
1 2017-06-24 18:16:00 2 05
thus achieving the desired result.
a more general answer is offsetting the minimum date to the nearest minute that can equally divided by n (in your case n=5 and the minimum is 18:11) .

Related

Optimizing a Pandas DataFrame Transformation to Link two Columns

Given the following df:
SequenceNumber | ID | CountNumber | Side | featureA | featureB
0 0 | 0 | 3 | Sell | 4 | 2
1 0 | 1 | 1 | Buy | 12 | 45
2 0 | 2 | 1 | Buy | 1 | 4
3 0 | 3 | 1 | Buy | 3 | 36
4 1 | 0 | 1 | Sell | 5 | 11
5 1 | 1 | 1 | Sell | 7 | 12
6 1 | 2 | 2 | Buy | 5 | 35
I want to create a new df such that for every SequenceNumber value, it takes the rows with the CountNumber == 1, and creates new rows where if the Side == 'Buy' then put their ID in a column named To. Otherwise put their ID in a column named From. Then the empty column out of From and To will take the ID of the row with the CountNumber > 1 (there is only one per each SequenceNumber value). The rest of the features should be preserved.
NOTE: basically each SequenceNumber represents one transactions that has either one seller and multiple buyers, or vice versa. I am trying to create a database that links the buyers and sellers where From is the Seller ID and To is the Buyer ID.
The output should look like this:
SequenceNumber | From | To | featureA | featureB
0 0 | 0 | 1 | 12 | 45
1 0 | 0 | 2 | 1 | 4
2 0 | 0 | 3 | 3 | 36
3 1 | 0 | 2 | 5 | 11
4 1 | 1 | 2 | 7 | 12
I implemented a method that does this, however I am using for loops which takes a long time to run on a large data. I am looking for a faster scalable method. Any suggestions?
Here is the original df:
df = pd.DataFrame({'SequenceNumber ': [0, 0, 0, 0, 1, 1, 1],
'ID': [0, 1, 2, 3, 0, 1, 2],
'CountNumber': [3, 1, 1, 1, 1, 1, 2],
'Side': ['Sell', 'Buy', 'Buy', 'Buy', 'Sell', 'Sell', 'Buy'],
'featureA': [4, 12, 1, 3, 5, 7, 5],
'featureB': [2, 45, 4, 36, 11, 12, 35]})
You can reshape with a pivot, select the features to keep with a mask and rework the output with groupby.first then concat:
features = list(df.filter(like='feature'))
out = (
# repeat the rows with CountNumber > 1
df.loc[df.index.repeat(df['CountNumber'])]
# rename Sell/Buy into from/to and de-duplicate the rows per group
.assign(Side=lambda d: d['Side'].map({'Sell': 'from', 'Buy': 'to'}),
n=lambda d: d.groupby(['SequenceNumber', 'Side']).cumcount()
)
# mask the features where CountNumber > 1
.assign(**{f: lambda d, f=f: d[f].mask(d['CountNumber'].gt(1)) for f in features})
.drop(columns='CountNumber')
# reshape with a pivot
.pivot(index=['SequenceNumber', 'n'], columns='Side')
)
out = (
pd.concat([out['ID'], out.drop(columns='ID').groupby(level=0, axis=1).first()], axis=1)
.reset_index('SequenceNumber')
)
Output:
SequenceNumber from to featureA featureB
n
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
0 1 0 2 5.0 11.0
1 1 1 2 7.0 12.0
atlernative using a merge like suggested by ifly6:
features = list(df.filter(like='feature'))
df1 = df.query('Side=="Sell"').copy()
df1[features] = df1[features].mask(df1['CountNumber'].gt(1))
df2 = df.query('Side=="Buy"').copy()
df2[features] = df2[features].mask(df2['CountNumber'].gt(1))
out = (df1.merge(df2, on='SequenceNumber').rename(columns={'ID_x': 'from', 'ID_y': 'to'})
.set_index(['SequenceNumber', 'from', 'to'])
.filter(like='feature')
.pipe(lambda d: d.groupby(d.columns.str.replace('_.*?$', '', regex=True), axis=1).first())
.reset_index()
)
Output:
SequenceNumber from to featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Initial response. To get the answer half complete. Split the data into sellers and buyers. Then merge it against itself on the sequence number:
ndf = df.query('Side == "Sell"').merge(
df.query('Side == "Buy"'), on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
I then drop the side variable.
ndf = ndf.drop(columns=[i for i in ndf.columns if i.startswith('Side')])
This creates a very wide table:
SequenceNumber From CountNumber_sell featureA_sell featureB_sell To CountNumber_buy featureA_buy featureB_buy
0 0 0 3 4 2 1 1 12 45
1 0 0 3 4 2 2 1 1 4
2 0 0 3 4 2 3 1 3 36
3 1 0 1 5 11 2 2 5 35
4 1 1 1 7 12 2 2 5 35
This leaves you, however, with two featureA and featureB columns. I don't think your question clearly establishes which one takes precedence. Please provide more information on that.
Is it select the side with the lower CountNumber? Is it when CountNumber == 1? If the latter, then just null out the entries at the merge stage, do the merge, and then forward fill your appropriate columns to recover the proper values.
Re nulling. If you null the portions in featureA and featureB where the CountNumber is not 1, you can then create new version of those columns after the merge by forward filling and selecting.
s = df.query('Side == "Sell"').copy()
s.loc[s['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
b = df.query('Side == "Buy"').copy()
b.loc[b['CountNumber'] != 1, ['featureA', 'featureB']] = np.nan
ndf = s.merge(
b, on='SequenceNumber', suffixes=['_sell', '_buy']) \
.rename(columns={'ID_sell': 'From', 'ID_buy': 'To'})
ndf['featureA'] = ndf[['featureA_buy', 'featureA_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf['featureB'] = ndf[['featureB_buy', 'featureB_sell']] \
.ffill(axis=1).iloc[:, -1]
ndf = ndf.drop(
columns=[i for i in ndf.columns if i.startswith('Side')
or i.endswith('_sell') or i.endswith('_buy')])
The final version of ndf then is:
SequenceNumber From To featureA featureB
0 0 0 1 12.0 45.0
1 0 0 2 1.0 4.0
2 0 0 3 3.0 36.0
3 1 0 2 5.0 11.0
4 1 1 2 7.0 12.0
Here is an alternative approach
df1 = df.loc[df['CountNumber'] == 1].copy()
df1['From'] = (df1['ID'].where(df1['Side'] == 'Sell', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1['To'] = (df1['ID'].where(df1['Side'] == 'Buy', df1['SequenceNumber']
.map(df.loc[df['CountNumber'] > 1].set_index('SequenceNumber')['ID']))
)
df1 = df1.drop(['ID', 'CountNumber', 'Side'], axis=1)
df1 = df1[['SequenceNumber', 'From', 'To', 'featureA', 'featureB']]
df1.reset_index(drop=True, inplace=True)
print(df1)
SequenceNumber From To featureA featureB
0 0 0 1 12 45
1 0 0 2 1 4
2 0 0 3 3 36
3 1 0 2 5 11
4 1 1 2 7 12

Keep pandas dataframe columns and their order in pivot table

I have a dataframe:
df = pd.DataFrame({'No': [123,123,123,523,523,523,765],
'Type': ['A','B','C','A','C','D','A'],
'Task': ['First','Second','First','Second','Third','First','Fifth'],
'Color': ['blue','red','blue','black','red','red','red'],
'Price': [10,5,1,12,12,12,18],
'Unit': ['E','E','E','E','E','E','E'],
'Pers.ID': [45,6,6,43,1,9,2]
})
So it looks like this:
df
+-----+------+--------+-------+-------+------+---------+
| No | Type | Task | Color | Price | Unit | Pers.ID |
+-----+------+--------+-------+-------+------+---------+
| 123 | A | First | blue | 10 | E | 45 |
| 123 | B | Second | red | 5 | E | 6 |
| 123 | C | First | blue | 1 | E | 6 |
| 523 | A | Second | black | 12 | E | 43 |
| 523 | C | Third | red | 12 | E | 1 |
| 523 | D | First | red | 12 | E | 9 |
| 765 | A | First | red | 18 | E | 2 |
+-----+------+--------+-------+-------+------+---------+
then I created a pivot table:
piv = pd.pivot_table(df, index=['No','Type','Task'])
Result:
Pers.ID Price
No Type Task
123 A First 45 10
B Second 6 5
C First 6 1
523 A Second 43 12
C Third 1 12
D First 9 12
765 A Fifth 2 18
As you can see, problems are:
multiple columns are gone (Color and Unit)
The order of the columns Price and Pers.ID is not the same as in the original dataframe.
I tried to fix this by executing:
cols = list(df.columns)
piv = pd.pivot_table(df, index=['No','Type','Task'], values = cols)
but the result is the same.
I read other posts but none of them matched my problem in a way that I could use it.
Thank you!
EDIT: desired output
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
I think problem is in pivot_table default aggregate function is mean, so strings columns are excluded. So need custom function, also order is changed, so reindex is necessary:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)
cols = df.columns[~df.columns.isin(['No','Type','Task'])].tolist()
piv = (pd.pivot_table(df,
index=['No','Type','Task'],
values = cols,
aggfunc=f).reindex(columns=cols))
print (piv)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
Another solution with groupby and same aggregation function, ordering is not problem:
df = (df.groupby(['No','Type','Task'])
.agg(lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ', '.join(x)))
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2
But if need set first 3 columns to MultiIndex only:
df = df.set_index(['No','Type','Task'])
print (df)
Color Price Unit Pers.ID
No Type Task
123 A First blue 10 E 45
B Second red 5 E 6
C First blue 1 E 6
523 A Second black 12 E 43
C Third red 12 E 1
D First red 12 E 9
765 A Fifth red 18 E 2

Counting cumulative occurrences of values based on date window in Pandas

I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.

Drop Duplicates in a DataFrame if Timestamps are Close, but not Identical

Imagine that I've got the following DataFrame
A | B | C | D
-------------------------------
2000-01-01 00:00:00 | 1 | 1 | 1
2000-01-01 00:04:30 | 1 | 2 | 2
2000-01-01 00:04:30 | 2 | 3 | 3
2000-01-02 00:00:00 | 1 | 4 | 4
And I want to drop rows where B are equal, and the values in A are "close". Say, withing five minutes of each other. So in this case the first two rows, but keep the last two.
So, instead of doing df.dropna(subset=['A', 'B'], inplace=True, keep=False), I'd like something that's more like df.dropna(subset=['A', 'B'], inplace=True, keep=False, func={'A': some_func}). With
def some_func(ts1, ts2):
delta = ts1 - ts2
return abs(delta.total_seconds()) >= 5 * 60
Is there a way to do this in Pandas?
m = df.groupby('B').A.apply(lambda x: x.diff().dt.seconds < 300)
m2 = df.B.duplicated(keep=False) & (m | m.shift(-1))
df[~m2]
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
Details
m gets a mask of all rows within 5 minutes of each other.
m
0 False
1 True
2 False
3 False
Name: A, dtype: bool
m2 is the final mask of all items that must be dropped.
m2
0 True
1 True
2 False
3 False
dtype: bool
I break down the steps ...And you can test with your real data to see whether it works or not ..
df['dropme']=df.A.diff().shift(-1).dt.seconds/60
df['dropme2']=df.A
df.loc[df.dropme<=5,'dropme2']=1
df.drop_duplicates(['dropme2'],keep=False).drop(['dropme','dropme2'],axis=1)
Out[553]:
A B C D
2 2000-01-01 00:04:30 2 3 3
3 2000-01-02 00:00:00 1 4 4
write a function that accepts a data frame, calculates the delta between two successive timestamps, and return the filtered dataframe. Then groupby & apply.
import pandas as pd
import datetime
# this one preserves 1 row from two or more closeby rows.
def filter_window(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(10.0)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df2 = df.groupby('B').apply(filter_window).reset_index()
# With your sample dataset, this is the output of df2
A B C D
0 2000-01-01 00:00:00 1 1 1
1 2000-01-02 00:00:00 1 4 4
2 2000-01-01 00:04:30 2 3 3
# this one drops all closeby rows.
def filter_window2(df):
df['filt'] = (df.A - df.A.shift(1)) / datetime.timedelta(minutes=1)
df['filt2'] = (df.A.shift(-1) - df.A) / datetime.timedelta(minutes=1)
df['filt'] = df.filt.fillna(df.filt2)
df = df[(df.filt > 5.0) | pd.isnull(df.filt)]
return df[['A', 'C', 'D']]
df3 = df.groupby('B').apply(filter_window2).reset_index()
# With your sample dataset, this is the output of df3
A B C D
0 2000-01-02 00:00:00 1 4 4
1 2000-01-01 00:04:30 2 3 3

Dataframe Wrangling with Dates and Periods in Pandas

There are a number of things I would typically do in SQL and excel that I'm trying to do with Pandas. There are a few different wrangling problems here, combined into one question because they all have the same goal.
I have a data frame df in python with three columns:
| EventID | PictureID | Date
0 | 1 | A | 2010-01-01
1 | 2 | A | 2010-02-01
2 | 3 | A | 2010-02-15
3 | 4 | B | 2010-01-01
4 | 5 | C | 2010-02-01
5 | 6 | C | 2010-02-15
EventIDs are unique. PictureIDs are not unique, although PictureID + Date are distinct.
I. First I would like to add a new column:
df['period'] = the month and year that the event falls into beginning 2010-01.
II. Second, I would like to 'melt' the data into some new dataframe that counts the number of events for a given PictureID in a given period. I'll use examples with just two periods.
| PictureID | Period | Count
0 | A | 2010-01 | 1
1 | A | 2010-02 | 2
2 | B | 2010-01 | 1
3 | C | 2010-02 | 2
So that I can then stack (?) this new data frame into something that provides period counts for all unique PictureIDs:
| PictureID | 2010-01 | 2010-02
0 | A | 1 | 2
1 | B | 1 | 0
2 | C | 0 | 2
My sense is that pandas is built do to this sort of thing easily, is that correct?
[Edit: Removed a confused third part.]
For the first two parts you can do:
>>> df['Period'] = df['Date'].map(lambda d: d.strftime('%Y-%m'))
>>> df
EventID PictureID Date Period
0 1 A 2010-01-01 00:00:00 2010-01
1 2 A 2010-02-01 00:00:00 2010-02
2 3 A 2010-02-15 00:00:00 2010-02
3 4 B 2010-01-01 00:00:00 2010-01
4 5 C 2010-02-01 00:00:00 2010-02
5 6 C 2010-02-15 00:00:00 2010-02
>>> grouped = df[['Period', 'PictureID']].groupby('Period')
>>> grouped['PictureID'].value_counts().unstack(0).fillna(0)
Period 2010-01 2010-02
A 1 2
B 1 0
C 0 2
For the third part, either I haven't understood the question well, or you haven't posted the correct numbers in the example. since the count for the A in the 3rd row should be 2? and for the C in the 6th row should be 1. If the period is six months...
Either way you should do something like this:
>>> ts = df.set_index('Date')
>>> ts.resample('6M', ...)
Update: This is a pretty ugly way to do it, I think I saw a better way to do it, but I can't find the SO question. But, this will also get the job done...
def for_half_year(row, data):
date = row['Date']
pid = row['PictureID']
# Do this 6 month checking better
if '__start' not in data or (date - data['__start']).days > 6*30:
# Reset values
for key in data:
data[key] = 0
data['__start'] = date
data[pid] = data.get(pid, -1) + 1
return data[pid]
df['PastSix'] = df.apply(for_half_year, args=({},), axis=1)

Categories