How do I index this dataset with DateTIme? - python

I'm a bit new with python and data processing elements, so sorry if this is a nooby question.
So I have a large 3D tensor(?) dataset that looks something like this:
data = [[[a], [b]], [[c], [d]] ... ]
And each 2D tensor in the dataset is connected to a timestamp i.e.
2018-09-29 05:00:00 -> [[a], [b]]
2018-09-29 06:00:00 -> [[c], [d]]
...
And each data-set i.e. a, b, c, d contains the same columns i.e:
a.head()
| val1 | val2 | val3 |
----------------------
| 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
...
I need to create a multivariable index, that is, a timestamp should refer to a matrix.
I've tried with:
dfs = [[[a], [b]], [[c], [d]] ... ]
dates = ['2018-09-29 05:00:00', '2018-09-29 06:00:00']
x = pd.concat(dfs, keys=pd.to_datetime(dates))
which creates an outermost index with dates, but I have no way of reaching this index. When I list the keys with x.keys(), I only get the columns for a, b... i.e. val1, val2, val3. That is, it creates this kindof table:
| val1 | val2 | val3 |
----------------------
2018-09-29 05:00:00 | 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
----------------------
2018-09-29 06:00:00 | 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
So how do I do create this DateTime indexing of multivariate values effectively? How can I access the timestamp-keys? Are there better ways of doing this?
Edit
I.e how can I achieve this as shown in pandas reshaping guide:
a b
variable val1 val2 val3 val1 val2 val3
date
2018-09-29 05:00:00 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
2018-09-29 06:00:00 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265

Not sure if this is what you want to do, but I tried to create a little toy example,
as specified in your question. So we have 2d matrices referenced by timestamps:
import pandas as pd
import numpy as np
data = {
'2018-09-29 05:00:00': np.arange(9). reshape(3, 3),
'2018-10-29 05:00:00': np.arange(9, 18). reshape(3, 3),
'2018-11-29 05:00:00': np.arange(18, 27). reshape(3, 3)
}
Then I just vertically stack the data and create an index like so:
matrices = []
index = []
for k, v in data.items():
matrices.append(v)
for _ in range(v.shape[0]):
index.append(k)
The dataframe would look like this:
df = pd.DataFrame(np.vstack(matrices), index=index)
print(df)
# 0 1 2
# 2018-09-29 05:00:00 0 1 2
# 2018-09-29 05:00:00 3 4 5
# 2018-09-29 05:00:00 6 7 8
# 2018-10-29 05:00:00 9 10 11
# 2018-10-29 05:00:00 12 13 14
# 2018-10-29 05:00:00 15 16 17
# 2018-11-29 05:00:00 18 19 20
# 2018-11-29 05:00:00 21 22 23
# 2018-11-29 05:00:00 24 25 26
If you want data of a specific timestamp all you have to do is use the loc method
print(df.loc['2018-09-29 05:00:00'])
# 0 1 2
# 2018-09-29 05:00:00 0 1 2
# 2018-09-29 05:00:00 3 4 5
# 2018-09-29 05:00:00 6 7 8
Hope this helps.
Edit:
You can convert the strings to Timestamps as well pd.Timestamp(...) and keep on querying with strings. I'm not to knowledgable about dos and don'ts of Pandas timestamps though.
Edit 2:
You could save objects in t he cells instead and include the whole numpy matrix as one cell entry, but then you would loose the abillity to query single rows/column of a matrix.

Related

Pandas group by then count & sum based on date range +/- x-days

I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column
Example data (edited to reflect my real dataset):
group | date | amount
-------------------------------------------
A | 2017-12-26 04:20:20 | 50000.0
A | 2018-01-17 00:54:15 | 60000.0
A | 2018-01-27 06:10:12 | 150000.0
A | 2018-02-01 01:15:06 | 100000.0
A | 2018-02-11 05:05:34 | 150000.0
A | 2018-03-01 11:20:04 | 150000.0
A | 2018-03-16 12:14:01 | 150000.0
A | 2018-03-23 05:15:07 | 150000.0
A | 2018-04-02 10:40:35 | 150000.0
group by group then sum based on date-7 < date < date+7
Results that I want:
group | date | amount | grouped_sum
-----------------------------------------------------------
A | 2017-12-26 04:00:00 | 50000.0 | 50000.0
A | 2018-01-17 00:00:00 | 60000.0 | 60000.0
A | 2018-01-27 06:00:00 | 150000.0 | 250000.0
A | 2018-02-01 01:00:00 | 100000.0 | 250000.0
A | 2018-02-11 05:05:00 | 150000.0 | 150000.0
A | 2018-03-01 11:00:04 | 150000.0 | 150000.0
A | 2018-03-16 12:00:01 | 150000.0 | 150000.0
A | 2018-03-23 05:00:07 | 100000.0 | 100000.0
A | 2018-04-02 10:00:00 | 100000.0 | 100000.0
Quick snippet to achieve the dataset:
group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15',
'2018-01-27 06:10:12', '2018-02-01 01:15:06',
'2018-02-11 05:05:34', '2018-03-01 11:20:04',
'2018-03-16 12:14:01', '2018-03-23 05:15:07',
'2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0,
150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})
Bit of explanation:
2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.
I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:
1.Create another column that shows date-7 and date+7 for each rows
group | date | amount | date-7 | date+7
-------------------------------------------------------------
A | 2017-12-26 | 50000.0 | 2017-12-19 | 2018-01-02
A | 2018-01-17 | 60000.0 | 2018-01-10 | 2018-01-24
2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()
3.But this method is quite tedious.
EDIT (2018-09-01):
Found out this method below based on #jezrael answer which works for me but only works for single group:
t = pd.Timedelta(7, unit='d')
def g(row):
res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
return res
df['new'] = df.apply(g, axis=1)
Here is problem need loop for each row and for each groups:
t = pd.Timedelta(7, unit='d')
def f(x):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum() ,axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
group date amount new
0 A 2018-01-01 10 10.0
1 A 2018-01-14 20 40.0
2 A 2018-01-15 20 40.0
3 B 2018-02-03 10 30.0
4 B 2018-02-04 10 30.0
5 B 2018-02-05 10 30.0
Thanks for improvement by #jpp:
def f(x, t):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum(),axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))
Verify solution:
t = pd.Timedelta(7, unit='d')
df = df[df['group'] == 'A']
def test(y):
a = df.loc[df['date'].between(y['date'] - t, y['date'] + t,inclusive=False)]
print (a)
print (a['amount'])
return a['amount'].sum()
group date amount
0 A 2018-01-01 10
0 10
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
df['new'] = df.apply(test,axis=1)
print (df)
group date amount new
0 A 2018-01-01 10 10
1 A 2018-01-14 20 40
2 A 2018-01-15 20 40
Add column with first days of the week:
df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)
Result:
group date amount week_start
0 A 2018-01-01 10 2017-12-26
1 A 2018-01-14 20 2018-01-09
2 A 2018-01-15 20 2018-01-09
3 B 2018-02-03 10 2018-01-30
4 B 2018-02-04 10 2018-01-30
5 B 2018-02-05 10 2018-01-30
Group by new column and find weekly total amount:
grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()
Result:
week_start amount
0 2017-12-26 10
1 2018-01-09 40
2 2018-01-30 30
Merge dataframes on week_start:
pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)
Result:
group date amount
0 A 2018-01-01 10
1 A 2018-01-14 40
2 A 2018-01-15 40
3 B 2018-02-03 30
4 B 2018-02-04 30
5 B 2018-02-05 30

Counting cumulative occurrences of values based on date window in Pandas

I have a DataFrame (df) that looks like the following:
+----------+----+
| dd_mm_yy | id |
+----------+----+
| 01-03-17 | A |
| 01-03-17 | B |
| 01-03-17 | C |
| 01-05-17 | B |
| 01-05-17 | D |
| 01-07-17 | A |
| 01-07-17 | D |
| 01-08-17 | C |
| 01-09-17 | B |
| 01-09-17 | B |
+----------+----+
This the end result i would like to compute:
+----------+----+-----------+
| dd_mm_yy | id | cum_count |
+----------+----+-----------+
| 01-03-17 | A | 1 |
| 01-03-17 | B | 1 |
| 01-03-17 | C | 1 |
| 01-05-17 | B | 2 |
| 01-05-17 | D | 1 |
| 01-07-17 | A | 2 |
| 01-07-17 | D | 2 |
| 01-08-17 | C | 1 |
| 01-09-17 | B | 2 |
| 01-09-17 | B | 3 |
+----------+----+-----------+
Logic
To calculate the cumulative occurrences of values in id but within a specified time window, for example 4 months. i.e. every 5th month the counter resets to one.
To get the cumulative occurences we can use this df.groupby('id').cumcount() + 1
Focusing on id = B we see that the 2nd occurence of B is after 2 months so the cum_count = 2. The next occurence of B is at 01-09-17, looking back 4 months we only find one other occurence so cum_count = 2, etc.
My approach is to call a helper function from df.groupby('id').transform. I feel this is more complicated and slower than it could be, but it seems to work.
# test data
date id cum_count_desired
2017-03-01 A 1
2017-03-01 B 1
2017-03-01 C 1
2017-05-01 B 2
2017-05-01 D 1
2017-07-01 A 2
2017-07-01 D 2
2017-08-01 C 1
2017-09-01 B 2
2017-09-01 B 3
# preprocessing
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
# Encode the ID strings to numbers to have a column
# to work with after grouping by ID
df['id_code'] = pd.factorize(df['id'])[0]
# solution
def cumcounter(x):
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
gr = x.groupby('date')
adjust = gr.rank(method='first') - gr.size()
y += adjust
return y
df['cum_count'] = df.groupby('id')['id_code'].transform(cumcounter)
# output
df[['id', 'id_num', 'cum_count_desired', 'cum_count']]
id id_num cum_count_desired cum_count
date
2017-03-01 A 0 1 1
2017-03-01 B 1 1 1
2017-03-01 C 2 1 1
2017-05-01 B 1 2 2
2017-05-01 D 3 1 1
2017-07-01 A 0 2 2
2017-07-01 D 3 2 2
2017-08-01 C 2 1 1
2017-09-01 B 1 2 2
2017-09-01 B 1 3 3
The need for adjust
If the same ID occurs multiple times on the same day, the slicing approach that I use will overcount each of the same-day IDs, because the date-based slice immediately grabs all of the same-day values when the list comprehension encounters the date on which multiple IDs show up. Fix:
Group the current DataFrame by date.
Rank each row in each date group.
Subtract from these ranks the total number of rows in each date group. This produces a date-indexed Series of ascending negative integers, ending at 0.
Add these non-positive integer adjustments to y.
This only affects one row in the given test data -- the second-last row, because B appears twice on the same day.
Including or excluding the left endpoint of the time interval
To count rows as old as or newer than 4 calendar months ago, i.e., to include the left endpoint of the 4-month time interval, leave this line unchanged:
y = [x.loc[d - pd.DateOffset(months=4):d].count() for d in x.index]
To count rows strictly newer than 4 calendar months ago, i.e., to exclude the left endpoint of the 4-month time interval, use this instead:
y = [d.loc[d - pd.DateOffset(months=4, days=-1):d].count() for d in x.index]
You can extend the groupby with a grouper:
df['cum_count'] = df.groupby(['id', pd.Grouper(freq='4M', key='date')]).cumcount()
Out[48]:
date id cum_count
0 2017-03-01 A 0
1 2017-03-01 B 0
2 2017-03-01 C 0
3 2017-05-01 B 0
4 2017-05-01 D 0
5 2017-07-01 A 0
6 2017-07-01 D 1
7 2017-08-01 C 0
8 2017-09-01 B 0
9 2017-09-01 B 1
We can make use of .apply row-wise to work on sliced df as well. Sliced will be based on the use of relativedelta from dateutil.
def get_cum_sum (slice, row):
if slice.shape[0] == 0:
return 1
return slice[slice['id'] == row.id].shape[0]
d={'dd_mm_yy':['01-03-17','01-03-17','01-03-17','01-05-17','01-05-17','01-07-17','01-07-17','01-08-17','01-09-17','01-09-17'],'id':['A','B','C','B','D','A','D','C','B','B']}
df=pd.DataFrame(data=d)
df['dd_mm_yy'] = pd.to_datetime(df['dd_mm_yy'], format='%d-%m-%y')
df['cum_sum'] = df.apply(lambda current_row: get_cum_sum(df[(df.index <= current_row.name) & (df.dd_mm_yy >= (current_row.dd_mm_yy - relativedelta(months=+4)))],current_row),axis=1)
>>> df
dd_mm_yy id cum_sum
0 2017-03-01 A 1
1 2017-03-01 B 1
2 2017-03-01 C 1
3 2017-05-01 B 2
4 2017-05-01 D 1
5 2017-07-01 A 2
6 2017-07-01 D 2
7 2017-08-01 C 1
8 2017-09-01 B 2
9 2017-09-01 B 3
Thinking if it is feasible to use .rolling but months are not a fixed period thus might not work.

Aggregate Data Based on Rows Python

I have a data set that looks like this:
Date | ID | Task | Description
2016-01-06 00:00:00 | 1 | 010 | This is text
2016-01-06 00:10:00 | 1 | 020 | This is text
2016-01-06 00:20:00 | 1 | 010 | This is text
2016-01-06 01:00:00 | 1 | 020 | This is text
2016-01-06 01:10:00 | 1 | 030 | This is text
2016-02-06 00:00:00 | 2 | 010 | This is text
2016-02-06 00:10:00 | 2 | 020 | This is text
2016-02-06 00:20:00 | 2 | 010 | This is text
2016-02-06 01:00:00 | 2 | 020 | This is text
2016-02-06 01:01:00 | 2 | 030 | This is text
Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.
I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:
ID | Task | Average | Sum | Count
1 | 010 | 25 | 50 | 2
1 | 020 | 10 | 20 | 2
etc | etc | etc | etc | etc
There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.
Can someone help me on how to do this in Python?
I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID
df['count'] = df.groupby(['ID','Task']).size()
will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.
See here for more details.
It seems you need agg with groupby, but in sample not numeric column so col was added:
print (df)
Date ID Task Description col
0 2016-01-06 00:00:00 1 010 This is text 1
1 2016-01-06 00:10:00 1 020 This is text 2
2 2016-01-06 00:20:00 1 010 This is text 6
3 2016-01-06 01:00:00 1 020 This is text 1
4 2016-01-06 01:10:00 1 030 This is text 3
5 2016-02-06 00:00:00 2 010 This is text 1
6 2016-02-06 00:10:00 2 020 This is text 8
7 2016-02-06 00:20:00 2 010 This is text 9
8 2016-02-06 01:00:00 2 020 This is text 1
df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
ID Task sum size mean
0 1 010 7 2 3.5
1 1 020 3 2 1.5
2 1 030 3 1 3.0
3 2 010 10 2 5.0
4 2 020 9 2 4.5
If need aggreagte datetime, id is a bit complicated, because need timedeltas:
df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
.agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
ID Task sum size mean
0 1 010 00:00:02.904078 2 00:00:01.452039
1 1 020 00:00:02.904081 2 00:00:01.452040
2 1 030 00:00:01.452042 1 00:00:01.452042
3 2 010 00:00:02.909434 2 00:00:01.454717
4 2 020 00:00:02.909437 2 00:00:01.454718
For finding difference in column date:
print (df.Date.dtypes)
object
#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0 NaT
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:40:00
4 0 days 00:10:00
5 30 days 22:50:00
6 0 days 00:10:00
7 0 days 00:10:00
8 0 days 00:40:00
9 0 days 00:01:00
Name: Date, dtype: timedelta64[ns]

Grouping dataframe by each 6 hour and generating a new column

I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1

Dataframe Wrangling with Dates and Periods in Pandas

There are a number of things I would typically do in SQL and excel that I'm trying to do with Pandas. There are a few different wrangling problems here, combined into one question because they all have the same goal.
I have a data frame df in python with three columns:
| EventID | PictureID | Date
0 | 1 | A | 2010-01-01
1 | 2 | A | 2010-02-01
2 | 3 | A | 2010-02-15
3 | 4 | B | 2010-01-01
4 | 5 | C | 2010-02-01
5 | 6 | C | 2010-02-15
EventIDs are unique. PictureIDs are not unique, although PictureID + Date are distinct.
I. First I would like to add a new column:
df['period'] = the month and year that the event falls into beginning 2010-01.
II. Second, I would like to 'melt' the data into some new dataframe that counts the number of events for a given PictureID in a given period. I'll use examples with just two periods.
| PictureID | Period | Count
0 | A | 2010-01 | 1
1 | A | 2010-02 | 2
2 | B | 2010-01 | 1
3 | C | 2010-02 | 2
So that I can then stack (?) this new data frame into something that provides period counts for all unique PictureIDs:
| PictureID | 2010-01 | 2010-02
0 | A | 1 | 2
1 | B | 1 | 0
2 | C | 0 | 2
My sense is that pandas is built do to this sort of thing easily, is that correct?
[Edit: Removed a confused third part.]
For the first two parts you can do:
>>> df['Period'] = df['Date'].map(lambda d: d.strftime('%Y-%m'))
>>> df
EventID PictureID Date Period
0 1 A 2010-01-01 00:00:00 2010-01
1 2 A 2010-02-01 00:00:00 2010-02
2 3 A 2010-02-15 00:00:00 2010-02
3 4 B 2010-01-01 00:00:00 2010-01
4 5 C 2010-02-01 00:00:00 2010-02
5 6 C 2010-02-15 00:00:00 2010-02
>>> grouped = df[['Period', 'PictureID']].groupby('Period')
>>> grouped['PictureID'].value_counts().unstack(0).fillna(0)
Period 2010-01 2010-02
A 1 2
B 1 0
C 0 2
For the third part, either I haven't understood the question well, or you haven't posted the correct numbers in the example. since the count for the A in the 3rd row should be 2? and for the C in the 6th row should be 1. If the period is six months...
Either way you should do something like this:
>>> ts = df.set_index('Date')
>>> ts.resample('6M', ...)
Update: This is a pretty ugly way to do it, I think I saw a better way to do it, but I can't find the SO question. But, this will also get the job done...
def for_half_year(row, data):
date = row['Date']
pid = row['PictureID']
# Do this 6 month checking better
if '__start' not in data or (date - data['__start']).days > 6*30:
# Reset values
for key in data:
data[key] = 0
data['__start'] = date
data[pid] = data.get(pid, -1) + 1
return data[pid]
df['PastSix'] = df.apply(for_half_year, args=({},), axis=1)

Categories