Indexing/Binning Time Series

Indexing/Binning Time Series - python

I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?

Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3

One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3

Related

Transpose by grouping a Dataframe having both numeric and string variables

I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.

Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1

Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1

count id which are smaller than certain value

My data consists of unique ids with a certain distance to a point. The goal is to count the id which is
equal or smaller than the radius.
Following example shows my DataFrame:
id distance radius
111 0.5 1
111 2 1
111 1 1
222 1 2
222 3 2
333 5 3
333 4 3
The output should look like this:
id count
111 2
222 1
333 0

You can do:
df['distance'].le(df['radius']).groupby(df['id']).sum()
Output:
id
111 2.0
222 1.0
333 0.0
dtype: float64
Or you can do:
(df.loc[df.distance <= df.radius, 'id']
.value_counts()
.reindex(df['id'].unique(), fill_value=0)
)
Output:
111 2
222 1
333 0
Name: id, dtype: int64

Grouping MultiIndex in Pandas

I'm selecting some data in spark like this:
base = spark.sql("""
SELECT
...
...
""")
print(base.count())
base.cache()
base=base.toPandas()
base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
base.set_index("yyyy_mm_dd", inplace=True)
This gives me a dataframe which looks like this:
id aggregated_field aggregated_field2
yyyy_mm_dd
I want to group by yyyy_mm_dd and id, but sum the aggregated fields. This way I can see per day, the total sum of the aggregated fields for every provider. I'll then want to aggregate this to be monthly. This is what I've done:
agg = base.groupby(['yyyy_mm_dd', 'id'])[['aggregated_field','aggregated_field2']].sum()
My dataframe now looks like this:
aggregated_field aggregated_field2
yyyy_mm_dd id
Finally, I try to resample() to monthly:
agg = agg.resample('M').sum()
Then I get this error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
I'm not sure why since I convert my yyyy_mm_dd to a date index earlier.
Edit: The output I'm looking for is this:
yyyy_mm_dd id aggregated_metric aggregated_metric2
2019-01-01 1 ... ...
2
3
2019-01-02 1
2
3

Maybe you will find this useful:
Solution 1 (employing pd.Period and its "rightful" displaying of monthly data format)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm'] = base['yyyy_mm_dd'].dt.to_period('M')
>>> agg = base.groupby(['yyyy_mm', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm id
2012-01 1 2 202
2 4 204
2012-02 1 10 210
2 12 212
Solution 2 (stick to datetime64)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm_dd_month_start'] = base['yyyy_mm_dd'].values.astype('datetime64[M]')
>>> agg = base.groupby(['yyyy_mm_dd_month_start', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm_dd_month_start id
2012-01-01 1 2 202
2 4 204
2012-02-01 1 10 210
2 12 212

Pandas calculating difference between rows

My goal is to calculate the day difference from Start/End comparing to End. I know I have to group them by Id but I'm not sure how to perform the difference between the Day.
I tried df['length'] = -(df.groupby('Id')['Day'].diff()). This doesn't compare to End, it only computes the difference whenever the Status changes.
df
Id Day Status
111 1 Start
111 5 End
222 2 Begin
222 7 End
333 1 Start
333 3 Begin
333 7 End
Ideal result would be:
Id Day Status Length
111 1 Start 4
111 5 End
222 2 Begin 5
222 7 End
333 1 Start 6 (since we Start on Day 1 and End on day 7)
333 3 Begin 4 (since we Begin on Day 3 and End on day 7)
333 7 End
Thank you

Here's another method with groupby + transform -
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0) # or v.mask(df.Status.eq('End'))
df
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN
Timings
df = pd.concat([df] * 1000000, ignore_index=True)
# apply + iloc
%timeit df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
1 loop, best of 3: 1.49 s per loop
# transform + mask
%%timeit
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0)
1 loop, best of 3: 294 ms per loop

By using apply with .iloc
df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
Out[187]:
0 4.0
1 NaN
2 5.0
3 NaN
4 6.0
5 4.0
6 NaN
Name: Day, dtype: float64
After assign it back
df['Length']=df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
df
Out[189]:
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN

Modify timestamps to sequence per ID

I have a Pandas dataframe (Python 3.5.1) with a timestamp column and an ID column.
Timestamp ID
0 2016-04-01T00:15:36.688 123
1 2016-04-01T00:12:52.688 123
2 2016-04-01T00:35:41.688 543
3 2016-04-01T00:01:12.688 543
4 2016-03-31T23:50:59.688 123
5 2016-04-01T01:05:52.688 543
I would like to sequence the timestamps per ID.
Timestamp ID Sequence
0 2016-04-01T00:15:36.688 123 3
1 2016-04-01T00:12:52.688 123 2
2 2016-04-01T00:35:41.688 543 2
3 2016-04-01T00:01:12.688 543 1
4 2016-03-31T23:50:59.688 123 1
5 2016-04-01T01:05:52.688 543 3
What is the best way to order the timestamps per ID, and generate a sequence number unique to each ID?

you can use sort_values(), groupby() and cumcount():
In [10]: df['Sequence'] = df.sort_values('Timestamp').groupby('ID').cumcount() + 1
In [11]: df
Out[11]:
Timestamp ID Sequence
0 2016-04-01 00:15:36.688 123 3
1 2016-04-01 00:12:52.688 123 2
2 2016-04-01 00:35:41.688 543 2
3 2016-04-01 00:01:12.688 543 1
4 2016-03-31 23:50:59.688 123 1
5 2016-04-01 01:05:52.688 543 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing/Binning Time Series - python

Related

Transpose by grouping a Dataframe having both numeric and string variables

count id which are smaller than certain value

Grouping MultiIndex in Pandas

Pandas calculating difference between rows

Modify timestamps to sequence per ID

Categories

Resources