Indexing/Binning Time Series - python

I have a dataframe like bellow
ID Date
111 1.1.2018
222 5.1.2018
333 7.1.2018
444 8.1.2018
555 9.1.2018
666 13.1.2018
and I would like to bin them into 5 days intervals.
The output should be
ID Date Bin
111 1.1.2018 1
222 5.1.2018 1
333 7.1.2018 2
444 8.1.2018 2
555 9.1.2018 2
666 13.1.2018 3
How can I do this in python, please?

Looks like groupby + ngroup does it:
df['Date'] = pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
df['Bin'] = df.groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3
If you don't want to mutate the Date column, then you may first call assign for a copy based assignment, and then do the groupby:
df['Bin'] = df.assign(
Date=pd.to_datetime(df.Date, errors='coerce', dayfirst=True)
).groupby(pd.Grouper(freq='5D', key='Date')).ngroup() + 1
df
ID Date Bin
0 111 1.1.2018 1
1 222 5.1.2018 1
2 333 7.1.2018 2
3 444 8.1.2018 2
4 555 9.1.2018 2
5 666 13.1.2018 3

One way is to create an array of your date range and use numpy.digitize.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
date_ranges = pd.date_range(df['Date'].min(), df['Date'].max(), freq='5D')\
.astype(np.int64).values
df['Bin'] = np.digitize(df['Date'].astype(np.int64).values, date_ranges)
Result:
ID Date Bin
0 111 2018-01-01 1
1 222 2018-01-05 1
2 333 2018-01-07 2
3 444 2018-01-08 2
4 555 2018-01-09 2
5 666 2018-01-13 3

Related

Transpose by grouping a Dataframe having both numeric and string variables

I have a DataFrame and I want to convert it into the following:
import pandas as pd
df = pd.DataFrame({'ID':[111,111,111,222,222,333],
'class':['merc','humvee','bmw','vw','bmw','merc'],
'imp':[1,2,3,1,2,1]})
print(df)
ID class imp
0 111 merc 1
1 111 humvee 2
2 111 bmw 3
3 222 vw 1
4 222 bmw 2
5 333 merc 1
Desired output:
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
I wish to transpose the entire dataframe, but grouped by a particular column, ID in this case and maintaining the row order.
My attempt: I tried using .set_index() und .unstack(), but it did not work.
Use GroupBy.cumcount for counter and then reshape by DataFrame.stack and Series.unstack:
df1 = (df.set_index(['ID',df.groupby('ID').cumcount()])
.stack()
.unstack(1, fill_value='')
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID 0 1 2
0 111 merc humvee bmw
1 111 1 2 3
2 222 vw bmw
3 222 1 2
4 333 merc
5 333 1
Another method would be to use groupby and concat - although this is not totally dynamic it works fine if you only have two columns you want to work with, namely class and imp
s = df.set_index([df['ID'],df.groupby('ID').cumcount()]).unstack(1)
df1 = pd.concat([s['class'],s['imp']],axis=0).sort_index().fillna('')
print(df1)
idx 0 1 2
ID
111 merc humvee bmw
111 1 2 3
222 vw bmw
222 1 2
333 merc
333 1

count id which are smaller than certain value

My data consists of unique ids with a certain distance to a point. The goal is to count the id which is
equal or smaller than the radius.
Following example shows my DataFrame:
id distance radius
111 0.5 1
111 2 1
111 1 1
222 1 2
222 3 2
333 5 3
333 4 3
The output should look like this:
id count
111 2
222 1
333 0
You can do:
df['distance'].le(df['radius']).groupby(df['id']).sum()
Output:
id
111 2.0
222 1.0
333 0.0
dtype: float64
Or you can do:
(df.loc[df.distance <= df.radius, 'id']
.value_counts()
.reindex(df['id'].unique(), fill_value=0)
)
Output:
111 2
222 1
333 0
Name: id, dtype: int64

Grouping MultiIndex in Pandas

I'm selecting some data in spark like this:
base = spark.sql("""
SELECT
...
...
""")
print(base.count())
base.cache()
base=base.toPandas()
base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
base.set_index("yyyy_mm_dd", inplace=True)
This gives me a dataframe which looks like this:
id aggregated_field aggregated_field2
yyyy_mm_dd
I want to group by yyyy_mm_dd and id, but sum the aggregated fields. This way I can see per day, the total sum of the aggregated fields for every provider. I'll then want to aggregate this to be monthly. This is what I've done:
agg = base.groupby(['yyyy_mm_dd', 'id'])[['aggregated_field','aggregated_field2']].sum()
My dataframe now looks like this:
aggregated_field aggregated_field2
yyyy_mm_dd id
Finally, I try to resample() to monthly:
agg = agg.resample('M').sum()
Then I get this error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
I'm not sure why since I convert my yyyy_mm_dd to a date index earlier.
Edit: The output I'm looking for is this:
yyyy_mm_dd id aggregated_metric aggregated_metric2
2019-01-01 1 ... ...
2
3
2019-01-02 1
2
3
Maybe you will find this useful:
Solution 1 (employing pd.Period and its "rightful" displaying of monthly data format)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm'] = base['yyyy_mm_dd'].dt.to_period('M')
>>> agg = base.groupby(['yyyy_mm', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm id
2012-01 1 2 202
2 4 204
2012-02 1 10 210
2 12 212
Solution 2 (stick to datetime64)
>>> import pandas as pd
>>> base = \
pd.DataFrame(
{
'yyyy_mm_dd': ['2012-01-01','2012-01-01','2012-01-02','2012-01-02','2012-02-01','2012-02-01','2012-02-02','2012-02-02'],
'id': [1,2,1,2,1,2,1,2],
'aggregated_field': [0,1,2,3,4,5,6,7],
'aggregated_field2': [100,101,102,103,104,105,106,107]
}
)
>>> base
yyyy_mm_dd id aggregated_field aggregated_field2
0 2012-01-01 1 0 100
1 2012-01-01 2 1 101
2 2012-01-02 1 2 102
3 2012-01-02 2 3 103
4 2012-02-01 1 4 104
5 2012-02-01 2 5 105
6 2012-02-02 1 6 106
7 2012-02-02 2 7 107
>>> base['yyyy_mm_dd'] = pd.to_datetime(base['yyyy_mm_dd'])
>>> base['yyyy_mm_dd_month_start'] = base['yyyy_mm_dd'].values.astype('datetime64[M]')
>>> agg = base.groupby(['yyyy_mm_dd_month_start', 'id'])[['aggregated_field','aggregated_field2']].sum()
>>> agg
aggregated_field aggregated_field2
yyyy_mm_dd_month_start id
2012-01-01 1 2 202
2 4 204
2012-02-01 1 10 210
2 12 212

Pandas calculating difference between rows

My goal is to calculate the day difference from Start/End comparing to End. I know I have to group them by Id but I'm not sure how to perform the difference between the Day.
I tried df['length'] = -(df.groupby('Id')['Day'].diff()). This doesn't compare to End, it only computes the difference whenever the Status changes.
df
Id Day Status
111 1 Start
111 5 End
222 2 Begin
222 7 End
333 1 Start
333 3 Begin
333 7 End
Ideal result would be:
Id Day Status Length
111 1 Start 4
111 5 End
222 2 Begin 5
222 7 End
333 1 Start 6 (since we Start on Day 1 and End on day 7)
333 3 Begin 4 (since we Begin on Day 3 and End on day 7)
333 7 End
Thank you
Here's another method with groupby + transform -
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0) # or v.mask(df.Status.eq('End'))
df
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN
Timings
df = pd.concat([df] * 1000000, ignore_index=True)
# apply + iloc
%timeit df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
1 loop, best of 3: 1.49 s per loop
# transform + mask
%%timeit
v = df.groupby('Id').Day.transform('last') - df.Day
df['Length'] = v.mask(v == 0)
1 loop, best of 3: 294 ms per loop
By using apply with .iloc
df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
Out[187]:
0 4.0
1 NaN
2 5.0
3 NaN
4 6.0
5 4.0
6 NaN
Name: Day, dtype: float64
After assign it back
df['Length']=df.groupby('Id').Day.apply(lambda x : x.iloc[-1]-x).replace(0,np.nan)
df
Out[189]:
Id Day Status Length
0 111 1 Start 4.0
1 111 5 End NaN
2 222 2 Begin 5.0
3 222 7 End NaN
4 333 1 Start 6.0
5 333 3 Begin 4.0
6 333 7 End NaN

Modify timestamps to sequence per ID

I have a Pandas dataframe (Python 3.5.1) with a timestamp column and an ID column.
Timestamp ID
0 2016-04-01T00:15:36.688 123
1 2016-04-01T00:12:52.688 123
2 2016-04-01T00:35:41.688 543
3 2016-04-01T00:01:12.688 543
4 2016-03-31T23:50:59.688 123
5 2016-04-01T01:05:52.688 543
I would like to sequence the timestamps per ID.
Timestamp ID Sequence
0 2016-04-01T00:15:36.688 123 3
1 2016-04-01T00:12:52.688 123 2
2 2016-04-01T00:35:41.688 543 2
3 2016-04-01T00:01:12.688 543 1
4 2016-03-31T23:50:59.688 123 1
5 2016-04-01T01:05:52.688 543 3
What is the best way to order the timestamps per ID, and generate a sequence number unique to each ID?
you can use sort_values(), groupby() and cumcount():
In [10]: df['Sequence'] = df.sort_values('Timestamp').groupby('ID').cumcount() + 1
In [11]: df
Out[11]:
Timestamp ID Sequence
0 2016-04-01 00:15:36.688 123 3
1 2016-04-01 00:12:52.688 123 2
2 2016-04-01 00:35:41.688 543 2
3 2016-04-01 00:01:12.688 543 1
4 2016-03-31 23:50:59.688 123 1
5 2016-04-01 01:05:52.688 543 3

Categories