Pandas count number of online devices at a time - python

I have the following dataframes which is a log database which shows each time a device connects to a new gateway
device
gateway
time
222
1
2021-01-01 05:02:03
222
2
2021-01-02 06:02:04
222
1
2021-01-03 02:02:53
223
3
2021-01-01 01:22:08
...
...
...
222
1
2021-02-01 12:32:23
I want to know for each minute for all the gateways how many devices are currently connected to each of the gateways
gateway
minute
count
1
2021-01-01 00:00:00
0
2
2021-01-01 00:00:00
0
3
2021-01-01 00:00:00
0
1
2021-01-01 00:01:00
0
...
...
...
1
2021-01-01 05:02:00
1
1
2021-01-01 05:03:00
1
1
2021-01-01 05:04:00
1
1
2021-01-01 05:05:00
1
...
...
...
1
2021-01-02 06:02:00
0
...
...
...
how can I accomplish this using pandas?

Try groupby with Grouper:
df.groupby(['gateway', pd.Grouper(freq='T', key='time')]).size().reset_index(name='count')

Related

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp

how to add a day to date column?

I wanna add a day to all cells of this dataframe:
value B N S date
date
2020-12-31 1 11 0 2020-12-31
2021-01-01 3 80 0 2021-01-01
2021-01-02 4 99 0 2021-01-02
2021-01-03 3 78 0 2021-01-03
2021-01-04 0 50 0 2021-01-04
to make it like this:
value B N S date
date
2020-12-31 1 11 0 2021-01-01
2021-01-01 3 80 0 2021-01-02
2021-01-02 4 99 0 2021-01-03
2021-01-03 3 78 0 2021-01-04
2021-01-04 0 50 0 2021-01-05
how can I do this?
df['date']=pd.to_datetime(df['date']).add(pd.offsets.Day(1))
df
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05
You can temporarily convert to datetime to add a DateOffset:
df['date'] = (pd.to_datetime(df['date'])
.add(pd.DateOffset(days=1))
.dt.strftime('%Y-%m-%d') # optional
)
Output:
value B N S date
0 2020-12-31 1 11 0 2021-01-01
1 2021-01-01 3 80 0 2021-01-02
2 2021-01-02 4 99 0 2021-01-03
3 2021-01-03 3 78 0 2021-01-04
4 2021-01-04 0 50 0 2021-01-05

Mean between two datetimes; if NaN, get last non-NaN value

Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Create a Series from a Pandas DataFrame by choosing an element from different columns on each row

My goal is to create a Series from a Pandas DataFrame by choosing an element from different columns on each row.
For example, I have the following DataFrame:
In [171]: pred[:10]
Out[171]:
0 1 2
Timestamp
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1
And, I have the following series:
In [172]: useProb[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0
I would like to create a new series, usePred, that takes the values from pred, based on the column information in useProb to return the following:
In [172]: usePred[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
This last step is where I fail. I've tried things like:
usePred = pd.DataFrame(index = pred.index)
for row in usePred:
usePred['PREDS'].ix[row] = pred.ix[row, useProb[row]]
And, I've tried:
usePred['PREDS'] = pred.iloc[:,useProb]
I google'd and search on stackoverflow, for hours, but can't seem to solve the problem.
One solution could be to use get dummies (which should be more efficient that apply):
In [11]: (pd.get_dummies(useProb) * pred).sum(axis=1)
Out[11]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: float64
You could use an apply with a couple of locs:
In [21]: pred.apply(lambda row: row.loc[useProb.loc[row.name]], axis=1)
Out[21]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: int64
The trick being that you have access to the rows index via the name property.
Here is another way to do it using DataFrame.lookup:
pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']])
It seems to be exactly what you need, except that care must be taken to supply values which are labels. For example, if pred.columns are strings, and useProb['0'] values are integers, then we could use
pred.columns[useProb['0']]
so that the values passed to the col_labels parameter are proper label values.
For example,
import io
import pandas as pd
content = io.BytesIO('''\
Timestamp 0 1 2
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1''')
pred = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
content = io.BytesIO('''\
Timestamp 0
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0''')
useProb = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
print(pd.Series(pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']]),
index=pred.index))
yields
Timestamp
2010-12-21 0
2010-12-20 1
2010-12-17 1
2010-12-16 1
2010-12-15 1
2010-12-14 1
2010-12-13 0
2010-12-10 1
2010-12-09 1
2010-12-08 0
dtype: int64

Categories