Pandas groupby user and count number of events between 2 timestamps - python

I have a DF1 where each row represents an "event". Each event has the columns "user", and "time":
DF1:
"user","time"
user1,2022-11-14 00:00:04
user2,2022-11-16 21:34:45
user1,2022-11-14 00:15:22
user3,2022-11-17 15:32:25
...
The "time" value is any timestamp in one week: from 2022-11-14 and 2022-11-20. There are 10k different users, and 27M events.
I have to divide the week in 8h time-slots (so 21 slots in total), and for each user, I need to look if that I can see any event of that user in each slot.
Then, I should create a DF2 (in which each row is a user) with 21 columns (one for each slot), with numbers 0 or 1: 0 if I have not seen the user in that slot, and 1 if I have seen the user in that slot.
DF2:
"user","slot1","slot2","slot3",...,"slot21"
user1,1,0,0,0,0,0,...,0
user2,0,0,1,1,1,0,...,0
user3,1,1,1,0,0,1,...,1
...
(After that, I will need to order DF2 and plot it as an spare matrix, but that is another story...)
I have managed to fill 1 single row of DF2, but it lasts 30 seconds for 1 user, in this way:
slots = pd.date_range('2022-11-14', periods=22, freq='8h')
row=[]
for i in np.arange(0,slots.value_counts().sum()-1):
if DF1[(DF1.user=="user1")&(DF1.time.between(slots[i],slots[i+1]))].shape[0] >=1:
row.append(1)
else:
row.append(0)
print(row)
So making this process for the 10k users would last almost 4 days...
Anyone have an idea on how can I achieve to create DF2 in a quicker way??
Maybe something like DF1.groupby('user').time and then what else?
I can be done in pandas or with any other way, or even different languages, if I get the spare matrix in DF2!
Any help would be much appreciated!

Use crosstab with cut for count values, if need 0,1 ouput compare for not equal 0 and cast to integers:
df = (pd.crosstab(df['user'],
pd.cut(df['time'], bins=slots, labels=False))
.ne(0)
.astype(int)
.add_prefix('slot'))

Related

Assigning a value to the same dates fulfilling a condition in a more efficient way in a dataframe

I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.

How to filter NA values in columns of pydatatable?

I have a datatable as,
DT_EX= dt.Frame({
'country':['a','a','a','a'],
'id':[3,3,3,3],
'shop':['dmart','dmart','dmart','dmart'],
'beef':[23,None,None,None],
'eggs':[None,33,None,None],
'fork':[None,None,10,None],
'veg':[None,None,None,40]})
It's output is as,
And I would like to convert it to a datatable which should not have NA's in columns as showed in this output,
Could you please explain how to do this operation(removing NA's) on py-datatable?. would dt.isna() be helpful in this case?.
One way around it would be to select the first three columns (they have no nulls) and extend it with the sum of the remaining columns : link
from datatable import f, first, sum
DT_EX[:,first(f[:3]).extend(sum(f[3:]))]
country id shop beef eggs fork veg
▪▪▪▪ ▪▪▪▪ ▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪ ▪▪▪▪▪▪▪▪
0 a 3 dmart 23 33 10 40
UPDATE: simpler solution from another related question:
DT_EX[:, sum(f[3:]), f[:3])]
So i have one more subgroup of items and here is a new DT.
DT_EX= dt.Frame({
'country':['a','a','a','a','b','b','c','c'],
'id':[3,3,3,3,4,4,4,4],
'shop':['dmart','dmart','dmart','dmart','amzn','amzn','amzn','amzn'],
'beef':[23,None,None,None,93,None,None,None],
'eggs':[None,33,None,None,None,103,None,None],
'fork':[None,None,10,None,None,None,210,None],
'veg':[None,None,None,40,None,None,None,340]})
I have tried to appply the recommended logics on it as here in the attached screenshot,
In second code chunk it has summed up each column(beef,eggs,fork,veg)
and In the third code chunk, i did a grouping on first three columns, here it gives a correct output, but it's adding duplicate columns, and another observation is that its filling NA values with 0, it can be found on C observation.
would you have any other ideas/suggestions for it ?.

Change one row in a pandas dataframe based on the value of another row

I have a pandas DataFrame with data from an icecream freezer. Several columns describe the different temperatures in the system as well as some other things.
One column, named 'Defrost status', tells me when the freezer was defreezing to remove abundant ice with boolean values.
Those 'defrosts' is what I am interested in, so I added another column named "around_defrost". This column currently only has NaN values, but I want to change them to 'True' whenever there is a defrost within 30 minutes away from that specific row in the dataframe.
The data is recorded every minute so 30 minutes would mean 30 rows before a defrost and 30 rows behind it need to be set to 'True'
I have tried to do this with itterrows, ittertuples and by playing with the indexes as seen in the figure below but nu success so far. If anyone has a good idea of how this would could be done, I'd really appreciate it!
enter image description here
You need to use dataframe.rolling:
df = df.sort_values("Time") #sort by Time
df['around_defrost'] = df['Defrost status'].rolling(60, center=True, min_periods = 0).apply(
lambda x: True if True in x else False, raw=True)
EDIT: you may need rolling(61, center=True) since you want to consider the row in question AND 30 before and after.

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

how to unstack a pandas dataframe with two sets of variables

I have a table that looks like this. Read from a CSV file, so no levels, no fancy indices, etc.
ID date1 amount1 date2 amount2
x 15/1/2015 100 15/1/2016 80
The actual file I have goes up to date5 and amount 5.
How can I convert it to:
ID date amount
x 15/1/2015 100
x 15/1/2016 80
If I only had one variable, I would use pandas.melt(), but with two variables I really don't know how to do it quickly.
I could do it manually exporting to a sqlite3 database in memory, and doing a union. Doing unions in pandas is more annoying because, unlike, SQL, it requires all field names to be the same, so in pandas I'd have to create a temporary dataframe and rename all the fields: a dataframe for date1 and amount1, rename the field to just date and amount, then do the same for all the other events, and only then can I do pandas.concat.
Any suggestions? Thanks!
Here is one way:
>>> pandas.concat(
... [pandas.melt(x, id_vars='ID', value_vars=x.columns[1::2].tolist(), value_name='date'),
... pandas.melt(x, value_vars=x.columns[2::2].tolist(), value_name='amount')
... ],
... axis=1
... ).drop('variable', axis=1)
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
The idea is to do two melts, one for each set of columns, then concat them. This assumes that the two kinds of columns are in alternating order, so that the columns[1::2] and columns[2::2] select them correctly. If not, you'd have to modify that part of it to choose the columns you want.
You can also do it with the little-known lreshape:
>>> pandas.lreshape(x, {'date': x.columns[1::2], 'amount': x.columns[2::2]})
ID date amount
0 x 15/1/2015 100
1 x 15/1/2016 80
However, lreshape is not really documented and it's not clear if it's supposed to be used.
If I assume that the columns always repeat, a simple trick provides the solution you want.
The trick lies in making a list of lists of the columns that go together, then looping over the main list appending as necessary. It does involve a call to pd.DataFrame() each time the loop runs. I am kind of pressed for time right now to find a way to avoid that. But it does work like you would expect it to, and for a small file, you should not have any problems (that is, run time).
In [1]: columns = [['date1', 'amount1'], ['date2', 'amount2'], ...]
In [2]: df_clean = pd.DataFrame(columns=['date', 'amount'])
for cols in columns:
df_clean = df_clean.append(pd.DataFrame(df.loc[:,cols].values,
columns=['date', 'amount']),
ignore_index=True)
df_clean
Out[2]: date amount
0 15/1/2015 100
1 15/1/2016 80
The neat thing about this is that it only runs over the DataFrame once, picking all the rows under the columns it is looping over. So if you have 5 column pairs, with 'n' rows under it, the loop will only run 5 times. For each run, it will append all 'n' rows below the columns to the clean DataFrame to give you a consistent result. You can then eliminate any NaN values and sort by date, or do whatever you want to do with the clean DF.
What do you think, does this beat creating an in-memory sqlite3 database?

Categories