Pandas resampling with multi-index - python

I have a DataFrame containing information about stores. It looks like the following:
date | store_id | x
2019-01-01| 1 | 5
2019-01-01| 2 | 1
2019-01-05| 1 | 3
...
The multi-index is [date,store_id]. Note that the dates are not unique.
I want to resample the data at an hourly level, but only for the days in the date column i.e. I don't want to fill in every hour in between. Furthermore, I want to fill in the value of x for every hour that is created. So the desired result for the above example would be
date | store_id | x
2019-01-01 00:00:00| 1 | 5
2019-01-01 01:00:00| 1 | 5
2019-01-01 02:00:00| 1 | 5
...
2019-01-01 23:00:00| 1 | 5
2019-01-01 00:00:00| 2 | 1
2019-01-01 01:00:00| 2 | 1
2019-01-01 02:00:00| 2 | 1
...
2019-01-01 23:00:00| 2 | 1
2019-01-05 00:00:00| 1 | 3
2019-01-05 01:00:00| 1 | 3
2019-01-05 02:00:00| 1 | 3
...
2019-01-05 23:00:00| 1 | 3

Define the following "replication" function:
def repl(row):
return pd.DataFrame({'date': pd.date_range(start=row.date,
periods=24, freq='H'),'store_id': row.store_id, 'x': row.x})
It "replicates" the source row (parameter), returning a sequence of rows
with the given date, for consecutive hours.
Then:
reset the index, to have all columns as "normal" columns,
apply this function (to each row),
convert the resulting Series of DataFrames into a list (of DataFrames),
concatenate the result.
The code to do it is:
pd.concat(df.reset_index().apply(repl, axis=1).tolist(), ignore_index=True)

Related

add values on a negative pandas df based on condition date

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison

I'm trying to compare values between 2 columns in the same pandas dataframe and for where ever the match has been found I want to return the values from that row but from a 3rd column.
Basically if the following is dataframe df
| date | date_new | category | value |
| --------- | ---------- | -------- | ------ |
|2016-05-11 | 2018-05-15 | day | 1000.0 |
|2020-03-28 | 2018-05-11 | night | 2220.1 |
|2018-05-15 | 2020-03-28 | day | 142.8 |
|2018-05-11 | 2019-01-29 | night | 1832.9 |
I want to add a new column say, value_new which is basically obtained by getting the values from value after comparing for every date value in date_new for every date value in date followed by comparing if both the rows have same category values.
[steps of transformation]
- 1. for each value in date_new look for a match in date
- 2. if match found, compare if values in category column also match
- 3. if both the matches in above steps fulfilled, pick the corresponding value from value column from the row where both the matches fulfilled, otherwise leave blank.
So, I would finally want the final dataframe to look something like this.
| date | date_new | category | value | value_new |
| --------- | ---------- | -------- | ------ | --------- |
|2016-05-11 | 2018-05-15 | day | 1000.0 | 142.8 |
|2020-03-28 | 2018-05-11 | night | 2220.1 | 1832.9 |
|2018-05-15 | 2020-03-28 | day | 142.8 | None |
|2018-05-11 | 2016-05-11 | day | 1832.9 | 1000.0 |
Use DataFrame.merge with left join and assigned new column:
df['value_new'] = df.merge(df,
left_on=['date_new','category'],
right_on=['date','category'], how='left')['value_y']
print (df)
date date_new category value value_new
0 2016-05-11 2018-05-15 day 1000.0 142.8
1 2020-03-28 2018-05-11 night 2220.1 NaN
2 2018-05-15 2020-03-28 day 142.8 NaN
3 2018-05-11 2016-05-11 day 1832.9 1000.0

How to delete specific rows in pandas dataframe if a condition is met

I have a pandas dataframe with few thousand rows and only one column. The structure of the content is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 1
4 | Score 2
5 | Date 2
6 | Group 2
7 | Score 2
8 | Score 3
9 | Date 3
10| Group 3
11| ...
12| ...
13| Score (n-1)
14| Score n
15| Date n
16| Group n
I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
The expected output is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 2
4 | Date 2
5 | Group 2
6 | Score 3
7 | Date 3
8 | Group 3
9 | ...
10| ...
11| Score n
12| Date n
13| Group n
I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
Given
>>> df
0
0 Score 1
1 Date 1
2 Group 1
3 Score 1
4 Score 2
5 Date 2
6 Group 2
7 Score 2
8 Score 3
9 Date 3
you can use
>>> mask = df.assign(shift=df[0].shift(-1)).apply(lambda s: s.str.contains('Score')).all(1)
>>> df[~mask].reset_index(drop=True)
0
0 Score 1
1 Date 1
2 Group 1
3 Score 2
4 Date 2
5 Group 2
6 Score 3
7 Date 3
Although if I were you I would use fix the format of the data first as the commenters already pointed out.

Pandas: get observations by timestamp

I got a list of dynamic values (e.g. observations). It records all value changes of an entity (e.g. display).
df
+----+---------------------+-----------------+---------+
| | time | display_index | value |
|----+---------------------+-----------------+---------|
| 0 | 2017-11-06 13:00:00 | 1 | val1 |
| 1 | 2017-11-06 14:00:00 | 1 | val2 |
| 2 | 2017-11-06 15:00:00 | 1 | val1 |
| 3 | 2017-11-06 13:30:00 | 2 | val3 |
| 4 | 2017-11-06 14:05:00 | 2 | val4 |
| 5 | 2017-11-06 15:30:00 | 2 | val1 |
+----+---------------------+-----------------+---------+
Now I got a second list of timestamps and I'm interested in the values that each display has shown at that time. Note that the first timestamp (13:00) for display_index 2 is before any value is even known for that one (first record is 13:30).
df_times
+----+---------------------+-----------------+
| | time | display_index |
|----+---------------------+-----------------|
| 0 | 2017-11-06 13:20:00 | 1 |
| 1 | 2017-11-06 13:40:00 | 1 |
| 2 | 2017-11-06 13:00:00 | 2 |
| 3 | 2017-11-06 14:00:00 | 2 |
+----+---------------------+-----------------+
I tried calculating the period between both timestamps and chose the observation with the minimum value for that period:
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
+----+---------------------+-----------------+---------------------+---------+-----------+
| | time | display_index | time_measured | value | seconds |
|----+---------------------+-----------------+---------------------+---------+-----------|
| 0 | 2017-11-06 13:00:00 | 2 | 2017-11-06 13:30:00 | val3 | 1800 |
| 1 | 2017-11-06 13:20:00 | 1 | 2017-11-06 13:00:00 | val1 | 1200 |
| 2 | 2017-11-06 13:40:00 | 1 | 2017-11-06 14:00:00 | val2 | 1200 |
| 3 | 2017-11-06 14:00:00 | 2 | 2017-11-06 14:05:00 | val4 | 300 |
+----+---------------------+-----------------+---------------------+---------+-----------+
The problem is that the last values for display 1 and 2 are wrong since they are still showing another value at that time. It should be val1 for display 1 and val3 for display 2. What I'm actually looking for is the observation that was last seen before the timestamp. So how to do this?
Here's the code that I used:
import pandas as pd
from tabulate import tabulate
import math
values = [("2017-11-06 13:00", 1, 'val1'),
("2017-11-06 14:00", 1, 'val2'),
("2017-11-06 15:00", 1, 'val1'),
("2017-11-06 13:30", 2, 'val3'),
("2017-11-06 14:05", 2, 'val4'),
("2017-11-06 15:30", 2, 'val1'),
]
labels = ['time', 'display_index', 'value']
df = pd.DataFrame.from_records(values, columns=labels)
df['time'] = pd.to_datetime(df['time'])
print(tabulate(df, headers='keys', tablefmt='psql'))
values = [("2017-11-06 13:20", 1),
("2017-11-06 13:40", 1),
("2017-11-06 13:00", 2),
("2017-11-06 14:00", 2),
]
labels = ['time', 'display_index']
df_times = pd.DataFrame.from_records(values, columns=labels)
df_times['time'] = pd.to_datetime(df_times['time'])
print(tabulate(df_times, headers='keys', tablefmt='psql'))
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
This is a perfect use case for pd.merge_asof
Note: I think you got the second row wrong.
# dataframes need to be sorted
df_times = df_times.sort_values(['time', 'display_index'])
df = df.sort_values(['time', 'display_index'])
pd.merge_asof(
df_times, df.assign(time_measured=df.time),
on='time', by='display_index', direction='forward'
).assign(seconds=lambda d: d.time_measured.sub(d.time).dt.total_seconds())
time display_index value time_measured seconds
0 2017-11-06 13:00:00 2 val3 2017-11-06 13:30:00 1800.0
1 2017-11-06 13:20:00 1 val2 2017-11-06 14:00:00 2400.0
2 2017-11-06 13:40:00 1 val2 2017-11-06 14:00:00 1200.0
3 2017-11-06 14:00:00 2 val4 2017-11-06 14:05:00 300.0
Explanation
pd.merge_asof for every row in the left argument, it attempts to locate a matching row in the right argument.
Since we passed direction='forward' it will look forward from the row in the left argument and find the next value.
I needed a way to capture the time_measured column. Since merge_asof snags the time column, I assigned it as a different column that I can use as intended. The use of df.assign(time_measured=df.time) just dups the column for use later.
I use assign again. This time to assign a new column seconds. When using assign, you can pass an array of equal length as the dataframe. You can pass a series in which the values will align based on the index. Or you can pass a callable that will get passed the dataframe that is calling assign. This is what I did. The lambda takes the calling dataframe and finds the difference in those two date columns and converts the resulting series of timedeltas to seconds.

Aggregating on 5 minute windows in pyspark

I Have the following dataframe df:
User | Datetime | amount | length
A | 2016-01-01 12:01 | 10 | 20
A | 2016-01-01 12:03 | 6 | 10
A | 2016-01-01 12:05 | 1 | 3
A | 2016-01-01 12:06 | 3 | 5
B | 2016-01-01 12:01 | 10 | 20
B | 2016-01-01 12:02 | 8 | 20
And I want to use pyspark efficiently to aggregate over a 5 minute time window and do some calculations - so for example calculate the average amount & length for every use for every 5 minute time window - the df will look like this:
User | Datetime | amount | length
A | 2016-01-01 12:00 | 8 | 15
B | 2016-01-01 12:00 | 2 | 4
A | 2016-01-01 12:05 | 9 | 20
How can I achieve this in the most efficient way?
In pandas I used:
df.groupby(['cs_username', pd.TimeGrouper('5Min')].apply(...)
Unfortunately, in pyspark this won't look so cool like in pandas ;-)
You can try casting date to timestamp and using modulo, for example:
import pyspark.sql.functions as F
seconds = 300
seconds_window = F.from_unixtime(F.unix_timestamp('date') - F.unix_timestamp('date') % seconds)
dataframe.withColumn('5_minutes_window', seconds_window)
Then you can simply group by new column and perform requested aggregations.

Categories