Pandas resampling with multi-index - python

I have a DataFrame containing information about stores. It looks like the following:
date | store_id | x
2019-01-01| 1 | 5
2019-01-01| 2 | 1
2019-01-05| 1 | 3
The multi-index is [date,store_id]. Note that the dates are not unique.
I want to resample the data at an hourly level, but only for the days in the date column i.e. I don't want to fill in every hour in between. Furthermore, I want to fill in the value of x for every hour that is created. So the desired result for the above example would be
date | store_id | x
2019-01-01 00:00:00| 1 | 5
2019-01-01 01:00:00| 1 | 5
2019-01-01 02:00:00| 1 | 5
2019-01-01 23:00:00| 1 | 5
2019-01-01 00:00:00| 2 | 1
2019-01-01 01:00:00| 2 | 1
2019-01-01 02:00:00| 2 | 1
2019-01-01 23:00:00| 2 | 1
2019-01-05 00:00:00| 1 | 3
2019-01-05 01:00:00| 1 | 3
2019-01-05 02:00:00| 1 | 3
2019-01-05 23:00:00| 1 | 3

Define the following "replication" function:
def repl(row):
return pd.DataFrame({'date': pd.date_range(,
periods=24, freq='H'),'store_id': row.store_id, 'x': row.x})
It "replicates" the source row (parameter), returning a sequence of rows
with the given date, for consecutive hours.
reset the index, to have all columns as "normal" columns,
apply this function (to each row),
convert the resulting Series of DataFrames into a list (of DataFrames),
concatenate the result.
The code to do it is:
pd.concat(df.reset_index().apply(repl, axis=1).tolist(), ignore_index=True)


add values on a negative pandas df based on condition date

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison

I'm trying to compare values between 2 columns in the same pandas dataframe and for where ever the match has been found I want to return the values from that row but from a 3rd column.
Basically if the following is dataframe df
| date | date_new | category | value |
| --------- | ---------- | -------- | ------ |
|2016-05-11 | 2018-05-15 | day | 1000.0 |
|2020-03-28 | 2018-05-11 | night | 2220.1 |
|2018-05-15 | 2020-03-28 | day | 142.8 |
|2018-05-11 | 2019-01-29 | night | 1832.9 |
I want to add a new column say, value_new which is basically obtained by getting the values from value after comparing for every date value in date_new for every date value in date followed by comparing if both the rows have same category values.
[steps of transformation]
- 1. for each value in date_new look for a match in date
- 2. if match found, compare if values in category column also match
- 3. if both the matches in above steps fulfilled, pick the corresponding value from value column from the row where both the matches fulfilled, otherwise leave blank.
So, I would finally want the final dataframe to look something like this.
| date | date_new | category | value | value_new |
| --------- | ---------- | -------- | ------ | --------- |
|2016-05-11 | 2018-05-15 | day | 1000.0 | 142.8 |
|2020-03-28 | 2018-05-11 | night | 2220.1 | 1832.9 |
|2018-05-15 | 2020-03-28 | day | 142.8 | None |
|2018-05-11 | 2016-05-11 | day | 1832.9 | 1000.0 |
Use DataFrame.merge with left join and assigned new column:
df['value_new'] = df.merge(df,
right_on=['date','category'], how='left')['value_y']
print (df)
date date_new category value value_new
0 2016-05-11 2018-05-15 day 1000.0 142.8
1 2020-03-28 2018-05-11 night 2220.1 NaN
2 2018-05-15 2020-03-28 day 142.8 NaN
3 2018-05-11 2016-05-11 day 1832.9 1000.0

How to delete specific rows in pandas dataframe if a condition is met

I have a pandas dataframe with few thousand rows and only one column. The structure of the content is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 1
4 | Score 2
5 | Date 2
6 | Group 2
7 | Score 2
8 | Score 3
9 | Date 3
10| Group 3
11| ...
12| ...
13| Score (n-1)
14| Score n
15| Date n
16| Group n
I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
The expected output is as follows:
| 0
0 | Score 1
1 | Date 1
2 | Group 1
3 | Score 2
4 | Date 2
5 | Group 2
6 | Score 3
7 | Date 3
8 | Group 3
9 | ...
10| ...
11| Score n
12| Date n
13| Group n
I need to delete all rows with index i if "Score" in row(i) and "Score" in row(i+1). Any suggestion on how to achieve this?
>>> df
0 Score 1
1 Date 1
2 Group 1
3 Score 1
4 Score 2
5 Date 2
6 Group 2
7 Score 2
8 Score 3
9 Date 3
you can use
>>> mask = df.assign(shift=df[0].shift(-1)).apply(lambda s: s.str.contains('Score')).all(1)
>>> df[~mask].reset_index(drop=True)
0 Score 1
1 Date 1
2 Group 1
3 Score 2
4 Date 2
5 Group 2
6 Score 3
7 Date 3
Although if I were you I would use fix the format of the data first as the commenters already pointed out.

Pandas: get observations by timestamp

I got a list of dynamic values (e.g. observations). It records all value changes of an entity (e.g. display).
| | time | display_index | value |
| 0 | 2017-11-06 13:00:00 | 1 | val1 |
| 1 | 2017-11-06 14:00:00 | 1 | val2 |
| 2 | 2017-11-06 15:00:00 | 1 | val1 |
| 3 | 2017-11-06 13:30:00 | 2 | val3 |
| 4 | 2017-11-06 14:05:00 | 2 | val4 |
| 5 | 2017-11-06 15:30:00 | 2 | val1 |
Now I got a second list of timestamps and I'm interested in the values that each display has shown at that time. Note that the first timestamp (13:00) for display_index 2 is before any value is even known for that one (first record is 13:30).
| | time | display_index |
| 0 | 2017-11-06 13:20:00 | 1 |
| 1 | 2017-11-06 13:40:00 | 1 |
| 2 | 2017-11-06 13:00:00 | 2 |
| 3 | 2017-11-06 14:00:00 | 2 |
I tried calculating the period between both timestamps and chose the observation with the minimum value for that period:
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
| | time | display_index | time_measured | value | seconds |
| 0 | 2017-11-06 13:00:00 | 2 | 2017-11-06 13:30:00 | val3 | 1800 |
| 1 | 2017-11-06 13:20:00 | 1 | 2017-11-06 13:00:00 | val1 | 1200 |
| 2 | 2017-11-06 13:40:00 | 1 | 2017-11-06 14:00:00 | val2 | 1200 |
| 3 | 2017-11-06 14:00:00 | 2 | 2017-11-06 14:05:00 | val4 | 300 |
The problem is that the last values for display 1 and 2 are wrong since they are still showing another value at that time. It should be val1 for display 1 and val3 for display 2. What I'm actually looking for is the observation that was last seen before the timestamp. So how to do this?
Here's the code that I used:
import pandas as pd
from tabulate import tabulate
import math
values = [("2017-11-06 13:00", 1, 'val1'),
("2017-11-06 14:00", 1, 'val2'),
("2017-11-06 15:00", 1, 'val1'),
("2017-11-06 13:30", 2, 'val3'),
("2017-11-06 14:05", 2, 'val4'),
("2017-11-06 15:30", 2, 'val1'),
labels = ['time', 'display_index', 'value']
df = pd.DataFrame.from_records(values, columns=labels)
df['time'] = pd.to_datetime(df['time'])
print(tabulate(df, headers='keys', tablefmt='psql'))
values = [("2017-11-06 13:20", 1),
("2017-11-06 13:40", 1),
("2017-11-06 13:00", 2),
("2017-11-06 14:00", 2),
labels = ['time', 'display_index']
df_times = pd.DataFrame.from_records(values, columns=labels)
df_times['time'] = pd.to_datetime(df_times['time'])
print(tabulate(df_times, headers='keys', tablefmt='psql'))
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
This is a perfect use case for pd.merge_asof
Note: I think you got the second row wrong.
# dataframes need to be sorted
df_times = df_times.sort_values(['time', 'display_index'])
df = df.sort_values(['time', 'display_index'])
df_times, df.assign(time_measured=df.time),
on='time', by='display_index', direction='forward'
).assign(seconds=lambda d: d.time_measured.sub(d.time).dt.total_seconds())
time display_index value time_measured seconds
0 2017-11-06 13:00:00 2 val3 2017-11-06 13:30:00 1800.0
1 2017-11-06 13:20:00 1 val2 2017-11-06 14:00:00 2400.0
2 2017-11-06 13:40:00 1 val2 2017-11-06 14:00:00 1200.0
3 2017-11-06 14:00:00 2 val4 2017-11-06 14:05:00 300.0
pd.merge_asof for every row in the left argument, it attempts to locate a matching row in the right argument.
Since we passed direction='forward' it will look forward from the row in the left argument and find the next value.
I needed a way to capture the time_measured column. Since merge_asof snags the time column, I assigned it as a different column that I can use as intended. The use of df.assign(time_measured=df.time) just dups the column for use later.
I use assign again. This time to assign a new column seconds. When using assign, you can pass an array of equal length as the dataframe. You can pass a series in which the values will align based on the index. Or you can pass a callable that will get passed the dataframe that is calling assign. This is what I did. The lambda takes the calling dataframe and finds the difference in those two date columns and converts the resulting series of timedeltas to seconds.

Aggregating on 5 minute windows in pyspark

I Have the following dataframe df:
User | Datetime | amount | length
A | 2016-01-01 12:01 | 10 | 20
A | 2016-01-01 12:03 | 6 | 10
A | 2016-01-01 12:05 | 1 | 3
A | 2016-01-01 12:06 | 3 | 5
B | 2016-01-01 12:01 | 10 | 20
B | 2016-01-01 12:02 | 8 | 20
And I want to use pyspark efficiently to aggregate over a 5 minute time window and do some calculations - so for example calculate the average amount & length for every use for every 5 minute time window - the df will look like this:
User | Datetime | amount | length
A | 2016-01-01 12:00 | 8 | 15
B | 2016-01-01 12:00 | 2 | 4
A | 2016-01-01 12:05 | 9 | 20
How can I achieve this in the most efficient way?
In pandas I used:
df.groupby(['cs_username', pd.TimeGrouper('5Min')].apply(...)
Unfortunately, in pyspark this won't look so cool like in pandas ;-)
You can try casting date to timestamp and using modulo, for example:
import pyspark.sql.functions as F
seconds = 300
seconds_window = F.from_unixtime(F.unix_timestamp('date') - F.unix_timestamp('date') % seconds)
dataframe.withColumn('5_minutes_window', seconds_window)
Then you can simply group by new column and perform requested aggregations.
