Updating Pandas DataFrame Specific Range - python

I have a Pandas DataFrame whose index is an DateTimeIndex (hourly stepped), the columns are the name of the rooms and each cell is a set().
room_a room_b ... room_az
2017-01-01 12:00 {} {} ... {}
2017-01-01 13:00 {} {} ... {}
2017-01-01 14:00 {} {} ... {}
...
2019-12-12 23:00 {} {} ... {}
I have to put the right persons in the right rooms during the time he/she/they occupied it. The data comes from another DataFrame that is like
index person_id room beg_effective_dt_tm end_effective_dt_tm
1 55 room_a 2017-01-01 15:45:33 2017-01-15 10:33:54
2 55 room_a 2017-01-25 09:15:55 2017-02-15 15:33:42
3 10 room_a 2017-01-05 12:10:33 2017-02-10 09:33:25
4 10 room_b 2017-02-10 09:34:15 2017-03-25 10:14:15
...
15000 55 room_z 2019-05-10 12:15:45 2019-05-10 15:33:25
15001 60 room_x 2019-06-02 15:10:33 2019-08-10 10:33:42
...
n
So I tried
for _, row in enumerate(df_origin.itertuples(), 1):
interval_start = hour_rounder(row.beg_effective_dt_tm)
interval_finish = hour_rounder(row.end_effective_dt_tm)
df_sets.update(
df_sets.loc[
interval_start:interval_finish, row.room
].apply(lambda s: s.add(row.person_id))
)
But, for each row this code is updating the whole respective column (room), ignoring the time span.
Without apply the series are been correctly selected, but the apply is setting the whole respective columns.
What am I missing here?
How can I implement this idea?
Thanks in advance.

Related

Python dataframe find closest date for each ID

I have a dataframe like this:
data = {'SalePrice':[10,10,10,20,20,3,3,1,4,8,8],'HandoverDateA':['2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-04-30','2022-03-30','2022-03-30'],'ID': ['Tom', 'Tom','Tom','Joseph','Joseph','Ben','Ben','Eden','Tim','Adam','Adam'], 'Tranche': ['Red', 'Red', 'Red', 'Red','Red','Blue','Blue','Red','Red','Red','Red'],'Totals':[100,100,100,50,50,90,90,70,60,70,70],'Sent':['2022-01-18','2022-02-19','2022-03-14','2022-03-14','2022-04-22','2022-03-03','2022-02-07','2022-01-04','2022-01-10','2022-01-15','2022-03-12'],'Amount':[20,10,14,34,15,60,25,10,10,40,20],'Opened':['2021-12-29','2021-12-29','2021-12-29','2022-12-29','2022-12-29','2021-12-19','2021-12-19','2021-12-29','2021-12-29','2021-12-29','2021-12-29']}
I need to find the sent date which is closest to the HandoverDate. I've seen plenty of examples that work when you give one date to search but here the date I want to be closest to can change for every ID. I have tried to adapt the following:
def nearest(items, pivot):
return min([i for i in items if i <= pivot], key=lambda x: abs(x - pivot))
And also tried to write a loop where I make a dataframe for each ID and use max on the date column then stick them together, but it's incredibly slow!
Thanks for any suggestions :)
IIUC, you can use:
data[['HandoverDateA', 'Sent']] = data[['HandoverDateA', 'Sent']].apply(pd.to_datetime)
out = data.loc[data['HandoverDateA']
.sub(data['Sent']).abs()
.groupby(data['ID']).idxmin()]
Output:
SalePrice HandoverDateA ID Tranche Totals Sent Amount Opened
10 8 2022-03-30 Adam Red 70 2022-03-12 20 2021-12-29
5 3 2022-04-30 Ben Blue 90 2022-03-03 60 2021-12-19
7 1 2022-04-30 Eden Red 70 2022-01-04 10 2021-12-29
4 20 2022-04-30 Joseph Red 50 2022-04-22 15 2022-12-29
8 4 2022-04-30 Tim Red 60 2022-01-10 10 2021-12-29
2 10 2022-04-30 Tom Red 100 2022-03-14 14 2021-12-29
Considering that the goal is to find the sent date which is closest to the HandoverDate, one approach would be as follows.
First of all, create the dataframe df from data with pandas.DataFrame
import pandas as pd
df = pd.DataFrame(data)
Then, make sure that the columns HandoverDateA and Sent are of datetime using pandas.to_datetime
df['HandoverDateA'] = pd.to_datetime(df['HandoverDateA'])
df['Sent'] = pd.to_datetime(df['Sent'])
Then, in order to make it more convenient, create a column, diff, to store the absolute value of the difference between the columns HandoverDateA and Sent
df['diff'] = (df['HandoverDateA'] - df['Sent']).dt.days.abs()
With that column, one can simply sort by that column as follows
df = df.sort_values(by=['diff'])
[Out]:
SalePrice HandoverDateA ID ... Amount Opened diff
4 20 2022-04-30 Joseph ... 15 2022-12-29 8
10 8 2022-03-30 Adam ... 20 2021-12-29 18
2 10 2022-04-30 Tom ... 14 2021-12-29 47
5 3 2022-04-30 Ben ... 60 2021-12-19 58
8 4 2022-04-30 Tim ... 10 2021-12-29 110
7 1 2022-04-30 Eden ... 10 2021-12-29 116
and the first row is the one where Sent is closest to HandOverDateA.
With the column diff, one option to get the one where diff is minimum is with pandas.DataFrame.query as follows
df = df.query('diff == diff.min()')
[Out]:
SalePrice HandoverDateA ID Tranche ... Sent Amount Opened diff
4 20 2022-04-30 Joseph Red ... 2022-04-22 15 2022-12-29 8
Notes:
For more information on sorting dataframes by columns, read my answer here.

Pandas returning 0 string as 0%

I'm doing a evaluation on how many stores report back in how many time (same day(0), 1 day(1), etc), but when calculate the percentage of the total, all same day stores return 0% of the total.
I tried turning the column into object, float and int, but with the same result.
DF['T_days'] = (DF['day included in the server'] - DF['day of sale']).dt.days
create my T_Days and fills it with the amount in days based on the 2 datatime columns. This works fine. And by:
DF['Percentage'] = (DF['T_days'] /DF['T_days'].sum()) * 100
return this table. I know what i should do but now how to do it.
COD_store
date in server
Date bought
T_days
Percentage
1
2021-12-03
2021-12-02
1
0.013746
1
2021-12-03
2021-12-02
1
0.013746
922
2022-01-27
2022-01-10
17
0.233677
922
2022-01-27
2022-01-10
17
0.233677
...
...
...
...
...
65
2022-01-12
2022-01-12
0
0.0
new DF after groupby:
T_DIAS
0 0.000000
1 1.374570
2 0.192440
3 15.793814
7 0.384880
17 82.254296
Name: Percentage, dtype: float64
I know i should divide the days resulted by the total amount of rows in DF and then group them by days, but my search on how to do this resulted in nothing. THW: i already have a separate DF for those days and percentage
Expected table:
T_days
Percentage
0
50
2
30
3
10
4
3
5
7
DF['T_days'].value_counts(normalize=True)*100)
worked. And after I turned it from a series to a DF to help the usage.

Calculate difference between dates for sequential pandas rows based on conditional column value

I need to find the number of days between a request date and its most recent offer date for each apartment number. My example dataframe looks like the first 3 columns below and I'm trying to figure out how to calculate the 'days_since_offer' column. The apartment and or_date columns are already sorted.
apartment offer_req_type or_date days_since_offer
A request 12/4/2019 n/a
A request 12/30/2019 n/a
A offer 3/4/2020 0
A request 4/2/2020 29
A request 6/4/2020 92
A request 8/4/2020 153
A offer 12/4/2020 0
A request 1/1/2021 28
B offer 1/1/2019 0
B request 8/1/2019 212
B offer 10/1/2019 0
B request 1/1/2020 92
B request 9/1/2020 244
B offer 1/1/2021 0
B request 1/25/2021 24
I tried to create a new function which sort of gives me what I want if I pass it dates for a single apartment. When I use the apply function is gives me an error though: "SpecificationError: Function names must be unique if there is no new column names assigned".
def func(attr, date_ser):
offer_dt = date(1900,1,1)
lapse_days = []
for row in range(len(attr)):
if attr[row] == 'offer':
offer_dt = date_ser[row]
lapse_days.append(-1)
else:
lapse_days.append(date_ser[row]-offer_dt)
print(lapse_days)
return lapse_days
df['days_since_offer'] = df.apply(func(df['offer_req_type'], df['or_date']))
I also tried to use groupby + diff functions like this and this but it's not the answer that I need:
df.groupby('offer_req_type').or_date.diff().dt.days
I also looked into using the shift method, but I'm not necessarily looking at sequential rows every time.
Any pointers on why my function is failing or if there is a better way to get the date differences that I need using a groupby method would be helpful!
I have played around and I am certainly not claiming this to be the best way. I used df.apply() (edit: see below for alternative without df.apply()).
import numpy as np
import pandas as pd
# SNIP: removed the df creation part for brevity.
df["or_date"] = pd.to_datetime(df["or_date"])
df.drop("days_since_offer", inplace=True, axis="columns")
def get_last_offer(row:pd.DataFrame, df: pd.DataFrame):
if row["offer_req_type"] == "offer":
return
temp_df = df[(df.apartment == row['apartment']) & (df.offer_req_type == "offer") & (df.or_date < row["or_date"])]
if temp_df.empty:
return
else:
x = row["or_date"]
y = temp_df.iloc[-1:, -1:]["or_date"].values[0]
return x-y
return 1
df["days_since_offer"] = df.apply(lambda row: get_last_offer(row, df), axis=1)
print(df)
This returns the following df:
0 A request 2019-12-04 NaT
1 A request 2019-12-30 NaT
2 A offer 2020-03-04 NaT
3 A request 2020-04-02 29 days
4 A request 2020-06-04 92 days
5 A request 2020-08-04 153 days
6 A offer 2020-12-04 NaT
7 A request 2021-01-01 28 days
8 B offer 2019-01-01 NaT
9 B request 2019-08-01 212 days
10 B offer 2019-10-01 NaT
11 B request 2020-01-01 92 days
12 B request 2020-09-01 336 days
13 B offer 2021-01-01 NaT
14 B request 2021-01-25 24 days
EDIT
I was wondering if I couldn't find a way not using df.apply(). I ended up with the following lines: (replace from line def get_last_offer() in previous code bit)
df["offer_dates"] = np.where(df['offer_req_type'] == 'offer', df['or_date'], pd.NaT)
# OLD: df["offer_dates"].ffill(inplace=True)
df["offer_dates"] = df.groupby("apartment")["offer_dates"].ffill()
df["diff"] = pd.to_datetime(df["or_date"]) - pd.to_datetime(df["offer_dates"])
df.drop("offer_dates", inplace=True, axis="columns")
This creates a helper column (df['offer_dates']) which is filled for every row that has offer_req_type as 'offer'. Then it is forward-filled, meaning that every NaT value will be replaced with the previous valid value. Then We calculate the df['diff'] column, with the exact same result. I like this bit better because it is cleaner and it has 4 lines rather than 12 lines of code :)

How to set a default value for a specific column in which other columns interact with Pandas?

I have a DataFrame that creates a CSV with this function:
def create_data(date, place, value):
can_spend = 190
try:
file = open(filename, 'r+')
data_set = pd.read_csv(filename, index_col=False)
frame = pd.DataFrame(data_set, columns=['Left', 'Date', 'Place', 'Spent'])
frame = frame.append({"Left": can_spend, "Date": date, "Place": place, "Spent": value}, ignore_index=True)
frame['Date'] = pd.to_datetime(frame['Date'])
frame['Week'] = frame['Date'].dt.weekofyear
# write the data-set to the csv
frame.to_csv(filename, index=None, header=True)
except IOError:
file = open(filename, "w")
frame = pd.DataFrame(columns=['Left', 'Date', 'Place', 'Spent'])
frame.to_csv(filename, index=None, header=True)
This DataFrame is going to be storing a small portion of my personal budget. I have a set spending limit that I want each each entry in the frame to subtract from based on the week (the spending limit will reset each week).
Here is how I add data to the DataFrame:
def create_new_entry(self):
get_date = input("Date: ")
get_place = input("Place: ")
get_amount = float(input("Amount: "))
create_data(get_date, get_place, get_amount)
Here is how I would like the DataFrame to look:
"Left" column will default to the value of 190 each week
Left Date Place Spent Week
0 146.69 2019-01-02 Walmart 43.31 1
1 92.46 2019-01-05 Kroger 54.23 1
2 72.46 2019-01-06 Kroger 20.00 1
# Here is where "Left" will reset on new week
3 170.00 2019-01-08 Kroger 20.00 2
How can I accomplish this?
This can be done with groupby and cumsum with a single line of code. Do not add the 'Left' column while reading and creating the dataframe (I mean, you can, but it will be overwritten anyway).
Suppose then that, after reading and first manipulation to create the useful 'Week' column, your df is:
Date Place Spent Week
0 2019-01-02 Walmart 43.31 1
1 2019-01-05 Kroger 54.23 1
2 2019-01-06 Kroger 20.00 1
3 2019-01-08 Walmart 20.00 2
4 2019-01-09 Walmart 30.00 2
5 2019-01-10 Kroger 10.00 2
Then you can create the 'Left' column like:
can_spend = 190
df['Left'] = df.groupby('Week').apply(lambda x : can_spend - x['Spent'].cumsum()).reset_index(drop=True)
And df will become:
Date Place Spent Week Left
0 2019-01-02 Walmart 43.31 1 146.69
1 2019-01-05 Kroger 54.23 1 92.46
2 2019-01-06 Kroger 20.00 1 72.46
3 2019-01-08 Walmart 20.00 2 170.00
4 2019-01-09 Walmart 30.00 2 140.00
5 2019-01-10 Kroger 10.00 2 130.00
A brief explanation: groupby creates subsets of the dataframe, grouping rows with the same value in column 'Week'. The apply method do the vectorized calculation to get the remaining amount for each subset (week). reset_index(drop=True) is needed otherwise the index builtd by groupby will not match with the index of df, raising an error.

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

Categories