comparing values of 2 columns from same pandas dataframe & returning value of 3rd column based on comparison - python

I'm trying to compare values between 2 columns in the same pandas dataframe and for where ever the match has been found I want to return the values from that row but from a 3rd column.
Basically if the following is dataframe df
| date | date_new | category | value |
| --------- | ---------- | -------- | ------ |
|2016-05-11 | 2018-05-15 | day | 1000.0 |
|2020-03-28 | 2018-05-11 | night | 2220.1 |
|2018-05-15 | 2020-03-28 | day | 142.8 |
|2018-05-11 | 2019-01-29 | night | 1832.9 |
I want to add a new column say, value_new which is basically obtained by getting the values from value after comparing for every date value in date_new for every date value in date followed by comparing if both the rows have same category values.
[steps of transformation]
- 1. for each value in date_new look for a match in date
- 2. if match found, compare if values in category column also match
- 3. if both the matches in above steps fulfilled, pick the corresponding value from value column from the row where both the matches fulfilled, otherwise leave blank.
So, I would finally want the final dataframe to look something like this.
| date | date_new | category | value | value_new |
| --------- | ---------- | -------- | ------ | --------- |
|2016-05-11 | 2018-05-15 | day | 1000.0 | 142.8 |
|2020-03-28 | 2018-05-11 | night | 2220.1 | 1832.9 |
|2018-05-15 | 2020-03-28 | day | 142.8 | None |
|2018-05-11 | 2016-05-11 | day | 1832.9 | 1000.0 |

Use DataFrame.merge with left join and assigned new column:
df['value_new'] = df.merge(df,
left_on=['date_new','category'],
right_on=['date','category'], how='left')['value_y']
print (df)
date date_new category value value_new
0 2016-05-11 2018-05-15 day 1000.0 142.8
1 2020-03-28 2018-05-11 night 2220.1 NaN
2 2018-05-15 2020-03-28 day 142.8 NaN
3 2018-05-11 2016-05-11 day 1832.9 1000.0

Related

add values on a negative pandas df based on condition date

I have a dataframe which contains credit of a user, each row is how much credit has a given day.
A user loses 1 credit per day.
I need a way to code that if a user has accumulated credit in the past will fill all the days that credit was 0.
An example of refilling past credits:
import pandas as pd
from datetime import datetime
data = pd.DataFrame({'credit':[0,0,0,2,0,0,1],'date':pd.date_range('01-01-2021','01-07-2021')})
data['credit_after_consuming'] = data.credit_refill -1
Looks like:
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 0 | 2021-01-01 00:00:00 | -1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 2 | 2021-01-04 00:00:00 | 1 |
| 4 | 0 | 2021-01-05 00:00:00 | -1 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
| 6 | 1 | 2021-01-07 00:00:00 | 0 |
The logic should be as you can see the three first days the user would have credit -1, until the 4th of January, where the user has 2 days of credit, one used that day and the other one is consumed the 5.
In total there would be 3 days(the first one without credits).
If at the start of the week a user picks 7 credits it is covered all week.
Another case would be
| | credit_refill | date | credit_after_consuming |
|---:|----------------:|:--------------------|-------------------------:|
| 0 | 2 | 2021-01-01 00:00:00 | 1 |
| 1 | 0 | 2021-01-02 00:00:00 | -1 |
| 2 | 0 | 2021-01-03 00:00:00 | -1 |
| 3 | 0 | 2021-01-04 00:00:00 | -1 |
| 4 | 1 | 2021-01-05 00:00:00 | 0 |
| 5 | 0 | 2021-01-06 00:00:00 | -1 |
In this case the participant would run out of credits the 3rd and 4th day, because it has 2 credits the 1st, one consumed the same day and then the 2nd.
Then the 5th would refill and consume the same day, to run out of credits the 6th day.
I feel it's like some variation of cumsum but I can-t manage to get the expected results.
I could do a sum all days and fill the 0's with those accumulated, but I have to take into account I can only refill with credits accumulated in the past.

How to dimensionalize a pandas dataframe

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

Check a value against row below, return if current row is the first unique value

Dear stackoverflow friends,
I need your help from senior pandas users for a pretty easy task that I cannot it solve myself.
here's a df with recorded entrance for every person, however, there are multiple ins for people (they get in- check out for lunchbreak- get in again).
The df is sorted ascending for people, date and entrance time.
I need to extract the 1st recorded entrance, thus excluding the others (eg: after lunchbreak).
To get to the desired output (1st_stamp) i need to check whether the current row is the first in of the day ( of the same person ofc), then a "y" would appear on column "1st_stamp".
It's tricky because some people have only 1 entrance (eg: Person N.3), some have 2 (Person N.2), and some split their lunch in two breaks so they have 3 in-recorded entrances (Person N. 7).
How would you go on solving this riddle?
Ps:being able to clean this data is of enormous use for staff planning processes.
Thank you dears :)
+-------------+------------+------------------+----------+-----------+
| name | Date | start | tstart | 1st_stamp |
+-------------+------------+------------------+----------+-----------+
| Person N. 1 | 13/08/2020 | 13/08/2020 07:00 | 07:00:00 | y |
| Person N. 1 | 13/08/2020 | 13/08/2020 13:10 | 13:10:00 | n |
| Person N. 2 | 13/08/2020 | 13/08/2020 10:00 | 10:00:00 | y |
| Person N. 2 | 13/08/2020 | 13/08/2020 13:46 | 13:46:00 | n |
| Person N. 3 | 13/08/2020 | 13/08/2020 09:00 | 09:00:00 | y |
| Person N. 4 | 13/08/2020 | 13/08/2020 08:00 | 08:00:00 | y |
| Person N. 4 | 13/08/2020 | 13/08/2020 13:04 | 13:04:00 | n |
| Person N. 4 | 13/08/2020 | NaT | NaT | n |
| Person N. 5 | 13/08/2020 | 13/08/2020 10:00 | 10:00:00 | y |
| Person N. 6 | 13/08/2020 | 13/08/2020 07:00 | 07:00:00 | y |
| Person N. 6 | 13/08/2020 | 13/08/2020 13:29 | 13:29:00 | n |
| Person N. 7 | 13/08/2020 | 13/08/2020 08:00 | 08:00:00 | y |
| Person N. 7 | 13/08/2020 | 13/08/2020 14:01 | 14:01:00 | n |
| Person N. 7 | 13/08/2020 | 13/08/2020 16:00 | 16:00:00 | n |
+-------------+------------+------------------+----------+-----------+
If I understood correctly, you want to create 1st_stamp column right?
To create 1st_stamp column, here is one way to approach:
# 1. Convert to datetime if it isn't already
df['start'] = pd.to_datetime(df['start'])
# 2. Partition data by name and rank them based on start datetime
df['order'] = df.groupby('name')['start'].rank(method='min')
# 3. Create a variable to indicate if it's the earliest instance
df['1st_stamp'] = np.where(df['order']==1, 'y', 'n')
df
2nd step is copied from this stackoverflow answer.
This will create the order column - if you don't need it, you can just delete it with del(df['order']).
Ensure the column is a datetime;
df['start'] = pd.to_datetime(df['start'])
Return the first time you can do something like;
df.groupby(['name', 'Date', 'tstart']).first()
Or first time and the count of entries;
grouped = df.groupby(['name', 'Date', 'tstart']).agg({'tstart': ['min', 'count']})

How to calculate average percentage change using groupby

I want to create a dateframe that calculates the average percentage change over a time period.
Target dataframe would look like this:
| | City | Ave_Growth |
|---|------|------------|
| 0 | A | 0.0 |
| 1 | B | -0.5 |
| 2 | C | 0.5 |
While simplified, real data would be cities with average changes over past 7 days.
Original dataset, df_bycity, looks like this:
| | City | Date | Case_Num |
|---|------|------------|----------|
| 0 | A | 2020-01-01 | 1 |
| 1 | A | 2020-01-02 | 1 |
| 2 | A | 2020-01-03 | 1 |
| 3 | B | 2020-01-01 | 3 |
| 4 | C | 2020-01-03 | 3 |
While simplified, this represents real data. Some cities have fewer cases, some have more. In some cities, there will be days that have no reported cases. But I would like to calculate average change from last seven days from today. I've simplified.
I tried the following code but I'm not getting the results I want:
df_bycity.groupby(['City','Date']).pct_change()
Case_Num
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Obviously I'm using either pct_change or groupby incorrectly. I'm just learning this.
Anyone can point me to the right direction? Thanks.

Conditionally setting rows in pandas groupby

I have a (simplified) dataframe like:
+--------+-----------+-------+
| type | estimated | value |
+--------+-----------+-------+
| type_a | TRUE | 1 |
| type_a | TRUE | 2 |
| type_a | | 3 |
| type_b | | 4 |
| type_b | | 5 |
| type_b | | 6 |
+--------+-----------+-------+
I'd like to group and sum it into two rows:
+--------+-----------+-------+
| type | estimated | value |
+--------+-----------+-------+
| type_a | TRUE | 6 |
| type_b | | 15 |
+--------+-----------+-------+
However, I want the grouped row to have the 'estimated' column to be TRUE if any of the rows grouped to form it were estimated. If my group by includes the 'estimated' column, then the rows won't be grouped together.
My idea was to iterate through each group, e.g. (pseudocode)
grouped = df.groupby('type')
for group in grouped:
group['flag'] = 0
for row in group:
if row['estimated'] == True:
group['flag'] = 1
Then after grouping I could set all the rows with non-zero 'flag' to an estimated = True.
I'm having some trouble figuring out how to iterate through rows of groups, and the solution seems pretty hacky. Also you shouldn't edit something you're iterating over. Is there a solution/better way?
you want groupby with agg
df.groupby('type').agg(dict(estimated='any', value='sum')).reset_index()
type value estimated
0 type_a 6 True
1 type_b 15 False

Categories