Manipulate pandas columns with datetime - python

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

Related

merge / concat two dataframes on column values and drop subsequent rows from the resulting dataframe

I have 2 data frames
df1
| email | ack |
| -------- | -------------- |
| first#abc.com | 1 |
| second#abc.com | 1 |
| third#abc.com | 1 |
| fourth#abc.com | 1 |
| fifth#abc.com | 1 |
| sixth#abc.com | 1 |
| seventh#abc.com | 1 |
| eight#abc.com | 1 |
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 0 |abc | 01/01/2022 |
| second#abc.com | 0 |xyz | 01/02/2022 |
| third#abc.com | 0 |mno | 01/03/2022 |
| fourth#abc.com | 0 |pqr | 01/04/2022 |
| fifth#abc.com | 0 |adam| 01/05/2022 |
| sixth#abc.com | 0 |eve |01/06/2022|
| seventh#abc.com | 0 |mary|01/07/2022|
| eight#abc.com | 0 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
How do i merge the above two dataframes so as to replace the values in 'ack' column of df2 wherever applicable i.e., on email address.
result
df2
| email | ack |name| date|
| -------- | -------------- |-------------- |-------------- |
|first#abc.com | 1 |abc|01/01/2022|
| second#abc.com | 1 |xyz|01/02/2022|
| third#abc.com | 1 |mno|01/03/2022|
| fourth#abc.com | 1 |pqr|01/04/2022|
| fifth#abc.com | 1 |adam|01/05/2022|
| sixth#abc.com | 1 |eve|01/06/2022|
| seventh#abc.com | 1 |mary|01/07/2022|
| eight#abc.com | 1 |john|01/08/2022|
| nine#abc.com | 0 |kate|01/09/2022|
| ten#abc.com | 0 |matt|01/10/2022|
I tried left join and outer join, it appended rows to existing rows.
Assuming df1['ack'] is always 1, the following code should work:
df2.loc[df2['email'].isin(df1['email']), 'ack'] = 1
In English:
If df2['email'] is found in df1['email'], set df2['ack'] = 1

How to get the column values of a Dataframe into another dataframe as a new column after matching the values in columns that both dataframes have?

I'm trying to create a new column in a DataFrame and storing it with values stored in a different dataframe by first comparing the values of columns that both dataframes have. For example:
df1 >>>
| name | team | week | dates | interceptions | pass_yds | rating |
| ---- | ---- | -----| ---------- | ------------- | --------- | -------- |
| maho | KC | 1 | 2020-09-10 | 0 | 300 | 105 |
| went | PHI | 1 | 2020-09-13 | 2 | 225 | 74 |
| lock | DEN | 1 | 2020-09-14 | 0 | 150 | 89 |
| dris | DEN | 2 | 2020-09-20 | 1 | 220 | 95 |
| went | PHI | 2 | 2020-09-20 | 2 | 250 | 64 |
| maho | KC | 2 | 2020-09-21 | 1 | 245 | 101 |
df2 >>>
| name | team | week | catches | rec_yds | rec_tds |
| ---- | ---- | -----| ------- | ------- | ------- |
| ertz | PHI | 1 | 5 | 58 | 1 |
| fant | DEN | 2 | 6 | 79 | 0 |
| kelc | KC | 2 | 8 | 105 | 1 |
| fant | DEN | 1 | 3 | 29 | 0 |
| kelc | KC | 1 | 6 | 71 | 1 |
| ertz | PHI | 2 | 7 | 91 | 2 |
| goed | PHI | 2 | 2 | 15 | 0 |
I want to create a dates column in df2 with the values of the dates stored in the dates column in df1 after matching the teams and the weeks columns. After the matching, df2 in this example should look something like this:
df2 >>>
| name | team | week | catches | rec_yds | rec_tds | dates |
| ---- | ---- | -----| ------- | ------- | ------- | ---------- |
| ertz | PHI | 1 | 5 | 58 | 1 | 2020-09-13 |
| fant | DEN | 2 | 6 | 79 | 0 | 2020-09-20 |
| kelc | KC | 2 | 8 | 105 | 1 | 2020-09-20 |
| fant | DEN | 1 | 3 | 29 | 0 | 2020-09-14 |
| kelc | KC | 1 | 6 | 71 | 1 | 2020-09-10 |
| ertz | PHI | 2 | 7 | 91 | 2 | 2020-09-20 |
| goed | PHI | 2 | 2 | 15 | 0 | 2020-09-20 |
I'm looking for an optimal solution. I've already tried nested for loops and comparing the week and team columns from both dataframes together but that hasn't worked. At this point I'm all out of ideas. Please help!
Disclaimer: The actual DataFrames I'm working with are a lot larger. They have a lot more rows, columns, and values (i.e. a lot more teams in the team columns, a lot more dates in the dates columns, and a lot more weeks in the week columns)

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

Python Pivot Table based on multiple criteria

I was asking the question in this link SUMIFS in python jupyter
However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in.
Here is the dataframe (sorted based on the date):
+---------------+--------+---------+-----------+--------+
| Switch In/Out | Client | Quality | Date | Amount |
+---------------+--------+---------+-----------+--------+
| Out | 1 | B | 15-Aug-19 | 360 |
| In | 1 | A | 16-Aug-19 | 180 |
| In | 1 | B | 17-Aug-19 | 180 |
| Out | 1 | A | 18-Aug-19 | 140 |
| In | 1 | B | 18-Aug-19 | 80 |
| In | 1 | A | 19-Aug-19 | 60 |
| Out | 2 | B | 14-Aug-19 | 45 |
| Out | 2 | C | 15-Aug-20 | 85 |
| In | 2 | C | 15-Aug-20 | 130 |
| Out | 2 | A | 20-Aug-19 | 100 |
| In | 2 | A | 22-Aug-19 | 30 |
| In | 2 | B | 23-Aug-19 | 30 |
| In | 2 | C | 23-Aug-19 | 40 |
+---------------+--------+---------+-----------+--------+
I would then create a new column and divide them into different transactions.
+---------------+--------+---------+-----------+--------+------+
| Switch In/Out | Client | Quality | Date | Amount | Rows |
+---------------+--------+---------+-----------+--------+------+
| Out | 1 | B | 15-Aug-19 | 360 | 1 |
| In | 1 | A | 16-Aug-19 | 180 | 1 |
| In | 1 | B | 17-Aug-19 | 180 | 1 |
| Out | 1 | A | 18-Aug-19 | 140 | 2 |
| In | 1 | B | 18-Aug-19 | 80 | 2 |
| In | 1 | A | 19-Aug-19 | 60 | 2 |
| Out | 2 | B | 14-Aug-19 | 45 | 3 |
| Out | 2 | C | 15-Aug-20 | 85 | 3 |
| In | 2 | C | 15-Aug-20 | 130 | 3 |
| Out | 2 | A | 20-Aug-19 | 100 | 4 |
| In | 2 | A | 22-Aug-19 | 30 | 4 |
| In | 2 | B | 23-Aug-19 | 30 | 4 |
| In | 2 | C | 23-Aug-19 | 40 | 4 |
+---------------+--------+---------+-----------+--------+------+
With this, I can apply the pivot formula and take it from there.
However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python.
Thank you!
One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map.
Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter.
import pandas as pd
switchInOut = ["Out", "In", "In", "Out", "In", "In",
"Out", "Out", "In", "Out", "In", "In", "In"]
df = pd.DataFrame(switchInOut, columns=['Switch In/Out'])
counter = 1
def changeToOut(i):
global counter
if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In":
counter += 1
return counter
rows = df.index.map(changeToOut)
df["Rows"] = rows
df
Result:
+----+-----------------+--------+
| | Switch In/Out | Rows |
|----+-----------------+--------|
| 0 | Out | 1 |
| 1 | In | 1 |
| 2 | In | 1 |
| 3 | Out | 2 |
| 4 | In | 2 |
| 5 | In | 2 |
| 6 | Out | 3 |
| 7 | Out | 3 |
| 8 | In | 3 |
| 9 | Out | 4 |
| 10 | In | 4 |
| 11 | In | 4 |
| 12 | In | 4 |
+----+-----------------+--------+

Interpolate in SQL based on subgroup in django models

I have the following sheetinfo model with the following data:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 0 |
| SAT123A03 | SAT123 | A | 3 | 0 |
| SAT123A04 | SAT123 | A | 4 | 0 |
| SAT123A05 | SAT123 | A | 5 | 500 |
| SAT123B05 | SAT123 | B | 5 | 400 |
| SAT123B04 | SAT123 | B | 4 | 0 |
| SAT123B03 | SAT123 | B | 3 | 0 |
| SAT123B02 | SAT123 | B | 2 | 500 |
| SAT124A01 | SAT124 | A | 1 | 400 |
| SAT124A02 | SAT124 | A | 2 | 0 |
| SAT124A03 | SAT124 | A | 3 | 0 |
| SAT124A04 | SAT124 | A | 4 | 475 |
I would like to interpolate and update the table with the correct T_val.
Formula is:
new_t_val = delta / (cnt -1) * sheet_num + min_tvc_of_subgroup
For instance the top 5:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 425 |
| SAT123A03 | SAT123 | A | 3 | 450 |
| SAT123A04 | SAT123 | A | 4 | 475 |
| SAT123A05 | SAT123 | A | 5 | 500 |
I have a django query that works to update the data, however it is SLOW and stops after a while (due to type errors etc.)
My question is there a way to accomplish this in SQL?
The ability to do this as one database call doesn't exist in stock Django. 3rd party packages exist though: https://github.com/aykut/django-bulk-update
Example of how that package works:
rows = Model.objects.all()
for row in rows:
# Modify rows as appropriate
row.T_val = delta / (cnt -1) * row.sheet_num + min_tvc_of_subgroup
Model.objects.bulk_update(rows)
For datasets up to the 1,000,000 range, this should have reasonable performance. Most of the bottleneck in iterating through and .save()-ing each object is the overhead on a database call. The python part is reasonably fast. The above example has only two database calls so it will be perhaps an order of magnitude or two faster.

Categories