Python Pivot Table based on multiple criteria - python

I was asking the question in this link SUMIFS in python jupyter
However, I just realized that the solution didn't work because they can switch in and switch out on different dates. So basically they have to switch out first before they can switch in.
Here is the dataframe (sorted based on the date):
+---------------+--------+---------+-----------+--------+
| Switch In/Out | Client | Quality | Date | Amount |
+---------------+--------+---------+-----------+--------+
| Out | 1 | B | 15-Aug-19 | 360 |
| In | 1 | A | 16-Aug-19 | 180 |
| In | 1 | B | 17-Aug-19 | 180 |
| Out | 1 | A | 18-Aug-19 | 140 |
| In | 1 | B | 18-Aug-19 | 80 |
| In | 1 | A | 19-Aug-19 | 60 |
| Out | 2 | B | 14-Aug-19 | 45 |
| Out | 2 | C | 15-Aug-20 | 85 |
| In | 2 | C | 15-Aug-20 | 130 |
| Out | 2 | A | 20-Aug-19 | 100 |
| In | 2 | A | 22-Aug-19 | 30 |
| In | 2 | B | 23-Aug-19 | 30 |
| In | 2 | C | 23-Aug-19 | 40 |
+---------------+--------+---------+-----------+--------+
I would then create a new column and divide them into different transactions.
+---------------+--------+---------+-----------+--------+------+
| Switch In/Out | Client | Quality | Date | Amount | Rows |
+---------------+--------+---------+-----------+--------+------+
| Out | 1 | B | 15-Aug-19 | 360 | 1 |
| In | 1 | A | 16-Aug-19 | 180 | 1 |
| In | 1 | B | 17-Aug-19 | 180 | 1 |
| Out | 1 | A | 18-Aug-19 | 140 | 2 |
| In | 1 | B | 18-Aug-19 | 80 | 2 |
| In | 1 | A | 19-Aug-19 | 60 | 2 |
| Out | 2 | B | 14-Aug-19 | 45 | 3 |
| Out | 2 | C | 15-Aug-20 | 85 | 3 |
| In | 2 | C | 15-Aug-20 | 130 | 3 |
| Out | 2 | A | 20-Aug-19 | 100 | 4 |
| In | 2 | A | 22-Aug-19 | 30 | 4 |
| In | 2 | B | 23-Aug-19 | 30 | 4 |
| In | 2 | C | 23-Aug-19 | 40 | 4 |
+---------------+--------+---------+-----------+--------+------+
With this, I can apply the pivot formula and take it from there.
However, how do I do this in python? In excel, I can just use multiple SUMIFS and compare in and out. However, this is not possible in python.
Thank you!

One simple solution is to iterate and apply a check (function) over each element being the result a new column, so: map.
Using df.index.map we get the index for each item to pass as a argument, so we can play with the values, get and compare. In your case your aim is to identify the change to "Out" keeping a counter.
import pandas as pd
switchInOut = ["Out", "In", "In", "Out", "In", "In",
"Out", "Out", "In", "Out", "In", "In", "In"]
df = pd.DataFrame(switchInOut, columns=['Switch In/Out'])
counter = 1
def changeToOut(i):
global counter
if df["Switch In/Out"].get(i) == "Out" and df["Switch In/Out"].get(i-1) == "In":
counter += 1
return counter
rows = df.index.map(changeToOut)
df["Rows"] = rows
df
Result:
+----+-----------------+--------+
| | Switch In/Out | Rows |
|----+-----------------+--------|
| 0 | Out | 1 |
| 1 | In | 1 |
| 2 | In | 1 |
| 3 | Out | 2 |
| 4 | In | 2 |
| 5 | In | 2 |
| 6 | Out | 3 |
| 7 | Out | 3 |
| 8 | In | 3 |
| 9 | Out | 4 |
| 10 | In | 4 |
| 11 | In | 4 |
| 12 | In | 4 |
+----+-----------------+--------+

Related

Transform a Pandas dataframe in a pandas with multicolumns

I have the following pandas dataframe, where the column id is the dataframe index
+----+-----------+------------+-----------+------------+
| | price_A | amount_A | price_B | amount_b |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
And I want to convert this dataframe in to a multi column data frame, that looks like this
+----+-----------+------------+-----------+------------+
| | A | B |
+----+-----------+------------+-----------+------------+
| id | price | amount | price | amount |
|----+-----------+------------+-----------+------------|
| 0 | 0.652826 | 0.941421 | 0.823048 | 0.728427 |
| 1 | 0.400078 | 0.600585 | 0.194912 | 0.269842 |
| 2 | 0.223524 | 0.146675 | 0.375459 | 0.177165 |
| 3 | 0.330626 | 0.214981 | 0.389855 | 0.541666 |
| 4 | 0.578132 | 0.30478 | 0.789573 | 0.268851 |
| 5 | 0.0943601 | 0.514878 | 0.419333 | 0.0170096 |
| 6 | 0.279122 | 0.401132 | 0.722363 | 0.337094 |
| 7 | 0.444977 | 0.333254 | 0.643878 | 0.371528 |
| 8 | 0.724673 | 0.0632807 | 0.345225 | 0.935403 |
| 9 | 0.905482 | 0.8465 | 0.585653 | 0.364495 |
+----+-----------+------------+-----------+------------+
I've tried transforming my old pandas dataframe in to a dict this way:
dict = {"A": df[["price_a","amount_a"]], "B":df[["price_b", "amount_b"]]}
df = pd.DataFrame(dict, index=df.index)
But I had no success, how can I do that?
Try renaming columns manually:
df.columns=pd.MultiIndex.from_tuples([x.split('_')[::-1] for x in df.columns])
df.index.name='id'
Output:
A B b
price amount price amount
id
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495
You can split the column names on the underscore and convert to a tuple. Once you map each split column name to a tuple, pandas will convert the Index to a MultiIndex for you. From there we just need to call swaplevel to get the letter level to come first and reassign to the dataframe.
note: in my input dataframe I replaced the column name "amount_b" with "amount_B" because it lined up with your expected output so I assumed it was a typo
df.columns = df.columns.str.split("_", expand=True).swaplevel()
print(df)
A B
price amount price amount
0 0.652826 0.941421 0.823048 0.728427
1 0.400078 0.600585 0.194912 0.269842
2 0.223524 0.146675 0.375459 0.177165
3 0.330626 0.214981 0.389855 0.541666
4 0.578132 0.304780 0.789573 0.268851
5 0.094360 0.514878 0.419333 0.017010
6 0.279122 0.401132 0.722363 0.337094
7 0.444977 0.333254 0.643878 0.371528
8 0.724673 0.063281 0.345225 0.935403
9 0.905482 0.846500 0.585653 0.364495

Manipulate pandas columns with datetime

Please see this SO post Manipulating pandas columns
I shared this dataframe:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
This is a follow-up question. Consider two more columns after Date as shown below.
+-----------------------+----------------------+
| Start Time (hh:mm:ss) | Stop Time (hh:mm:ss) |
+-----------------------+----------------------+
| 13:24:38 | 14:17:39 |
| 03:48:36 | 04:17:20 |
| 04:55:05 | 05:23:48 |
| 08:44:34 | 09:13:15 |
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21
| 19:21:05 | 20:18:57 |
| 21:05:06 | 22:01:50 |
| 14:24:43 | 14:59:37 |
| 07:57:32 | 09:46:21 |
+-----------------------+----------------------+
The task remains the same - to get the time difference but in hours, corresponding to the Stop Time of the first row and Start Time of the last row
for each Key.
Based on the answer, I was trying something like this:
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df['Time_h']=df.groupby(['Location','Event'])['Start Time (hh:mm:ss)','Stop Time (hh:mm:ss)'].\
transform(lambda x,y : (x.iloc[-1]-y.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')] # This gives an error on transform
to get the difference in days and hours separately and then combine. Is there a better way?

Logical indexing in pandas dataframes [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 3 years ago.
I have some data like this:
+-----------+---------+-------+
| Duration | Outcome | Event |
+-----------+---------+-------+
| 421 | 0 | 1 |
| 421 | 0 | 1 |
| 261 | 0 | 1 |
| 24 | 0 | 1 |
| 27 | 0 | 1 |
| 613 | 0 | 1 |
| 2454 | 0 | 1 |
| 227 | 0 | 1 |
| 2560 | 0 | 1 |
| 229 | 0 | 1 |
| 2242 | 0 | 1 |
| 6680 | 0 | 1 |
| 1172 | 0 | 1 |
| 5656 | 0 | 1 |
| 5082 | 0 | 1 |
| 7239 | 0 | 1 |
| 127 | 0 | 1 |
| 128 | 0 | 1 |
| 128 | 0 | 1 |
| 7569 | 1 | 1 |
| 324 | 0 | 2 |
| 6395 | 0 | 2 |
| 6196 | 0 | 2 |
| 31 | 0 | 2 |
| 228 | 0 | 2 |
| 274 | 0 | 2 |
| 270 | 0 | 2 |
| 275 | 0 | 2 |
| 232 | 0 | 2 |
| 7310 | 0 | 2 |
| 7644 | 1 | 2 |
| 6949 | 0 | 3 |
| 6903 | 1 | 3 |
| 6942 | 0 | 4 |
| 7031 | 1 | 4 |
+-----------+---------+-------+
Now, for each Event, with the Outcome 0/1 considered as Fail/Pass, I want to sum the total Duration of Fail/Pass events separately in 2 new columns (or 1, whatever ensures readability).
I'm new to dataframes and I feel significant logical indexing is involved here. What is the best way to approach this problem?
df.groupby(['Event', 'Outcome'])['Duration'].sum()
So you group by both the event then the outcome, look at the duration column then take the sum of each group.
You can also try:
pd.pivot_table(index='Event',
columns='Outcome',
values='Duration',
data=df,
aggfunc='sum')
which gives you a table with two columns:
+---------+-------+------+
| Outcome | 0 | 1 |
+---------+-------+------+
| Event | | |
+---------+-------+------+
| 1 | 35691 | 7569 |
| 2 | 21535 | 7644 |
| 3 | 6949 | 6903 |
| 4 | 6942 | 7031 |
+---------+-------+------+

Numpy version of finding the highest and lowest value locations within an interval of another column?

Given the following numpy array. How can I find the highest and lowest value locations of column 0 within the interval on column 1 using numpy?
import numpy as np
data = np.array([
[1879.289,np.nan],[1879.281,np.nan],[1879.292,1],[1879.295,1],[1879.481,1],[1879.294,1],[1879.268,1],
[1879.293,1],[1879.277,1],[1879.285,1],[1879.464,1],[1879.475,1],[1879.971,1],[1879.779,1],
[1879.986,1],[1880.791,1],[1880.29,1],[1879.253,np.nan],[1878.268,np.nan],[1875.73,1],[1876.792,1],
[1875.977,1],[1876.408,1],[1877.159,1],[1877.187,1],[1883.164,1],[1883.171,1],[1883.495,1],
[1883.962,1],[1885.158,1],[1885.974,1],[1886.479,np.nan],[1885.969,np.nan],[1884.693,1],[1884.977,1],
[1884.967,1],[1884.691,1],[1886.171,1],[1886.166,np.nan],[1884.476,np.nan],[1884.66,1],[1882.962,1],
[1881.496,1],[1871.163,1],[1874.985,1],[1874.979,1],[1871.173,np.nan],[1871.973,np.nan],[1871.682,np.nan],
[1872.476,np.nan],[1882.361,1],[1880.869,1],[1882.165,1],[1881.857,1],[1880.375,1],[1880.66,1],
[1880.891,1],[1880.377,1],[1881.663,1],[1881.66,1],[1877.888,1],[1875.69,1],[1875.161,1],
[1876.697,np.nan],[1876.671,np.nan],[1879.666,np.nan],[1877.182,np.nan],[1878.898,1],[1878.668,1],[1878.871,1],
[1878.882,1],[1879.173,1],[1878.887,1],[1878.68,1],[1878.872,1],[1878.677,1],[1877.877,1],
[1877.669,1],[1877.69,1],[1877.684,1],[1877.68,1],[1877.885,1],[1877.863,1],[1877.674,1],
[1877.676,1],[1877.687,1],[1878.367,1],[1878.179,1],[1877.696,1],[1877.665,1],[1877.667,np.nan],
[1878.678,np.nan],[1878.661,1],[1878.171,1],[1877.371,1],[1877.359,1],[1878.381,1],[1875.185,1],
[1875.367,np.nan],[1865.492,np.nan],[1865.495,1],[1866.995,1],[1866.672,1],[1867.465,1],[1867.663,1],
[1867.186,1],[1867.687,1],[1867.459,1],[1867.168,1],[1869.689,1],[1869.693,1],[1871.676,1],
[1873.174,1],[1873.691,np.nan],[1873.685,np.nan]
])
In the third column below you can see where the max and min is for each interval.
+-------+----------+-----------+---------+
| index | Value | Intervals | Min/Max |
+-------+----------+-----------+---------+
| 0 | 1879.289 | np.nan | |
| 1 | 1879.281 | np.nan | |
| 2 | 1879.292 | 1 | |
| 3 | 1879.295 | 1 | |
| 4 | 1879.481 | 1 | |
| 5 | 1879.294 | 1 | |
| 6 | 1879.268 | 1 | -1 | min
| 7 | 1879.293 | 1 | |
| 8 | 1879.277 | 1 | |
| 9 | 1879.285 | 1 | |
| 10 | 1879.464 | 1 | |
| 11 | 1879.475 | 1 | |
| 12 | 1879.971 | 1 | |
| 13 | 1879.779 | 1 | |
| 17 | 1879.986 | 1 | |
| 18 | 1880.791 | 1 | 1 | max
| 19 | 1880.29 | 1 | |
| 55 | 1879.253 | np.nan | |
| 56 | 1878.268 | np.nan | |
| 57 | 1875.73 | 1 | -1 |min
| 58 | 1876.792 | 1 | |
| 59 | 1875.977 | 1 | |
| 60 | 1876.408 | 1 | |
| 61 | 1877.159 | 1 | |
| 62 | 1877.187 | 1 | |
| 63 | 1883.164 | 1 | |
| 64 | 1883.171 | 1 | |
| 65 | 1883.495 | 1 | |
| 66 | 1883.962 | 1 | |
| 67 | 1885.158 | 1 | |
| 68 | 1885.974 | 1 | 1 | max
| 69 | 1886.479 | np.nan | |
| 70 | 1885.969 | np.nan | |
| 71 | 1884.693 | 1 | |
| 72 | 1884.977 | 1 | |
| 73 | 1884.967 | 1 | |
| 74 | 1884.691 | 1 | -1 | min
| 75 | 1886.171 | 1 | 1 | max
| 76 | 1886.166 | np.nan | |
| 77 | 1884.476 | np.nan | |
| 78 | 1884.66 | 1 | 1 | max
| 79 | 1882.962 | 1 | |
| 80 | 1881.496 | 1 | |
| 81 | 1871.163 | 1 | -1 | min
| 82 | 1874.985 | 1 | |
| 83 | 1874.979 | 1 | |
| 84 | 1871.173 | np.nan | |
| 85 | 1871.973 | np.nan | |
| 86 | 1871.682 | np.nan | |
| 87 | 1872.476 | np.nan | |
| 88 | 1882.361 | 1 | 1 | max
| 89 | 1880.869 | 1 | |
| 90 | 1882.165 | 1 | |
| 91 | 1881.857 | 1 | |
| 92 | 1880.375 | 1 | |
| 93 | 1880.66 | 1 | |
| 94 | 1880.891 | 1 | |
| 95 | 1880.377 | 1 | |
| 96 | 1881.663 | 1 | |
| 97 | 1881.66 | 1 | |
| 98 | 1877.888 | 1 | |
| 99 | 1875.69 | 1 | |
| 100 | 1875.161 | 1 | -1 | min
| 101 | 1876.697 | np.nan | |
| 102 | 1876.671 | np.nan | |
| 103 | 1879.666 | np.nan | |
| 111 | 1877.182 | np.nan | |
| 112 | 1878.898 | 1 | |
| 113 | 1878.668 | 1 | |
| 114 | 1878.871 | 1 | |
| 115 | 1878.882 | 1 | |
| 116 | 1879.173 | 1 | 1 | max
| 117 | 1878.887 | 1 | |
| 118 | 1878.68 | 1 | |
| 119 | 1878.872 | 1 | |
| 120 | 1878.677 | 1 | |
| 121 | 1877.877 | 1 | |
| 122 | 1877.669 | 1 | |
| 123 | 1877.69 | 1 | |
| 124 | 1877.684 | 1 | |
| 125 | 1877.68 | 1 | |
| 126 | 1877.885 | 1 | |
| 127 | 1877.863 | 1 | |
| 128 | 1877.674 | 1 | |
| 129 | 1877.676 | 1 | |
| 130 | 1877.687 | 1 | |
| 131 | 1878.367 | 1 | |
| 132 | 1878.179 | 1 | |
| 133 | 1877.696 | 1 | |
| 134 | 1877.665 | 1 | -1 | min
| 135 | 1877.667 | np.nan | |
| 136 | 1878.678 | np.nan | |
| 137 | 1878.661 | 1 | 1 | max
| 138 | 1878.171 | 1 | |
| 139 | 1877.371 | 1 | |
| 140 | 1877.359 | 1 | |
| 141 | 1878.381 | 1 | |
| 142 | 1875.185 | 1 | -1 | min
| 143 | 1875.367 | np.nan | |
| 144 | 1865.492 | np.nan | |
| 145 | 1865.495 | 1 | -1 | min
| 146 | 1866.995 | 1 | |
| 147 | 1866.672 | 1 | |
| 148 | 1867.465 | 1 | |
| 149 | 1867.663 | 1 | |
| 150 | 1867.186 | 1 | |
| 151 | 1867.687 | 1 | |
| 152 | 1867.459 | 1 | |
| 153 | 1867.168 | 1 | |
| 154 | 1869.689 | 1 | |
| 155 | 1869.693 | 1 | |
| 156 | 1871.676 | 1 | |
| 157 | 1873.174 | 1 | 1 | max
| 158 | 1873.691 | np.nan | |
| 159 | 1873.685 | np.nan | |
+-------+----------+-----------+---------+
I must specify upfront that this question has already been answered here with a pandas solution. The solution performs reasonable at about 300 seconds for a table of around 1 million rows. But after some more testing, I see that if the table is over 3 million rows, the execution time increases dramatically to over 2500 seconds and even more. This is obviously too long for such a simple task. How would the same problem be solved with numpy?
Here's one NumPy approach -
mask = ~np.isnan(data[:,1])
s0 = np.flatnonzero(mask[1:] > mask[:-1])+1
s1 = np.flatnonzero(mask[1:] < mask[:-1])+1
lens = s1 - s0
tags = np.repeat(np.arange(len(lens)), lens)
idx = np.lexsort((data[mask,0], tags))
starts = np.r_[0,lens.cumsum()]
offsets = np.r_[s0[0], s0[1:] - s1[:-1]]
offsets_cumsum = offsets.cumsum()
min_ids = idx[starts[:-1]] + offsets_cumsum
max_ids = idx[starts[1:]-1] + offsets_cumsum
out = np.full(data.shape[0], np.nan)
out[min_ids] = -1
out[max_ids] = 1
So this is a bit of a cheat since it uses scipy:
import numpy as np
from scipy import ndimage
markers = np.isnan(data[:, 1])
groups = np.cumsum(markers)
mins, max, min_idx, max_idx = ndimage.measurements.extrema(
data[:, 0], labels=groups, index=range(2, groups.max(), 2))

Interpolate in SQL based on subgroup in django models

I have the following sheetinfo model with the following data:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 0 |
| SAT123A03 | SAT123 | A | 3 | 0 |
| SAT123A04 | SAT123 | A | 4 | 0 |
| SAT123A05 | SAT123 | A | 5 | 500 |
| SAT123B05 | SAT123 | B | 5 | 400 |
| SAT123B04 | SAT123 | B | 4 | 0 |
| SAT123B03 | SAT123 | B | 3 | 0 |
| SAT123B02 | SAT123 | B | 2 | 500 |
| SAT124A01 | SAT124 | A | 1 | 400 |
| SAT124A02 | SAT124 | A | 2 | 0 |
| SAT124A03 | SAT124 | A | 3 | 0 |
| SAT124A04 | SAT124 | A | 4 | 475 |
I would like to interpolate and update the table with the correct T_val.
Formula is:
new_t_val = delta / (cnt -1) * sheet_num + min_tvc_of_subgroup
For instance the top 5:
| Trav | Group | Subgroup | Sheet_num | T_val |
| SAT123A01 | SAT123 | A | 1 | 400 |
| SAT123A02 | SAT123 | A | 2 | 425 |
| SAT123A03 | SAT123 | A | 3 | 450 |
| SAT123A04 | SAT123 | A | 4 | 475 |
| SAT123A05 | SAT123 | A | 5 | 500 |
I have a django query that works to update the data, however it is SLOW and stops after a while (due to type errors etc.)
My question is there a way to accomplish this in SQL?
The ability to do this as one database call doesn't exist in stock Django. 3rd party packages exist though: https://github.com/aykut/django-bulk-update
Example of how that package works:
rows = Model.objects.all()
for row in rows:
# Modify rows as appropriate
row.T_val = delta / (cnt -1) * row.sheet_num + min_tvc_of_subgroup
Model.objects.bulk_update(rows)
For datasets up to the 1,000,000 range, this should have reasonable performance. Most of the bottleneck in iterating through and .save()-ing each object is the overhead on a database call. The python part is reasonably fast. The above example has only two database calls so it will be perhaps an order of magnitude or two faster.

Categories