Retain Value in python for groups - python

I have a data which looks like below.
country item week stock FCST
A 1 1 105 3
A 1 2 105 6
A 1 3 105 9
A 1 4 105 4
A 1 5 105 7
A 1 6 105 4
A 1 7 105 7
the task i wish to perform is assigning closing stock of the current week as the opening stock of next week. in the above table my stock qty was 105 at very beginning, based on the forecast(fcst column) it is decreasing and assigning as closing stock of the same week. now the closing stock should become opening stock for next coming week.
I have done the same in SAS using retain statement. I do not have any idea how replicate the same in python.
Also to make a note this operation to be performed for every country-Item combination. (We can not always shift the value to take as opening stock, AS new item might have different current stock).
Can anyone help me out on the same.
My Output should look like below table.
country item week stock FCST OPENING_STOCK CLOSING_STK
A 1 1 105 3 105 102
A 1 2 105 6 102 96
A 1 3 105 9 96 87
A 1 4 105 4 87 83
A 1 5 105 7 83 76
A 1 6 105 4 76 72
A 1 7 105 7 72 65
Please click on above link to see the desired output.
Thanks in advance.

The code i have used to solve the issue is pasted below.
df.sort_values(by=['ITM_CD','Country','WEEK'],inplace=True)
df['CONCAT']=df['Country']+df['ITM_CD']
#CALCULATE BEGINING STOCK EVERY WEEK
df['TMP1']=1
grouper = (df["CONCAT"]!= df["CONCAT"].shift()).cumsum()
df["WEEK_NO"] = df.groupby(grouper)['TMP1'].cumsum()
df["FCST1"] = df.groupby(grouper)['FCST'].cumsum()
result = df.CURR_STCK_TOT - df.FCST1
df["CLOSING"] = result
df["CLOSING"] = np.where(df["CLOSING"]<0,0,df["CLOSING"])
df["OPENING"] = np.where(df["WEEK_NO"]==1,df["STOCK"],result.shift(1))
df["OPENING"] = np.where(df["OPENING"]<0,0,df["OPENING"])
Also i have done some extra manipulation like making all negative values 0.

Now it works.
combine_first is used to fill gaps in df.opening
import pandas as pd
df = pd.DataFrame({
'country': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A', 11: 'A', 12: 'A', 13: 'B', 14: 'B', 15: 'B', 16: 'B', 17: 'B', 18: 'B', 19: 'B', 20: 'B', 21: 'B', 22: 'B'},
'item': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 2, 10: 2, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 3, 18: 2, 19: 1, 20: 2, 21: 1, 22: 3},
'week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 7, 8: 8, 9: 8, 10: 9, 11: 9, 12: 10, 13: 1, 14: 2, 15: 3, 16: 3, 17: 3, 18: 4, 19: 4, 20: 5, 21: 5, 22: 5},
'stock': {0: 105, 1: 105, 2: 105, 3: 105, 4: 105, 5: 105, 6: 105, 7: 94, 8: 105, 9: 94, 10: 94, 11: 105, 12: 105, 13: 100, 14: 100, 15: 100, 16: 200, 17: 300, 18: 200, 19: 100, 20: 200, 21: 100, 22: 300},
'FCST': {0: 3, 1: 6, 2: 9, 3: 4, 4: 7, 5: 4, 6: 7, 7: 2, 8: 1, 9: -5, 10: 2, 11: 8, 12: 6, 13: 2, 14: 6, 15: 8, 16: 3, 17: 7, 18: 8, 19: 9, 20: 3, 21: 5, 22: 6}
})
df_new = pd.DataFrame(columns=df.columns)
groups = df.groupby(["country", "item"])
df["closing"] = df.stock - groups.FCST.cumsum()
df["opening"] = groups.closing.shift(1)
df["opening"] = df["opening"].combine_first(df.stock)
Outputs:
country item week stock FCST closing opening
0 A 1 1 105 3 102 105.0
1 A 1 2 105 6 96 102.0
2 A 1 3 105 9 87 96.0
3 A 1 4 105 4 83 87.0
4 A 1 5 105 7 76 83.0
5 A 1 6 105 4 72 76.0
6 A 1 7 105 7 65 72.0
7 A 2 7 94 2 92 94.0
8 A 1 8 105 1 64 65.0
9 A 2 8 94 -5 97 92.0
10 A 2 9 94 2 95 97.0
11 A 1 9 105 8 56 64.0
12 A 1 10 105 6 50 56.0
13 B 1 1 100 2 98 100.0
14 B 1 2 100 6 92 98.0
15 B 1 3 100 8 84 92.0
16 B 2 3 200 3 197 200.0
17 B 3 3 300 7 293 300.0
18 B 2 4 200 8 189 197.0
19 B 1 4 100 9 75 84.0
20 B 2 5 200 3 186 189.0
21 B 1 5 100 5 70 75.0
22 B 3 5 300 6 287 293.0

Related

Remove rows in dataframe by overlaping groups based on coordinates

I have a dataframe such as
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
F C2 350 400 50 12
A C2 349 400 51 12
B C2 450 500 50 12
And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.
For example in C1:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dic format if it can help :
{'Seq': {0: 'A', 1: 'B', 2: 'C', 3: 'Cbis', 4: 'D', 5: 'E', 6: 'F', 7: 'A', 8: 'B'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1', 3: 'C1', 4: 'C1', 5: 'C1', 6: 'C2', 7: 'C2', 8: 'C2'}, 'start': {0: 1, 1: 3, 2: 6, 3: 6, 4: 70, 5: 78, 6: 350, 7: 349, 8: 450}, 'end': {0: 50, 1: 55, 2: 60, 3: 60, 4: 120, 5: 111, 6: 400, 7: 400, 8: 500}, 'length': {0: 49, 1: 52, 2: 54, 3: 54, 4: 50, 5: 33, 6: 50, 7: 51, 8: 50}, 'score': {0: 12, 1: 12, 2: 12, 3: 11, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12}}
Edit for Corralien :
If I used this table :
Seq Chrm start end length score
A C1 12414 14672 49 12
B C1 12414 14741 52 12
C C1 12414 14744 54 12
It does not class A,B and C in the same overlapping group...
{'Seq': {0: 'A', 1: 'B', 2: 'C'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1'}, 'start': {0: 12414, 1: 12414, 2: 12414}, 'end': {0: 14672, 1: 14741, 2: 14744}, 'length': {0: 49, 1: 52, 2: 54}, 'score': {0: 12, 1: 12, 2: 12}}
Create virtual groups and keep the best row (length, score) for each group:
Suppose this dataframe:
>>> df
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
9 A C1 12414 14672 49 12
10 B C1 12414 14741 52 12
11 C C1 12414 14744 54 12
Create groups:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
df['group'] = df.sort_values(['Chrm', 'start', 'end']) \
.groupby('Chrm').apply(is_overlapped).droplevel(0).cumsum()
out = df.sort_values(['group', 'length', 'score'], ascending=[True, False, False]) \
.groupby(df['group']).head(1)
Output:
>>> out
Seq Chrm start end length score group
2 C C1 6 60 54 12 1
4 D C1 70 120 50 12 2
11 C C1 12414 14744 54 12 3
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
# Groups
>>> df
Seq Chrm start end length score group
0 A C1 1 50 49 12 1
1 B C1 3 55 52 12 1
2 C C1 6 60 54 12 1
3 Cbis C1 6 60 54 11 1
4 D C1 70 120 50 12 2
5 E C1 78 111 33 12 2
6 F C2 350 400 50 12 4
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
9 A C1 12414 14672 49 12 3
10 B C1 12414 14741 52 12 3
11 C C1 12414 14744 54 12 3
You can drop the group column with out.drop(columns='group') but I left it to illustrate the virtual groups.

I cannot make my ideal DataFrame

There is a csv data like
No,User,A,B,C,D
1 Tom 100 120 110 90
1 Juddy 89 90 100 110
1 Bob 99 80 90 100
2 Tom 80 100 100 70
2 Juddy 79 90 80 70
2 Bob 88 90 95 90
・
・
・
I want to transform this csv data into this DataFrame like
Tom_A Tom_B Tom_C Tom_D Juddy_A Juddy_B Juddy_C Juddy_D Bob_A Bob_B Bob_C Bob_D
No
1 100 120 110 90 89 90 100 110
99 80 90 100
2 80 100 100 70 79 90 80 70
88 90 95 90
I run the codes,
import pandas as pd
csv = pd.read_csv("user.csv", header=0, index_col=‘No', sep='\s|,', engine='python')
but output is not my ideal one.I cannot understand how to make columns is not resignated like Tom_A・Tom_B・Juddy_A which is in csv.
How should I fix my codes?
Setup
df = pd.DataFrame({'No': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2}, 'User': {0: 'Tom', 1: 'Juddy', 2: 'Bob', 3: 'Tom', 4: 'Juddy', 5: 'Bob'}, 'A': {0: 100, 1: 89, 2: 99, 3: 80, 4: 79, 5: 88}, 'B': {0: 120, 1: 90, 2: 80, 3: 100, 4: 90, 5: 90}, 'C': {0: 110, 1: 100, 2: 90, 3: 100, 4: 80, 5: 95}, 'D': {0: 90, 1: 110, 2: 100, 3: 70, 4: 70, 5: 90}})
You want pivot_table:
out = df.pivot_table(index='No', columns='User')
A B C D
User Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70
To get the prefix:
out.columns = out.columns.swaplevel(0,1).to_series().str.join('_')
Bob_A Juddy_A Tom_A Bob_B Juddy_B Tom_B Bob_C Juddy_C Tom_C Bob_D Juddy_D Tom_D
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70

Pandas - unstack/pivot with multiple index

I have a melted DataFrame I would like to pivot but cannot manage to do so using 2 columns as index.
import pandas as pd
df = pd.DataFrame({'A': {0: 'XYZ', 1: 'XYZ', 2: 'XYZ', 3: 'XYZ', 4: 'XYZ', 5: 'XYZ', 6: 'XYZ', 7: 'XYZ', 8: 'XYZ', 9: 'XYZ', 10: 'ABC', 11: 'ABC', 12: 'ABC', 13: 'ABC', 14: 'ABC', 15: 'ABC', 16: 'ABC', 17: 'ABC', 18: 'ABC', 19: 'ABC'}, 'B': {0: '01/01/2017', 1: '02/01/2017', 2: '03/01/2017', 3: '04/01/2017', 4: '05/01/2017', 5: '01/01/2017', 6: '02/01/2017', 7: '03/01/2017', 8: '04/01/2017', 9: '05/01/2017', 10: '01/01/2017', 11: '02/01/2017', 12: '03/01/2017', 13: '04/01/2017', 14: '05/01/2017', 15: '01/01/2017', 16: '02/01/2017', 17: '03/01/2017', 18: '04/01/2017', 19: '05/01/2017'}, 'C': {0: 'Price', 1: 'Price', 2: 'Price', 3: 'Price', 4: 'Price', 5: 'Trading', 6: 'Trading', 7: 'Trading', 8: 'Trading', 9: 'Trading', 10: 'Price', 11: 'Price', 12: 'Price', 13: 'Price', 14: 'Price', 15: 'Trading', 16: 'Trading', 17: 'Trading', 18: 'Trading', 19: 'Trading'}, 'D': {0: '100', 1: '101', 2: '102', 3: '103', 4: '104', 5: 'Yes', 6: 'Yes', 7: 'Yes', 8: 'Yes', 9: 'Yes', 10: '50', 11: nan, 12: '48', 13: '47', 14: '46', 15: 'Yes', 16: 'No', 17: 'Yes', 18: 'Yes', 19: 'Yes'}})
So:
A B C D
XYZ 01/01/2017 Price 100
XYZ 02/01/2017 Price 101
XYZ 03/01/2017 Price 102
XYZ 04/01/2017 Price 103
XYZ 05/01/2017 Price 104
XYZ 01/01/2017 Trading Yes
XYZ 02/01/2017 Trading Yes
XYZ 03/01/2017 Trading Yes
XYZ 04/01/2017 Trading Yes
XYZ 05/01/2017 Trading Yes
ABC 01/01/2017 Price 50
ABC 02/01/2017 Price
ABC 03/01/2017 Price 48
ABC 04/01/2017 Price 47
ABC 05/01/2017 Price 46
ABC 01/01/2017 Trading Yes
ABC 02/01/2017 Trading No
ABC 03/01/2017 Trading Yes
ABC 04/01/2017 Trading Yes
ABC 05/01/2017 Trading Yes
Would become:
A B Trading Price
ABC 01/01/2017 Yes 50
02/01/2017 No
03/01/2017 Yes 48
04/01/2017 Yes 47
05/01/2017 Yes 46
XYZ 01/01/2017 Yes 100
02/01/2017 Yes 101
03/01/2017 Yes 102
04/01/2017 Yes 103
05/01/2017 Yes 104
or:
ABC XYZ
Trading Price Trading Price
01/01/2017 Yes 50 Yes 100
02/01/2017 No Yes 101
03/01/2017 Yes 48 Yes 102
04/01/2017 Yes 47 Yes 103
05/01/2017 Yes 46 Yes 104
I thought this could simply be done with pivot but get an error:
df.pivot(index=['A', 'B'], columns = ['C'], values = ['D'] )
Traceback (most recent call last):
File "<ipython-input-41-afcc34979ff8>", line 1, in <module>
df.pivot(index=['A', 'B'], columns = ['C'], values = ['D'] )
File "C:\Miniconda\lib\site-packages\pandas\core\frame.py", line 3951, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "C:\Miniconda\lib\site-packages\pandas\core\reshape\reshape.py", line 377, in pivot
index=MultiIndex.from_arrays([index, self[columns]]))
File "C:\Miniconda\lib\site-packages\pandas\core\series.py", line 248, in __init__
raise_cast_failure=True)
File "C:\Miniconda\lib\site-packages\pandas\core\series.py", line 3027, in _sanitize_array
raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
In R this will be quickly done with gather/spread.
Thanks!
Is that what you want?
In [23]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc='first')
Out[23]:
C Price Trading
A B
ABC 01/01/2017 50 Yes
02/01/2017 NaN No
03/01/2017 48 Yes
04/01/2017 47 Yes
05/01/2017 46 Yes
XYZ 01/01/2017 100 Yes
02/01/2017 101 Yes
03/01/2017 102 Yes
04/01/2017 103 Yes
05/01/2017 104 Yes
I found the following is possible:
df.set_index(['A', 'C', 'B']).unstack().T
Out[59]:
A ABC XYZ
C Price Trading Price Trading
B
D 01/01/2017 50 Yes 100 Yes
02/01/2017 NaN No 101 Yes
03/01/2017 48 Yes 102 Yes
04/01/2017 47 Yes 103 Yes
05/01/2017 46 Yes 104 Yes
And:
df.set_index(['A', 'B', 'C']).unstack()
Out[61]:
D
C Price Trading
A B
ABC 01/01/2017 50 Yes
02/01/2017 NaN No
03/01/2017 48 Yes
04/01/2017 47 Yes
05/01/2017 46 Yes
XYZ 01/01/2017 100 Yes
02/01/2017 101 Yes
03/01/2017 102 Yes
04/01/2017 103 Yes
05/01/2017 104 Yes

Multiple binary columns to one column

I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.
the example data looks like below
2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)
Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is
2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total
Can I achieve this using NumPy, scikit-learn or something else?
This should do it if it is loaded into a numpy array
out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]
from io import StringIO
import pandas as pd
csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
"\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")
df = pd.read_csv(csv, header=None)
df = pd.concat(objs=[df[df.columns[:11]],
df[df.columns[11:-1]].idxmax(axis=1) - 10,
df[df.columns[-1]]], axis=1)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 0 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 1 2 3 4 5 6 7 8 9 10 11 5 1
Data:
In [135]: df
Out[135]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2
1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2
[2 rows x 22 columns]
Solution:
df = pd.read_csv('/path/to/file.csv', header=None)
In [137]: df.iloc[:, :11] \
.join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
.join(df.iloc[:, -1])
Out[137]:
0 1 2 3 4 5 6 7 8 9 10 11 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 2596 51 3 258 0 510 221 232 148 6279 24 10 2
Setup
df = pd.DataFrame({0: {2596: 51},
1: {2596: 3},
2: {2596: 258},
3: {2596: 0},
4: {2596: 510},
5: {2596: 221},
6: {2596: 232},
7: {2596: 148},
8: {2596: 6279},
9: {2596: 24},
10: {2596: 0},
11: {2596: 0},
12: {2596: 0},
13: {2596: 0},
14: {2596: 0},
15: {2596: 1},
16: {2596: 0},
17: {2596: 0},
18: {2596: 0},
19: {2596: 0},
20: {2596: 2}})
Solution
#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1
#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]
Out[2167]:
0 1 2 3 4 5 6 7 8 9 10 20
2596 51 3 258 0 510 221 232 148 6279 24 6 2

Pandas Very Simple Percent of total size from Group by

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this statement in the past but cannot get this to work now: percent = totals.div(totals.sum(1), axis=0)
Original DF:
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 44 95 27
Result:
df1.groupby('A').size() / df1.groupby('A').size().sum()
A
34 0.1
44 0.1
72 0.1
77 0.7
Here is what I came up with so far which seems pretty reasonable way to do this:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
I don't know if I'm missing something, but looks like you could do something like this:
df.groupby('A').size() * 100 / len(df)
or
df.groupby('A').size() * 100 / df.shape[0]
Getting good performance (3.73s) on DF with shape (3e6,59) by using:
df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
How about:
df = pd.DataFrame({'A': {0: 77, 1: 77, 2: 77, 3: 77, 4: 77, 5: 77, 6: 77, 7: 72, 8: 34, 9: None},
'B': {0: 3, 1: 52, 2: 58, 3: 3, 4: 31, 5: 53, 6: 2, 7: 25, 8: 41, 9: 95},
'C': {0: 98, 1: 99, 2: 61, 3: 93, 4: 99, 5: 51, 6: 9, 7: 78, 8: 34, 9: 27}})
>>> df.groupby('A').size().divide(sum(df['A'].notnull()))
A
34 0.111111
72 0.111111
77 0.777778
dtype: float64
>>> df
A B C
0 77 3 98
1 77 52 99
2 77 58 61
3 77 3 93
4 77 31 99
5 77 53 51
6 77 2 9
7 72 25 78
8 34 41 34
9 NaN 95 27

Categories