Month on month comparison using python - python

We have below dataset having month on month data and need to determine percentage between month.
ID Jan 1 Feb 2 Mar 3
1 50 40 60 55 45 37
2 100 92 100 80 100 30
3 110 89 110 0 120 119
4 200 195 0 0 125 120
5 0 0 0 0 125 120
percentage need to calculate by= 1/Jan*100
If that percentage below 90 then we need to mark that column in our result column as comma separated.
Expected result:
ID Jan 1 %_1 Feb 2 %_2 Mar 3 %_3 Result
1 50 40 80 60 55 91.67 45 37 82.22 1,3
2 100 92 92 100 80 80 100 30 30 2,3
3 110 89 80.91 110 0 0 120 119 99.17 1,2
4 200 195 97.5 0 0 0 125 120 96 1,2
5 0 0 0 0 0 0 125 120 96 1,2

EDIT:
L = {1: 91.0, 2: 105.0, 3: 96.0, 4: 126.0, 5: 125.0, 6: 139.0, 7: 120.0,
8: 145.0, 9: 116.0,
'Apr': 134.0, 'Aug': 150.0, 'Feb': 108.0, 'Jan': 91.0,
'Jul': 128.0, 'Jun': 147.0,
'Mar': 102.0, 'May': 134.0, 'Sep': 116.0, 'id': 494}
L1 = {1: 10.0, 2: 105.0, 3: 96.0, 4: 126.0, 5: 20.0, 6: 139.0, 7: 120.0, 8: 52.0, 9: 116.0,
'Apr': 134.0, 'Aug': 150.0, 'Feb': 108.0, 'Jan': 91.0, 'Jul': 128.0, 'Jun': 147.0,
'Mar': 102.0, 'May': 134.0, 'Sep': 12.0, 'id': 496}
df = pd.DataFrame([L, L1])
#convert id to index
df1 = df.set_index('id')
#test if columns names are months
mask = pd.to_datetime(df1.columns, format='%b', errors='coerce').notna()
#convert months to categoricals and sorting
df2 = df1.loc[:, mask]
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df2.columns = pd.CategoricalIndex(df2.columns, categories=cats, ordered=True)
df2 = df2.sort_index(axis=1)
# print (df2)
#extract not months columns
df3 = df1.loc[:, ~mask]
# print (df3)
#VERY IMPORTANT TEST IF BOTH LENGTHS MATCH
print (len(df2.columns) == len(df3.columns))
#divide by df2 converted to numpy
df4 = df3.div(df2.to_numpy()).mul(100)
# print (df4)
#created new column by dot for matrix multiplication
res = df4.lt(90).dot(df4.columns.astype(str) + ',').str.strip(',')
# print (res)
#dict for replace columns names
d = dict(zip(df3.columns, '%_' + df2.columns.astype(str) + ' ' + df3.columns.astype(str) ))
# print (d)
#join together
df = pd.concat([df3, df4.rename(columns=d), res.rename('Result')], axis=1)
# print (df)
#change ordering
order = [i for x in df3.columns for i in (x, d[x])] + ['Result']
# print (order)
df = df[order]
print (df)
1 %_Jan 1 2 %_Feb 2 3 %_Mar 3 4 %_Apr 4 \
id
494 91.0 100.000000 105.0 97.222222 96.0 94.117647 126.0 94.029851
496 10.0 10.989011 105.0 97.222222 96.0 94.117647 126.0 94.029851
5 %_May 5 6 %_Jun 6 7 %_Jul 7 8 %_Aug 8 \
id
494 125.0 93.283582 139.0 94.557823 120.0 93.75 145.0 96.666667
496 20.0 14.925373 139.0 94.557823 120.0 93.75 52.0 34.666667
9 %_Sep 9 Result
id
494 116.0 100.000000
496 116.0 966.666667 1,5,8

Related

Python Pandas Merge Two Dataframes Based on Another Correspondence Dataframe

I'm working with 2 dataframes. Dataframe1 is for parking sites. Dataframe2 is for sensors. Correspondence dataframe shows which sensor is in which site.
Dataframe1:
Site Time Available Capacity
0 19E 12:00 5 10
1 19E 13:00 4 10
2 44E 12:00 8 22
3 44E 13:00 11 22
Dataframe2:
Sensor Time Temp Precipitation
0 113 12:00 74 0.01
1 113 13:00 76 0.02
2 114 12:00 75 0.00
3 114 13:00 77 0.00
Correspondence dataframe:
Site Sensor
0 19E 113
1 44E 114
2 58E 115
...
I'd like to combine dataframe 1 and 2 based on the correspondence dataframe, and also ‘Time’ column. Intervals are both 1h in those two dataframes.
Expected result:
Site Time Available Capacity Sensor Time Temp Precipitation
0 19E 12:00 5 10 113 12:00 74 0.01
1 19E 13:00 4 10 113 13:00 76 0.02
2 44E 12:00 8 22 114 12:00 75 0.00
3 44E 13:00 11 22 114 13:00 77 0.00
You can use the code below to generate raw materials:
import pandas as pd
df1 = pd.DataFrame({
'Site': {0: '19E', 1: '19E', 2: '44E', 3: '44E'},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Available': {0: 5, 1: 4, 2: 8, 3: 11},
'Capacity': {0: 10, 1: 10, 2: 22, 3: 22}})
df2 = pd.DataFrame({
'Sensor': {0: 113, 1: 113, 2: 114, 3: 114},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Tem': {0: 74, 1: 76, 2: 75, 3: 77},
'Precipitation': {0: 0.01, 1: 0.02, 2: 0.00, 3: 0.00}})
cor_df = pd.DataFrame({
'Site': {0: '19E', 1: '44E', 2: '58E'},
'Sensor': {0: 113, 1: 114, 2: 115}})
Use Series.map to map Site to Sensor and then DataFrame.merge on Sensor and Time:
lookup = cor_df.set_index("Site").squeeze()
res = df1.assign(Sensor=df1["Site"].map(lookup)).merge(df2, on=["Sensor", "Time"])
print(res)
Output
Site Time Available Capacity Sensor Tem Precipitation
0 19E 12:00 5 10 113 74 0.01
1 19E 13:00 4 10 113 76 0.02
2 44E 12:00 8 22 114 75 0.00
3 44E 13:00 11 22 114 77 0.00

Remove rows in dataframe by overlaping groups based on coordinates

I have a dataframe such as
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
F C2 350 400 50 12
A C2 349 400 51 12
B C2 450 500 50 12
And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.
For example in C1:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dic format if it can help :
{'Seq': {0: 'A', 1: 'B', 2: 'C', 3: 'Cbis', 4: 'D', 5: 'E', 6: 'F', 7: 'A', 8: 'B'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1', 3: 'C1', 4: 'C1', 5: 'C1', 6: 'C2', 7: 'C2', 8: 'C2'}, 'start': {0: 1, 1: 3, 2: 6, 3: 6, 4: 70, 5: 78, 6: 350, 7: 349, 8: 450}, 'end': {0: 50, 1: 55, 2: 60, 3: 60, 4: 120, 5: 111, 6: 400, 7: 400, 8: 500}, 'length': {0: 49, 1: 52, 2: 54, 3: 54, 4: 50, 5: 33, 6: 50, 7: 51, 8: 50}, 'score': {0: 12, 1: 12, 2: 12, 3: 11, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12}}
Edit for Corralien :
If I used this table :
Seq Chrm start end length score
A C1 12414 14672 49 12
B C1 12414 14741 52 12
C C1 12414 14744 54 12
It does not class A,B and C in the same overlapping group...
{'Seq': {0: 'A', 1: 'B', 2: 'C'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1'}, 'start': {0: 12414, 1: 12414, 2: 12414}, 'end': {0: 14672, 1: 14741, 2: 14744}, 'length': {0: 49, 1: 52, 2: 54}, 'score': {0: 12, 1: 12, 2: 12}}
Create virtual groups and keep the best row (length, score) for each group:
Suppose this dataframe:
>>> df
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
9 A C1 12414 14672 49 12
10 B C1 12414 14741 52 12
11 C C1 12414 14744 54 12
Create groups:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
df['group'] = df.sort_values(['Chrm', 'start', 'end']) \
.groupby('Chrm').apply(is_overlapped).droplevel(0).cumsum()
out = df.sort_values(['group', 'length', 'score'], ascending=[True, False, False]) \
.groupby(df['group']).head(1)
Output:
>>> out
Seq Chrm start end length score group
2 C C1 6 60 54 12 1
4 D C1 70 120 50 12 2
11 C C1 12414 14744 54 12 3
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
# Groups
>>> df
Seq Chrm start end length score group
0 A C1 1 50 49 12 1
1 B C1 3 55 52 12 1
2 C C1 6 60 54 12 1
3 Cbis C1 6 60 54 11 1
4 D C1 70 120 50 12 2
5 E C1 78 111 33 12 2
6 F C2 350 400 50 12 4
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
9 A C1 12414 14672 49 12 3
10 B C1 12414 14741 52 12 3
11 C C1 12414 14744 54 12 3
You can drop the group column with out.drop(columns='group') but I left it to illustrate the virtual groups.

Retain Value in python for groups

I have a data which looks like below.
country item week stock FCST
A 1 1 105 3
A 1 2 105 6
A 1 3 105 9
A 1 4 105 4
A 1 5 105 7
A 1 6 105 4
A 1 7 105 7
the task i wish to perform is assigning closing stock of the current week as the opening stock of next week. in the above table my stock qty was 105 at very beginning, based on the forecast(fcst column) it is decreasing and assigning as closing stock of the same week. now the closing stock should become opening stock for next coming week.
I have done the same in SAS using retain statement. I do not have any idea how replicate the same in python.
Also to make a note this operation to be performed for every country-Item combination. (We can not always shift the value to take as opening stock, AS new item might have different current stock).
Can anyone help me out on the same.
My Output should look like below table.
country item week stock FCST OPENING_STOCK CLOSING_STK
A 1 1 105 3 105 102
A 1 2 105 6 102 96
A 1 3 105 9 96 87
A 1 4 105 4 87 83
A 1 5 105 7 83 76
A 1 6 105 4 76 72
A 1 7 105 7 72 65
Please click on above link to see the desired output.
Thanks in advance.
The code i have used to solve the issue is pasted below.
df.sort_values(by=['ITM_CD','Country','WEEK'],inplace=True)
df['CONCAT']=df['Country']+df['ITM_CD']
#CALCULATE BEGINING STOCK EVERY WEEK
df['TMP1']=1
grouper = (df["CONCAT"]!= df["CONCAT"].shift()).cumsum()
df["WEEK_NO"] = df.groupby(grouper)['TMP1'].cumsum()
df["FCST1"] = df.groupby(grouper)['FCST'].cumsum()
result = df.CURR_STCK_TOT - df.FCST1
df["CLOSING"] = result
df["CLOSING"] = np.where(df["CLOSING"]<0,0,df["CLOSING"])
df["OPENING"] = np.where(df["WEEK_NO"]==1,df["STOCK"],result.shift(1))
df["OPENING"] = np.where(df["OPENING"]<0,0,df["OPENING"])
Also i have done some extra manipulation like making all negative values 0.
Now it works.
combine_first is used to fill gaps in df.opening
import pandas as pd
df = pd.DataFrame({
'country': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A', 11: 'A', 12: 'A', 13: 'B', 14: 'B', 15: 'B', 16: 'B', 17: 'B', 18: 'B', 19: 'B', 20: 'B', 21: 'B', 22: 'B'},
'item': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 2, 10: 2, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 3, 18: 2, 19: 1, 20: 2, 21: 1, 22: 3},
'week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 7, 8: 8, 9: 8, 10: 9, 11: 9, 12: 10, 13: 1, 14: 2, 15: 3, 16: 3, 17: 3, 18: 4, 19: 4, 20: 5, 21: 5, 22: 5},
'stock': {0: 105, 1: 105, 2: 105, 3: 105, 4: 105, 5: 105, 6: 105, 7: 94, 8: 105, 9: 94, 10: 94, 11: 105, 12: 105, 13: 100, 14: 100, 15: 100, 16: 200, 17: 300, 18: 200, 19: 100, 20: 200, 21: 100, 22: 300},
'FCST': {0: 3, 1: 6, 2: 9, 3: 4, 4: 7, 5: 4, 6: 7, 7: 2, 8: 1, 9: -5, 10: 2, 11: 8, 12: 6, 13: 2, 14: 6, 15: 8, 16: 3, 17: 7, 18: 8, 19: 9, 20: 3, 21: 5, 22: 6}
})
df_new = pd.DataFrame(columns=df.columns)
groups = df.groupby(["country", "item"])
df["closing"] = df.stock - groups.FCST.cumsum()
df["opening"] = groups.closing.shift(1)
df["opening"] = df["opening"].combine_first(df.stock)
Outputs:
country item week stock FCST closing opening
0 A 1 1 105 3 102 105.0
1 A 1 2 105 6 96 102.0
2 A 1 3 105 9 87 96.0
3 A 1 4 105 4 83 87.0
4 A 1 5 105 7 76 83.0
5 A 1 6 105 4 72 76.0
6 A 1 7 105 7 65 72.0
7 A 2 7 94 2 92 94.0
8 A 1 8 105 1 64 65.0
9 A 2 8 94 -5 97 92.0
10 A 2 9 94 2 95 97.0
11 A 1 9 105 8 56 64.0
12 A 1 10 105 6 50 56.0
13 B 1 1 100 2 98 100.0
14 B 1 2 100 6 92 98.0
15 B 1 3 100 8 84 92.0
16 B 2 3 200 3 197 200.0
17 B 3 3 300 7 293 300.0
18 B 2 4 200 8 189 197.0
19 B 1 4 100 9 75 84.0
20 B 2 5 200 3 186 189.0
21 B 1 5 100 5 70 75.0
22 B 3 5 300 6 287 293.0

Find an replace the values in between the two indexes in a python data-frame

I am new to the python and pandas. Here I have two data frames which are like,
df1 =
DocumentId offset feature
0 0 2000
0 7 2000
0 16 0
0 27 0
0 36 0
0 40 0
0 46 0
0 57 0
0 63 0
0 78 0
0 88 0
0 91 0
0 103 2200
1 109 0
1 113 2200
1 126 2200
1 131 2200
1 137 2200
1 142 0
1 152 0
1 157 200
1 159 200
1 161 200
1 167 0
1 170 200
Now, here I have another dataframe which is like ,
start end Previous_Three Next_Three
7.0 103.0 [2000.0, 2000.0] [2200.0, 0.0, 2200.0]
103.0 113.0 [2200.0, 0.0, 0.0] [2200.0, 2200.0, 2200.0]
137.0 157.0 [2200.0, 2200.0, 2200.0] [200.0, 200.0, 200.0]
161.0 170.0 [200.0, 200.0, 200.0] [200.0, 0.0, 200.0]
Now, this start and end are the offset from the first datafrmae.
Now what I am trying is replacing the 0 from the first dataframe feature column.
Now If we see in df1 after 7 to 103 between that all are 0 .
Now,My logic is like, if the previous_Three and next_three are equal then I am replacing the all values between them with the value from that array in the first dataframe .
so, If it matches then output would look like if it matches (In my given data it does not match),
DocumentId offset feature
0 0 2000
0 7 2000
0 16 2000
0 27 2000
0 36 2000
0 40 2000
0 46 2000
0 57 2000
0 63 2000
0 78 2000
0 88 2000
0 91 2000
0 103 2200
So, it will be like this.
Same goes for the next start and endoffsets .values between 103 and 113 and array to check will be [2200.0, 0.0, 0.0] [2200.0, 2200.0, 2200.0]
. what I tried is ,
def printfun(start,end, previous_three,next_three):
#print(np.array_equal(previous_three, next_three))
if np.array_equal(previous_three, next_three):
print('going',start)
start_index = list(final_output[final_output['OFFSET'] == start].index)[0]
end_index = list(final_output[final_output['OFFSET'] == end].index)
print(start_index)
But not understanding. can any one help me with this ?
for i, row in final_output.iterrows():
value = final_output.loc[start_index:end_index]

Multiple binary columns to one column

I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.
the example data looks like below
2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)
Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is
2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total
Can I achieve this using NumPy, scikit-learn or something else?
This should do it if it is loaded into a numpy array
out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]
from io import StringIO
import pandas as pd
csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
"\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")
df = pd.read_csv(csv, header=None)
df = pd.concat(objs=[df[df.columns[:11]],
df[df.columns[11:-1]].idxmax(axis=1) - 10,
df[df.columns[-1]]], axis=1)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 0 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 1 2 3 4 5 6 7 8 9 10 11 5 1
Data:
In [135]: df
Out[135]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2
1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2
[2 rows x 22 columns]
Solution:
df = pd.read_csv('/path/to/file.csv', header=None)
In [137]: df.iloc[:, :11] \
.join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
.join(df.iloc[:, -1])
Out[137]:
0 1 2 3 4 5 6 7 8 9 10 11 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 2596 51 3 258 0 510 221 232 148 6279 24 10 2
Setup
df = pd.DataFrame({0: {2596: 51},
1: {2596: 3},
2: {2596: 258},
3: {2596: 0},
4: {2596: 510},
5: {2596: 221},
6: {2596: 232},
7: {2596: 148},
8: {2596: 6279},
9: {2596: 24},
10: {2596: 0},
11: {2596: 0},
12: {2596: 0},
13: {2596: 0},
14: {2596: 0},
15: {2596: 1},
16: {2596: 0},
17: {2596: 0},
18: {2596: 0},
19: {2596: 0},
20: {2596: 2}})
Solution
#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1
#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]
Out[2167]:
0 1 2 3 4 5 6 7 8 9 10 20
2596 51 3 258 0 510 221 232 148 6279 24 6 2

Categories