I have a melted DataFrame I would like to pivot but cannot manage to do so using 2 columns as index.
import pandas as pd
df = pd.DataFrame({'A': {0: 'XYZ', 1: 'XYZ', 2: 'XYZ', 3: 'XYZ', 4: 'XYZ', 5: 'XYZ', 6: 'XYZ', 7: 'XYZ', 8: 'XYZ', 9: 'XYZ', 10: 'ABC', 11: 'ABC', 12: 'ABC', 13: 'ABC', 14: 'ABC', 15: 'ABC', 16: 'ABC', 17: 'ABC', 18: 'ABC', 19: 'ABC'}, 'B': {0: '01/01/2017', 1: '02/01/2017', 2: '03/01/2017', 3: '04/01/2017', 4: '05/01/2017', 5: '01/01/2017', 6: '02/01/2017', 7: '03/01/2017', 8: '04/01/2017', 9: '05/01/2017', 10: '01/01/2017', 11: '02/01/2017', 12: '03/01/2017', 13: '04/01/2017', 14: '05/01/2017', 15: '01/01/2017', 16: '02/01/2017', 17: '03/01/2017', 18: '04/01/2017', 19: '05/01/2017'}, 'C': {0: 'Price', 1: 'Price', 2: 'Price', 3: 'Price', 4: 'Price', 5: 'Trading', 6: 'Trading', 7: 'Trading', 8: 'Trading', 9: 'Trading', 10: 'Price', 11: 'Price', 12: 'Price', 13: 'Price', 14: 'Price', 15: 'Trading', 16: 'Trading', 17: 'Trading', 18: 'Trading', 19: 'Trading'}, 'D': {0: '100', 1: '101', 2: '102', 3: '103', 4: '104', 5: 'Yes', 6: 'Yes', 7: 'Yes', 8: 'Yes', 9: 'Yes', 10: '50', 11: nan, 12: '48', 13: '47', 14: '46', 15: 'Yes', 16: 'No', 17: 'Yes', 18: 'Yes', 19: 'Yes'}})
So:
A B C D
XYZ 01/01/2017 Price 100
XYZ 02/01/2017 Price 101
XYZ 03/01/2017 Price 102
XYZ 04/01/2017 Price 103
XYZ 05/01/2017 Price 104
XYZ 01/01/2017 Trading Yes
XYZ 02/01/2017 Trading Yes
XYZ 03/01/2017 Trading Yes
XYZ 04/01/2017 Trading Yes
XYZ 05/01/2017 Trading Yes
ABC 01/01/2017 Price 50
ABC 02/01/2017 Price
ABC 03/01/2017 Price 48
ABC 04/01/2017 Price 47
ABC 05/01/2017 Price 46
ABC 01/01/2017 Trading Yes
ABC 02/01/2017 Trading No
ABC 03/01/2017 Trading Yes
ABC 04/01/2017 Trading Yes
ABC 05/01/2017 Trading Yes
Would become:
A B Trading Price
ABC 01/01/2017 Yes 50
02/01/2017 No
03/01/2017 Yes 48
04/01/2017 Yes 47
05/01/2017 Yes 46
XYZ 01/01/2017 Yes 100
02/01/2017 Yes 101
03/01/2017 Yes 102
04/01/2017 Yes 103
05/01/2017 Yes 104
or:
ABC XYZ
Trading Price Trading Price
01/01/2017 Yes 50 Yes 100
02/01/2017 No Yes 101
03/01/2017 Yes 48 Yes 102
04/01/2017 Yes 47 Yes 103
05/01/2017 Yes 46 Yes 104
I thought this could simply be done with pivot but get an error:
df.pivot(index=['A', 'B'], columns = ['C'], values = ['D'] )
Traceback (most recent call last):
File "<ipython-input-41-afcc34979ff8>", line 1, in <module>
df.pivot(index=['A', 'B'], columns = ['C'], values = ['D'] )
File "C:\Miniconda\lib\site-packages\pandas\core\frame.py", line 3951, in pivot
return pivot(self, index=index, columns=columns, values=values)
File "C:\Miniconda\lib\site-packages\pandas\core\reshape\reshape.py", line 377, in pivot
index=MultiIndex.from_arrays([index, self[columns]]))
File "C:\Miniconda\lib\site-packages\pandas\core\series.py", line 248, in __init__
raise_cast_failure=True)
File "C:\Miniconda\lib\site-packages\pandas\core\series.py", line 3027, in _sanitize_array
raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional
In R this will be quickly done with gather/spread.
Thanks!
Is that what you want?
In [23]: df.pivot_table(index=['A','B'], columns='C', values='D', aggfunc='first')
Out[23]:
C Price Trading
A B
ABC 01/01/2017 50 Yes
02/01/2017 NaN No
03/01/2017 48 Yes
04/01/2017 47 Yes
05/01/2017 46 Yes
XYZ 01/01/2017 100 Yes
02/01/2017 101 Yes
03/01/2017 102 Yes
04/01/2017 103 Yes
05/01/2017 104 Yes
I found the following is possible:
df.set_index(['A', 'C', 'B']).unstack().T
Out[59]:
A ABC XYZ
C Price Trading Price Trading
B
D 01/01/2017 50 Yes 100 Yes
02/01/2017 NaN No 101 Yes
03/01/2017 48 Yes 102 Yes
04/01/2017 47 Yes 103 Yes
05/01/2017 46 Yes 104 Yes
And:
df.set_index(['A', 'B', 'C']).unstack()
Out[61]:
D
C Price Trading
A B
ABC 01/01/2017 50 Yes
02/01/2017 NaN No
03/01/2017 48 Yes
04/01/2017 47 Yes
05/01/2017 46 Yes
XYZ 01/01/2017 100 Yes
02/01/2017 101 Yes
03/01/2017 102 Yes
04/01/2017 103 Yes
05/01/2017 104 Yes
Related
I'm working with 2 dataframes. Dataframe1 is for parking sites. Dataframe2 is for sensors. Correspondence dataframe shows which sensor is in which site.
Dataframe1:
Site Time Available Capacity
0 19E 12:00 5 10
1 19E 13:00 4 10
2 44E 12:00 8 22
3 44E 13:00 11 22
Dataframe2:
Sensor Time Temp Precipitation
0 113 12:00 74 0.01
1 113 13:00 76 0.02
2 114 12:00 75 0.00
3 114 13:00 77 0.00
Correspondence dataframe:
Site Sensor
0 19E 113
1 44E 114
2 58E 115
...
I'd like to combine dataframe 1 and 2 based on the correspondence dataframe, and also ‘Time’ column. Intervals are both 1h in those two dataframes.
Expected result:
Site Time Available Capacity Sensor Time Temp Precipitation
0 19E 12:00 5 10 113 12:00 74 0.01
1 19E 13:00 4 10 113 13:00 76 0.02
2 44E 12:00 8 22 114 12:00 75 0.00
3 44E 13:00 11 22 114 13:00 77 0.00
You can use the code below to generate raw materials:
import pandas as pd
df1 = pd.DataFrame({
'Site': {0: '19E', 1: '19E', 2: '44E', 3: '44E'},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Available': {0: 5, 1: 4, 2: 8, 3: 11},
'Capacity': {0: 10, 1: 10, 2: 22, 3: 22}})
df2 = pd.DataFrame({
'Sensor': {0: 113, 1: 113, 2: 114, 3: 114},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Tem': {0: 74, 1: 76, 2: 75, 3: 77},
'Precipitation': {0: 0.01, 1: 0.02, 2: 0.00, 3: 0.00}})
cor_df = pd.DataFrame({
'Site': {0: '19E', 1: '44E', 2: '58E'},
'Sensor': {0: 113, 1: 114, 2: 115}})
Use Series.map to map Site to Sensor and then DataFrame.merge on Sensor and Time:
lookup = cor_df.set_index("Site").squeeze()
res = df1.assign(Sensor=df1["Site"].map(lookup)).merge(df2, on=["Sensor", "Time"])
print(res)
Output
Site Time Available Capacity Sensor Tem Precipitation
0 19E 12:00 5 10 113 74 0.01
1 19E 13:00 4 10 113 76 0.02
2 44E 12:00 8 22 114 75 0.00
3 44E 13:00 11 22 114 77 0.00
We have below dataset having month on month data and need to determine percentage between month.
ID Jan 1 Feb 2 Mar 3
1 50 40 60 55 45 37
2 100 92 100 80 100 30
3 110 89 110 0 120 119
4 200 195 0 0 125 120
5 0 0 0 0 125 120
percentage need to calculate by= 1/Jan*100
If that percentage below 90 then we need to mark that column in our result column as comma separated.
Expected result:
ID Jan 1 %_1 Feb 2 %_2 Mar 3 %_3 Result
1 50 40 80 60 55 91.67 45 37 82.22 1,3
2 100 92 92 100 80 80 100 30 30 2,3
3 110 89 80.91 110 0 0 120 119 99.17 1,2
4 200 195 97.5 0 0 0 125 120 96 1,2
5 0 0 0 0 0 0 125 120 96 1,2
EDIT:
L = {1: 91.0, 2: 105.0, 3: 96.0, 4: 126.0, 5: 125.0, 6: 139.0, 7: 120.0,
8: 145.0, 9: 116.0,
'Apr': 134.0, 'Aug': 150.0, 'Feb': 108.0, 'Jan': 91.0,
'Jul': 128.0, 'Jun': 147.0,
'Mar': 102.0, 'May': 134.0, 'Sep': 116.0, 'id': 494}
L1 = {1: 10.0, 2: 105.0, 3: 96.0, 4: 126.0, 5: 20.0, 6: 139.0, 7: 120.0, 8: 52.0, 9: 116.0,
'Apr': 134.0, 'Aug': 150.0, 'Feb': 108.0, 'Jan': 91.0, 'Jul': 128.0, 'Jun': 147.0,
'Mar': 102.0, 'May': 134.0, 'Sep': 12.0, 'id': 496}
df = pd.DataFrame([L, L1])
#convert id to index
df1 = df.set_index('id')
#test if columns names are months
mask = pd.to_datetime(df1.columns, format='%b', errors='coerce').notna()
#convert months to categoricals and sorting
df2 = df1.loc[:, mask]
cats = ['Jan', 'Feb', 'Mar', 'Apr','May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df2.columns = pd.CategoricalIndex(df2.columns, categories=cats, ordered=True)
df2 = df2.sort_index(axis=1)
# print (df2)
#extract not months columns
df3 = df1.loc[:, ~mask]
# print (df3)
#VERY IMPORTANT TEST IF BOTH LENGTHS MATCH
print (len(df2.columns) == len(df3.columns))
#divide by df2 converted to numpy
df4 = df3.div(df2.to_numpy()).mul(100)
# print (df4)
#created new column by dot for matrix multiplication
res = df4.lt(90).dot(df4.columns.astype(str) + ',').str.strip(',')
# print (res)
#dict for replace columns names
d = dict(zip(df3.columns, '%_' + df2.columns.astype(str) + ' ' + df3.columns.astype(str) ))
# print (d)
#join together
df = pd.concat([df3, df4.rename(columns=d), res.rename('Result')], axis=1)
# print (df)
#change ordering
order = [i for x in df3.columns for i in (x, d[x])] + ['Result']
# print (order)
df = df[order]
print (df)
1 %_Jan 1 2 %_Feb 2 3 %_Mar 3 4 %_Apr 4 \
id
494 91.0 100.000000 105.0 97.222222 96.0 94.117647 126.0 94.029851
496 10.0 10.989011 105.0 97.222222 96.0 94.117647 126.0 94.029851
5 %_May 5 6 %_Jun 6 7 %_Jul 7 8 %_Aug 8 \
id
494 125.0 93.283582 139.0 94.557823 120.0 93.75 145.0 96.666667
496 20.0 14.925373 139.0 94.557823 120.0 93.75 52.0 34.666667
9 %_Sep 9 Result
id
494 116.0 100.000000
496 116.0 966.666667 1,5,8
I have a data which looks like below.
country item week stock FCST
A 1 1 105 3
A 1 2 105 6
A 1 3 105 9
A 1 4 105 4
A 1 5 105 7
A 1 6 105 4
A 1 7 105 7
the task i wish to perform is assigning closing stock of the current week as the opening stock of next week. in the above table my stock qty was 105 at very beginning, based on the forecast(fcst column) it is decreasing and assigning as closing stock of the same week. now the closing stock should become opening stock for next coming week.
I have done the same in SAS using retain statement. I do not have any idea how replicate the same in python.
Also to make a note this operation to be performed for every country-Item combination. (We can not always shift the value to take as opening stock, AS new item might have different current stock).
Can anyone help me out on the same.
My Output should look like below table.
country item week stock FCST OPENING_STOCK CLOSING_STK
A 1 1 105 3 105 102
A 1 2 105 6 102 96
A 1 3 105 9 96 87
A 1 4 105 4 87 83
A 1 5 105 7 83 76
A 1 6 105 4 76 72
A 1 7 105 7 72 65
Please click on above link to see the desired output.
Thanks in advance.
The code i have used to solve the issue is pasted below.
df.sort_values(by=['ITM_CD','Country','WEEK'],inplace=True)
df['CONCAT']=df['Country']+df['ITM_CD']
#CALCULATE BEGINING STOCK EVERY WEEK
df['TMP1']=1
grouper = (df["CONCAT"]!= df["CONCAT"].shift()).cumsum()
df["WEEK_NO"] = df.groupby(grouper)['TMP1'].cumsum()
df["FCST1"] = df.groupby(grouper)['FCST'].cumsum()
result = df.CURR_STCK_TOT - df.FCST1
df["CLOSING"] = result
df["CLOSING"] = np.where(df["CLOSING"]<0,0,df["CLOSING"])
df["OPENING"] = np.where(df["WEEK_NO"]==1,df["STOCK"],result.shift(1))
df["OPENING"] = np.where(df["OPENING"]<0,0,df["OPENING"])
Also i have done some extra manipulation like making all negative values 0.
Now it works.
combine_first is used to fill gaps in df.opening
import pandas as pd
df = pd.DataFrame({
'country': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A', 11: 'A', 12: 'A', 13: 'B', 14: 'B', 15: 'B', 16: 'B', 17: 'B', 18: 'B', 19: 'B', 20: 'B', 21: 'B', 22: 'B'},
'item': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 2, 10: 2, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 3, 18: 2, 19: 1, 20: 2, 21: 1, 22: 3},
'week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 7, 8: 8, 9: 8, 10: 9, 11: 9, 12: 10, 13: 1, 14: 2, 15: 3, 16: 3, 17: 3, 18: 4, 19: 4, 20: 5, 21: 5, 22: 5},
'stock': {0: 105, 1: 105, 2: 105, 3: 105, 4: 105, 5: 105, 6: 105, 7: 94, 8: 105, 9: 94, 10: 94, 11: 105, 12: 105, 13: 100, 14: 100, 15: 100, 16: 200, 17: 300, 18: 200, 19: 100, 20: 200, 21: 100, 22: 300},
'FCST': {0: 3, 1: 6, 2: 9, 3: 4, 4: 7, 5: 4, 6: 7, 7: 2, 8: 1, 9: -5, 10: 2, 11: 8, 12: 6, 13: 2, 14: 6, 15: 8, 16: 3, 17: 7, 18: 8, 19: 9, 20: 3, 21: 5, 22: 6}
})
df_new = pd.DataFrame(columns=df.columns)
groups = df.groupby(["country", "item"])
df["closing"] = df.stock - groups.FCST.cumsum()
df["opening"] = groups.closing.shift(1)
df["opening"] = df["opening"].combine_first(df.stock)
Outputs:
country item week stock FCST closing opening
0 A 1 1 105 3 102 105.0
1 A 1 2 105 6 96 102.0
2 A 1 3 105 9 87 96.0
3 A 1 4 105 4 83 87.0
4 A 1 5 105 7 76 83.0
5 A 1 6 105 4 72 76.0
6 A 1 7 105 7 65 72.0
7 A 2 7 94 2 92 94.0
8 A 1 8 105 1 64 65.0
9 A 2 8 94 -5 97 92.0
10 A 2 9 94 2 95 97.0
11 A 1 9 105 8 56 64.0
12 A 1 10 105 6 50 56.0
13 B 1 1 100 2 98 100.0
14 B 1 2 100 6 92 98.0
15 B 1 3 100 8 84 92.0
16 B 2 3 200 3 197 200.0
17 B 3 3 300 7 293 300.0
18 B 2 4 200 8 189 197.0
19 B 1 4 100 9 75 84.0
20 B 2 5 200 3 186 189.0
21 B 1 5 100 5 70 75.0
22 B 3 5 300 6 287 293.0
There is a csv data like
No,User,A,B,C,D
1 Tom 100 120 110 90
1 Juddy 89 90 100 110
1 Bob 99 80 90 100
2 Tom 80 100 100 70
2 Juddy 79 90 80 70
2 Bob 88 90 95 90
・
・
・
I want to transform this csv data into this DataFrame like
Tom_A Tom_B Tom_C Tom_D Juddy_A Juddy_B Juddy_C Juddy_D Bob_A Bob_B Bob_C Bob_D
No
1 100 120 110 90 89 90 100 110
99 80 90 100
2 80 100 100 70 79 90 80 70
88 90 95 90
I run the codes,
import pandas as pd
csv = pd.read_csv("user.csv", header=0, index_col=‘No', sep='\s|,', engine='python')
but output is not my ideal one.I cannot understand how to make columns is not resignated like Tom_A・Tom_B・Juddy_A which is in csv.
How should I fix my codes?
Setup
df = pd.DataFrame({'No': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 2}, 'User': {0: 'Tom', 1: 'Juddy', 2: 'Bob', 3: 'Tom', 4: 'Juddy', 5: 'Bob'}, 'A': {0: 100, 1: 89, 2: 99, 3: 80, 4: 79, 5: 88}, 'B': {0: 120, 1: 90, 2: 80, 3: 100, 4: 90, 5: 90}, 'C': {0: 110, 1: 100, 2: 90, 3: 100, 4: 80, 5: 95}, 'D': {0: 90, 1: 110, 2: 100, 3: 70, 4: 70, 5: 90}})
You want pivot_table:
out = df.pivot_table(index='No', columns='User')
A B C D
User Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom Bob Juddy Tom
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70
To get the prefix:
out.columns = out.columns.swaplevel(0,1).to_series().str.join('_')
Bob_A Juddy_A Tom_A Bob_B Juddy_B Tom_B Bob_C Juddy_C Tom_C Bob_D Juddy_D Tom_D
No
1 99 89 100 80 90 120 90 100 110 100 110 90
2 88 79 80 90 90 100 95 80 100 90 70 70
I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.
the example data looks like below
2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)
Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is
2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total
Can I achieve this using NumPy, scikit-learn or something else?
This should do it if it is loaded into a numpy array
out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]
from io import StringIO
import pandas as pd
csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
"\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")
df = pd.read_csv(csv, header=None)
df = pd.concat(objs=[df[df.columns[:11]],
df[df.columns[11:-1]].idxmax(axis=1) - 10,
df[df.columns[-1]]], axis=1)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 0 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 1 2 3 4 5 6 7 8 9 10 11 5 1
Data:
In [135]: df
Out[135]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2
1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2
[2 rows x 22 columns]
Solution:
df = pd.read_csv('/path/to/file.csv', header=None)
In [137]: df.iloc[:, :11] \
.join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
.join(df.iloc[:, -1])
Out[137]:
0 1 2 3 4 5 6 7 8 9 10 11 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 2596 51 3 258 0 510 221 232 148 6279 24 10 2
Setup
df = pd.DataFrame({0: {2596: 51},
1: {2596: 3},
2: {2596: 258},
3: {2596: 0},
4: {2596: 510},
5: {2596: 221},
6: {2596: 232},
7: {2596: 148},
8: {2596: 6279},
9: {2596: 24},
10: {2596: 0},
11: {2596: 0},
12: {2596: 0},
13: {2596: 0},
14: {2596: 0},
15: {2596: 1},
16: {2596: 0},
17: {2596: 0},
18: {2596: 0},
19: {2596: 0},
20: {2596: 2}})
Solution
#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1
#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]
Out[2167]:
0 1 2 3 4 5 6 7 8 9 10 20
2596 51 3 258 0 510 221 232 148 6279 24 6 2