I have a CSV file dataset that contains 21 columns, the first 10 columns are numbers and I don't want to change them. The next 10 columns are binary data and contain only 1 and 0 in it, one "1" and the others are "0", and the last column is the given label.
the example data looks like below
2596,51,3,258,0,510,221,232,148,6279,24(10th column),0,0,0,0,0,1(16th column),0,0,0,0,2(the last column)
Suppose I load the data into a matrix, can I keep the first 10 columns and the last column unchanged, and convert the middle 10 columns into one column? After transformation, I want the column value to be based on the index of the "1" in the row, like the row above, the wanted result is
2596,51,3,258,0,510,221,232,148,6279,24,6(it's 6 because the "1" is on 6th column of the binary data),2 #12 columns in total
Can I achieve this using NumPy, scikit-learn or something else?
This should do it if it is loaded into a numpy array
out = np.c_[in[:, :11], np.where(in[:, 11:-1])[1] + 1, in[:, -1]]
from io import StringIO
import pandas as pd
csv = StringIO("2596,51,3,258,0,510,221,232,148,6279,24,0,0,0,0,0,1,0,0,0,0,2"
"\n1,2,3,4,5,6,7,8,9,10,11,0,0,0,0,1,0,0,0,0,0,1")
df = pd.read_csv(csv, header=None)
df = pd.concat(objs=[df[df.columns[:11]],
df[df.columns[11:-1]].idxmax(axis=1) - 10,
df[df.columns[-1]]], axis=1)
print(df)
Output:
0 1 2 3 4 5 6 7 8 9 10 0 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 1 2 3 4 5 6 7 8 9 10 11 5 1
Data:
In [135]: df
Out[135]:
0 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 1 0 0 0 0 2
1 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 1 2
[2 rows x 22 columns]
Solution:
df = pd.read_csv('/path/to/file.csv', header=None)
In [137]: df.iloc[:, :11] \
.join(df.iloc[:, 11:21].dot(range(1,11)).to_frame(11)) \
.join(df.iloc[:, -1])
Out[137]:
0 1 2 3 4 5 6 7 8 9 10 11 21
0 2596 51 3 258 0 510 221 232 148 6279 24 6 2
1 2596 51 3 258 0 510 221 232 148 6279 24 10 2
Setup
df = pd.DataFrame({0: {2596: 51},
1: {2596: 3},
2: {2596: 258},
3: {2596: 0},
4: {2596: 510},
5: {2596: 221},
6: {2596: 232},
7: {2596: 148},
8: {2596: 6279},
9: {2596: 24},
10: {2596: 0},
11: {2596: 0},
12: {2596: 0},
13: {2596: 0},
14: {2596: 0},
15: {2596: 1},
16: {2596: 0},
17: {2596: 0},
18: {2596: 0},
19: {2596: 0},
20: {2596: 2}})
Solution
#find the index of the column with value 1 within the 10 columns
df.iloc[:,10] = np.argmax(df.iloc[:,10:20].values,axis=1)+1
#select the first 10 columns, the position column and the label column
df.iloc[:,list(range(11))+[20]]
Out[2167]:
0 1 2 3 4 5 6 7 8 9 10 20
2596 51 3 258 0 510 221 232 148 6279 24 6 2
Related
I have a data frame that looks like the following.
0
1
2
3
4
5
0: 2
57: 9
None
436: 77
11469: 1018
203: 44
0: 0
57: 15
None
436: 47
None
203: 89
0: 45
57: 0
11469: 1116
436: 7
None
203: 0
0: 1
57: 23
None
436: 0
11469: 18
None
0: 23
57: 5
None
436: 63
None
203: 4
Here, the column values represent the distance and time, in meters and seconds (57: 9 means 57 meters and 9 seconds). I want to rename my column such that the meter value becomes column name and the seconds value remains as a column value. Moreover, the columns where values are None, they should be replaced by Zero (0).
Desired output:
0
57
11469
436
11469
203
2
9
0
77
1018
44
0
15
0
47
0
89
45
0
1116
7
0
0
1
23
0
0
18
0
23
5
0
63
0
4
I am new to python so I don't know how I can achieve that.
First split each column by : with select last splitted values and replace to missing values, for columns forward filling missing values with select last row and after split select first values:
df1 = df.apply(lambda x: x.str.split(': ').str[-1]).fillna(0)
df1.columns = df.ffill().iloc[-1].str.split(': ').str[0].tolist()
print (df1)
0 57 11469 436 11469 203
0 2 9 0 77 1018 44
1 0 15 0 47 0 89
2 45 0 1116 7 0 0
3 1 23 0 0 18 0
4 23 5 0 63 0 4
I'm working with 2 dataframes. Dataframe1 is for parking sites. Dataframe2 is for sensors. Correspondence dataframe shows which sensor is in which site.
Dataframe1:
Site Time Available Capacity
0 19E 12:00 5 10
1 19E 13:00 4 10
2 44E 12:00 8 22
3 44E 13:00 11 22
Dataframe2:
Sensor Time Temp Precipitation
0 113 12:00 74 0.01
1 113 13:00 76 0.02
2 114 12:00 75 0.00
3 114 13:00 77 0.00
Correspondence dataframe:
Site Sensor
0 19E 113
1 44E 114
2 58E 115
...
I'd like to combine dataframe 1 and 2 based on the correspondence dataframe, and also ‘Time’ column. Intervals are both 1h in those two dataframes.
Expected result:
Site Time Available Capacity Sensor Time Temp Precipitation
0 19E 12:00 5 10 113 12:00 74 0.01
1 19E 13:00 4 10 113 13:00 76 0.02
2 44E 12:00 8 22 114 12:00 75 0.00
3 44E 13:00 11 22 114 13:00 77 0.00
You can use the code below to generate raw materials:
import pandas as pd
df1 = pd.DataFrame({
'Site': {0: '19E', 1: '19E', 2: '44E', 3: '44E'},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Available': {0: 5, 1: 4, 2: 8, 3: 11},
'Capacity': {0: 10, 1: 10, 2: 22, 3: 22}})
df2 = pd.DataFrame({
'Sensor': {0: 113, 1: 113, 2: 114, 3: 114},
'Time': {0: '12:00', 1: '13:00', 2: '12:00', 3: '13:00'},
'Tem': {0: 74, 1: 76, 2: 75, 3: 77},
'Precipitation': {0: 0.01, 1: 0.02, 2: 0.00, 3: 0.00}})
cor_df = pd.DataFrame({
'Site': {0: '19E', 1: '44E', 2: '58E'},
'Sensor': {0: 113, 1: 114, 2: 115}})
Use Series.map to map Site to Sensor and then DataFrame.merge on Sensor and Time:
lookup = cor_df.set_index("Site").squeeze()
res = df1.assign(Sensor=df1["Site"].map(lookup)).merge(df2, on=["Sensor", "Time"])
print(res)
Output
Site Time Available Capacity Sensor Tem Precipitation
0 19E 12:00 5 10 113 74 0.01
1 19E 13:00 4 10 113 76 0.02
2 44E 12:00 8 22 114 75 0.00
3 44E 13:00 11 22 114 77 0.00
I have a dataframe such as
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
F C2 350 400 50 12
A C2 349 400 51 12
B C2 450 500 50 12
And I would like, within each specific Chrm, to keep within each overlapping start and end the row with the longest length value AND the highest Score value.
For example in C1:
Seq Chrm start end length score
A C1 1 50 49 12
B C1 3 55 52 12
C C1 6 60 54 12
Cbis C1 6 60 54 11
D C1 70 120 50 12
E C1 78 111 33 12
Coordinates from start to end of A,B,C,Cbis together overlaps and D and E together overlaps.
In the A,B,C,Cbis group the longest are C and Cbis with 54, then I keep the one with the highest score which is **C** (12) In the **D,E** group, the longest is **D** with50`.
so I keep only the row C and D here.
If I do the same for other Chrm I should then get the following output:
Seq Chrm start end length score
C C1 6 60 54 12
D C1 70 120 50 12
A C2 349 400 51 12
B C2 450 500 50 12
Here is the dataframe in dic format if it can help :
{'Seq': {0: 'A', 1: 'B', 2: 'C', 3: 'Cbis', 4: 'D', 5: 'E', 6: 'F', 7: 'A', 8: 'B'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1', 3: 'C1', 4: 'C1', 5: 'C1', 6: 'C2', 7: 'C2', 8: 'C2'}, 'start': {0: 1, 1: 3, 2: 6, 3: 6, 4: 70, 5: 78, 6: 350, 7: 349, 8: 450}, 'end': {0: 50, 1: 55, 2: 60, 3: 60, 4: 120, 5: 111, 6: 400, 7: 400, 8: 500}, 'length': {0: 49, 1: 52, 2: 54, 3: 54, 4: 50, 5: 33, 6: 50, 7: 51, 8: 50}, 'score': {0: 12, 1: 12, 2: 12, 3: 11, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12}}
Edit for Corralien :
If I used this table :
Seq Chrm start end length score
A C1 12414 14672 49 12
B C1 12414 14741 52 12
C C1 12414 14744 54 12
It does not class A,B and C in the same overlapping group...
{'Seq': {0: 'A', 1: 'B', 2: 'C'}, 'Chrm': {0: 'C1', 1: 'C1', 2: 'C1'}, 'start': {0: 12414, 1: 12414, 2: 12414}, 'end': {0: 14672, 1: 14741, 2: 14744}, 'length': {0: 49, 1: 52, 2: 54}, 'score': {0: 12, 1: 12, 2: 12}}
Create virtual groups and keep the best row (length, score) for each group:
Suppose this dataframe:
>>> df
Seq Chrm start end length score
0 A C1 1 50 49 12
1 B C1 3 55 52 12
2 C C1 6 60 54 12
3 Cbis C1 6 60 54 11
4 D C1 70 120 50 12
5 E C1 78 111 33 12
6 F C2 350 400 50 12
7 A C2 349 400 51 12
8 B C2 450 500 50 12
9 A C1 12414 14672 49 12
10 B C1 12414 14741 52 12
11 C C1 12414 14744 54 12
Create groups:
is_overlapped = lambda x: x['start'] >= x['end'].shift(fill_value=-1)
df['group'] = df.sort_values(['Chrm', 'start', 'end']) \
.groupby('Chrm').apply(is_overlapped).droplevel(0).cumsum()
out = df.sort_values(['group', 'length', 'score'], ascending=[True, False, False]) \
.groupby(df['group']).head(1)
Output:
>>> out
Seq Chrm start end length score group
2 C C1 6 60 54 12 1
4 D C1 70 120 50 12 2
11 C C1 12414 14744 54 12 3
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
# Groups
>>> df
Seq Chrm start end length score group
0 A C1 1 50 49 12 1
1 B C1 3 55 52 12 1
2 C C1 6 60 54 12 1
3 Cbis C1 6 60 54 11 1
4 D C1 70 120 50 12 2
5 E C1 78 111 33 12 2
6 F C2 350 400 50 12 4
7 A C2 349 400 51 12 4
8 B C2 450 500 50 12 5
9 A C1 12414 14672 49 12 3
10 B C1 12414 14741 52 12 3
11 C C1 12414 14744 54 12 3
You can drop the group column with out.drop(columns='group') but I left it to illustrate the virtual groups.
I have a data which looks like below.
country item week stock FCST
A 1 1 105 3
A 1 2 105 6
A 1 3 105 9
A 1 4 105 4
A 1 5 105 7
A 1 6 105 4
A 1 7 105 7
the task i wish to perform is assigning closing stock of the current week as the opening stock of next week. in the above table my stock qty was 105 at very beginning, based on the forecast(fcst column) it is decreasing and assigning as closing stock of the same week. now the closing stock should become opening stock for next coming week.
I have done the same in SAS using retain statement. I do not have any idea how replicate the same in python.
Also to make a note this operation to be performed for every country-Item combination. (We can not always shift the value to take as opening stock, AS new item might have different current stock).
Can anyone help me out on the same.
My Output should look like below table.
country item week stock FCST OPENING_STOCK CLOSING_STK
A 1 1 105 3 105 102
A 1 2 105 6 102 96
A 1 3 105 9 96 87
A 1 4 105 4 87 83
A 1 5 105 7 83 76
A 1 6 105 4 76 72
A 1 7 105 7 72 65
Please click on above link to see the desired output.
Thanks in advance.
The code i have used to solve the issue is pasted below.
df.sort_values(by=['ITM_CD','Country','WEEK'],inplace=True)
df['CONCAT']=df['Country']+df['ITM_CD']
#CALCULATE BEGINING STOCK EVERY WEEK
df['TMP1']=1
grouper = (df["CONCAT"]!= df["CONCAT"].shift()).cumsum()
df["WEEK_NO"] = df.groupby(grouper)['TMP1'].cumsum()
df["FCST1"] = df.groupby(grouper)['FCST'].cumsum()
result = df.CURR_STCK_TOT - df.FCST1
df["CLOSING"] = result
df["CLOSING"] = np.where(df["CLOSING"]<0,0,df["CLOSING"])
df["OPENING"] = np.where(df["WEEK_NO"]==1,df["STOCK"],result.shift(1))
df["OPENING"] = np.where(df["OPENING"]<0,0,df["OPENING"])
Also i have done some extra manipulation like making all negative values 0.
Now it works.
combine_first is used to fill gaps in df.opening
import pandas as pd
df = pd.DataFrame({
'country': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'A', 9: 'A', 10: 'A', 11: 'A', 12: 'A', 13: 'B', 14: 'B', 15: 'B', 16: 'B', 17: 'B', 18: 'B', 19: 'B', 20: 'B', 21: 'B', 22: 'B'},
'item': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 2, 8: 1, 9: 2, 10: 2, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 3, 18: 2, 19: 1, 20: 2, 21: 1, 22: 3},
'week': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 7, 8: 8, 9: 8, 10: 9, 11: 9, 12: 10, 13: 1, 14: 2, 15: 3, 16: 3, 17: 3, 18: 4, 19: 4, 20: 5, 21: 5, 22: 5},
'stock': {0: 105, 1: 105, 2: 105, 3: 105, 4: 105, 5: 105, 6: 105, 7: 94, 8: 105, 9: 94, 10: 94, 11: 105, 12: 105, 13: 100, 14: 100, 15: 100, 16: 200, 17: 300, 18: 200, 19: 100, 20: 200, 21: 100, 22: 300},
'FCST': {0: 3, 1: 6, 2: 9, 3: 4, 4: 7, 5: 4, 6: 7, 7: 2, 8: 1, 9: -5, 10: 2, 11: 8, 12: 6, 13: 2, 14: 6, 15: 8, 16: 3, 17: 7, 18: 8, 19: 9, 20: 3, 21: 5, 22: 6}
})
df_new = pd.DataFrame(columns=df.columns)
groups = df.groupby(["country", "item"])
df["closing"] = df.stock - groups.FCST.cumsum()
df["opening"] = groups.closing.shift(1)
df["opening"] = df["opening"].combine_first(df.stock)
Outputs:
country item week stock FCST closing opening
0 A 1 1 105 3 102 105.0
1 A 1 2 105 6 96 102.0
2 A 1 3 105 9 87 96.0
3 A 1 4 105 4 83 87.0
4 A 1 5 105 7 76 83.0
5 A 1 6 105 4 72 76.0
6 A 1 7 105 7 65 72.0
7 A 2 7 94 2 92 94.0
8 A 1 8 105 1 64 65.0
9 A 2 8 94 -5 97 92.0
10 A 2 9 94 2 95 97.0
11 A 1 9 105 8 56 64.0
12 A 1 10 105 6 50 56.0
13 B 1 1 100 2 98 100.0
14 B 1 2 100 6 92 98.0
15 B 1 3 100 8 84 92.0
16 B 2 3 200 3 197 200.0
17 B 3 3 300 7 293 300.0
18 B 2 4 200 8 189 197.0
19 B 1 4 100 9 75 84.0
20 B 2 5 200 3 186 189.0
21 B 1 5 100 5 70 75.0
22 B 3 5 300 6 287 293.0
I want to find the indexes where a new range of 100 values begins.
In the case below, since the first row is 0, the next index would be the next number above 100 (7).
At index 7, the value is 104, so the next index would be next number above 204 (15).
At index 15, the value is 205, so the next index would be the next number above 305 (n/a).
Therefore the output would be [0, 7, 15].
0 0
1 0
2 4
3 10
4 30
5 65
6 92
7 104
8 108
9 109
10 123
11 132
12 153
13 160
14 190
15 205
16 207
17 210
18 240
19 254
20 254
21 254
22 263
23 273
24 280
25 293
You can do zfill to create three digit numbers:
# convert number to string
df['grp'] = df['b'].astype(str).str.zfill(3).str[0]
print(df)
a b grp
0 0 0 0
1 1 0 0
2 2 4 0
3 3 10 0
4 4 30 0
5 5 65 0
6 6 92 0
7 7 104 1
8 8 108 1
9 9 109 1
10 10 123 1
11 11 132 1
12 12 153 1
13 13 160 1
14 14 190 1
15 15 205 2
# get first row from each group
ix = df.groupby('grp').first()['a'].to_numpy()
print(ix)
array([ 0, 7, 15])
For sorted data, we can use searchsorted -
In [98]: df.head()
Out[98]:
A
0 0
1 0
2 4
3 10
4 30
In [143]: df.A.searchsorted(np.arange(0,df.A.iloc[-1],100))
Out[143]: array([ 0, 7, 15])
If you need based on dataframe/series index, index it by df.index -
In [101]: df.index[_]
Out[101]: Int64Index([0, 7, 15], dtype='int64')