I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')
Related
I have a column in a DataFrame that contains a string from which I must retrieve two pieces of information by different separators:
ID STR
280 11040402-38.58551%;11050101-9.29086%;11070101-52.12363%
351 11130203-35%;11130230-65%
510 11070103-69%
655 11090103-41.63463%;11160102-58.36537%
666 11130205-50.00%;11130207-50%
I have been trying to use the .apply method on this series together with a lambda function to make the splitting in one go, to no avail:
df['STR'].apply(lambda x: y.split('-') for y in x.split(';'))
Ideally, not only I would be able to split the string in one go, but also separate the left side of the - from the right side:
ID STR.LEFT STR.RIGHT
280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
351 [11130203, 11130230] [35%, 65%]
510 [11070103] [69%]
655 [11090103, 11160102] [41.63463%, 58.36537%]
666 [11130205, 11130207] [50.00%, 50%]
I believe this could be achievable with .apply and slicing, but any other solution is welcome.
You can try splitting several times:
# set ID as index
df.set_index('ID', inplace=True)
new_series = df.STR.str.split(';', expand=True).stack().reset_index(level=-1,drop=True)
new_df = new_series.str.split('-', expand=True)
new_df.groupby('ID').agg(list).reset_index()
Output:
ID 0 1
-- ---- ------------------------------------ --------------------------------------
0 280 ['11040402', '11050101', '11070101'] ['38.58551%', '9.29086%', '52.12363%']
1 351 ['11130203', '11130230'] ['35%', '65%']
2 510 ['11070103'] ['69%']
3 655 ['11090103', '11160102'] ['41.63463%', '58.36537%']
4 666 ['11130205', '11130207'] ['50.00%', '50%']
str.split
Assuming the pattern always leaves 'l-r;l-r;l-r...'
s = df.STR.str.split('-|;')
df[['ID']].join(pd.concat({'STR.LEFT': s.str[::2], 'STR.RIGTH': s.str[1::2]}, axis=1))
ID STR.LEFT STR.RIGTH
0 280 [11040402, 11050101, 11070101] [38.58551%, 9.29086%, 52.12363%]
1 351 [11130203, 11130230] [35%, 65%]
2 510 [11070103] [69%]
3 655 [11090103, 11160102] [41.63463%, 58.36537%]
4 666 [11130205, 11130207] [50.00%, 50%]
If you want to explode these lists into separate rows
s = df.STR.str.split('-|;')
i = np.arange(len(df)).repeat(s.str.len() // 2)
d = {'STR.LEFT': np.concatenate(s.str[::2]),
'STR.RIGHT': np.concatenate(s.str[1::2])}
df[['ID']].iloc[i].assign(**d).reset_index(drop=True)
ID STR.LEFT STR.RIGHT
0 280 11040402 38.58551%
1 280 11050101 9.29086%
2 280 11070101 52.12363%
3 351 11130203 35%
4 351 11130230 65%
5 510 11070103 69%
6 655 11090103 41.63463%
7 655 11160102 58.36537%
8 666 11130205 50.00%
9 666 11130207 50%
A single str.extractall call will suffice to extract the pairs into separate columns. You can then aggregate them into lists using groupby.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False))
STR.LEFT STR.RIGHT
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%]
1 [11130203, ;11130230] [35%, 65%]
2 [11070103] [69%]
3 [11090103, ;11160102] [41.63463%, 58.36537%]
4 [11130205, ;11130207] [50.00%, 50%]
To join with ID, you use just that: join.
(df['STR'].str.extractall(r'(.*?)-(.*?)(?=;|$)')
.groupby(level=0)
.agg(list)
.set_axis(['STR.LEFT', 'STR.RIGHT'], axis=1, inplace=False)
.join(df['ID'])
STR.LEFT STR.RIGHT ID
0 [11040402, ;11050101, ;11070101] [38.58551%, 9.29086%, 52.12363%] 280
1 [11130203, ;11130230] [35%, 65%] 351
2 [11070103] [69%] 510
3 [11090103, ;11160102] [41.63463%, 58.36537%] 655
4 [11130205, ;11130207] [50.00%, 50%] 666
I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132
I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')
This question already has answers here:
Row-wise average for a subset of columns with missing values
(3 answers)
Closed 5 years ago.
I have a this data frame and I would like to calculate a new column as the mean of salary_1, salary_2 and salary_3:
df = pd.DataFrame({
'salary_1': [230, 345, 222],
'salary_2': [235, 375, 292],
'salary_3': [210, 385, 260]
})
salary_1 salary_2 salary_3
0 230 235 210
1 345 375 385
2 222 292 260
How can I do it in pandas in the most efficient way? Actually I have many more columns and I don't want to write this one by one.
Something like this:
salary_1 salary_2 salary_3 salary_mean
0 230 235 210 (230+235+210)/3
1 345 375 385 ...
2 222 292 260 ...
Use .mean. By specifying the axis you can take the average across the row or the column.
df['average'] = df.mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average
0 230 235 210 225.000000
1 345 375 385 368.333333
2 222 292 260 258.000000
If you only want the mean of a few you can select only those columns. E.g.
df['average_1_3'] = df[['salary_1', 'salary_3']].mean(axis=1)
df
returns
salary_1 salary_2 salary_3 average_1_3
0 230 235 210 220.0
1 345 375 385 365.0
2 222 292 260 241.0
an easy way to solve this problem is shown below :
col = df.loc[: , "salary_1":"salary_3"]
where "salary_1" is the start column name and "salary_3" is the end column name
df['salary_mean'] = col.mean(axis=1)
df
This will give you a new dataframe with a new column that shows the mean of all the other columns
This approach is really helpful when you are having a large set of columns or also helpful when you need to perform on only some selected columns not on all.
I am aligning two dataframes which look like the following:
Dataframe 1
Timestamp L_x L_y L_a R_x R_y R_a
2403950 621.3 461.3 313 623.3 461.8 260
2404050 622.5 461.3 312 623.3 462.6 260
2404150 623.1 461.5 311 623.4 464 261
2404250 623.6 461.7 310 623.7 465.4 261
2404350 623.8 461.5 309 623.9 466.1 261
Dataframe 2
This dataframe contains the timestamps that a particular event occured.
Timestamp
0 2404030
1 2404050
2 2404250
3 2404266
4 2404282
5 2404298
6 2404314
7 2404330
8 2404350
9 2404382
All timestamps are in milliseconds. As you can see, the first dataframe is resampled to 100milliseconds. So what I want to do is, to align the two dataframes based on count. Which means based on the count how many events occur during a particular 100milliseconds bin time. For example, from the dataframe 1, in the first 100millisecond bin time (24043950 - 2404049), only one event occur according to the second dataframe which is at 2404030 and so on. The aligned table should look like the following:
Timestamp L_x L_y L_a R_x R_y R_a count
2403950 621.3 461.3 313 623.3 461.8 260 1
2404050 622.5 461.3 312 623.3 462.6 260 1
2404150 623.1 461.5 311 623.4 464 261 0
2404250 623.6 461.7 310 623.7 465.4 261 6
2404350 623.8 461.5 309 623.9 466.1 261 2
Thank you for your help and suggestion.
You want to perform integer division on the timestamp (i.e. a // b), but first need to add 50 to it given your bucketing. Then convert it back into the correct units by multiplying by 100 and subtracting 50.
Now, group on this new index and perform a count.
You then merge these counts to your original dataframe and do some formatting operations to get the data in the desired shape. Make sure to fill NaNs with zero.
df2['idx'] = (df2.Timestamp + 50) // 100 * 100 - 50
counts = df2.groupby('idx').count()
>>> counts
Timestamp
idx
2403950 1
2404050 1
2404250 6
2404350 2
df_new =df.merge(counts, how='left', left_on='Timestamp', right_index=True, suffixes=['', '_'])
columns = list(df_new)
columns[-1] = 'count'
df_new.columns = columns
df_new['count'].fillna(0, inplace=True)
>>> df_new
Timestamp L_x L_y L_a R_x R_y R_a count
0 2403950 621.3 461.3 313 623.3 461.8 260 1
1 2404050 622.5 461.3 312 623.3 462.6 260 1
2 2404150 623.1 461.5 311 623.4 464.0 261 0
3 2404250 623.6 461.7 310 623.7 465.4 261 6
4 2404350 623.8 461.5 309 623.9 466.1 261 2