pandas group by multiple columns and remove rows based on multiple conditions - python
I have a dataframe which is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line.
For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4.
I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far:
df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10]
Not sure what function should I call within transform and how to include ydiff too.
The expected output is as follows:
imagename,locationName,brandname,x,y,w,h,xdiff,ydiff
95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41
95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132
95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30
95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function
#df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x)
df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x)
Out:
imagename locationName brandname x y w h xdiff ydiff
2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41
5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30
7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130
0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0
4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132
Related
Using groupby() for a dataframe in pandas resulted Index Error
I have this dataframe: x y z parameter 0 26 24 25 Age 1 35 37 36 Age 2 57 52 54.5 Age 3 160 164 162 Hgt 4 182 163 172.5 Hgt 5 175 167 171 Hgt 6 95 71 83 Wgt 7 110 68 89 Wgt 8 89 65 77 Wgt I'm using pandas to get this final result: x y parameter 0 160 164 Hgt 1 182 163 Hgt 2 175 167 Hgt I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe First, I added a column to set it as an index: df = df.insert(0,'index', [count for count in range(df.shape[0])], True) And the dataframe came out like this: index x y z parameter 0 0 26 24 25 Age 1 1 35 37 36 Age 2 2 57 52 54.5 Age 3 3 160 164 162 Hgt 4 4 182 163 172.5 Hgt 5 5 175 167 171 Hgt 6 6 95 71 83 Wgt 7 7 110 68 89 Wgt 8 8 89 65 77 Wgt Then, I used the following code to group based on index and extract the columns I need: df1 = df.groupby('index')[['x', 'y','parameter']] And the output was: x y parameter 0 26 24 Age 1 35 37 Age 2 57 52 Age 3 160 164 Hgt 4 182 163 Hgt 5 175 167 Hgt 6 95 71 Wgt 7 110 68 Wgt 8 89 65 Wgt After that, I used the following code to isolate only Hgt values: df2 = df1[df1['parameter'] == 'Hgt'] When I ran df2, I got an error saying: IndexError: Column(s) ['x', 'y', 'parameter'] already selected Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code. Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details: df = df.insert(0,'index', [count for count in range(df.shape[0])], True) This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly. But this step is not even needed for a groupby as you can group by index level: df.groupby(level=0) But... the groupby is useless anyways as you only have single membered groups. Also, when you do: df1 = df.groupby('index')[['x', 'y','parameter']] df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups). So when you run: df1[df1['parameter'] == 'Hgt'] again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy. I hope it helped!
Do you really need groupby? >>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True) x y parameter 0 160 164 Hgt 1 182 163 Hgt 2 175 167 Hgt
Check if a row in one DataFrame exist in another, BASED ON SPECIFIC COLUMNS ONLY
I have two Pandas DataFrame with different columns number. df1 is a single row DataFrame: a X0 b Y0 c 0 233 100 56 shark -23 df2, instead, is multiple rows Dataframe: d X0 e f Y0 g h 0 snow 201 32 36 cat 58 336 1 rain 176 99 15 tiger 63 845 2 sun 193 81 42 dog 48 557 3 storm 100 74 18 shark 39 673 # <-- This row 4 cloud 214 56 27 wolf 66 406 I would to verify if the df1's row is in df2, but considering X0 AND Y0 columns only, ignoring all other columns. In this example the df1's row match the df2's row at index 3, that have 100 in X0 and 'shark' in Y0. The output for this example is: True Note: True/False as output is enough for me, I don't care about index of matched row. I founded similar questions but all of them check the entire row...
Use df.merge with an if condition check on len: In [219]: if len(df1[['X0', 'Y0']].merge(df2)): ...: print(True) ...: True OR: In [225]: not (df1[['X0', 'Y0']].merge(df2)).empty Out[225]: True
Try this: df2[(df2.X0.isin(df1.X0))&(df2.Y0.isin(df1.Y0))] Output: d X0 e f Y0 g h 3 storm 100 74 18 shark 39 673
duplicated df2.append(df1).duplicated(['X0', 'Y0']).iat[-1] True Save a tad bit of time df2[['X0', 'Y0']].append(df1[['X0', 'Y0']]).duplicated().iat[-1]
Extract information from an Excel (by updating arrays) with Excel / Python
I have an Excel file with thousands of columns on the following format: Member No. X Y Z 1000 25 60 -30 -69 38 68 45 2 43 1001 24 55 79 4 -7 89 78 51 -2 1002 45 -55 149 94 77 -985 -2 559 56 I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like: Member No. X Y Z 1000 69 60 68 1001 78 55 89 1002 94 559 985 I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000). I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column? Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python. I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe. Assuming your dataframe is called data: import pandas as pd data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int) df = abs(df) res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index() Which will print you: Member No. X Y Z A B C 0 1000 69 60 68 60 74 69 1 1001 78 55 89 78 92 87 2 1002 94 559 985 985 971 976 Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
Replace Not Correctly Removing for Object Variable Type
I'm trying to remove the ":30" portion of values in my First variable. The First variable data type is object. Here are a few examples of of the First variable, and the counts, ignore the counts: 11a 211 7p 178 4p 127 2:30p 112 11:30a 108 1p 107 12p 105 9a 100 10p 85 2p 24 10:30a 12 6p 5 9:30a 2 9p 2 12:30a 2 8p 2 I wrote the following code which runs without any errors; however, when I run the value counts, it still shows times with a ":30". The NewFirst variable dataype is int64.Not quite sure what I'm doing wrong here. bad_chars = ":30" DF["NewFirst"] = DF.First.replace(bad_chars,'') DF["NewFirst"].value_counts() The desired output would have the NewFirst values like: 11a 211 7p 178 4p 127 2p 112 11a 108 1p 107 12p 105 9a 100 10p 85 2p 24 10a 12 6p 5 9a 2 9p 2 12a 2 8p 2
You shouldn't be looping over the characters in bad_chars. That will remove all 3 and 0 characters, so 10p will become 1p, and 3a will become a. You should just replace the whole bad_chars string, with no loop. You also need to use the .str accessor. DF["NewFirst"] = DF["First"].str.replace(bad_chars,'')
Selecting rows with lowest values based on combination two columns from pandas
I'm not even sure if the title makes sense. I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below: x y time 0 225 0 20.295270 1 225 1 21.134015 2 225 2 21.382298 3 225 3 20.704367 4 225 4 20.152735 5 225 5 19.213522 ....... 900 437 900 27.748966 901 437 901 20.898460 902 437 902 23.347935 903 437 903 22.011992 904 437 904 21.231041 905 437 905 28.769945 906 437 906 21.662975 .... and so on What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y. So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y. e.g. x y time 225 0 20.295270 438 1 19.648954 27 20 4.342732 9 438 17.884423 225 907 24.560400 I tried up until now groupby but I'm only getting the same x as y. print(df.groupby('id_y', sort=False)['time'].idxmin()) y 0 0 1 1 2 2 3 3 4 4 The one below just returns the df that I already have. df.loc[df.groupby("id_y")["time"].idxmin()] Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution: df = df[df['time'] != 0] df2 = df.loc[df.groupby("y")["time"].idxmin()] Similar alternative with filter by query: df = df.query('time != 0') df2 = df.loc[df.groupby("y")["time"].idxmin()] Or use sort_values with drop_duplicates: df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')