Using groupby() for a dataframe in pandas resulted Index Error - python

I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?

Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!

Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt

Related

Compare row wise elements of a single column. If there are 2 continuous L then select lowest from High column and ignore other. Conversly if 2 L

High D_HIGH D_HIGH_H
33 46.57 0 0L
0 69.93 42 42H
1 86.44 68 68H
34 56.58 83 83L
35 67.12 125 125L
2 117.91 158 158H
36 94.51 186 186L
3 120.45 245 245H
4 123.28 254 254H
37 83.20 286 286L
In column D_HIGH_H there is L & H at end.
If there are two continuous H then the one having highest value in High column has to be selected and other has to be ignored(deleted).
If there are two continuous L then the one having lowest value in High column has to be selected and other has to be ignored(deleted).
If the sequence is H,L,H,L then no changes to be made.
Output I want is as follows:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I tried various options using list map but did not work out.Also tried with groupby but no logical conclusion.
Here's one way:
g = ((l := df['D_HIGH_H'].str[-1]) != l.shift()).cumsum()
def f(x):
if (x['D_HIGH_H'].str[-1] == 'H').any():
return x.nlargest(1, 'D_HIGH')
return x.nsmallest(1, 'D_HIGH')
df.groupby(g, as_index=False).apply(f)
Output:
High D_HIGH D_HIGH_H
0 33 46.57 0 0L
1 1 86.44 68 68H
2 34 56.58 83 83L
3 2 117.91 158 158H
4 36 94.51 186 186L
5 4 123.28 254 254H
6 37 83.20 286 286L
You can use extract to get the letter, then compute a custom group and groupby.apply with a function that depends on the letter:
# extract letter
s = df['D_HIGH_H'].str.extract('(\D)$', expand=False)
# group by successive letters
# get the idxmin/idxmax depending on the type of letter
keep = (df['High']
.groupby([s, s.ne(s.shift()).cumsum()], sort=False)
.apply(lambda x: x.idxmin() if x.name[0] == 'L' else x.idxmax())
.tolist()
)
out = df.loc[keep]
Output:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L

Extract information from an Excel (by updating arrays) with Excel / Python

I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.

Pandas: How to (cleanly) unpivot two columns with same category?

I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32

Comparing results in dataframe and grouping results

I have a dataset consisting of emails and how they are similar to each other correlated by their score.
emlgroup1 emlgroup2 scores
79 1739.eml 1742.eml 100
130 1742.eml 1739.eml 100
153 1743.eml 1744.eml 99
157 1743.eml 1748.eml 82
170 1744.eml 1743.eml 99
175 1744.eml 1748.eml 82
231 1747.eml 1750.eml 85
242 1748.eml 1743.eml 82
243 1748.eml 1744.eml 82
282 1750.eml 1747.eml 85
What I want to do now is group them automatically like so and put that in a new dataframe with one column.
group 1: 1739.eml, 1742.eml
group 2: 1743.eml, 1744.eml, 1748
group 3: 1747.eml, 1750.eml
Desired Output:
Col 1
1 1739.eml 1742.eml
2 1743.eml 1744.eml 1748.eml
3 1747.eml 1750.eml
I am getting stuck at the logic part where it splits the data into another group/cluster. I'm really new to posting on StackOverflow so I hope I am not committing any sins, Thanks in advance!
This network problem using networkx
import networkx as nx
G=nx.from_pandas_edgelist(df, 'emlgroup1', 'emlgroup2')
l=list(nx.connected_components(G))
l
[{'1739.eml', '1742.eml'}, {'1744.eml', '1743.eml', '1748.eml'}, {'1747.eml', '1750.eml'}]
pd.Series(l).to_frame('col 1')
col 1
0 {1739.eml, 1742.eml}
1 {1744.eml, 1743.eml, 1748.eml}
2 {1747.eml, 1750.eml}

Series of Strings to Arrays

I have some image arrays that I'm trying to run a regression on and somehow I'm importing the csv file as a series of strings instead of a series of arrays
In: image_train = pd.read_csv('image_train_data.csv')
In: image_train['image_array'].head()
Out: 0 [73 77 58 71 68 50 77 69 44 120 116 83 125 120...
1 [7 5 8 7 5 8 5 4 6 7 4 7 11 5 9 11 5 9 17 11 1...
2 [169 122 65 131 108 75 193 196 192 218 221 222...
3 [154 179 152 159 183 157 165 189 162 174 199 1...
4 [216 195 180 201 178 160 210 184 164 212 188 1...
Name: image_array, dtype: object
When I try to run the regression using image_train('image_array') I get
ValueError: could not convert string to float: '[255 255 255 255 255 255 255 255...
The array is a string.
Is there a way to transform the strings to arrays for the entire series?
You can use converters to describe how you want to read that field in. The easiest way would be to define your own converter to treat that column as a list, e.g.:
import ast
def conv(x):
return ast.literal_eval(','.join(x.split(' ')))
image_train = pd.read_csv('image_train_data.csv', converters={'image_array':conv})
While AChampion's solution looks good, I went ahead and found another solution:
image_train['image_array'].str.findall(r'\d+').apply(lambda x: map(int, x))
Which would be useful if you already had it loaded and didn't want to/couldn't load it again.
Here's another solution that works well for just evaluating a literal string representation of a list:
pd.eval(image_train['image_array'])
However, if it's separated by spaces you could do:
pd.eval(image_train['image_array'].str.replace(' ', ','))

Categories