I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I have an Excel file with thousands of columns on the following format:
Member No.
X
Y
Z
1000
25
60
-30
-69
38
68
45
2
43
1001
24
55
79
4
-7
89
78
51
-2
1002
45
-55
149
94
77
-985
-2
559
56
I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like:
Member No.
X
Y
Z
1000
69
60
68
1001
78
55
89
1002
94
559
985
I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000).
I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column?
Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python.
I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe.
Assuming your dataframe is called data:
import pandas as pd
data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int)
df = abs(df)
res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index()
Which will print you:
Member No. X Y Z A B C
0 1000 69 60 68 60 74 69
1 1001 78 55 89 78 92 87
2 1002 94 559 985 985 971 976
Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
I am trying to apply a custom function that takes two arguments to certain two columns of a group by dataframe.
I have tried with apply and groupby dataframe but any suggestion is welcome.
I have the following dataframe:
id y z
115 10 820
115 12 960
115 13 1100
144 25 2500
144 55 5500
144 65 960
144 68 6200
144 25 2550
146 25 2487
146 25 2847
146 25 2569
146 25 2600
146 25 2382
And I would like to apply a custom function with two arguments and get the result by id.
def train_logmodel(x, y):
##.........
return x
data.groupby('id')[['y','z']].apply(train_logmodel)
TypeError: train_logmodel() missing 1 required positional argument: 'y'
I would like to know how to pass 'y' and 'z' in order to estimate the desired column 'x' by each id.
The expected output example:
id x
115 0.23
144 0.45
146 0.58
It is a little different from the question: How to apply a function to two columns of Pandas dataframe
In this case we have to deal with groupby dataframe which works slightly different than a dataframe.
Thanks in advance!
Not knowing your train_logmodel function, I can only give a general example here. Your function takes one argument, from this argument you get the columns inside your function:
def train_logmodel(data):
return (data.z / data.y).min()
df.groupby('id').apply(train_logmodel)
Result:
id
115 80.000000
144 14.769231
146 95.280000
I am reading a column from an excel file using openpyxl.
I have written code to get the column of data I need into excel but the data is separated by empty cells.
I want to group these data wherever the cell value is not None into 19 sets of countries so that I can use it later to calculate the mean and standard deviation for the 19 countries.
I don't want to hard code it using list slices. Instead I want to save these integers to a list or list of lists using a loop but im not sure how to because this is my first project with Python.
Here's my code:
#Read PCT rankings project ratified results
#Beta
import openpyxl
wb=openpyxl.load_workbook('PCT rankings project ratified results.xlsx', data_only=True)
sheet=wb.get_sheet_by_name('PCT by IP firms')
row_counter=sheet.max_row
column_counter=sheet.max_column
print(row_counter)
print(column_counter)
#iterating over column of patent filings and trying to use empty cells to flag loop for it to append/store list of numbers before reaching the next non empty cell and repeat this everytime it happens(expecting 19 times)
list=[]
for row in range(4,sheet.max_row +1):
patent=sheet['I'+str(row)].value
print(patent)
if patent == None:
list.append(patent)
print(list)
This is the output from Python giving you a visualisation of what I am trying to do.
Column I:
412
14
493
488
339
273
238
226
200
194
153
164
151
126
None
120
None
None
133
77
62
79
24
0
30
20
16
0
6
9
11
None
None
None
None
608
529
435
320
266
264
200
272
134
113
73
23
12
52
21
I want to divide my data frame at based on two columns in two stages. 1st stages will be 20 groups based on 20 integers in the first column. Then the each of those groups will be further divided into groups based on another value of the second column (200 integers, intervals of 10). Any idea how I can do this? The data frame looks something like this:
60 150
60 155
60 156
61 155
61 166
62 132
62 145
62 167
63 172
63 180