for a school project I need to implement the following function.
Make a function select(df, col1, col2) that takes a data frame and two column labels and outputs a multi indexed Series with the fraction of occurrences of possible values of col2 given the values of col1.
For example select(df_test, 'Do you ever gamble?', 'Lottery Type') would yield
No risk yes 0.433099
risk no 0.566901
Yes risk yes 0.548872
risk no 0.451128
Note that the sum of Lottery Type:risk yes + Lottery Type:risk no is 1.0.
It was a much larger dataframe but I managed to groupby and aggregate to a point using gr = df.groupby([col1, col2], as_index=True).count() It resulted in the below smallish dataframes. ;
Do you ever smoke cigarettes? Do you ever drink alcohol? Have you ever been skydiving? Do you ever drive above the speed limit? Have you ever cheated on your significant other? Do you eat steak? How do you like your steak prepared? Gender Age Household Income Education Location (Census Region)
Do you ever gamble? Lottery Type
No risk no 155 157 156 157 155 157 121 147 147 121 147 145
risk yes 120 120 120 119 120 120 89 117 117 94 116 117
Yes risk no 114 114 113 113 114 114 99 110 110 96 109 110
risk yes 141 142 141 142 142 141 116 133 133 113 133 133
The Code looks messy so this is an image of the above DF. So my question is how can I aggregate on the percentage of the people say who don't smoke and percentage of the people who smoke. I tried using custom aggregation functions but I couldn't figure out. Using the below function just throws a type error.
.agg(lambda x: sum(x)/len(x))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Have a look at pivot_table. https://stackoverflow.com/a/40302194/5478373 has a good example of how to use the pivot_tables to sum the totals and then divide the results by that total and multiply by 100.
group = pd.pivot_table(df,
...
aggfunc=np.sum)
.div(len(df.index))
.mul(100)
Related
I have one dataframe in Python :
Winner_height
Winner_rank
Loser_height
Loser_rank
183
15
185
32
195
42
178
12
And I would like to get a mixed database keeping the information about both players and a field that allows me to identify the winner (0 if it is player 1, 1 if it is player 2), as below :
Player_1_height
Player_1_rank
Player_2_height
Player_2_rank
Winner
183
15
185
32
0
178
12
195
42
1
Is there an efficient way to mix groups of columns with pandas, i.e. without drawing a random number for each row and creating a duplicate database?
Thanks in advance
I have this dataframe:
x y z parameter
0 26 24 25 Age
1 35 37 36 Age
2 57 52 54.5 Age
3 160 164 162 Hgt
4 182 163 172.5 Hgt
5 175 167 171 Hgt
6 95 71 83 Wgt
7 110 68 89 Wgt
8 89 65 77 Wgt
I'm using pandas to get this final result:
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I'm using groupby() to extract and isolate rows based on same parameter Hgt from the original dataframe
First, I added a column to set it as an index:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
And the dataframe came out like this:
index x y z parameter
0 0 26 24 25 Age
1 1 35 37 36 Age
2 2 57 52 54.5 Age
3 3 160 164 162 Hgt
4 4 182 163 172.5 Hgt
5 5 175 167 171 Hgt
6 6 95 71 83 Wgt
7 7 110 68 89 Wgt
8 8 89 65 77 Wgt
Then, I used the following code to group based on index and extract the columns I need:
df1 = df.groupby('index')[['x', 'y','parameter']]
And the output was:
x y parameter
0 26 24 Age
1 35 37 Age
2 57 52 Age
3 160 164 Hgt
4 182 163 Hgt
5 175 167 Hgt
6 95 71 Wgt
7 110 68 Wgt
8 89 65 Wgt
After that, I used the following code to isolate only Hgt values:
df2 = df1[df1['parameter'] == 'Hgt']
When I ran df2, I got an error saying:
IndexError: Column(s) ['x', 'y', 'parameter'] already selected
Am I missing something here? What to do to get the final result?
Because you asked what you did wrong, let me point to useless/bad code.
Without any judgement (this is just to help you improve future code), almost everything is incorrect. It feels like a succession of complicated ways to do useless things. Let me give some details:
df = df.insert(0,'index', [count for count in range(df.shape[0])], True)
This seems a very convoluted way to do df.reset_index(). Even [count for count in range(df.shape[0])] could be have been simplified by using range(df.shape[0]) directly.
But this step is not even needed for a groupby as you can group by index level:
df.groupby(level=0)
But... the groupby is useless anyways as you only have single membered groups.
Also, when you do:
df1 = df.groupby('index')[['x', 'y','parameter']]
df1 is not a dataframe but a DataFrameGroupBy object. Very useful to store in a variable when you know what you're doing, this is however causing the error in your case as you thought this was a DataFrame. You need to apply an aggregation or transformation method of the DataFrameGroupBy object to get back a DataFrame, which you didn't (likely because, as seen above, there isn't much interesting to do on single-membered groups).
So when you run:
df1[df1['parameter'] == 'Hgt']
again, all is wrong as df1['parameter'] is equivalent to df.groupby('index')[['x', 'y','parameter']]['parameter'] (the cause of the error as you select twice 'parameter'). Even if you removed this error, the equality comparison would give a single True/False as you still have your DataFrameGroupBy and not a DataFrame, and this would incorrectly try to subselect an inexistent column of the DataFrameGroupBy.
I hope it helped!
Do you really need groupby?
>>> df.loc[df['parameter'] == 'Hgt', ['x', 'y', 'parameter']].reset_index(drop=True)
x y parameter
0 160 164 Hgt
1 182 163 Hgt
2 175 167 Hgt
I have a df with numbers in the second column. Each number represents the length of a DNA sequence. I would like to create two new columns in which the first one says where this sequence start and the second one says where this sequence end.
This is my current df:
Names LEN
0 Ribosomal_S9: 121
1 Ribosomal_S8: 129
2 Ribosomal_L10: 100
3 GrpE: 166
4 DUF150: 141
.. ... ...
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 2 columns]
And this is what I am trying to get
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
.. ... ... ... ..
115 TIGR03632: 117
116 TIGR03654: 175
117 TIGR03723: 314
118 TIGR03725: 212
119 TIGR03953: 188
[120 rows x 4 columns]
Can please anyone put me in the right direction?
Use DataFrame.assign with new columns created with Series.cumsum and for start is added Series.shift:
#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')
s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
Names LEN Start End
0 Ribosomal_S9: 121 0 121
1 Ribosomal_S8: 129 121 250
2 Ribosomal_L10: 100 250 350
3 GrpE: 166 350 516
4 DUF150: 141 516 657
I am trying to apply a custom function that takes two arguments to certain two columns of a group by dataframe.
I have tried with apply and groupby dataframe but any suggestion is welcome.
I have the following dataframe:
id y z
115 10 820
115 12 960
115 13 1100
144 25 2500
144 55 5500
144 65 960
144 68 6200
144 25 2550
146 25 2487
146 25 2847
146 25 2569
146 25 2600
146 25 2382
And I would like to apply a custom function with two arguments and get the result by id.
def train_logmodel(x, y):
##.........
return x
data.groupby('id')[['y','z']].apply(train_logmodel)
TypeError: train_logmodel() missing 1 required positional argument: 'y'
I would like to know how to pass 'y' and 'z' in order to estimate the desired column 'x' by each id.
The expected output example:
id x
115 0.23
144 0.45
146 0.58
It is a little different from the question: How to apply a function to two columns of Pandas dataframe
In this case we have to deal with groupby dataframe which works slightly different than a dataframe.
Thanks in advance!
Not knowing your train_logmodel function, I can only give a general example here. Your function takes one argument, from this argument you get the columns inside your function:
def train_logmodel(data):
return (data.z / data.y).min()
df.groupby('id').apply(train_logmodel)
Result:
id
115 80.000000
144 14.769231
146 95.280000
I have a dataset consisting of emails and how they are similar to each other correlated by their score.
emlgroup1 emlgroup2 scores
79 1739.eml 1742.eml 100
130 1742.eml 1739.eml 100
153 1743.eml 1744.eml 99
157 1743.eml 1748.eml 82
170 1744.eml 1743.eml 99
175 1744.eml 1748.eml 82
231 1747.eml 1750.eml 85
242 1748.eml 1743.eml 82
243 1748.eml 1744.eml 82
282 1750.eml 1747.eml 85
What I want to do now is group them automatically like so and put that in a new dataframe with one column.
group 1: 1739.eml, 1742.eml
group 2: 1743.eml, 1744.eml, 1748
group 3: 1747.eml, 1750.eml
Desired Output:
Col 1
1 1739.eml 1742.eml
2 1743.eml 1744.eml 1748.eml
3 1747.eml 1750.eml
I am getting stuck at the logic part where it splits the data into another group/cluster. I'm really new to posting on StackOverflow so I hope I am not committing any sins, Thanks in advance!
This network problem using networkx
import networkx as nx
G=nx.from_pandas_edgelist(df, 'emlgroup1', 'emlgroup2')
l=list(nx.connected_components(G))
l
[{'1739.eml', '1742.eml'}, {'1744.eml', '1743.eml', '1748.eml'}, {'1747.eml', '1750.eml'}]
pd.Series(l).to_frame('col 1')
col 1
0 {1739.eml, 1742.eml}
1 {1744.eml, 1743.eml, 1748.eml}
2 {1747.eml, 1750.eml}