This question already has answers here:
Preparing an aggregate dataframe for publication
(2 answers)
Closed 2 years ago.
I would like to groupby this dataframe with unique values for priority and Alias column to create a latex report:
Alias Number Duration(h) priority
A 23834 8111.130497 120
B 16453 6773.243598 120
C 15988 8347.042753 120
A 19 113.475702 139
B 16 113.476042 139
So I tried:
df = df.groupby(['priority', 'Alias'])
df
The terminal return:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002377285CA00>
The expected result:
priority Alias Number Duration(h)
120 A 23834 8111.130497
B 16453 6773.243598
C 15988 8347.042753
139 A 19 113.475702
B 16 113.476042
I don't understand why the terminal return this... Thanks for your time !
Your data are already grouped by priority and alias because every combination of values for this 2 columns is unique at your dataset. It's just a matter of visualize it better and i think the set_index() above recommended is the correct answer.
You can also bring priority column in front of alias.
Related
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 9 months ago.
So here's my simple example (the json field in my actual dataset is very nested so I'm unpacking things one level at a time). I need to keep certain columns on the dataset post json_normalize().
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
Start:
Expected (Excel mockup):
Actual:
import json
d = {'report_id': [100, 101, 102], 'start_date': ["2021-03-12", "2021-04-22", "2021-05-02"],
'report_json': ['{"name":"John", "age":30, "disease":"A-Pox"}', '{"name":"Mary", "age":22, "disease":"B-Pox"}', '{"name":"Karen", "age":42, "disease":"C-Pox"}']}
df = pd.DataFrame(data=d)
display(df)
df = pd.json_normalize(df['report_json'].apply(json.loads), max_level=0, meta=['report_id', 'start_date'])
display(df)
Looking at the documentation on json_normalize(), I think the meta parameter is what I need to keep the report_id and start_date but it doesn't seem to be working as the expected fields to keep are not appearing on the final dataset.
Does anyone have advice? Thank you.
as you're dealing with a pretty simple json along a structured index you can just normalize your frame then make use of .join to join along your axis.
from ast import literal_eval
df.join(
pd.json_normalize(df['report_json'].map(literal_eval))
).drop('report_json',axis=1)
report_id start_date name age disease
0 100 2021-03-12 John 30 A-Pox
1 101 2021-04-22 Mary 22 B-Pox
2 102 2021-05-02 Karen 42 C-Pox
This question already has answers here:
Create new column based on values from other columns / apply a function of multiple columns, row-wise in Pandas
(8 answers)
Closed 2 years ago.
I am trying to build a column that will be based off of another. The new column should reflect the values meeting certain criteria and put 0's where the values do not meet the criteria.
Example, a column called bank balance will have negative and positive values; the new column, overdraft, will have the negative values for the appropriate row and 0 where the balance is greater than 0.
Bal Ovr
21 0
-34 -34
45 0
-32 -32
The final result should look like that.
Assuming your dataframe is called df, you can use np.where and do:
import numpy as np
df['Ovr'] = np.where(df['Bal'] <0,'df['Bal'],0)
which will create a column called Ovr, with 0's when Bal is +ve, and the same as Bal when Bal is -ve.
df["over"] = df.Bal.apply(lambda x: 0 if x>0 else x)
Additional method to enrich your coding skills. However, it isn't needed for such easy tasks.
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a folder with around 1000 .txt files, and I need to run the same code on each of them. One particular thing I need to keep track of is the count of a particular haplotype. I used
hap = df['hap'].value_counts().to_frame() to create a new data frame with the counts of each haplotype. It looks something like this.
hap count
1 347
5 171
3 168
7 140
6 56
11 51
9 33
2 31
10 3
I was wondering if there was a way for me to extract the count of just haplotype 7 and store its value in a variable.
One method that I have used is to use the df['haplotype'].tolist() command and then run a for loop with a basic if-else clause that keeps track of the count haplotype 7, according to the number of times it occurs in the list. But I'm curious to know if I can access it in the manner I've described above.
You can get the row where hap is 7 using the [] operator.
cnt = df[df['hap']==7]['count']
This question already has answers here:
Apply function to each row of pandas dataframe to create two new columns
(5 answers)
How to add multiple columns to pandas dataframe in one assignment?
(13 answers)
Closed 3 years ago.
I am trying to create multiple new dataframe columns using a function. When I run the simple code below, however, I get the error, "KeyError: "['AdjTime1' 'AdjTime2'] not in index."
How can I correct this to add the two new columns ('AdjTime1' & 'AdjTime2') to my dataframe?
Thanks!
import pandas as pd
df = pd.DataFrame({'Runner':['Wade','Brian','Jason'],'Time':[80,75,98]})
def adj_speed(row):
adjusted_speed1 = row['Time']*1.5
adjusted_speed2 = row['Time']*2.0
return adjusted_speed1, adjusted_speed2
df[['AdjTime1','AdjTime2']] = df.apply(adj_speed,axis=1)
Just do something like (assuming you have a list values you want to multiply Time on):
l=[1.5,2.0]
for e,i in enumerate(l):
df['AdjTime'+str(e+1)]=df.Time*i
print(df)
Runner Time AdjTime1 AdjTime2
0 Wade 80 120.0 160.0
1 Brian 75 112.5 150.0
2 Jason 98 147.0 196.0
This question already has answers here:
Data loading using arrays in Python
(3 answers)
Closed 4 years ago.
I have a pandas data frame like this:
TransactionID ProductID
1 132
1 256
1 985
2 321
3 451
3 219
I want to group by the 'TransactionID' and assign the 'ProductID' to a list, like this:
list = [[132, 256, 985], [321], [451, 291]]
What is the proper way to performe this task?
Thanks in advance!
Something like this might help.
You simply group them by the TransactionID and then take the ProductID from it and convert to list
grouped_list = list(df.groupby('TransactionID')['ProductID'].apply(list))
As mentioned in the comments, it is not good to use 'list' as your variable name. This is because you will re-define the original function of the list command by setting it to the grouped list that you extracted from the df.
the following is not so good but can work.
result = [list(i.ProductID) for i in dict(list(df.groupby("TransactionID"))).values()]