I'm using pandas.qcut for dividing data into 5 groups, and want to label each group based on the qcut min and max score.
For example, I tried on "age" data from data frame column.
df['age group'] = pd.qcut(df['age'], 5)
and it resulted in
Categories (5, interval[float64]): [(37.999, 61.0] < (61.0, 67.0] < (67.0, 73.0] < (73.0, 78.0] < (78.0, 93.0]]
The expected result is to give label for each group automotically based on min and max value, e.g.
Categories 1 label would be "37.999 to 60.999", etc.
For now I did the labeling manually looking at each category range. How should I define the label to make it as expected? Thanks!
You can redefine the categories:
df['age group'] = pd.qcut(df['age'], 5)
df['age group'].cat.categories = [f'{i.left} to {i.right}' for i in df['age group'].cat.categories]
Related
Shown below are the details from a DataFrame
Below is the Syntax used to add a percentage column,
df1 = df[['Attrition', 'Gender',"JobSatisfaction"]]
df1 = df1.groupby(['Attrition','Gender'])['Job_Satisfaction'].value_counts().reset_index(name='count')
df1['%'] = 100 * df1['count']/ df1.groupby(['Attrition','Gender','Job_Satisfaction'])['count'].transform('sum')
df1 = df1 .sort_values(by=['Gender','Attrition','Job_Satisfaction'])
df1
below is the results I get
How can I add a percentage column shown below,
You can normalize using groupby.transform('sum') and multiply by 100:
df['%'] = df['count'].div(df.groupby('Gender')['count'].transform('sum')).mul(100)
For a string:
df['%'] = (df['count']
.div(df.groupby('Gender')['count'].transform('sum')
.mul(100).astype(int).astype(str).add('%')
)
The percentage denominator you want is total Gender count, hence df1.groupby(['Attrition','Gender','Job_Satisfaction']) was incorrect.
Use df1.groupby(['Gender']) instead.
A DataFrame containing data with age binned in separate rows, as below:
VALUE,AGE
10, 0-4
20, 5-9
30, 10-14
40, 15-19
.. .. .....
So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:
VALUE,AGE
30, 0-9
70, 10-19
I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?
Here's a "dumb" version, based on this answer - just sum every 2 rows:
In[0]
df.groupby(df.index // 2).sum()
Out[0]:
VALUE
0 30
1 70
I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.
A "smarter" version would be to actually create bins with pd.cut and use that to group the data, based on the ages for each row:
In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)
bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)
Out[0]:
VALUE
AGE
(0, 10] 30
(10, 20] 70
Explanation:
Use pandas.Series.str methods to get out the maximum age for each row,
store in a column "MAX_AGE"
Create bins at 10 year cut offs
Use pd.cut to assign the data into bins based on the max age of each row. Then use groupby on these bins and sum. Note that since we specify right = False, the bins depicted in the index should mean 0-9 and 10-19.
For reference, here is the data I was using:
import pandas as pd
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']
df = pd.DataFrame({'VALUE':VALUE,
'AGE':AGE})
This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.
Below splits the string to get the numerical value
df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)
Then it will apply the grouping logic, and rename the columns to represent the desired time period.
df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])
I have a df with hundreds of thousands of rows, and am creating a new dataframe which only contains the top quantile of rows for some group of values:
quantiles = (df.groupby(['Person', 'Date'])['Value'].apply(lambda x: pd.qcut(x, 4, labels=[0, 0.25, 0.5, 1], duplicates='drop')))
When I run it, I get:
ValueError: Bin labels must be one fewer than the number of bin edges
After trying to change the number of bins to 5 I still get the same error.
How can I fix this?
I was facing the same issue and I did this to overcome it.
bins = number of times the data is being sliced
labels = the range you are categorizing using labels.
This error appears when labels > bins
follow these steps:
Step. 1: Don't pass labels initially
train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
q=5,duplicates='drop')
This will result in:
(-0.001, 16.0] 880
(205.2, 1600.0] 292
(16.0, 205.2] 288
Name: MasVnrArea, dtype: int64
Step 2:
Now we can see there are only three categories which are possible on binned.
So, assign labels accordingly. In my case, it is 3. So I am passing 3 labels.
bin_labels_MasVnrArea = ['Platinum_MasVnrArea',
'Diamond_MasVnrArea','Supreme_MasVnrArea']
train['MasVnrArea'] = pd.qcut(train['MasVnrArea'],
q=5,labels=bin_labels_MasVnrArea,duplicates='drop')
Please watch this video on bins for a clear understanding.
https://www.youtube.com/watch?v=HofOMf8RgjM
It's happening because, after removing duplicates form some of the groups, the passed label size is more than the number of cut params you have passed. You have to make it dynamic something like wherever it's arising generate labels that much only.
Method : 1
df.groupby(key)['sales'].transform(lambda x: pd.qcut(x, min(len(x),3), labels=range(1,min(len(x),3)+1), duplicates = 'drop'))
here I'm making max 3 cut but if not possible make at least 1 or 2
Mathod : 2
import pandas as pd
df['rank'] = df.groupby([key])[sales].transform(lambda x: x.rank(method = 'first'))
def q_cut(x, cuts):
unique_var = len(np.unique([i for i in x]))
labels1 = range(1, min(unique_var,cuts)+1)
output = pd.qcut(x, min(unique_var,cuts), labels= labels1, duplicates = 'drop')
return output
df.groupby([key])['rank'].transform(lambda x: q_cut(x,10))`
follow up with Amit Bidlan's answer, if we want the labels to be "int" instead of the origin "string of range", because of some drawing reasons,
i would do like this:
train['range_ori'] = pd.qcut(train['MasVnrArea'], q=5, duplicates='drop')
grouplist = train["range_ori"].unique().tolist()
grouplist.sort()
group_mapping = {g:i for i, g in enumerate(grouplist)}
train['group'] = train['range_ori'].apply(lambda x: group_mapping[x])
the advantage is you can handle dynamic bins length and no need to check it before setting the label~
I have a pandas groupby object that returns the counts of each gene type, roughly as shown below (column headers formatted manually for clarity):
counts = df.groupby(["ID", "Gene"]).size()
counts
ID Gene Count
1_1_1 SMARCB1 1
smad 12
1_1_10 SMARCB1 2
smad 17
1_1_100 SMARCB1 3
I need to get the within group zscore, and then return the Gene with the highest zscore.
I've tried the following, but it seems to be calculating zscores across the whole dataset and does not return the correct zscore:
zscore = lambda x: (x - x.mean()) / x.std()
counts = df.groupby(["ID", "Match"]).size().pipe(zscore)
I've tried with transform and gotten the same results.
I tried:
counts = match_df.groupby(["ID", "Match"]).size().apply(zscore)
Which gives me the following error:
'int' object has no attribute 'mean'
Whatever I try, it doesn't give the correct output. The zscores for the first two lines should be [-1,1] in which case I would return the row for 1_1_1 SMARCB1. Etc. Thanks!
Update
Thanks to help from #ZaxR and switching to numpy mean and standard deviation, I was able to solve this as shown below. This solution also provides a summary dataframe of the raw counts and zscores for each gene:
# group by id and gene match and sum hits to each molecule
counts = df.groupby(["ID", "Match"]).size()
# calculate zscore by feature for molecule counts
# features that only align to one molecule are given a score of 1
zscore = lambda x: (x - np.mean(x)) / np.std(x)
zscores = counts.groupby('ID').apply(zscore).fillna('1').to_frame('Zscore')
# group results back together with counts and output to
# merge with positions and save to file
zscore_df = zscores.reset_index()
zscore_df.columns = ["ID", "Match", "Zscore"]
count_df = counts.reset_index()
count_df.columns = ["ID", "Match", "Counts"]
zscore_df["Counts"] = count_df["Counts"]
# select gene with best zscore meeting threshold
max_df = zscore_df[zscore_df.groupby('ID')['Zscore'].transform(max) \
== zscore_df['Zscore']]
The reason why df.groupby(["ID", "Gene"]).size().transform(zscore) doesn't work is because the last group is a series with only one item, so when you try to apply the lambda function zscore to a single [integer], you get the 'int' object has no attribute 'mean' error. Note that x.mean() behaves differently than pandas' 'mean'.
Update
I think this should do it:
# Setup code
df = pd.DataFrame({"ID": ["1_1_1", "1_1_1", "1_1_10", "1_1_10", "1_1_100"],
"Gene": ["SMARCB1", "smad", "SMARCB1", "smad", "SMARCB1"],
"Count": [1, 12, 2, 17, 3]})
df = df.set_index(['ID', 'Gene'])
# Add standard deviation for every row
# Note: .transform(zscore) would also work
df['std_dev'] = df.groupby('ID')['Count'].apply(zscore)
# Find the max standard deviation for each group and
# use that as a mask for the original df
df[df.groupby('ID')['std_dev'].transform(max) == df['std_dev']]
Out:
Count std_dev
ID Gene
1_1_1 smad 12 0.707107
1_1_10 smad 17 0.707107
My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})