Pandas 're-binning' a DataFrame - python

A DataFrame containing data with age binned in separate rows, as below:
VALUE,AGE
10, 0-4
20, 5-9
30, 10-14
40, 15-19
.. .. .....
So, basically, the age is grouped in 5 year bins. I'd like to have 10 year bins, that is, 0-9,10-19 etc. What I'm after is the VALUE, but for 10-year based age bins, so the values would be:
VALUE,AGE
30, 0-9
70, 10-19
I can do it by shifting and adding, and taking every second row of the resulting dataframe, but is there any smart, more general way built into Pandas to do this ?

Here's a "dumb" version, based on this answer - just sum every 2 rows:
In[0]
df.groupby(df.index // 2).sum()
Out[0]:
VALUE
0 30
1 70
I say "dumb" because this method doesn't factor in the age cut offs, it just happens to align with them. So say if the age ranges are variable, or if you have data that start at 5-9 instead of 0-4, this will likely cause an issue. You also have to rename the index as it is unclear.
A "smarter" version would be to actually create bins with pd.cut and use that to group the data, based on the ages for each row:
In[0]
df['MAX_AGE'] = df['AGE'].str.split('-').str[-1].astype(int)
bins = [0,10,20]
out = df.groupby(pd.cut(df['MAX_AGE'], bins=bins, right=False)).sum().drop('MAX_AGE',axis=1)
Out[0]:
VALUE
AGE
(0, 10] 30
(10, 20] 70
Explanation:
Use pandas.Series.str methods to get out the maximum age for each row,
store in a column "MAX_AGE"
Create bins at 10 year cut offs
Use pd.cut to assign the data into bins based on the max age of each row. Then use groupby on these bins and sum. Note that since we specify right = False, the bins depicted in the index should mean 0-9 and 10-19.
For reference, here is the data I was using:
import pandas as pd
VALUE = [10,20,30,40,]
AGE = ['0-4','5-9','10-14','15-19']
df = pd.DataFrame({'VALUE':VALUE,
'AGE':AGE})

This should work as long as they are all in 5 year increments. This will find where the upper number is uneven and group it with what came before, stopping at the last uneven number.
Below splits the string to get the numerical value
df['lower'] = df['AGE'].str.split('-').str[0]
df['upper'] = df['AGE'].str.split('-').str[1]
df[['lower','upper']] = df[['lower','upper']].astype(int)
Then it will apply the grouping logic, and rename the columns to represent the desired time period.
df['VALUE'] = df.groupby((df['upper'] % 2 == 1).shift().fillna(0).cumsum())['VALUE'].transform('sum')
df = df.drop_duplicates(subset = ['VALUE'],keep = 'last')
df['lower'] = df['lower'] - 5
df[['lower','upper']] = df[['lower','upper']].astype(str)
df['AGE'] = df['lower'] + '-' + df['upper']
df = df.drop(columns = ['lower','upper'])

Related

Pandas - How to groupby, calculate difference between first and last row, calculate max, and select the corresponding group in original frame

I have a Pandas dataframe with two columns I am interested in: A categorical label and a timestamp. Presumably what I'm trying to do would also work with ordered numerical data. The dataframe is already sorted by timestamps in ascending order. I want to find out which label spans the longest time-window and select only the values associated with it in the original dataframe.
I have tried grouping the df by label, calculating the difference and selecting the maximum (longest time-window) successfully, however I'm having trouble finding an expression to select the corresponding values in the original df using this information.
Consider this example with numerical values:
d = {'cat': ['A','A','A','A','A','A','B','B','B','B','C','C','C','C','C','C','C'],
'val': [1,3,5,6,8,9,0,5,10,20,4,5,6,7,8,9,10]}
df = pd.DataFrame(data = d)
Here I would expect something equivalent to df.loc[df.cat == 'B'] since B has the maximum difference of all the categories.
df.groupby('cat').val.apply(lambda x: x.max() - x.min()).max()
gives me the correct difference, but I have no idea how to use this to select the correct category in the original df.
You can go for idxmax to get the category that gave rise to maximum peak-to-peak value within groups (np.ptp does the maximum minus minimum). Then you can index with loc as you said, or query:
>>> max_cat = df.groupby("cat").val.apply(np.ptp).idxmax()
>>> max_cat
"B"
>>> df.query("cat == #max_cat") # or df.loc[df.cat == max_cat]
cat val
6 B 0
7 B 5
8 B 10
9 B 20

To assign a range of values as 0 or 1 in an existing dataframe

I wanted to assign 0,1,2,3 numeric values in my existing dataframe. Following is my code:
for i in train_df['age']:
train_df[i].values[0:30] = 0
train_df[i].values[30:40] = 1
train_df[i].values[40:50] = 2
train_df[i].values[50:60] = 3
Is there any better way to assign and change the values?
(Pls note the age column contains value from 20-60 (int nos), so i want to assign values of 20-30 as 0, 30-40 as 1 and so on)
Can you try train_df['new_age_group'] = pd.cut(train_df.age, [0,30,40, 50], include_lowest=True)
To get numeric values like you showed you could try. For the below method please make sure you sort the age column.
arry = np.array([0,30,40, 50])
train_df['new_age_group'] = arry.searchsorted(train_df.age)

How to define function for pandas qcut label?

I'm using pandas.qcut for dividing data into 5 groups, and want to label each group based on the qcut min and max score.
For example, I tried on "age" data from data frame column.
df['age group'] = pd.qcut(df['age'], 5)
and it resulted in
Categories (5, interval[float64]): [(37.999, 61.0] < (61.0, 67.0] < (67.0, 73.0] < (73.0, 78.0] < (78.0, 93.0]]
The expected result is to give label for each group automotically based on min and max value, e.g.
Categories 1 label would be "37.999 to 60.999", etc.
For now I did the labeling manually looking at each category range. How should I define the label to make it as expected? Thanks!
You can redefine the categories:
df['age group'] = pd.qcut(df['age'], 5)
df['age group'].cat.categories = [f'{i.left} to {i.right}' for i in df['age group'].cat.categories]

How to group records with Pandas cut()?

My goal is to group n records by 4, say for example:
0-3
4-7
8-11
etc.
find max() value of each group of 4 based on one column among other columns, and create a new dataset or csv file. The max() operation would be performed on one column, while the other columns remain as they are.
Based on the research I have made here (Stackoverflow), I have tried to customize and apply the following solution on this site on my dataset, but it wasn't giving me my expectations:
# Group by every 4 row until the len(dataset)
groups = dataset.groupby(pd.cut(dataset.index, range(0,len(dataset),3))
needataset = groups.max()
I'm getting results similar to the following:
Column 1 Column 2 ... Column n
0. (0,3]
1. (3,6]
The targeted column for the max() operation did not also produce the expected result.
I will appreciate any guide to tackling the problem.
This example should help you. Here I create a DataFrame of some random values between 0 and 100 with step 5, and group those values in groups of 4 (sort_values is really important, it will make your life easier)
df = pd.DataFrame({'value': np.random.randint(0, 100, 5)})
df = df.sort_values(by='value')
labels = ["{0} - {1}".format(i, i + 4) for i in range(0, 100, 5)]
df['group'] = pd.cut(df.value, range(0, 105, 5), right=False, labels=labels)
groups = df["group"].unique()
Then I create an array for max values
max_vals = np.zeros((len(groups)))
for i, group in enumerate(groups):
max_vals[i] = max(df[df["group"] == group]["value"])
And then a DataFrame out of those max values
pd.DataFrame({"group": groups, "max value": max_vals})

Python: Iterate over dataframes of different lengths, and compute new value with repeat values

EDIT:
I realized I set up my example incorrectly, the corrected version follows:
I have two dataframes:
df1 = pd.DataFrame({'x values': [11, 12, 13], 'time':[1,2.2,3.5})
df2 = pd.DataFrame({'x values': [11, 21, 12, 43], 'time':[1,2.1,2.6,3.1})
What I need to do is iterate over both of these dataframes, and compute a new value, which is a ratio of the x values in df1 and df2. The difficulty comes in because these dataframes are of different lengths.
If I just wanted to compute values in the two, I know that I could use something like zip, or even map. Unfortunately, I don't want to drop any values. Instead, I need to be able to compare the time column between the two frames to determine whether or not to copy over a value from a previous time to the computation of in the next time period.
So for instance, I would compute the first ratio:
df1["x values"][0]/df2["x values"][0]
Then for the second I check which update happens next, which in this case is to df2, so df1["time"] < df2["time"] and:
df1["x values"][0]/df2["x values"][1]
For the third I would see that df1["time"] > df2["time"], so the third computation would be:
df1["x values"][1]/df2["x values"][1]
The only time both values should be used to compute the ratio from the same "position" is if the times in the two dataframes are equal.
And so on. I'm very confused as to whether or not this is possible to execute using something like a lambda function, or itertools. I've made some attempts, but most have yielded errors. Any help would be appreciated.
Here is what I ended up doing. Hopefully it helps clarify what my question was. Also, if anyone can think of a more pythonic way to do this, I would appreciate the feedback.
#add a column indicating which 'type' of dataframe it is
df1['type']=pd.Series('type1',index=df1.index)
df2['type']=pd.Series('type2',index=df2.index)
#concatenate the dataframes
df = pd.concat((df1, df2),axis=0, ignore_index=True)
#sort by time
df = df.sort_values(by='time').reset_index()
#we create empty arrays in order to track records
#in a way that will let us compute ratios
x1 = []
x2 = []
#we will iterate through the dataframe line by line
for i in range(0,len(df)):
#if the row contains data from df1
if df["type"][i] == "type1":
#we append the x value for that type
x1.append(df[df["type"]=="type1"]["x values"][i])
#if the x2 array contains exactly 1 value
if len(x2) == 1:
#we add it to match the number of x1
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x1)-1):
x2.append(x2[0])
#if the x2 array contains more than 1 value
#add a copy of the previous x2 record to correspond
#to the new x1 record
if len(x2) > 0:
x2.append(x2[len(x2)-1])
#if the row contains data from df2
if df["type"][i] == "type2":
#we append the x value for that type
x2.append(df[df["type"]=="type2"]["x values"][i])
#if the x1 array contains exactly 1 value
if len(x1) == 1:
#we add it to match the number of x2
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x2)-1):
x1.append(x2[0])
#if the x1 array contains more than 1 value
#add a copy of the previous x1 record to correspond
#to the new x2 record
if len(x1) > 0:
x1.append(x1[len(x1)-1])
#combine the records
new__df = pd.DataFrame({'Type 1':x1, 'Type 2': x2})
#compute the ratio
new_df['Ratio'] = new_df['x1']/f_df['x2']
You can merge the two dataframes on time and then calculate ratios
new_df = df1.merge(df2, on = 'time', how = 'outer')
new_df['ratio'] = new_df['x values_x'] / new_df['x values_y']
You get
time x values_x x values_y ratio
0 1 11 11 1.000000
1 2 12 21 0.571429
2 2 12 12 1.000000
3 3 13 43 0.302326

Categories