I am trying to determine what the ages in a dataframe are that fall between 0 and 10. I have written the following, but it only returns 'Yes' even though not all values fall between 1 and 10:
x = df['Age']
for i in x :
if df['Age'].between(0, 10, inclusive=True).any():
print('Yes')
else:
print('No')
I am doing this with the intention of creating a new column in the dataframe that will categorize people based on whether they fall into an age group, i.e., 0-10, 11-20, etc...
Thanks for any help!
If you want to create a new column, assign to the column:
df['Child'] = df['Age'].between(0, 10, inclusive=True)
with the intention of creating a new column in the dataframe that will
categorize people based on whether they fall into an age group, i.e.,
0-10, 11-20, etc...
Then pd.cut is what you are looking for:
pd.cut(df['Age'], list(range(0, df['Age'].max() + 10, 10)))
For example:
df['Age'] = pd.Series([10, 7, 15, 24, 66, 43])
then the above gives you:
0 (0, 10]
1 (0, 10]
2 (10, 20]
3 (20, 30]
4 (60, 70]
5 (40, 50]
Related
How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2
Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])
As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)
You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)
You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)
I have a large dataframe where I want to use groupby and nlargest to look for the second largest, third, fourth and fifth largest value of each group. I have over 500 groups and each group has over 1000 values. I also have other columns in the dataframe which I want to keep after applying groupby and nlargest. My dataframe looks like this
df = pd.DataFrame({
'group': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10]
})
To look for second, third, fourth largest and so on of each group for column a I use
secondlargest = df.groupby(['group'], as_index=False)['a'].apply(lambda grp: grp.nlargest(2).min())
which returns
group a
0 1 4
1 2 5
2 3 1
3 4 2
4 5 20
5 6 10
6 7 40
7 8 30
I need columns b and c present in this resulting dataframe. I use the following to subset the original dataframe but it returns an empty dataframe. How should I modify the code?
secondsubset = df[df.groupby(['group'])['a'].apply(lambda grp: grp.nlargest(2).min())]
If I understand your goal correctly, you should be able to just drop as_index=False, use idxmin instead of min, pass the result to df.loc:
df.loc[df.groupby('group')['a'].apply(lambda grp: grp.nlargest(2).idxmin())]
You can uses agg lambda. It is neater
df.groupby('group').agg(lambda grp: grp.nlargest(2).min())
I have a function that returns a list of lists (see image below). The first list is an identification number. The remaining lists identify an item and then values related to the item. What I'm trying to do is take all lists other than list[0] and place them in a pandas dataframe. I know how to take an entire list of lists and create a df.
data = lists
df = pd.DataFrame(data)
Can anyone help me make a dataframe minus the first list? If you have a suggestion to make this question easier to understand, or a link to where this is already solved, I'd appreciate the help. I searched stack overflow but couldn't find a question on point. If this is just a dumb idea to do it this way for some reason, I'm new and it'd be helpful to point me in a better direction as well. But some of the lists have a lot of entries and I want to drop them into a dataframe to do some analysis on them with pandas.
(['pE7464AFD1F'],
[['t29', 1, 15, 50],
['t248', 1, 15, 15],
['t140', 1, 15, 33],
['t121', 1, 15, 41],
['t221', 1, 15, 19]])
Unless I'm misunderstanding your question, you have a tuple containing an identification number, and a list of lists which represent your data.
You're just looking to separate the two from each other, and turn the data into a dataframe.
import pandas as pd
identifier, data = (['pE7464AFD1F'],
[['t29', 1, 15, 50],
['t248', 1, 15, 15],
['t140', 1, 15, 33],
['t121', 1, 15, 41],
['t221', 1, 15, 19]])
df = pd.DataFrame(data)
# For Display
print(df)
Output:
0 1 2 3
0 t29 1 15 50
1 t248 1 15 15
2 t140 1 15 33
3 t121 1 15 41
4 t221 1 15 19
I have a dataframe with multiple columns using with added a new column for age intervals.
# Create Age Intervals
bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
df['age_intervals'] = pd.cut(df['age'],bins)
Now, I've another column named no_show that states whether a person shows up for the appointment or not using values 0 or 1. By using the below code, I'm able to groupby the data based on age_intervals.
df[['no_show','age_intervals']].groupby('age_intervals').count()
Output:
age_intervals no_show
(0, 5] 8192
(5, 10] 7017
(10, 15] 5719
(15, 20] 7379
(20, 25] 6750
But how can I group the no_show data based on its values 0 and 1. For example, in the age interval (0,5], out of 8192, 3291 are 0 and 4901 are 1 for no_show and so on.
An easy way would be to group on both columns and use size() which returns a Series:
df.groupby(['age_intervals', 'no_show']).size()
This will return a Series with divided values depending on both the age_intervals column and the no_show column.
I was trying to group/rank in Python like we do in SAS with Proc Rank code. The code I tried is
Merge_Data['FrSeg'] = Merge_Data['Frequency'].rank(method='dense').astype(int)
It gives me an out put with the same numbers. I would like to group into 3.
For example, Frequency from 1-10 in rank 1, 11-20 in rank 2 and 21-above in rank 3. I have min=1 and max=68 Frequency(number orders put in- if you want to know).
Thanks for your help in advance
You might be interested in numpy and pandas packages:
import pandas as pd
import numpy as np
# dataframe to hold the list of values
df = pd.DataFrame({'FrSeg': [33, 66, 26, 5, 16, 31, 34, 10, 17, 40]})
# set the rank ranges
ranges = [0, 10, 20, 68]
# pandas.cut:
# returns indices of half-open bins to which each value of 'FrSeg' belongs.
print df.groupby(pd.cut(df['FrSeg'], ranges)).count()
output:
FrSeg
FrSeg
(1, 10] 2
(10, 20] 2
(20, 68] 6