How to group various groups in python into one - python

I have a dataset which I want to group by the age.
So, here is the first part of the dataset:
It is a simulation for a inventory data. Größe means the number of people with the age (Alter) 15. Risiko gives every person a number and Geschlecht is feminine or masculine.
I want to add a column "Group" and give every people, which have the age 15-19 one number, than with age 20-24 one number and so on.
How can I do this?

You can use map and lambda to create a new column like so :
def return_age_from_range(age):
# Max value in range is excluded, so remember to add +1 to the range you want
if age in range(15, 20):
return 1
elif age in range(20, 25):
return 2
# and so on...
df['group'] = df.alter.map(lambda x: return_age_from_range(x))

Use numpy.select:
In [488]: import numpy as np
In [489]: conds = [df['Alter'].between(15,19), df['Alter'].between(20,24), df['Alter'].between(24,28)]
In [490]: choices = [1,2,3]
In [493]: df['Group'] = np.select(conds, choices)
In [494]: df
Out[494]:
Größe Risiko Geschlecht Alter Group
0 95 1 F 15 1
1 95 2 F 15 1
2 95 3 M 15 1
3 95 4 F 15 1
4 95 5 M 15 1
5 95 6 M 15 1
6 95 7 M 15 1
7 95 8 F 15 1
8 95 9 M 15 1

Related

Efficient lookup between pandas column values and a list of values

I have a list of n elements lets say:
[5,30,60,180,240]
And a dataframe with the following characteristics
id1 id2 feat1
1 1 40
1 2 40
1 3 40
1 4 40
2 6 87
2 7 87
2 8 87
The combination of id1 + id2 is unique but all of the records with common id1 share the value of feat1. I would like to write a function to run it via groupby + apply (or whatever is faster) that creates a column called 'closest_number'. The 'closest_number' will be the closest element between the feat1 column for a given id1+id2 (or id1 as the records share feat1) and each of the elements of the list.
Desired output:
id1 id2 feat1 closest_number
1 1 40 30
1 2 40 30
1 3 40 30
1 4 40 30
2 6 87 60
2 7 87 60
2 8 87 60
If this will be a standard 2 array lookup problem I could do:
def get_closest(array, values):
# make sure array is a numpy array
array = np.array(array)
# get insert positions
idxs = np.searchsorted(array, values, side="left")
# find indexes where previous index is closer
prev_idx_is_less = ((idxs == len(array))|(np.fabs(values - array[np.maximum(idxs-1, 0)]) < np.fabs(values - array[np.minimum(idxs, len(array)-1)])))
idxs[prev_idx_is_less] -= 1
return array[idxs]
An if I apply this do the columns there I will get as output:
array([30, 60])
However I will not get any information about which indexes they have the correspondence with 30 and 60.
What will be the optimum way of doing this? As my list of elements is very small I have created distance columns in my dataset and then I have selected the one that gets me the min distances.
But I assume there should be a more elegant way of doing this.
BR
E
Use get_closest as follows:
# obtain the series with index id1 and values feat1
vals = df.groupby("id1")["feat1"].first().rename("closest_number")
# find the closest values and assign them back
vals[:] = get_closest(s, vals)
# merge the series into the original DataFrame
res = df.merge(vals, right_index=True, left_on="id1", how="left")
print(res)
Output
id1 id2 feat1 closest_number
0 1 1 40 30
1 1 2 40 30
2 1 3 40 30
3 1 4 40 30
4 2 6 87 60
5 2 7 87 60
6 2 8 87 60

Counting the number of entries in a dataframe that satisfies multiple criteria

I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.
Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()
Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)

new column base on multi column condition

import pandas as pd
df = pd.DataFrame({
'cakeName': ['A','B','C','D','E','F','G','H'],
'chocolate%': ['20','70','30','50','50','10','75','20'],
'milk%' : ['50','20','40','0', '30','80','15','10'],
'straberry%' : ['30','10','30','50','20','10','10','70'],
})
df.head(10)
i would like to create a new column 'cakeType' based on the columns value
objective:
- scan through each cakeName
- if there are single ingredient which stand out, >= 75, then return a value in 'cakeType'
- for example: cake 'G' chocolate% >= 75, then 'choco' etc
- else if none of the ingredient have more than 75, its just a 'normal cake'
i had seek for answer in forum, doesn't seem quite fit, as i will have many many ingredients column
so scanning the row looking for value >= 75 is a better way to do it?
thanks a lot
Method 1: np.select:
Good use case for np.select where we define our conditions and based on those conditions we select choices. Plus we have a default value if none of the conditions is met:
conditions = [
df['chocolate%'].ge(75),
df['milk%'].ge(75),
df['straberry%'].ge(75)
]
choices = ['choco', 'milk', 'strawberry']
df['cakeType'] = np.select(conditions, choices, default='normal cake')
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 choco
7 H 20 10 70 normal cake
Method 2: idxmax, Series.where and fillna:
First we get the column names where a value is >= 75. Then we remove the column names which do not have any value >= 75 and fillna them with normal cake
m1 = df.iloc[:, 1:].ge(75).idxmax(axis=1)
newcol = m1.where(df.iloc[:, 1:].ge(75).any(axis=1)).str[:-1].fillna('normal cake')
df['cakeType'] = newcol
cakeName chocolate% milk% straberry% cakeType
0 A 20 50 30 normal cake
1 B 70 20 10 normal cake
2 C 30 40 30 normal cake
3 D 50 0 50 normal cake
4 E 50 30 20 normal cake
5 F 10 80 10 milk
6 G 75 15 10 chocolate
7 H 20 10 70 normal cake

Python: Apply function to each row of a Pandas DataFrame and return **new data frame**

I am trying to apply a function to each row of a data frame. The tricky part is that the function returns a new data frame for each processed row. Assume the columns of this data frame can easily be derived from the processed row.
At the end the result should be all these data frames (1 for each processed row) concatenated. I intentionally do not provide sample code, because the simplest of solution proposal will do, as long as the 'tricky' part if fulfilled.
I have spend hours trying digging through docs and stackoverflow to find a solution. As usual the pandas docs are so devoid of practical examples aside the simplest of operations that I just couldn't figure it out. I also made sure to not miss any duplicate questions. Thanks a lot.
It is unclear what you are trying to achieve, but I doubt you need to create separate dataframes.
The example below shows how you can take a dataframe, subset it to your columns of interest, apply a function foo to one of the columns and then apply a second function bar that returns multiple values.
df = pd.DataFrame({
'first_name': ['john', 'nancy', 'jolly'],
'last_name': ['smith', 'drew', 'rogers'],
'A': [1, 4, 7],
'B': [2, 5, 8],
'C': [3, 6, 9]
})
>>> df
first_name last_name A B C
0 john smith 1 2 3
1 nancy drew 4 5 6
2 jolly rogers 7 8 9
def foo(first_name):
return 2 if first_name.startswith('j') else 1
def bar(first_name):
return (2, 0) if first_name.startswith('j') else (1, 3)
columns_of_interest = ['first_name', 'A']
df_new = pd.concat([
df[columns_of_interest].assign(x=df.first_name.apply(foo)),
df.first_name.apply(bar).apply(pd.Series)], axis=1)
>>> df_new
first_name A x 0 1
0 john 1 2 2 0
1 nancy 4 1 1 3
2 jolly 7 2 2 0
Assuming the function you are applying to each row is called f
pd.concat({i: f(row) for i, row in df.iterrows()})
Working example
df = pd.DataFrame(np.arange(25).reshape(5, 5), columns=list('ABCDE'))
def f(row):
return pd.concat([row] * 2, keys=['x', 'y']).unstack().drop('C', 1).assign(S=99)
pd.concat({i: f(row) for i, row in df.iterrows()})
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
Or
df.groupby(level=0).apply(lambda x: f(x.squeeze()))
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
I would do it this way - although I note the .apply is possibly what you are looking for.
import pandas as pd
import numpy as np
np.random.seed(7)
orig=pd.DataFrame(np.random.rand(6,3))
orig.columns=(['F1','F2','F3'])
res=[]
for i,r in orig.iterrows():
tot=0
for col in r:
tot=tot+col
rv={'res':tot}
a=pd.DataFrame.from_dict(rv,orient='index',dtype=np.float64)
res.append(a)
res[0].head()
Should return something like this
{'res':10}

how to add complementary intervals in pandas dataframe

Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check

Categories