under sex column of my Dataframe are for each row mehrere elemente (female, female) or (male, famale) or normal ones (female). Now i want to count the number of male and female for each row. can you help me pls?
import collections
for female in df_fundinv['borrower_genders'][0]:
collections.counter()
type(liste)
collections.Counter(liste)
0 female
1 female, female
2 female
3 female
4 female
...
646828 female
646829 female
646830 female, female
646831 female, female
646832 female
Name: borrower_genders, Length: 646833, dtype: object
gender.values
männlich={}
for element in gender.values:
try:
'element'!= 'male'
except:
männlich.append(dict(zip(male, counts)))
männlich
Was ich erwarte:
male(neue Spalte) female(neue Sp
0 female 1
1 female, female 2
2 female 1
3 female
4 female
...
646828 female
646829 female
646830 female, female 2 2
646831 female, female
646832 female
Name: borrower_genders, Length: 646833, dtype: object
Related
imagine I have dataframe like this:
item name gender
banana tom male
banana kate female
apple kate female
kiwi jim male
apple tom male
banana kimmy female
kiwi kate female
banana tom male
Is there any way to drop rows that the person only relate(buy) less than 2 item? Also I don't want to drop duplicates. So the output I want like this:
item name gender
banana tom male
banana kate female
apple kate female
apple tom male
kiwi kate female
banana tom male
#sammywemmy's solution:
df.loc[df.groupby('name').item.transform('size').ge(2)]
groupby to group rows with the same name together
# Get Each Group
print(df.groupby('name').apply(lambda s: s.reset_index()))
index item name gender
name
jim 0 3 kiwi jim male
kate 0 1 banana kate female
1 2 apple kate female
2 6 kiwi kate female
kimmy 0 5 banana kimmy female
tom 0 0 banana tom male
1 4 apple tom male
2 7 banana tom male
transform to get a value in every row that represents the group size. (Number of rows)
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['item'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
This could have been done on any column in this case:
# Turn Each Item Into The Number of Rows in The Group
df['group_size'] = df.groupby('name')['gender'].transform('size')
print(df)
item name gender group_size
0 banana tom male 3
1 banana kate female 3
2 apple kate female 3
3 kiwi jim male 1
4 apple tom male 3
5 banana kimmy female 1
6 kiwi kate female 3
7 banana tom male 3
Notice how now each row has the corresponding group size at the end. tom has 3 instances so every name == tom row has 3 in group_size.
ge Convert to Boolean Index based on relational operator
# Add Condition To determine if the row should be kept or not
df['should_keep'] = df.groupby('name')['item'].transform('size').ge(2)
print(df)
item name gender group_size should_keep
0 banana tom male 3 True
1 banana kate female 3 True
2 apple kate female 3 True
3 kiwi jim male 1 False
4 apple tom male 3 True
5 banana kimmy female 1 False
6 kiwi kate female 3 True
7 banana tom male 3 True
loc use Boolean Index to get the desired rows
print(df.groupby('name')['item'].transform('size').ge(2))
0 True
1 True
2 True
3 False
4 True
5 False
6 True
7 True
Name: item, dtype: bool
loc will include any index that is True, any index that is False will be excluded. (indexes 3 and 5 are False so they will not be included)
All together:
import pandas as pd
df = pd.DataFrame({'item': {0: 'banana', 1: 'banana', 2: 'apple',
3: 'kiwi', 4: 'apple', 5: 'banana',
6: 'kiwi', 7: 'banana'},
'name': {0: 'tom', 1: 'kate', 2: 'kate',
3: 'jim', 4: 'tom', 5: 'kimmy',
6: 'kate', 7: 'tom'},
'gender': {0: 'male', 1: 'female',
2: 'female', 3: 'male',
4: 'male', 5: 'female',
6: 'female', 7: 'male'}})
print(df.loc[df.groupby('name')['name'].transform('size').ge(2)])
item name gender
0 banana tom male
1 banana kate female
2 apple kate female
4 apple tom male
6 kiwi kate female
7 banana tom male
I have a massive dataframe df with around 10 million rows:
df.sort_values(['pair','x1','x2'])
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
-------------------------------------------------------------------------------
A male H female a male d male 0
A male W male a male d male 0 (*)
A male KK female a male d male 0 (**)
B female C male a male d male 0 (-)
B female W male a male d male 0 (*)
B female BB female a male d male 0
B female KK female a male d male 0 (**)
F male W male a male d male 0 (*)
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1 (-)
D male E male b female d male 1
A male C male b female e female 2
...
Each column can be explained by the following:
x1gen is a gender data of x1, x2gen is of x2, and so on.
x1 cites y1 and x2 cites y2.
Each pair of y1 and y2 is assigned a unique pair value.
My objective is to find four values per unique pair:
male citing male
male citing female
female citing male
female citing female
where, each citation network should not be counted more than once.
For example, in the given sample, x2 = W is appeared three times in pair = 0 (see (*)), so it should be counted once, not three times. Same applies to x2 = KK in pair = 0 (see (**)). However, we can count the same reference if it is a new pair. (C -> d in (-) is counted separately once per pair = 0 and pair = 1)
Hence, for the first pair pair = 0, the objective values are:
male citing male = 4 (A -> a, F -> a, W -> d, C -> d)
male citing female = 0
female citing male = 4 (B -> a, H -> d, KK -> d, BB -> d)
female citing female = 0
What I initially did was using a for loop and a set of if loops and creating four lists separately for x1 and x2:
mm = [1]
mf = [0]
fm = [0]
ff = [0]
mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0
for i in range(1, len(df)):
if df['pair'][i] == df['pair'][i-1]:
if df['x1'][i] != df['x1'][i-1]:
if df['x1gen'][i] == 'male':
if df['y1gen'][i] == 'male':
mm1 += 1
else:
mf1 += 1
else:
if df['y1gen'][i] == 'male':
fm1 += 1
else:
ff1 += 1
...
and the gist is analogous (the code itself is MANY lines long, but just a repetition of those lines). As one can tell, this is HIGHLY inefficient (takes around 120 minutes).
What is the optimal way to find such values without having to do a highly inefficient string-matching?
You can try the following:
import io
import re
import pandas as pd
# this just recreates the dataframe
s = '''
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
A male H female a male d male 0
A male W male a male d male 0
A male KK female a male d male 0
B female C male a male d male 0
B female W male a male d male 0
B female BB female a male d male 0
B female KK female a male d male 0
F male W male a male d male 0
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1
D male E male b female d male 1
A male C male b female e female 2
'''
s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)
It gives:
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
0 A male H female a male d male 0
1 A male W male a male d male 0
2 A male KK female a male d male 0
3 B female C male a male d male 0
4 B female W male a male d male 0
5 B female BB female a male d male 0
6 B female KK female a male d male 0
7 F male W male a male d male 0
8 A male T female b female d male 1
9 A male BB female b female d male 1
10 B female C male b female d male 1
11 D male E male b female d male 1
12 A male C male b female e female 2
Counting citation pairs:
# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
c1.add(c2, fill_value=0).astype(int)
This gives:
female_female 1
female_male 6
male_female 4
male_male 6
Computing results for each pair separately:
def cit_count(g):
# count x2-> y2 pairs
df1 = g.drop_duplicates(subset=['x1', 'y1'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = g.drop_duplicates(subset=['x2', 'y2'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
return c1.add(c2, fill_value=0)
print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))
It gives:
female_female female_male male_female male_male
pair
0 0 4 0 4
1 1 2 2 2
2 0 0 2 0
I have two dataframes:
df1:
Gender Registered
female 1
male 0
female 0
female 1
male 1
male 0
df2:
Gender
female
female
male
male
I want to modify df2, so that there is a new column 'Count' with the count of registered = 1 for corresponding gender values from df1. For example, in df1 there are 2 registered females and 1 registered male. I want to transform the df2 so that the output is as follows:
output:
Gender Count
female 2
female 2
male 1
male 1
I tried many things and got close but couldn't make it fully work.
sum + map:
v = df1.groupby('Gender').Registered.sum()
df2.assign(Count=df2.Gender.map(v))
Gender Count
0 female 2
1 female 2
2 male 1
3 male 1
pd.merge
pd.merge(df2, df1.groupby('Gender', as_index=False).sum())
Gender Registered
0 female 2
1 female 2
2 male 1
3 male 1
I'm trying to solve the Titanic survival program from Kaggle. It's my first step in actually learning Machine Learning. I have a problem where the gender column causes an error. The stacktrace says could not convert string to float: 'female'. How did you guys come across this issue? I don't want solutions. I just want a practical approach to this problem because I do need the gender column to build my model.
This is my code:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)
x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)
titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)
val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
There are a couple ways to deal with this, and it kind of depends what you're looking for:
You could encode your categories to numeric values, i.e. transform each level of your category to a distinct number,
or
dummy code your category, i.e. turn each level of your category into a separate column, which gets a value of 0 or 1.
In lots of machine learning applications, factors are better to deal with as dummy codes.
Note that in the case of a 2-level category, encoding to numeric according to the methods outlined below is essentially equivalent to dummy coding: all the values that are not level 0 are necessarily level 1. In fact, in the dummy code example I've given below, there is redundant information, as I've given each of the 2 classes its own column. It's just to illustrate the concept. Typically, one would only create n-1 columns, where n is the number of levels, and the omitted level is implied (i.e. make a column for Female, and all the 0 values are implied to be Male).
Encoding Categories to numeric:
Method 1: pd.factorize
pd.factorize is a simple, fast way of encoding to numeric:
For example, if your column gender looks like this:
>>> df
gender
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Male
7 Female
8 Female
9 Female
df['gender_factor'] = pd.factorize(df.gender)[0]
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Method 2: categorical dtype
Another way would be to use category dtype:
df['gender_factor'] = df['gender'].astype('category').cat.codes
This would result in the same output
Method 3 sklearn.preprocessing.LabelEncoder()
This method comes with some bonuses, such as easy back transforming:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)
>>> df
gender gender_factor
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
# Easy to back transform:
df['gender_factor'] = le.inverse_transform(df.gender_factor)
>>> df
gender gender_factor
0 Female Female
1 Male Male
2 Male Male
3 Male Male
4 Female Female
5 Female Female
6 Male Male
7 Female Female
8 Female Female
9 Female Female
Dummy Coding:
Method 1: pd.get_dummies
df.join(pd.get_dummies(df.gender))
gender Female Male
0 Female 1 0
1 Male 0 1
2 Male 0 1
3 Male 0 1
4 Female 1 0
5 Female 1 0
6 Male 0 1
7 Female 1 0
8 Female 1 0
9 Female 1 0
Note, if you want to omit one column to get a non-redundant dummy code (see my note at the beginning of this answer), you can use:
df.join(pd.get_dummies(df.gender, drop_first=True))
gender Male
0 Female 0
1 Male 1
2 Male 1
3 Male 1
4 Female 0
5 Female 0
6 Male 1
7 Female 0
8 Female 0
9 Female 0
Let's say I have a Panda DataFrame like this
import pandas as pd
a=pd.Series([{'Country'='Italy','Name'='Augustina','Gender'='Female','Number'=1}])
b=pd.Series([{'Country'='Italy','Name'='Piero','Gender'='Male','Number'=2}])
c=pd.Series([{'Country'='Italy','Name'='Carla','Gender'='Female','Number'=3}])
d=pd.Series([{'Country'='Italy','Name'='Roma','Gender'='Female','Number'=4}])
e=pd.Series([{'Country'='Greece','Name'='Sophia','Gender'='Female','Number'=5}])
f=pd.Series([{'Country'='Greece','Name'='Zeus','Gender'='Male','Number'=6}])
df=pd.DataFrame([a,b,c,d,e,f])
then, I sort with multiindex, like
df.set_index(['Country','Gender'],inplace=True)
Now, I wold like to know how to count how many people are from Italy, or how many Greek female I have in the dataframe.
I've tried
df['Italy'].count()
and
df['Greece']['Female'].count()
. None of them works,
Thanks
I think you need groupby with aggregatingsize:
What is the difference between size and count in pandas?
a=pd.DataFrame([{'Country':'Italy','Name':'Augustina','Gender':'Female','Number':1}])
b=pd.DataFrame([{'Country':'Italy','Name':'Piero','Gender':'Male','Number':2}])
c=pd.DataFrame([{'Country':'Italy','Name':'Carla','Gender':'Female','Number':3}])
d=pd.DataFrame([{'Country':'Italy','Name':'Roma','Gender':'Female','Number':4}])
e=pd.DataFrame([{'Country':'Greece','Name':'Sophia','Gender':'Female','Number':5}])
f=pd.DataFrame([{'Country':'Greece','Name':'Zeus','Gender':'Male','Number':6}])
df=pd.concat([a,b,c,d,e,f], ignore_index=True)
print (df)
Country Gender Name Number
0 Italy Female Augustina 1
1 Italy Male Piero 2
2 Italy Female Carla 3
3 Italy Female Roma 4
4 Greece Female Sophia 5
5 Greece Male Zeus 6
df = df.groupby('Country').size()
print (df)
Country
Greece 2
Italy 4
dtype: int64
df = df.groupby(['Country', 'Gender']).size()
print (df)
Country Gender
Greece Female 1
Male 1
Italy Female 3
Male 1
dtype: int64
If need only some sizes with select by MultiIndex by xs or slicers:
df.set_index(['Country','Gender'],inplace=True)
print (df)
Name Number
Country Gender
Italy Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
Greece Female Sophia 5
Male Zeus 6
print (df.xs('Italy', level='Country'))
Name Number
Gender
Female Augustina 1
Male Piero 2
Female Carla 3
Female Roma 4
print (len(df.xs('Italy', level='Country').index))
4
print (df.xs(('Greece', 'Female'), level=('Country', 'Gender')))
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.xs(('Greece', 'Female'), level=('Country', 'Gender')).index))
1
#KeyError: 'MultiIndex Slicing requires
#the index to be fully lexsorted tuple len (2), lexsort depth (0)'
df.sort_index(inplace=True)
idx = pd.IndexSlice
print (df.loc[idx['Italy', :],:])
Name Number
Country Gender
Italy Female Augustina 1
Female Carla 3
Female Roma 4
Male Piero 2
print (len(df.loc[idx['Italy', :],:].index))
4
print (df.loc[idx['Greece', 'Female'],:])
Name Number
Country Gender
Greece Female Sophia 5
print (len(df.loc[idx['Greece', 'Female'],:].index))
1