I have a dataset with the following unique values in one of its columns.
df['Gender'].unique()
array(['Female', 'M', 'Male', 'male', 'm', 'Male-ish', 'maile',
'Trans-female', 'Cis Female', 'something kinda male?', 'Cis Male',
'queer/she/they', 'non-binary', 'Make', 'Nah', 'All', 'Enby',
'fluid', 'Genderqueer', 'Androgyne', 'Agender', 'Guy (-ish) ^_^',
'male leaning androgynous', 'Male ', 'Man', 'msle', 'Neuter',
'queer', 'A little about you', 'Malr',
'ostensibly male, unsure what that really means')]
As you can see, there are obvious cases where a row should be listed as 'Male' (I'm referring to the cases where 'Male' is misspelled, of course). How can I replace these values with 'Male' without calling the replace function ten times? This is the code I have tried:
x=0
while x<=11:
for i in df['Gender']:
if i[0:2]=='Ma':
print('Male')
elif i[0]=='m':
print('Male')
x+=1
However, I just get a print of a bunch of "Male".
Edit: I want to convert the following values to 'Male': 'M', 'male', 'm', 'maile', 'Make', 'Man', 'msle', 'Malr', 'Male '
Create a list with all the nicknames of Male:
males_list = ['M', 'male', 'm', 'maile', 'Make', 'Man', 'msle', 'Malr', 'Male ']
And then replace them with:
df.loc[df['Gender'].isin(males_list), 'Gender'] = 'Male'
btw: There is always a better solution than looping the rows in pandas, not just in cases like this.
I would use the map function as it allows you to create any custom logic. So for instance, by looking at your code, something like this would do the trick:
def correct_gender(text):
if text[0:2]=='Ma' or text[0]=='m':
return "Male"
return text
df["Gender"] = df["Gender"].map(correct_gender)
If I understand you correctly, you want a more generalized approach. We can use regex to check if the word starts with M or has the letters Ma preceded by a whitespace, so we dont catch Female:
(?i): stands for ignore case sensitivity
?<=\s: means all the words which start with ma and are preceded by a whitespace
df.loc[df['Gender'].str.contains('(?i)^M|(?<=\s)ma'), 'Gender'] = 'Male'
Output
Gender
0 Female
1 Male
2 Male
3 Male
4 Male
5 Male
6 Male
7 Trans-female
8 Cis Female
9 Male
10 Male
11 queer/she/they
12 non-binary
13 Male
14 Nah
15 All
16 Enby
17 fluid
18 Genderqueer
19 Androgyne
20 Agender
21 Guy (-ish) ^_^
22 Male
23 Male
24 Male
25 Male
26 Neuter
27 queer
28 A little about you
29 Male
30 Male
Related
I'm working on transforming a dataframe to show the top 3 earners.
The dataframe looks like this
data = {'Name': ['Allistair', 'Bob', 'Carrie', 'Diane', 'Allistair', 'Bob', 'Carrie','Evelyn'], 'Sale': [20, 21, 19, 18, 5, 300, 35, 22]}
df = pd.DataFrame(data)
print(df)
Name Sale
0 Allistair 20
1 Bob 21
2 Carrie 19
3 Diane 18
4 Allistair 5
5 Bob 300
6 Carrie 35
7 Evelyn 22
In my actual dataset, I have several more columns and rows, and I want to print out and get to
something like
Name Sale
0 Bob 321
1 Carrie 35
2 Allistair 25
Every iteration that I've searched through doesn't quite get there because I get
'Name' is both an index level and a column label, which is ambiguous.
Use groupby:
>>> df.groupby('Name').sum().sort_values('Sale', ascending=False)
Sale
Name
Bob 321
Carrie 54
Allistair 25
Evelyn 22
Diane 18
Thanks to #Andrej Kasely above,
df.groupby("Name")["Sale"].sum().nlargest(3)
So I have a pandas dataframe with rows of tokenized strings in a column named story. I also have a list of words in a list called selected_words. I am trying to count the instances of any of the selected_words in each of the rows in the column story.
The code I used before that had worked is
CCwordsCount=df4.story.str.count('|'.join(selected_words))
This is now giving me NaN values for every row.
Below is the first few rows of the column story in df4. The dataframe contains a little over 400 rows of NYTimes Articles.
0 [it, was, a, curious, choice, for, the, good, ...
1 [when, he, was, a, yale, law, school, student,...
2 [video, bitcoin, has, real, world, investors, ...
3 [bitcoin, s, wild, ride, may, not, have, been,...
4 [amid, the, incense, cheap, art, and, herbal, ...
5 [san, francisco, eight, years, ago, ernie, all...
This is the list of selected_words
selected_words = ['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves']
Link to my df4 .csv file
Each story entry appears to be a list containing a string.
Use map to get the string from the list before applying str as follows.
CCwordsCount = df4.story.map(lambda x: ''.join(x[1:-1])).str.count('|'.join(selected_words))
print(CCwordsCount.head(20)) # Show first 20 story results
Output
0 1
1 2
2 5
3 7
4 0
5 1
6 10
7 8
8 2
9 2
10 8
11 0
12 0
13 2
14 0
15 4
16 2
17 9
18 0
19 0
Name: story, dtype: int64
Explanation
Each story was in a list converted to a string, so basically:
"['it', 'was', 'a', 'curious', 'choice', 'for', 'the', 'good', 'wife', ...]"
Converted to list of words by dropping '[' and ']' and concatenating words
map(lambda x: ''.join(x[1:-1]))
This results in words separated by commas in quotes. For first row this results in the string:
'it', 'was', 'a', 'curious', 'choice', 'for', ...
.find() function can be useful. And this can be implemented in many different ways. If you don't have any other purpose for the raw article and it can be a bunch of string. Then try this, you can also put them in a dictionary and loop over.
def find_words(text, words):
return [word for word in words if word in text]
sentences = "0 [it, was, a, curious, choice, for, the, good, 1 [when, he, was, a, yale, law, school, student, 2 [video, bitcoin, has, real, world, investors, 3 [bitcoin, s, wild, ride, may, not, have, been, 4 [amid, the, incense, cheap, art, and, herbal, 5 [san, francisco, eight, years, ago, ernie, all"
search_keywords=['accept', 'believe', 'trust', 'accepted', 'accepts', 'trusts', 'believes', \
'acceptance', 'trusted', 'trusting', 'accepting', 'believes', 'believing', 'believed',\
'normal', 'normalize', ' normalized', 'routine', 'belief', 'faith', 'confidence', 'adoption', \
'adopt', 'adopted', 'embrace', 'approve', 'approval', 'approved', 'approves', 'good']
found = find_words(sentences, search_keywords)
print(found)
Note : I didn't have panda data frame in mind whine I create this.
I have a large dataset and would like to filter it to only show rows which contain a particular substring (In the following example, 'George') (also bonus points if you tell me how to pass multiple substrings)
For example, if I start with the code
data = {
'Employee': ['George Brazil', 'Tim Swak', 'Rajesh Mahta', 'Kristy Karns', 'Jamie Hyneman', 'Pamela Voss', 'Tyrone Johnson', 'Anton Lafreu'],
'Director': ['Omay Wanja', 'Omay Wanja', 'George Stafford', 'Omay Wanja', 'George Stafford', 'Kristy Karns', 'Carissa Karns', 'Christie Karns'],
'Supervisor': ['George Foreman', 'Mary Christiemas', 'Omay Wanja', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'CEO PERSON', 'George of the jungle'],
'A series of ints to make this more complex': [1,0,1,4 , 1, 3, 3, 7]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df
Employee Director Supervisor A series of ints to make this more complex
a George Brazil Omay Wanja George Foreman 1
b Tim Swak Omay Wanja Mary Christiemas 0
c Rajesh Mahta George Stafford Omay Wanja 1
d Kristy Karns Omay Wanja CEO PERSON 4
e Jamie Hyneman George Stafford CEO PERSON 1
f Pamela Voss Kristy Karns CEO PERSON 3
g Tyrone Johnson Carissa Karns CEO PERSON 3
h Anton Lafreu Christie Karns George of the jungle 7
I would like to then perform an operation such that it returns the dataframe but with only rows a, c, e, and h, because they are the only rows which contain the substring 'George'
Try this
filters = 'George'
df[df.apply(lambda row: row.astype(str).str.contains(filters).any(), axis=1)]
edited to return subset
You can separate use an or statement for each column. There's probably a more elegant way to get it to work, but this will do.
df[df['Employee'].str.contains("George") | df['Director'].str.contains("George") | df['Supervisor'].str.contains("George")]
From your code, it seems you only want the rows that have 'George' in columns ['Employee', 'Director', 'Supervisor']. If so, try this:
# Lambda solution for first `n` columns
mask = df.iloc[:, 0:3].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Lambda solution with named columns
mask = df[['Employee','Director','Supervisor']].apply(lambda x: x.str.contains('George')).sum(axis=1) > 0
df[mask]
# Trivial solution
df[(df['Employee'].str.contains('George')) | (df['Director')].str.contains('George')) | (df['Supervisor'].str.contains('George'))]
I'm trying to calculate the average of a subset of a subset of data.
For example, imagine my data is
**Family Name / Gender / Grade**
Smith / Male / 90
Smith / Male / 85
Smith / Female / 65
Smith / Female / 100
Johns / Male / 95
Johns / Male / 45
Johns / Female / 20
Johns / Female / 100
So what I am trying to do is calculate the average grades of the females in the Smith family. The answer would be (65+100)/2.
I know how to calculate it the mean, but I do not know how to break it into subcategories twice.
My code is:
numpy.mean(students.grade)
I also tried a method where I did:
smith_family = students[students['Family Name'] == 'Smith']
np.mean(smith_family.grades)
But this method isn't scalable because I would have to manually type in every family name.
I made up the data; I'm actually doing it with animals and people's ratings of animals but its the same concept.
P.S. I'm using Python.
You'll use groupby here:
students[students['Family Name'] == 'Smith'].groupby('Gender').Grade.mean()
You can
df.set_index(['FamilyName','Gender']).mean(level=[0,1])
Out[271]:
Grade
FamilyName Gender
Smith Male 87.5
Female 82.5
Johns Male 70.0
Female 60.0
Instead of typing in every family name (which would be tedious as you mentioned) you can just groupby the name column. Additionally, you can do a second level groupby using gender to give combinations of both name and gender. Then calculate the mean for each subgroup:
import pandas as pd
df = pd.DataFrame({'Name': ['Smith', 'Smith', 'Smith', 'Smith', 'Johns', 'Johns', 'Johns', 'Johns'],
'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female'],
'Grade': [90, 85, 65, 100, 95, 45, 20, 100]})
df.groupby(['Name', 'Gender']).mean()
Which will give you:
Grade
Name Gender
Johns Female 60.0
Male 70.0
Smith Female 82.5
Male 87.5
Use groupby!
students = pd.DataFrame({'Family Name': ['Smith', 'Smith', 'Smith', 'Smith', 'Johns', 'Johns', 'Johns', 'Johns'], 'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female'], 'Grade': [90, 85, 65, 100, 95, 45, 20, 100]})
students.groupby(['Family Name', 'Gender']).mean()
Here's a link to the documentation for pandas.DataFrame.groupby. Good luck!
use Groupby method in Pandas. First convert the array to a DataFrame object
df = pandas.DataFrame(values, index=index)
then group by the family name and calculate the mean or sum for each group.
df.groupby('Family Name').mean()
It appears to me that you have a jumble of strings, family name, gender, and grade, that aren't organized at all and, as a result, you are struggling to figure
out how to make sense of it all. Here is a time where object-oriented programming is excellent.
Rather than storing a bunch of variables:
family_name_1 = "smith"
gender_1 = "male"
grade_1 = 95
family_name_2 = "johns"
#...
You could create a class, called, say, Person, with three instance variables:
class Person:
family_name
gender
grade
Now, your class needs a constructor, so you could make a Person and tell the program what that specific Person's family name, gender, and grade are. Inside your code for the class, you will need something like this:
def __init__(self, family_name, gender, grade):
self.family_name = family_name
self.gender = gender
self.grade = grade
Now, you are finished setting up your Person class. Next, you are going to want to populate by creating new people:
bob = Person("smith", "male", 95)
Not only is this easier to type than what was above, your code is now much more organized. The next thing you'll need is a list of people so you can average them together:
people = [Person("smith", "female", 97), Person("johns", "male", 60)] #...
For averaging all the people's grade's, I actually wouldn't use numpy, rather, something like this:
total = 0
number = 0
for person in people:
if person.gender == "female" and person.family_name == "smith":
total += person.grade
number += 1
average = total / number
print average
If you feed all of your data as I did above into the list, and use my for loop, you should get the average of all the grades of all the Smith females. I hope you understand, and, please, someone correct me if my syntax is wrong - it's been awhile since I've used Python!
I wanted to parse a given file line by line. The file has a format of
'name age gender hobby1 hobby2...'.
The first thing that came to mind was to use a named tuple of the form namedtuple('info',['name','age', 'gender','hobby']).
How can I save the data in my file to a list of tuples with the corresponding value. I tried using line.split() but I couldn't figure out how I can save the space separated hobbies to info.hobby.
Input file
If I understand you correctly, you can use pandas and pass 'this_is_a_space' as the sep if data is like this:
name age gender hobby1 hobby2
steve 12 male xyz abc
bob 29 male swimming golfing
alice 40 female reading cooking
tom 50 male sleeping
and here is syntax for method described above:
import pandas as pd
df = pd.read_csv('file.txt', sep=' ')
df.fillna(' ', inplace=True)
df['hobby'] = df[['hobby1', 'hobby2']].apply(lambda i: ' '.join(i), axis=1)
df.drop(['hobby1', 'hobby2'], axis=1, inplace=True)
print df
out:
name age gender hobby
0 steve 12 male xyz abc
1 bob 29 male swimming golfing
2 alice 40 female reading cooking
3 tom 50 male sleeping
EDIT: added your data from comment above