multi query a dataframe table

multi query a dataframe table - python

I have a dataframe as follows:
student_id
gender
major
admitted
35377
female
Chemistry
False
56105
male
Physics
True
etc.
How do I find the admission rate for females?
I have tried:
df.loc[(df['gender'] == "female") & (df['admitted'] == "True")].sum()
But this returns an error:
TypeError: invalid type comparison

I guess the last column is Boolean. can you try this
df[df['gender'] == "F"]['admitted'].sum()

Remove that .loc and use this code: df[(df['gender'] == "female") & (df['admitted'] == "True")].sum()

Related

Conditionally merge two dataframe columns

I have two columns in my data frame for gender derived from first name and middle name. I want to create a third column for overall gender. As such, where there is male or female in either column, it should override the unknown in the other column. I've written the following function but end up with the following error:
# Assign gender for names so that number of female names can be counted.
d = gender.Detector()
trnY['Gender_first'] = trnY['first'].map(lambda x: d.get_gender(x))
trnY['Gender_mid'] = trnY['middle'].map(lambda x: d.get_gender(x))
# merge the two gender columns:
def gender(x):
if ['Gender_first'] == male or ['Gender_mid'] == male:
return male
elif ['Gender_first'] == female or ['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender)
trnY
Error:
--> 50 trnY['Gender'] = trnY.apply(gender)
ValueError: Unable to coerce to Series, the length must be 21: given 1

If you want to use apply() on rows you should pass it the parameter axis=1. In your case.
def gender(x):
if x['Gender_first'] == male or x['Gender_mid'] == male:
return male
elif x['Gender_first'] == female or x['Gender_mid'] == female:
return female
else:
return unknown
trnY['Gender'] = trnY.apply(gender, axis=1)
This should solve your problem.

can anyone explain to me why this apply() method isn't working?

This doesn't work:
def rator(row):
if row['country'] == 'Canada':
row['stars'] = 3
elif row['points'] >= 95:
row['stars'] = 3
elif row['points'] >= 85:
row['stars'] = 2
else:
row['stars'] = 1
return row
with_stars = reviews.apply(rator, axis='columns')
But this works:
def rator(row):
if row['country'] == 'Canada':
return 3
elif row['points'] >= 95:
return 3
elif row['points'] >= 85:
return 2
else:
return 1
with_stars = reviews.apply(rator, axis='columns')
I'm practicing on Kaggle, and reading through their tutorial as well as the documentation. I am a bit confused by the concept.
I understand that the apply() method acts on an entire row of a DataFrame, while map() acts on each element in a column. And that it's supposed to return a DataFrame, while map() returns a Series.
Just not sure how the mechanics work here, since it's not letting me return rows inside the function...
some of the data:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco -1.447138 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe #kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos -1.447138 15.0 Douro NaN NaN Roger Voss #vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
Index(['country', 'description', 'designation', 'points', 'price', 'province',
'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
'variety', 'winery'],
dtype='object')
https://www.kaggle.com/residentmario/summary-functions-and-maps

When you use apply, the function is applied iteratively to each row (or column, depending on the axis parameter). The return value of apply is not a DataFrame but a Series built using the return values of your function. That means that your second piece of code returns the stars rating of each row, which is used to build a new Series. So a better name for storing the return value is star_ratings instead of with_stars.
If you want to append this Series to your original dataframe you can use:
star_ratings = reviews.apply(rator, axis='columns')
reviews['stars'] = star_ratings
or, more succinctly:
reviews['stars'] = reviews.apply(rator, axis='columns')
As for why your first piece of code does not work, it is because you are trying to add a new column: your are not supposed to mutate the passed object. The official docs state:
Functions that mutate the passed object
can produce unexpected behavior or
errors and are not supported
To better understand the differences between map and apply please see the different responses to this question, as they present many different and correct viewpoints.

You shouldn't use apply with a function that modifies the input. You could change your code to this:
def rator(row):
new_row = row.copy()
if row['country'] == 'Canada':
new_row['stars'] = 3
elif row['points'] >= 95:
new_row['stars'] = 3
elif row['points'] >= 85:
new_row['stars'] = 2
else:
new_row['stars'] = 1
return new_row
with_stars = reviews.apply(rator, axis='columns')
However, it's simpler to just return the column you care about rather than returning an entire dataframe just to change one column. If you write rator to return just one column, but you want to have an entire dataframe, you can do with_stars = reviews.copy() and then with_stars['stars'] = reviews.apply(rator, axis='columns'). Also, if an if branch ends with a return, you can do just if after it rather than elif. You can also simplify your code with cut.

Filtering data on Titanic dataset

I am using the Titanic dataset to make some filters on the data. I need to find the most youngest passengers who didn't survived. I have got this result by now:
df_kids = df[(df["Survived"] == 0)][["Age","Name","Sex"]].sort_values("Age").head(10)
df_kids
Now I need to say how many of them are male and how female. I have tried a loop but it's giving me zero for both lists all the time. I don't know what I am doing wrong:
list_m = list()
list_f = list()
for i in df_kids:
if [["Sex"] == "male"]:
list_m.append(i)
else:
list_f.append(i)
len(list_m)
len(list_f)
Could you help me, please?
Thanks a lot!

You can create a masking. For example:
male_mask = df_kids['Sex' == 'male']
And use it:
male = df_kids[male_mask]
female = df_kids[~male_mask] # Assuming Sex is either male or female
You can use the shape attribute now if you are interested on counts only.
print(male.shape[0])
print(female.shape[0])

I want to return "Action" in GENRE column for all the movies that contain a certain string or starts with "Sp" using pandas

I want to return "Action" in GENRE Dataframe/column for all the movies that contains or starts with "Sp" using pandas.
For Example:
Movies Rating Genre
Spider 4.8 Action
Spies 2.5 Action
Special 5.0 Comedy
I've tried 'str.contains' method but still no luck

The below code will return the Genre containing 'Action' and the Movie name containing 'Sp' as its substring.
df.loc[(df['Movies'].str.contains('Sp')) & (df['Genre'] == 'Action')]
The below code will return the Genre containing 'Action' alongwith the Movie name starting with 'Sp'.
df.loc[( df['Movies'].str.startswith('Sp')) & (df['Genre'] == 'Action')]

Use loc to filter your frame using boolean indexing and assign a value to a column. Change the string in startswith and contains to whatever value you want.
df.loc[df['Movies'].str.contains('es') | df['Movies'].str.startswith('Sp'), 'Genre'] = 'Action'

I don't know if it is efficient way or not but you can do following:
def filter(x,y):
if x[:2]=="Sp": return "Action"
else: return y
df["genre"]=df[["Movies", "genre"]].apply(lambda z: filter(z.Movies, z.genre), axis=1)

Assuming that your DataFrame (df) is:
Movies Rating Genre
0 Spider 4.8 Action
1 Spies 2.5 Action
2 Special 5.0 Comedy
You can filter the DataFrame with:
import pandas as pd
df[(df.Genre == 'Action') & (df.Movies.str.startswith('Sp'))]
The output will be:
Movies Rating Genre
0 Spider 4.8 Action
1 Spies 2.5 Action

Filtering a large dataframe in pandas using multiprocessing

I have a dataframe and I need to filter it according to the following conditions
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ROMANCE' & count_GENRE >= 1
CITY == 'Mumbai' & LANGUAGE == 'Hindi' & count_LANGUAGE >= 1 & GENRE == 'ACTION'
when I am trying to do that by
df1 = df.query(condition1)
df2 = df.query(condition2)
I am getting memory error(since my dataframe size is Huge).
SO I planned to go by filtering main condition then sub condition, so that the load will be less and performance will be better.
By parsing above conditions, somehow managed to get
main_filter = "CITY == 'Mumbai'"
sub_cond1 = "LANGUAGE == 'English'"
sub_cond1_cond1 = "GENRE == 'ACTION' & count_GENRE >= 1"
sub_cond1_cond2 = "GENRE == 'ROMANCE' & count_GENRE >= 1"
sub_cond2 = "LANGUAGE == 'Hindi' & count_LANGUGE >= 1"
sub_cond2_cond1 = "GENRE == 'COMEDY'"
So think it as a tree structure(not binary of course and actually it is not a tree at all).
Now I want to follow a multiprocessing method (deep -- sub process under subprocess)
Now I want something like
on level 1
df = df_main.query(main_filter)
on level 2
df1 = df.query(sub_cond1)
df2 = df.query(sub_cond2)
onlevel 3
df11 = df1.query(sub_cond1_cond1)
df12 = df1.query(sub_cond1_cond2)
df21 = df2.query(sub_cond2_cond1) ######like this
So problem is how to pass conditions properly to each level(if I am going to store all conditions in a list(Actually not even thought about that)).
NB: result from each filteration should export to separate separate csvs.
Ex:
df11.to_csv('CITY == 'Mumbai' & LANGUAGE == 'English' & GENRE == 'ACTION' & count_GENRE >= 1')
As a starter I don't know how to follow multiprocessing (its syntax & way of execution, etc particularly for this). But got the task unfortunately. Hence not able to post any codes.
So can anybody give a codeline example to achieve this.
If you have any better idea (class object or node traversing), please suggest.

This looks like a problem suitable for dask, the python module that helps you deal with larger-than-memory data.
I will show how to solve this problem using the dask.dataframe. Let's start by creating some data:
import pandas as pd
from collections import namedtuple
Record = namedtuple('Record', "CITY LANGUAGE GENRE count_GENRE count_LANGUAGE")
cities = ['Mumbai', 'Chennai', 'Bengalaru', 'Kolkata']
languages = ['English', 'Hindi', 'Spanish', 'French']
genres = ['Action', 'Romance', 'Comedy', 'Drama']
import random
df = pd.DataFrame([Record(random.choice(cities),
random.choice(languages),
random.choice(genres),
random.choice([1,2,3]),
random.choice([1,2,3])) for i in range(4000000)])
df.to_csv('temp.csv', index=False)
print(df.head())
CITY LANGUAGE GENRE count_GENRE count_LANGUAGE
0 Chennai Spanish Action 2 1
1 Bengalaru English Drama 2 3
2 Kolkata Spanish Action 2 1
3 Mumbai French Romance 1 2
4 Chennai French Action 2 3
The data created above has 4 million rows, and occupies 107 MB. It is not larger-than-memory, but good enough to use in this example.
Below I show the transcript of a python session where I filtered the data according to the criteria in the question:
>>> import dask.dataframe as dd
>>> dask_df = dd.read_csv('temp.csv', header=0)
>>> dask_df.npartitions
4
# We see above that dask.dataframe has decided to split the
# data into 4 partitions
# We now execute the query:
>>> result = dask_df[(dask_df['CITY'] == 'Mumbai') &
... (dask_df['LANGUAGE'] == 'English') &
... (dask_df['GENRE'] == 'Action') &
... (dask_df['count_GENRE'] > 1)]
>>>
# The line above takes very little time to execute. In fact, nothing has
# really been computed yet. Behind the scenes dask has create a plan to
# execute the query, but has not yet pulled the trigger.
# The result object is a dask dataframe:
>>> type(result)
<class 'dask.dataframe.core.DataFrame'>
>>> result
dd.DataFrame<series-slice-read-csv-temp.csv-fc62a8c019c213f4cd106801b9e45b29[elemwise-cea80b0dd8dd29ae325a9db1896b027c], divisions=(None, None, None, None, None)>
# We now pull the trigger by calling the compute() method on the dask
# dataframe. The execution of the line below takes a few seconds:
>>> dfout = result.compute()
# The result is a regular pandas dataframe:
>>> type(dfout)
<class 'pandas.core.frame.DataFrame'>
# Of our 4 million records, only ~40k match the query:
>>> len(dfout)
41842
>>> dfout.head()
CITY LANGUAGE GENRE count_GENRE count_LANGUAGE
225 Mumbai English Action 2 3
237 Mumbai English Action 3 2
306 Mumbai English Action 3 3
335 Mumbai English Action 2 2
482 Mumbai English Action 2 3
I hope this gets you started on the solution to your problem. For more info on dask see the tutorial and examples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

multi query a dataframe table - python

I guess the last column is Boolean. can you try this df[df['gender'] == "F"]['admitted'].sum()

Remove that .loc and use this code: df[(df['gender'] == "female") & (df['admitted'] == "True")].sum()

Related

Conditionally merge two dataframe columns

can anyone explain to me why this apply() method isn't working?

Filtering data on Titanic dataset

I want to return "Action" in GENRE column for all the movies that contain a certain string or starts with "Sp" using pandas

Filtering a large dataframe in pandas using multiprocessing

Categories

Resources