I am using the Titanic dataset to make some filters on the data. I need to find the most youngest passengers who didn't survived. I have got this result by now:
df_kids = df[(df["Survived"] == 0)][["Age","Name","Sex"]].sort_values("Age").head(10)
df_kids
Now I need to say how many of them are male and how female. I have tried a loop but it's giving me zero for both lists all the time. I don't know what I am doing wrong:
list_m = list()
list_f = list()
for i in df_kids:
if [["Sex"] == "male"]:
list_m.append(i)
else:
list_f.append(i)
len(list_m)
len(list_f)
Could you help me, please?
Thanks a lot!
You can create a masking. For example:
male_mask = df_kids['Sex' == 'male']
And use it:
male = df_kids[male_mask]
female = df_kids[~male_mask] # Assuming Sex is either male or female
You can use the shape attribute now if you are interested on counts only.
print(male.shape[0])
print(female.shape[0])
Related
I am working on a course with low code requirements, and have one step where I am stuck.
I have this code that creates a list of restaurants and the number of reviews each has:
Filter the rated restaurants
df_rated = df[df['rating'] != 'Not given'].copy()
df_rated['rating'] = df_rated['rating'].astype('int')
df_rating_count = df_rated.groupby(['restaurant_name'])['rating'].count().sort_values(ascending = False).reset_index()
df_rating_count.head()
From there I am supposed to create a list limited to those above 50 reviews, starting from this base:
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count['______________']['restaurant_name']
# Filter to get the data of restaurants that have rating count more than 50
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
# Group the restaurant names with their ratings and find the mean rating of each restaurant
df_mean_4.groupby(['_______'])['_______'].mean().sort_values(ascending = False).reset_index().dropna() ## Complete the code to find the mean rating
Where I am stuck is on the first step.
rest_names = df_rating_count['______________']['restaurant_name']
I am pretty confident in the other 2 steps.
df_mean_4 = df_rated[df_rated['restaurant_name'].isin(rest_names)].copy()
df_mean_4.groupby(['restaurant_name'])['rating'].mean().sort_values(ascending = False).reset_index().dropna()
I have frankly tried so many different things I don't even know where to start.
Does anyone have any hints to at least point me in the right direction?
you can index and filter using [].
# Get the restaurant names that have rating count more than 50
rest_names = df_rating_count[df_rating_count['rating'] > 50]['restaurant_name']
#function to determine the revenue
def compute_rev(x):
if x > 20:
return x*0.25
elif x > 5:
return x*0.15
else:
return x*0
## Write the appropriate column name to compute the revenue
df['Revenue'] = df['________'].apply(compute_rev)
df.head()
I'm once again asking for help on iterating over a list. This is the problem that eludes me this time:
I have this table:
that contains various combinations of countries with their relative trade flow.
Since trade goes both ways, my list has for example one value for ALB-ARM (how much albania traded with armenia that year) and then down the list another value for ARM-ALB (the other way around).
I want to sum this two trade values for every pair of countries; and I've been trying around with some code but I quickly realise how all my approaches are wrong.
How do I even set it up? I feel like it's too hard with a loop and it will be easy with some function that I don't even know exists.
Example data in Table format:
from astropy.table import Table
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
t = Table([country1,country2,flow],names=["1","2","flow"],meta={"Header":"Table"})
and the expected output would be:
trade = [700,90,700,320,90,320]
result = Table([country1,country2,flow,trade],names=["1","2","flow","trade"],meta={"Header":"Table"})
Thank you in advance all
Maybe this could help:
country1 = ["ALB","ALB","ARM","ARM","AZE","AZE"]
country2 = ["ARM","AZE","ALB","AZE","ALB","ARM"]
flow = [500,0,200,300,90,20]
trade = []
pairs = map(lambda t: '-'.join(t), zip(country1, country2))
flow_map = dict(zip(pairs, flow))
for left_country, right_country in zip(country1, country2):
trade.append(flow_map['-'.join((left_country, right_country))] + flow_map['-'.join((right_country, left_country))])
print(trade)
outputs:
[700, 90, 700, 320, 90, 320]
I am working with a Dataset that contains the information of every March Madness game since 1985. I want to know which teams have won it all and how many times each.
I masked the main dataset and created a new one containing only information about the championship game. Now I am trying to create a loop that compares the scores from both teams that played in the championship game, detects the winner and adds that team to a list. This is how the dataset looks like: https://imgur.com/tXhPYSm
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for i in champions:
if champions['Score'] > champions['Score.1']:
list_champs.append(i['Team'])
else:
list_champs.append(i['Team.1'])
Why do you need to loop through the DataFrame?
Basic filtering should work well. Something like this:
champs1 = champions.loc[champions['Score'] > champions['Score.1'], 'Team']
champs2 = champions.loc[champions['Score'] < champions['Score.1'], 'Team.1']
list_champs = list(champs1) + list(champs2)
A minimalist change (not the most efficient) to get your code working:
tourney = pd.read_csv('ncaa.csv')
champions = tourney.loc[tourney['Region Name'] == "Championship", ['Year','Seed','Score','Team','Team.1','Score.1','Seed.1']]
list_champs = []
for row in champions.iterrows():
if row['Score'] > row['Score.1']:
list_champs.append(row['Team'])
else:
list_champs.append(row['Team.1'])
Otherwise, you could simply do:
df.apply(lambda row: row['Team'] if row['Score'] > row['Score.1'] else row['Team.1'], axis=1).values
I have a dataframe in which I would like to determine how many unique bird species each person saw who participated in my "Big Year".
I've tried using a list comprehension and for loops to iterate over each row and determine if it's unique using .is_unique(), but that seems to be the source of much of my distress. I can get a list of all the unique species with .unique(), quite nicely, but I would like to somehow get the people associated with those birds.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_unique_bird = [x for x in range(len(df['Species'])) if df['Birder'][x]=='Ben' and df['Species'][x].is_unique()]
Edit: I think I'm unclear in this- I want to get a list of birds that each person saw that no one else did. So the output would be something like (Steve, 0), (Ben, 1), (Greg, 1), in whatever format.
Thanks!
This can be done with list comprehension quite easily.
df = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
matches = [(row[1], row[2]) for row in df.itertuples() if (row[1],row[2]) not in matches]
This gives a list of tuples as output:
[('Steve', 'woodpecker'), ('Ben', 'woodpecker'), ('Ben', 'dove'), ('Greg', 'mockingbird')]
name of unique birds they saw
ben_unique_bird = df[df['Birder'] == 'Ben']['Species'].unique()
number of unique birds they saw
len(df[df['Birder'] == 'Ben']['Species'].unique())
Recommended method 1 to get a table
df.groupby(['Birder']).agg({"Species": lambda x: x.nunique()})
same method broken down
for i in df['Birder'].unique():
print (" Name ",i," Distinct count ",len(df[df['Birder'] == i]['Species'].unique())," distinct bird names ",df[df['Birder'] == i]['Species'].unique())
You can create a helper series via pd.DataFrame.duplicated and then use GroupBy + sum:
counts = data.assign(dup_flag=df['Species'].duplicated(keep=False))\
.groupby('Birder')['dup_flag'].sum().astype(int)
for name, count in counts.items():
print(f'{name} saw {count} bird(s) that no one else saw')
Result:
Ben saw 1 bird(s) that no one else saw
Greg saw 0 bird(s) that no one else saw
Steve saw 1 bird(s) that no one else saw
I figured out a terrible way of doing what I want, but it works. Please let me know if you have a more efficient way of doing this, because I know there has to be one.
data = pd.DataFrame({'Species':['woodpecker', 'woodpecker', 'dove', 'mockingbird'], 'Birder':['Steve', 'Ben','Ben','Greg']})
ben_birds = []
steve_birds = []
greg_birds = []
#get all the names of the birds that people saw and put them in a list
for index, row in data.iterrows():
if row['Birder'] == 'Bright':
ben_birds.append(row['Species'])
elif row['Birder'] == 'Filios':
steve_birds.append(row['Species'])
else:
greg_birds.append(row['Species'])
duplicates=[]
#compare each of the lists to look for duplicates, and make a new list with those
for bird in ben_birds:
if (bird in steve_birds) or (bird in greg_birds):
duplicates.append(bird)
for bird in steve_birds:
if (bird in greg_birds):
duplicates.append(bird)
#if any of the duplicates are in a list, remove those birds
for bird in ben_birds:
if bird in duplicates:
ben_birds.remove(bird)
for bird in steve_birds:
if bird in duplicates:
steve_birds.remove(bird)
for bird in greg_birds:
if bird in duplicates:
greg_birds.remove(bird)
print(f'Ben saw {len(ben_birds)} Birds that no one else saw')
print(f'Steve saw {len(steve_birds)} Birds that no one else saw')
print(f'Greg saw {len(greg_birds)} Birds that no one else saw')
I am trying to perform a Principal Component Analysis for work. While i have successful in getting the the Principal Components laid out, i don't really know how to assign the resulting Component Score to each line item. I am looking for an output sort of like this.
Town PrinComponent 1 PrinComponent 2 PrinComponent 3
Columbia 0.31989 -0.44216 -0.44369
Middletown -0.37101 -0.24531 -0.47020
Harrisburg -0.00974 -0.06105 0.32792
Newport -0.38678 0.40935 -0.62996
The scikit-learn docs are not being helpful in this circumstance. Can anybody explain to me how i can reach this output?
The code i have so far is below.
def perform_PCA(df):
threshold = 0.1
pca = decomposition.PCA(n_components=3)
numpyMatrix = df.as_matrix().astype(float)
scaled_data = preprocessing.scale(numpyMatrix)
pca.fit(scaled_data)
pca.transform(scaled_data)
pca_components_df = pd.DataFrame(data = pca.components_,columns = df.columns.values)
#print pca_components_df
#pca_components_df.to_csv('pca_components_df.csv')
filtered = pca_components_df[abs(pca_components_df) > threshold]
trans_filtered= filtered.T
#print filtered.T #Tranformed Dataframe
trans_filtered.to_csv('trans_filtered.csv')
print pca.explained_variance_ratio_
I pumped the transformed array into the data portion of the DataFrame function, and then defined the index and columns the by putting them into columns= and index= respectively.
pd.DataFrame(data=transformed, columns=["PC1", "PC2"], index=df.index)