I have 2 dataframes, where I was looking if cells of column player in df1 existed in column last_name of df2. I merged on column player and if it is present in df2 it prints the cell but if its not it prints nan(which is what I wanted). I next wanted to make a description column but only for the non nan values. How can I add description for all the values that aren't nan?
df3 = df.merge(df2, left_on = 'player', right_on = 'last_name', how = 'left')
df1
player
team
position
Tatum
Celtics
SF
Brown
Celtics
SG
Smart
Celtics
PG
Horford
Celtics
C
Brogdon
Celtics
PG
Gallinari
Celtics
F
df2
last_name
team
position
Durant
Nets
SF
James
Lakers
SF
Smart
Celtics
PG
Horford
Celtics
C
Davis
Lakers
C
Curry
Warriors
PG
I changed the matched_player column name for readability with:
df3.rename(columns = {'last_name':'matched_player'}, inplace=True)
output(df3)
player
team
position
matched_player
Tatum
Celtics
SF
nan
Brown
Celtics
SG
nan
Smart
Celtics
PG
Smart
Horford
Celtics
C
Horford
Brogdon
Celtics
PG
nan
Gallinari
Celtics
F
nan
expected output
player
team
position
matched_player
description
Tatum
Celtics
SF
nan
Brown
Celtics
SG
nan
Smart
Celtics
PG
Smart
a player from df1
Horford
Celtics
C
Horford
a player from df1
Brogdon
Celtics
PG
nan
Gallinari
Celtics
F
nan
You can try np.where
df3['description'] = np.where(df3['matched_player'].notna(), 'a player from df1', '')
# or
df3['description'] = np.where(df3['matched_player'].isna(), '', 'a player from df1')
Related
Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0
Could someone help!py I am only trying to remove any apostrophes from string text in my data frame, I am not sure what am missing.
I have regular express and replace and renaming but can't seem to get rid of it.
country designation points price \
0 US Martha's Vineyard 96.0 235.0
1 Spain Carodorum Selección Especial Reserva 96.0 110.0
2 US Special Selected Late Harvest 96.0 90.0
3 US Reserve 96.0 65.0
4 France La Brûlade 95.0 66.0
province region_1 region_2 variety \
0 California Napa Valley Napa Cabernet Sauvignon
1 Northern Spain Toro NaN Tinta de Toro
2 California Knights Valley Sonoma Sauvignon Blanc
3 Oregon Willamette Valley Willamette Valley Pinot Noir
4 Provence Bandol NaN Provence red blend
winery last_year_points
0 Heitz 94
1 Bodega Carmen Rodríguez 92
2 Macauley
df.columns=df.columns.str.replace("''","")
df.Designation=df.Designation.str.replace("''","")
import re
re.sub("\'+",'',df.Designation)
df.rename(Destination={'Martha's Vineyard:'Mathas'}, inplace=True)
Error Message:SyntaxError: invalid syntax
See the code snippet below to solve your problem using a combination of lambda inline functions and the replace function for a string object.
df = pd.DataFrame({'Name': ["Tom's", "Jerry's", "Harry"]})
print(df, '\n')
Tom's Jerry's Harry
# Remove any apostrophes using lambda and replace function
df = df['Name'].apply(lambda x: str(x).replace("'", ""))
print(df, '\n')
Toms Jerrys Harry
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I have dataframe that has more than 100 columns. but here I am trying to replacing the number all across the dataframe whose column contains the number (Int/float/any formate of number).
I know how to take care column seperately, but i am looking for some smart code that efficiently replacing the value to -5 if Value <= 0 and 111 if value > 50.
Below is the code.
import numpy as np
import pandas as pd
df = pd.DataFrame({'Name': ['Avery Bradley', 'Jae Crowder', 'John Holland', 'R.J. Hunter'],
'Team': ['Boston Celtics',
'Boston Celtics',
'Boston Celtics',
'Boston Celtics'],
'Number1': [0.0, 999.0, -30.0, 28.0],
'Number2': [1000, 500, -10, 25],
'Position': ['PG', 'SF', 'SG', 'SG']})
#df["Number1"].values[df["Number1"] > 50] = 999
#df["Number1"].values[df["Number1"] < 0] = -5
df[ df > 50 ] = 888
df[ df < 0 ] = -5
You can use select_dtypes with np.select for multiple conditions here:
m = df.select_dtypes(np.number)
df[m.columns] = np.select([m>50,m<0],[888,-5],m)
print(df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888.0 PG
1 Jae Crowder Boston Celtics 888.0 888.0 SF
2 John Holland Boston Celtics -5.0 -5.0 SG
3 R.J. Hunter Boston Celtics 28.0 25.0 SG
Use:
c = df.select_dtypes(np.number).columns
df[c] = df[c].mask(df[c] > 50, 888)
df[c] = df[c].mask(df[c] < 0, -5)
print (df)
Name Team Number1 Number2 Position
0 Avery Bradley Boston Celtics 0.0 888 PG
1 Jae Crowder Boston Celtics 888.0 888 SF
2 John Holland Boston Celtics -5.0 -5 SG
3 R.J. Hunter Boston Celtics 28.0 25 SG
I'm trying to print all the fields that have England in them, the current code i have prints all the Nationalities into a txt file for me, but i want just the england fields to print. the page im pulling from is https://www.premierleague.com/players
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
players.write(playerCountry.text.strip())
players.write("\n")
Just need to check if it's not equal 'England', and if so, skip to next item in list:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.premierleague.com/players")
c=r.content
soup=BeautifulSoup(c, "html.parser")
players = open("playerslist.txt", "w+")
for playerCountry in soup.findAll("span", {"class":"playerCountry"}):
if playerCountry.text.strip() != 'England':
continue
players.write(playerCountry.text.strip())
players.write("\n")
Or, you could just use pandas.read_html() and a couple lines of code:
import pandas as pd
df = pd.read_html("https://www.premierleague.com/players")[0]
print(df.loc[df['Nationality'] != 'England'])
Prints:
Player Position Nationality
2 Charlie Adam Midfielder Scotland
3 Adrián Goalkeeper Spain
4 Adrien Silva Midfielder Portugal
5 Ibrahim Afellay Midfielder Netherlands
6 Benik Afobe Forward The Democratic Republic Of Congo
7 Sergio Agüero Forward Argentina
9 Soufyan Ahannach Midfielder Netherlands
10 Ahmed Hegazi Defender Egypt
11 Nathan Aké Defender Netherlands
14 Toby Alderweireld Defender Belgium
15 Aleix García Midfielder Spain
17 Ali Gabr Defender Egypt
18 Allan Nyom Defender Cameroon
19 Allan Souza Midfielder Brazil
20 Joe Allen Midfielder Wales
22 Marcos Alonso Defender Spain
23 Paulo Alves Midfielder Portugal
24 Daniel Amartey Midfielder Ghana
25 Jordi Amat Defender Spain
27 Ethan Ampadu Defender Wales
28 Nordin Amrabat Forward Morocco