Remove any apostrophes from string - Python Pandas - python

Could someone help!py I am only trying to remove any apostrophes from string text in my data frame, I am not sure what am missing.
I have regular express and replace and renaming but can't seem to get rid of it.
country designation points price \
0 US Martha's Vineyard 96.0 235.0
1 Spain Carodorum Selección Especial Reserva 96.0 110.0
2 US Special Selected Late Harvest 96.0 90.0
3 US Reserve 96.0 65.0
4 France La Brûlade 95.0 66.0
province region_1 region_2 variety \
0 California Napa Valley Napa Cabernet Sauvignon
1 Northern Spain Toro NaN Tinta de Toro
2 California Knights Valley Sonoma Sauvignon Blanc
3 Oregon Willamette Valley Willamette Valley Pinot Noir
4 Provence Bandol NaN Provence red blend
winery last_year_points
0 Heitz 94
1 Bodega Carmen Rodríguez 92
2 Macauley
df.columns=df.columns.str.replace("''","")
df.Designation=df.Designation.str.replace("''","")
import re
re.sub("\'+",'',df.Designation)
df.rename(Destination={'Martha's Vineyard:'Mathas'}, inplace=True)
Error Message:SyntaxError: invalid syntax

See the code snippet below to solve your problem using a combination of lambda inline functions and the replace function for a string object.
df = pd.DataFrame({'Name': ["Tom's", "Jerry's", "Harry"]})
print(df, '\n')
Tom's Jerry's Harry
# Remove any apostrophes using lambda and replace function
df = df['Name'].apply(lambda x: str(x).replace("'", ""))
print(df, '\n')
Toms Jerrys Harry

Related

Draw a Map of cities in python

I have a ranking of countries across the world in a variable called rank_2000 that looks like this:
Seoul
Tokyo
Paris
New_York_Greater
Shizuoka
Chicago
Minneapolis
Boston
Austin
Munich
Salt_Lake
Greater_Sydney
Houston
Dallas
London
San_Francisco_Greater
Berlin
Seattle
Toronto
Stockholm
Atlanta
Indianapolis
Fukuoka
San_Diego
Phoenix
Frankfurt_am_Main
Stuttgart
Grenoble
Albany
Singapore
Washington_Greater
Helsinki
Nuremberg
Detroit_Greater
TelAviv
Zurich
Hamburg
Pittsburgh
Philadelphia_Greater
Taipei
Los_Angeles_Greater
Miami_Greater
MannheimLudwigshafen
Brussels
Milan
Montreal
Dublin
Sacramento
Ottawa
Vancouver
Malmo
Karlsruhe
Columbus
Dusseldorf
Shenzen
Copenhagen
Milwaukee
Marseille
Greater_Melbourne
Toulouse
Beijing
Dresden
Manchester
Lyon
Vienna
Shanghai
Guangzhou
San_Antonio
Utrecht
New_Delhi
Basel
Oslo
Rome
Barcelona
Madrid
Geneva
Hong_Kong
Valencia
Edinburgh
Amsterdam
Taichung
The_Hague
Bucharest
Muenster
Greater_Adelaide
Chengdu
Greater_Brisbane
Budapest
Manila
Bologna
Quebec
Dubai
Monterrey
Wellington
Shenyang
Tunis
Johannesburg
Auckland
Hangzhou
Athens
Wuhan
Bangalore
Chennai
Istanbul
Cape_Town
Lima
Xian
Bangkok
Penang
Luxembourg
Buenos_Aires
Warsaw
Greater_Perth
Kuala_Lumpur
Santiago
Lisbon
Dalian
Zhengzhou
Prague
Changsha
Chongqing
Ankara
Fuzhou
Jinan
Xiamen
Sao_Paulo
Kunming
Jakarta
Cairo
Curitiba
Riyadh
Rio_de_Janeiro
Mexico_City
Hefei
Almaty
Beirut
Belgrade
Belo_Horizonte
Bogota_DC
Bratislava
Dhaka
Durban
Hanoi
Ho_Chi_Minh_City
Kampala
Karachi
Kuwait_City
Manama
Montevideo
Panama_City
Quito
San_Juan
What I would like to do is a map of the world where those cities are colored according to their position on the ranking above. I am opened to further solutions for the representation (such as bubbles of increasing dimension according to the position of the cities in the rank or, if necessary, representing only a sample of countries taken from the top rank, the middle and the bottom).
Thank you,
Federico
Your question has two parts; finding the location of each city and then drawing them on the map. Assuming you have the latitude and longitude of each city, here's how you'd tackle the latter part.
I like Folium (https://pypi.org/project/folium/) for drawing maps. Here's an example of how you might draw a circle for each city, with it's position in the list is used to determine the size of that circle.
import folium
cities = [
{'name':'Seoul', 'coodrs':[37.5639715, 126.9040468]},
{'name':'Tokyo', 'coodrs':[35.5090627, 139.2094007]},
{'name':'Paris', 'coodrs':[48.8588787,2.2035149]},
{'name':'New York', 'coodrs':[40.6976637,-74.1197631]},
# etc. etc.
]
m = folium.Map(zoom_start=15)
for counter, city in enumerate(cities):
circle_size = 5 + counter
folium.CircleMarker(
location=city['coodrs'],
radius=circle_size,
popup=city['name'],
color="crimson",
fill=True,
fill_color="crimson",
).add_to(m)
m.save('map.html')
Output:
You may need to adjust the circle_size calculation a little to work with the number of cities you want to include.

How to only extract the full words of a string in Python?

I want to extract only the full words of a string.
I have this df:
Students Age
0 Boston Terry Emma 23
1 Tommy Julien Cambridge 20
2 London 21
3 New York Liu 30
4 Anna-Madrid+ Pauline 26
5 Mozart Cambridge 27
6 Gigi Tokyo Lily 18
7 Paris Diane Marie Dive 22
And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)
cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']
Desired df:
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN
I tried this code:
pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)
My code for cities works, I just need to repair the issue for the 'Liked Names'.
How to make this work? Thanks a lot!!!
I think what you are looking for are word boundaries. In a regular expression they can be expressed with a \b. An ugly (albeit working) solution is to modify the liked_names list to include word boundaries and then run the code:
l = [
["Boston Terry Emma", 23],
["Tommy Julien Cambridge", 20],
["London", 21],
["New York Liu", 30],
["Anna-Madrid+ Pauline", 26],
["Mozart Cambridge", 27],
["Gigi Tokyo Lily", 18],
["Paris Diane Marie Dive", 22],
]
cities = [
"Boston",
"Cambridge",
"Bruxelles",
"New York",
"London",
"Amsterdam",
"Madrid",
"Tokyo",
"Paris",
]
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"\b" + n + r"\b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])
pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)
print(df)
A nicer solution would be to include the word boundaries in the creation of the regular expression.
I first tried using \s, i.e. whitespace, but that did not work at the end of the list, so \b was the solution. You can check https://regular-expressions.mobi/wordboundaries.html?wlr=1 for some details.
You can try this regex:
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
pat = (
"(" + "|".join(r"[a-zA-Z]*{}[a-zA-Z]*".format(n) for n in liked_names) + ")"
)
df["Liked Names"] = df["Students"].str.extract(pat)
print(df)
Prints:
Students Age Liked Names
0 Boston Terry Emma 23 Emma
1 Tommy Julien Cambridge 20 Tommy Julien
2 London 21 NaN
3 New York Liu 30 Liu
4 Anna-Madrid+ Pauline 26 Pauline
5 Mozart Cambridge 27 NaN
6 Gigi Tokyo Lily 18 NaN
7 Paris Diane Marie Dive 22 NaN
You can do an additional check to see if matched name is in Students column.
import numpy as np
def check(row):
if row['Liked Names'] == row['Liked Names']:
# If `Liked Names` is not nan
# Get all possible names
patterns = row['Students'].split(' ')
# If matched `Liked Names` in `Students`
isAllMatched = all([name in patterns for name in row['Liked Names'].split(' ')])
if not isAllMatched:
return np.nan
else:
return row['Liked Names']
else:
# If `Liked Names` is nan, still return nan
return np.nan
df['Liked Names'] = df.apply(check, axis=1)
# print(df)
Students Age Cities Liked Names
0 Boston Terry Emma 23 Boston Emma
1 Tommy Julien Cambridge 20 Cambridge Tommy Julien
2 London 21 London NaN
3 New York Liu 30 New York NaN
4 Anna-Madrid+ Pauline 26 Madrid Pauline
5 Mozart Cambridge 27 Cambridge NaN
6 Gigi Tokyo Lily 18 Tokyo NaN
7 Paris Diane Marie Dive 22 Paris NaN

How do you remove sections from a csv file using pandas?

I am following along with this project guide and I reached a segment where I'm not exactly sure how the code works. Can someone explain the following block of code please:
to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1)
This is the format of the csv file before the previous code is executed:
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Date of Publication Publisher \
0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All for Greed. [A novel. The dedication signed... A., A. A.
2 Love the Avenger. By the author of “All for Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Contributors Corporate Author \
0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN
Corporate Contributors Former owner Engraver Issuance type \
0 NaN NaN NaN monographic
1 NaN NaN NaN monographic
2 NaN NaN NaN monographic
3 NaN NaN NaN monographic
4 NaN NaN NaN monographic
Flickr URL \
0 http://www.flickr.com/photos/britishlibrary/ta...
1 http://www.flickr.com/photos/britishlibrary/ta...
2 http://www.flickr.com/photos/britishlibrary/ta...
3 http://www.flickr.com/photos/britishlibrary/ta...
4 http://www.flickr.com/photos/britishlibrary/ta...
Shelfmarks
0 British Library HMNTS 12641.b.30.
1 British Library HMNTS 12626.cc.2.
2 British Library HMNTS 12625.dd.1.
3 British Library HMNTS 10369.bbb.15.
4 British Library HMNTS 9007.d.28.
Which part of the code tells pandas to remove the columns and not rows? What does the inplace=True and axis=1 mean?
This is really basic in Pandas data frame, I guess you should take on a free tutorial.Anyways this code block removes the columns that you have stored in to_drop.
So far a data frame whose name is df we remove columns using this command
df.drop([], inplace=True), axis=1,
where in list we mention the columns we want to drop, axis =1 means to drop them columnwise and in place simply makes it a permanent change that this change will occur actually on the original dataframe.
You can also write the above command as
df.drop(['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks'], inplace=True, axis=1)
Here is quite basic guide to pandas for your future queries Introduction to pandas

Filter and drop rows by proportion python

I have a dataframe called wine that contains a bunch of rows I need to drop.
How do i drop all rows in column 'country' that are less than 1% of the whole?
Here are the proportions:
#proportion of wine countries in the data set
wine.country.value_counts() / len(wine.country)
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
New Zealand 0.009069
Israel 0.006133
Greece 0.004493
Canada 0.002526
Hungary 0.001755
Romania 0.001558
...
I got lazy and didn't include all of the results, but i think you catch my drift. I need to drop all rows with proportions less than .01
Here is the head of my dataframe:
country designation points price province taster_name variety year price_category
Portugal Avidagos 87 15.0 Douro Roger Voss Portuguese Red 2011.0 low
You can use something like this:
df = df[df.proportion >= .01]
From that dataset it should give you something like this:
US 0.382384
France 0.153514
Italy 0.100118
Spain 0.070780
Portugal 0.062186
Chile 0.056742
Argentina 0.042835
Austria 0.034767
Germany 0.028928
Australia 0.021434
South Africa 0.010233
figured it out
country_filter = wine.country.value_counts(normalize=True) > 0.01
country_index = country_filter[country_filter.values == True].index
wine = wine[wine.country.isin(list(country_index))]

Printing three rows of output given list of strings using format()

Hi guys I got another CSC110 question. I'm just trying to learn the optimal way to do things. I'm sure this will be pretty easy.
Basically I need to output the names of some countries in a standard output that looks like this:
Afghanistan Albania Armenia
Bangladesh Benin Bhutan
Bolivia Burkina Faso Burundi
Cabo Verde Cambodia Cameroon
Central African Republic Chad Comoros
Congo Cote D'Ivoire D.P.R. Of Korea
D.R. Of The Congo Djibouti Egypt
El Salvador Eritrea Ethiopia
Gambia Georgia Ghana
Guatemala Guinea Guinea-Bissau
Guyana Haiti Honduras
India Indonesia Kenya
Kiribati Kosovo Kyrgyzstan
Lao People'S Dr Lesotho Liberia
Madagascar Malawi Mali
Marshall Islands Mauritania Micronesia (Fs Of)
Mongolia Morocco Mozambique
Myanmar Nepal Nicaragua
Niger Nigeria Pakistan
Papua New Guinea Paraguay Philippines
Republic Of Moldova Rwanda Samoa
Sao Tome And Principe Senegal Sierra Leone
Solomon Islands Somalia South Sudan
Sri Lanka State Of Palestine Sudan
Swaziland Syrian Arab Republic Tajikistan
Togo U.R. Of Tanzania: Mainland Uganda
Ukraine Uzbekistan Vanuatu
Viet Nam Yemen Zambia
Zanzibar Zimbabwe
I have written a function that does this called table(countries). While what I have written works it doesn't seem like the most efficient way to do this. While I don't have to use the format() function for the assignment it is something i'm most comfortable with. Feel free to show me another way if there is a better way, but please remember this is my first programming language/class.
Here is the code I have written:
def table(countries):
counter = 0 #Four counting when I've printed 3 columns
for outer in range(len(countries)):
print(format(countries[outer], '30'), end ='')
counter +=1
if counter == 3:
counter = 0
print() #Starts a new column
Thanks in advance!
You can mod and won't need another variable.
def table(countries):
for outer in range(len(countries)):
if outer%3 == 0:
print() #Starts a new column
print(format(countries[outer], '30'), end ='')
You can try list comprehension:
countries = ['aaa','bbb','ccc','dd','eeeee','fff','ggggggg']
print('\n'.join([" ".join([country.ljust(30) for country in countries[i:i+3]]) for i in range(0,len(countries),3)]))
which will result in:
aaa bbb ccc
dd eeeee fff
ggggggg
First of all, we separate countries into list of 3 each - for i in range(0,len(countries),3)
Then, we make each country from that sublist into fixed length string, filling with spaces till the length of 30 - [country.ljust(30) for country in countries[i:i+3]]
After, we join each sublists into one string - " ".join(...)
And at the end, we join each of that sublist's string into one string with End Of Line symbol - '\n'.join(...)
Worth noting that you are going to have trailing spaces at the end of each string - if that is unwanted, you can call rstrip() to get rid of them.

Categories