Changing values in a column based on a match - python

I have a Pandas DataFrame which contains names of brazilians universities, but somethings I have these names in a short way or in a long way (for example, the Universidade Federal do Rio de Janeiro sometimes is identified as UFRJ).
The DataFrame look like this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| UFRJ |
| Universidade de Sao Paulo |
| USP |
| Catholic University of Minas Gerais |
And I have another one which has in separate columns the short name and the long name of SOME (not all) of those universities. Which looks likes this:
| long_name | short_name |
|----------------------------------------|------------|
| Universidade Federal do Rio de Janeiro | UFRJ |
| Universidade de Sao Paulo | USP |
What I want is: substitute all short names by long names, so in this context, the first dataframe would have the college column changed to this:
| college |
|----------------------------------------|
| Universidade Federal do Rio de Janeiro |
| Universidade Federal do Rio de Janeiro |
| Universidade de Sao Paulo |
| Universidade de Sao Paulo |
| Catholic University of Minas Gerais | <--- note: this one does not have a match, so it stays the same
Is there a way to do that using pandas and numpy (or any other library)?

Use Series.map with replace by second DataFrame, if no match get missing values, so added Series.fillna:
df1['college'] = (df1['college'].map(df2.set_index('short_name')['long_name'])
.fillna(df1['college']))
print (df1)
college
0 Universidade Federal do Rio de Janeiro
1 Universidade Federal do Rio de Janeiro
2 Universidade de Sao Paulo
3 Universidade de Sao Paulo
4 Catholic University of Minas Gerais

Related

Scraping news title from a page with bs4 in python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 months ago.
Improve this question
I was trying to scrape the "entry-title" of the last news on the site "https://www.abafg.it/category/avvisi/" and prints [ ] instead, what am i doing the wrong way?
The result of what the code returns instead of the "entry-title" of the page i want to scrape the info
I tried to scrape the class "entry-title" to let me save the title, the link of where that news leads and the date of publish
The entry-title class is not of the link a tag, but of the h2 wrapped around it.
You can use
names = [h.a for h in soup.find_all('h2', class_='entry-title')]
But I think using CSS selectors would be better here
names = soup.select('h2.entry-title > a[href]')
will select any a tag with a href attribute and with a h2 parent of class entry-title.
Then,
for a in names: print(a.get_text().strip(), a.get('href'))
will print
AVVISO LEZIONI DI SCULTURA : PROF.BORRELLI https://www.abafg.it/avviso-lezioni-di-scultura-prof-borrelli/
ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022 https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/
PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/
D.D. n. 7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19 https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/
D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19 https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/
RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/
D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/
AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/
D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022. https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/
ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/
Added EDIT: to save the printed text into a file, you could first save it as one string with .join first
asText = '\n'.join([f'{a.get_text().strip()} {a.get("href")}' for a in names])
and then you could save it with
with open('./resources/titles.txt', 'w', encoding='utf-8') as f:
f.write(asText)
If you want something more visuals-friendly, I suggest using pandas
asDF = pandas.DataFrame([{
'title': a.get_text().strip(), 'link': a.get('href')
} for a in names])
asText = asDF.to_markdown(index=False)
and now asText looks like
| title | link |
|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------|
| ORARIO DELLE LEZIONI A.A.2022/2023 IN VIGORE DAL 21 NOVEMBRE 2022 | https://www.abafg.it/orario-delle-lezioni-a-a-2022-2023-in-vigore-dal-21-novembre-2022/ |
| PROROGA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 | https://www.abafg.it/proroga-bando-affidamenti-interni-d-d-n-3-del-4-11-2022/ |
| D.D. n. 7 del 15.11.2022 DECRETO GRADUATORIA PROVVISORIA ABPR19 | https://www.abafg.it/d-d-n-7-del-15-11-2022-decreto-graduatoria-provvisoria-abpr19/ |
| D.D. n. 5 DEL 10.11.2022 DECRETO DI NOMINA COMMISSIONE ABPR19 | https://www.abafg.it/d-d-n-5-del-10-11-2022-decreto-di-nomina-commissione-abpr19/ |
| RIAPERTURA BANDO AFFIDAMENTI INTERNI D.D. N. 3 DEL 4.11.2022 | https://www.abafg.it/riapertura-bando-affidamenti-interni-d-d-n-4-del-4-11-2022/ |
| D.D.81 del 26.10.2022 GRADUATORIA DEFINITIVA ABST48 STORIA DELLE ARTI APPLICATE | https://www.abafg.it/d-d-81-del-26-10-2022-graduatoria-definitiva-abst48-storia-delle-arti-applicate/ |
| AVVISO PRESENTAZIONE DOMANDE CULTORE DELLA MATERIA A.A.22.23-SCADENZA 11.11.2022 | https://www.abafg.it/avviso-presentazione-domande-cultore-della-materia-a-a-22-23-scadenza-11-11-2022/ |
| D.D. N.78 DEL 19/10/2022 BANDO GRADUATORIE D’ISTITUTO-SCADENZA 9/11/2022. | https://www.abafg.it/d-d-n-78-bando-graduatorie-distituto-scadenza-9-11-2022/ |
| ORARIO PROVVISIORIO DELLE LEZIONI A.A. 2022/2023: TRIENNIO E BIENNIO | https://www.abafg.it/orario-provvisiorio-delle-lezioni-a-a-2022-2023-triennio-e-biennio/ |
| GRADUATORIA DEFINITIVA ABST47 STILE,STORIA DELL’ARTE E DEL COSTUME | https://www.abafg.it/graduatoria-definitiva-abst47-stilestoria-dellarte-e-del-costume/ |
And then, instead of TXT, you could also save it as CSV with
asDF.to_csv('./resources/titles.csv', index=False)
so that you can view it as a spreadsheet

Pyhton pandas for manipulate text & inconsistent data

how i take specific text from one column in python pandas but inconsistent format for example like this
Area | Owners
Bali Island: 4600 | John
Java Island:7200 | Van Hour
Hallo Island : 2400| Petra
and the format would be like this
Area | Owners | Area Number
Bali Island: 4600 | John | 4600
Java Island:7200 | Van Hour | 7200
Hallo Island : 2400| Petra | 2400
You could use str.extract:
df['Area Number'] = df['Area'].str.extract('(\d+)$')
output:
Area Owners Area Number
0 Bali Island: 4600 John 4600
1 Java Island:7200 Van Hour 7200
2 Hallo Island : 2400 Petra 2400

TypeError: read_json() got an unexpected keyword argument 'delimiter'

I am trying to delimit a flat json file in Python 3 (via Jupyter), in order to create an extra column. Pandas automatically reads and produces rows between "...". When I print without a delimiter it reads the file just fine. Here the first four rows:
0 <h1>lorum ipsum|
1 <h2>lorum ipsum|
2
3 <h5>lorum ipsum...
However, I would like to separate an extra column every time json has file a >, but I receive an extensive error I do not understand. What am I doing wrong?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-647ecd72fd56> in <module>
1 import sys
2 import pandas as pd
----> 3 df = pd.read_json('/filepath/doc.json' , delimiter='>', engine='python', header=None)
4 print (df)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
297 )
298 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 299 return func(*args, **kwargs)
300
301 return wrapper
TypeError: read_json() got an unexpected keyword argument 'delimiter'
Code that produces error is:
import pandas as pd
df = pd.read_json('/path/file.json' , delimiter='>', engine='python', header=None)
print (df)
#Ron
Sure. It was originally a pdf, which I scraped into json, the best I could. With headers, sub headers and paragraphs.
["<h1>Amsterdam Circulair|", "<h2>leren door te doen|", "", "<h5>In 2 jaar met ruim 20 projecten in de praktijk laten zien dat het kan|", "<h5>2|", "", "<s1>Voorpagina:| Een medewerkster van Gro-Holland verzamelt de op koffieprut | gegroeide oesterzwammen, waarna zij naar de restaurants van La | Place gaan, de leverancier van de koffieprut. Foto: Olaf Kraak.| Bron: Trouw|", "", "<h2>Inhoud|", "", "<h5>Voorwoord| 4|", "<h5>1 | Inleiding | 6|", "<h5>2 | Wat hebben het onderzoek en de marktconsultatie opgeleverd? | 7|", "", "<p>2.1 | Onderzoeksrapport Amsterdam Circulair | 7|", "<p>2.2 | Resultaat marktconsultatie | 7|", "<p>2.3 | Duurzaamheidsraad | 8|", "", "<h5>3 | Wat doen we al? | 9|", "<h5> 4 | Hoe gaan we verder? | 10|", "", "<p>4.1 | Inzet gemeente | 10|", "<p>4.2 | Inzet 2016-2017 uitgewerkt | 12|", "<p>4.3 | Wat vragen we anderen | 18|", "", "<s1>4.3.1 | Rijk | 18|", "<s1>4.3.2 | Bedrijfsleven | 18|", "<s1>4.3.3 | Amsterdammers | 18|", "", "<h5>5 | Samengevat | 19|", "", "<p>Versie 16 november 2016|", "<h5>3|", "", "<h2>Voorwoord|", "", "<p>De transformatie van onze huidige economie naar een duurzame, circulaire economie is noodzakelijk en onontkoombaar. De | \u2018circulaire economie\u2019 is anno 2016 nog een toekomstbeeld. De vraag die nu voorligt is daarom: hoe moeilijk of makkelijk gaan we | het elkaar maken om dit toekomstbeeld om te zetten in realiteit?|", "<p>Vast staat dat de overheid niet kan bepalen dat \u2018de\u2019 economie maar circulair moet worden, maar dat ze een faciliterende en | stimulerende taak heeft. Juist bedrijven en instellingen moeten hun rol kunnen pakken in een economie in transitie.|", "<p>De bijdrage die de gemeente Amsterdam daarom op dit moment het beste kan leveren, is aantonen dat de circulaire economie | werkelijkheid kan worden. Dit gaan we doen door de komende twee jaar met twintig projecten te laten zien dat circulair werken | kan.

Pandas: How to map the values of a Dataframe to another Dataframe?

I am totally new to Python and just learning with some use cases I have.
I have 2 Data Frames, one is where I need the values in the Country Column, and another is having the values in the column named 'Countries' which needs to be mapped in the main Data Frame referring to the column named 'Data'.
(Please accept my apology if this question has already been answered)
Below is the Main DataFrame:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas |
Divya london Khosla |
new delhi Pragati Kumari |
Will London Turner |
Joseph Mascurenus Bombay |
Jason New York Bourne |
New york Vice Roy |
Joseph Mascurenus new York |
Peter Parker California |
Bruce (istanbul) Wayne |
Below is the Referenced DataFrame:
Data | Countries
-------------- | ---------
las Vegas | US
london | UK
New Delhi | IN
London | UK
bombay | IN
New York | US
New york | US
new York | US
California | US
istanbul | TR
Moscow | RS
Cape Town | SA
And what I want in the result will look like below:
Name Data | Country
----------------------------- | ---------
Arjun Kumar Reddy las Vegas | US
Divya london Khosla | UK
new delhi Pragati Kumari | IN
Will London Turner | UK
Joseph Mascurenus Bombay | IN
Jason New York Bourne | US
New york Vice Roy | US
Joseph Mascurenus new York | US
Peter Parker California | US
Bruce (istanbul) Wayne | TR
Please note, Both the dataframes are not same in size.
I though of using map or Fuzzywuzzy method but couldn't really achieved the result.
Find the country key that matches in the reference dataframe and extract it.
regex = '(' + ')|('.join(ref_df['Data']) + ')'
df['key'] = df['Name Data'].str.extract(regex, flags=re.I).bfill(axis=1)[0]
>>> df
Name Data key
0 Arjun Kumar Reddy las Vegas las Vegas
1 Bruce (istanbul) Wayne istanbul
2 Joseph Mascurenus new York new York
>>> ref_df
Data Country
0 las Vegas US
1 new York US
2 istanbul TR
Merge both the dataframes on key extracted.
pd.merge(df, ref_df, left_on='key', right_on='Data')
Name Data key Data Country
0 Arjun Kumar Reddy las Vegas las Vegas las Vegas US
1 Bruce (istanbul) Wayne istanbul istanbul TR
2 Joseph Mascurenus new York new York new York US
It looks like everything is sorted so you can merge on index
mdf.merge(rdf, left_index=True, right_index=True)

How to split a pandas string to extract middle names?

I want to split names of individuals into multiple strings. I am able to extract the first name and last name quite easily, but I have problems extracting the middle name or names as these are quite different in each scenario.
The data would look like this:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
Whereby there are names with only a first and last name and names with something in between or two middle names. How can I extract the middle names from a Pandas dataframe? I can already extract the first and last names.
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
The desired results should look like this:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
A single str.extract call will work here:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
Regex Breakdown
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
I think you can do:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Here's another answer that uses some simple lambda functionality.
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict = {
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
}
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
Then you should end up with something like this:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
Or, replace NaN values with with blank strings:
data = data.replace(np.NaN, r'', regex=False)
Then you have:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ

Categories