Unable to parse names properly from some elements - python

I've written a script in python to parse some names out of some elements. When i execute my script, it does parse names but the output is weird to look at. The names are being parsed in such a way so that it looks like two big names. The names are separated by br tag. How can i get each names individually?
Elements within which the names are:
html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts <br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''
The script I've written to parse names:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
Output I'm having (partial result):
DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche
Output I wanna get:
DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
FYI, when i take a closer look at the result, I could find that each separate names are attached to each other with no gap in between.

Using item.text removes all the tags, you need to replace the <br> tags with '\n'. Using the answer provided by Ian Mackinnon for the question: Convert </br> to end line
your script should be:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for br in soup.find_all("br"):
br.replace_with("\n")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
and the output:
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

Check below solution and let me know if some improvements required:
for items in soup.select(".second-child"):
for text_nodes in items.select("p"):
name = " \n".join([item for item in text_nodes.strings if item])
print(name)
Output
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith
E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

Related

Web-scraping, Python and Beautifulsoup; extracting all <li> tags under a specific h2 tag

I am attempting to extract all the events from a wiki article on a date, such as May 9 (for example), and have all those events in a one-column dataframe while also ignoring the <h3> tag sub-headings Pre-1600, 1601–1900, 1901–present. All those events in those subsections should just be concatenated together into one column seamlessly.
I also want to ignore the other sections such as births, deaths, etc which are denoted in <h2> tag as well. So, only the events section is being extracted. The <h2> tag/section of interest is the second in the list as seen here.
import requests, itertools, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://en.wikipedia.org/wiki/May_9').text, 'html.parser')
h2 = d.find_all("h2")
h2
[<h2 id="mw-toc-heading">Contents</h2>,
<h2><span class="mw-headline" id="Events">Events</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Births">Births</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Deaths">Deaths</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="Holidays_and_observances">Holidays and observances</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="References">References</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2><span class="mw-headline" id="External_links">External links</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span>edit<span class="mw-editsection-bracket">]</span></span></h2>,
<h2>Navigation menu</h2>]
I'm struggling with constructing a function that selects the Events section and then the subsequent <li> tags but ignores the subheadings and the other sections.
I've attempted to separate out the <h2> sections with
data = [[i.name, i] for i in d.find_all(re.compile('h2|ul'))]
new_data = [[a, list(b)] for a, b in itertools.groupby(data, key=lambda x:x[0] == 'h2')]
But I'm stuck at this point. If there is a better approach, I'm happy to use it.
You can use .find_previous to check if previous <h2> is the Events heading:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/May_9"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for li in soup.select("h3 + ul > li"):
if (h2 := li.find_previous("h2")) and (h2.find(id="Events")):
date, event = li.text.replace("–", "-").split(" - ", maxsplit=1)
print("{:<10} {}".format(date, event))
Prints:
0328 Athanasius is elected Patriarch of Alexandria.[1]
1009 Lombard Revolt: Lombard forces led by Melus revolt in Bari against the Byzantine Catepanate of Italy.
1386 England and Portugal formally ratify their alliance with the signing of the Treaty of Windsor, making it the oldest diplomatic alliance in the world which is still in force.
1450 'Abd al-Latif (Timurid monarch) is assassinated.
1540 Hernando de Alarcón sets sail on an expedition to the Gulf of California.
1662 The figure who later became Mr. Punch makes his first recorded appearance in England.[2]
1671 Thomas Blood, disguised as a clergyman, attempts to steal England's Crown Jewels from the Tower of London.
1726 Five men arrested during a raid on Mother Clap's molly house in London are executed at Tyburn.
1864 Second Schleswig War: The Danish navy defeats the Austrian and Prussian fleets in the Battle of Heligoland.
1865 American Civil War: Nathan Bedford Forrest surrenders his forces at Gainesville, Alabama.
1865 American Civil War: President Andrew Johnson issues a proclamation ending belligerent rights of the rebels and enjoining foreign nations to intern or expel Confederate ships.
1873 Der Krach: Vienna stock market crash heralds the Long Depression.
1877 Mihail Kogălniceanu reads, in the Chamber of Deputies, the Declaration of Independence of Romania. This day became the Independence Day of Romania.
1901 Australia opens its first national parliament in Melbourne.
1911 The works of Gabriele D'Annunzio are placed in the Index of Forbidden Books by the Vatican.
1915 World War I: Second Battle of Artois between German and French forces.
1918 World War I: Germany repels Britain's second attempt to blockade the port of Ostend, Belgium.
1920 Polish-Soviet War: The Polish army under General Edward Rydz-Śmigły celebrates its capture of Kiev with a victory parade on Khreshchatyk.
1926 Admiral Richard E. Byrd and Floyd Bennett claim to have flown over the North Pole (later discovery of Byrd's diary appears to cast some doubt on the claim.)
1927 Old Parliament House, Canberra officially opens.[3]
1936 Italy formally annexes Ethiopia after taking the capital Addis Ababa on May 5.
1941 World War II: The German submarine U-110 is captured by the Royal Navy. On board is the latest Enigma machine which Allied cryptographers later use to break coded German messages.
1942 The Holocaust in Ukraine: The SS executes 588 Jewish residents of the Podolian town of Zinkiv (Khmelnytska oblast. The Zoludek Ghetto (in Belarus) is destroyed and all its inhabitants executed or deported.
1945 World War II: The final German Instrument of Surrender is signed at the Soviet headquarters in Berlin-Karlshorst.
1946 King Victor Emmanuel III of Italy abdicates and is succeeded by Umberto II.
1948 Czechoslovakia's Ninth-of-May Constitution comes into effect.
1950 Robert Schuman presents the "Schuman Declaration", is considered by some people to be the beginning of the creation of what is now the European Union.
1955 Cold War: West Germany joins NATO.
1960 The Food and Drug Administration announces it will approve birth control as an additional indication for Searle's Enovid, making Enovid the world's first approved oral contraceptive pill.
1969 Carlos Lamarca leads the first urban guerrilla action against the military dictatorship of Brazil in São Paulo, by robbing two banks.
1974 Watergate scandal: The United States House Committee on the Judiciary opens formal and public impeachment hearings against President Richard Nixon.
1979 Iranian Jewish businessman Habib Elghanian is executed by firing squad in Tehran, prompting the mass exodus of the once 100,000-strong Jewish community of Iran.
1980 In Florida, United States, Liberian freighter MV Summit Venture collides with the Sunshine Skyway Bridge over Tampa Bay, making a 1,400-ft. section of the southbound span collapse. Thirty-five people in six cars and a Greyhound bus fall 150 ft. into the water and die.
1980 In Norco, California, United States, five masked gunmen hold up a Security Pacific bank, leading to a violent shoot-out and one of the largest pursuits in California history. Two of the gunmen and one police officer are killed and thirty-three police and civilian vehicles are destroyed in the chase.
1987 LOT Flight 5055 Tadeusz Kościuszko crashes after takeoff in Warsaw, Poland, killing all 183 people on board.
1988 New Parliament House, Canberra officially opens.[3]
1992 Armenian forces capture Shusha, marking a major turning point in the First Nagorno-Karabakh War.
1992 Westray Mine disaster kills 26 workers in Nova Scotia, Canada.
2001 In Ghana, 129 football fans die in what became known as the Accra Sports Stadium disaster. The deaths are caused by a stampede (caused by the firing of tear gas by police personnel at the stadium) that followed a controversial decision by the referee.
2002 The 38-day stand-off in the Church of the Nativity in Bethlehem comes to an end when the Palestinians inside agree to have 13 suspected terrorists among them deported to several different countries.[4]
2017 US President Donald Trump fires FBI Director James Comey.[5]
2018 The historic defeat for Barisan Nasional, the governing coalition of Malaysia since the country's independence in 1957 in 2018 Malaysian general election.
2020 The COVID-19 recession causes the U.S. unemployment rate to hit 14.9 percent, its worst rate since the Great Depression.[6]

Create tuples of (lemma, NER type) in python , Nlp problem

I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)
I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))

Removing \n \t from data scraped from website

I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.
Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla
You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())
As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']
// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content

Text Information not scrape properly-Python

I need to scrape the text information between the following HTML. My code below is not working properly for cases where tag and class names are same. Here i need to get the text in single list element and not as two different list element. The code i have written here for the case where there is no split like below. In my case i need to scrape both kind of text and append it to a single list.
Sample HTML code(where list element is one)- working correctly:
<DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">The board of Hillshire Brands has withdrawn its recommendation to acquire frozen foods maker Pinnacle Foods, clearing the way for Tyson Foods' $8.55bn takeover bid.</SPAN><SPAN CLASS="c2"> </SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Last Monday Tyson won the bidding war for Hillshire, maker of Ball Park hot dogs, with a $63-a-share offer, topping rival poultry processor Pilgrim's Pride's $7.7bn bid.</SPAN></P>
Sample HTML Code(where list element is two):
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>
Python Code:
soup = BeautifulSoup(response, 'html.parser')
tree = html.fromstring(response)
values = [[''.join(text for text in div.xpath('.//p[#class="c9"]//span[#class="c2"]//text()'))] for div in tree.xpath('//div[#class="c5"]') if div.getchildren()]
split_at = ','
textvalues = [list(g) for k, g in groupby(values, lambda x: x != split_at) if k]
list2 = [x for x in textvalues[0] if x]
def purify(list2):
for (i, sl) in enumerate(list2):
if type(sl) == list:
list2[i] = purify(sl)
return [i for i in list2 if i != [] and i != '']
list3=purify(list2)
flattened = [val for sublist in list3 for val in sublist]
Current Output:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi","--Remaining text--"]
Expected Sample Output:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi --Remaining text--"]
Please help me to resolve the above issue.
Something like this?
from bs4 import BeautifulSoup
a="""
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>
"""
l = BeautifulSoup(a).text.split('\n')
b = [' '.join(l[1:])]
print b
Output:
[u"M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago. But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food. Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0 "]
text = '''<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>'''
html = etree.HTML(text)
res = html.xpath('//span[#class="c2" and ../#class="c9"]/text()')
print([''.join(res)])
out:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil MunhsiPickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0"]

Python text cleaning and matching

I am trying to get a cleaned address from a bunch of addresses.
These are different addresses I have for Harvard University. What I want is to convert all these addresses to "Harvard University".
1) Division of Renal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard Institutes of Medicine Suite 550, 4 Blackfan Circle, Boston, Massachusetts 02115, USA
​2) FAS Center for Systems Biology, Harvard University, Cambridge, Massachusetts 02138
​3) Department of Neurobiology, Howard Hughes Medical Institute, Harvard Medical School, Boston, Massachusetts, United States of America.
Just simple text matching doesn't work. So, I tried difflib.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
a = "Division of Renal Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Harvard Institutes of Medicine Suite 550, 4 Blackfan Circle, Boston, Massachusetts 02115, USA"
b = "Harvard University"
print(similar(a, b)) # gives 0.11981566820276497
print(similar(a, "Toronto University")) # gives 0.04608294930875576
But I think this approach won't give correct results for my data set. How can I set a threshold for the similarity? Can anyone recommend a better approach?

Categories