Removing \n \t from data scraped from website - python

I am trying to remove \n and \t that show up in data scraped from a webpage.
I have used the strip() function, however it doesn't seem to work for some reason.
My output still shows up with all the \ns and \ts.
Here's my code :
import urllib.request
from bs4 import BeautifulSoup
import sys
all_comments = []
max_comments = 10
base_url = 'https://www.mygov.in/'
next_page = base_url + '/group-issue/share-your-ideas-pm-narendra-modis-mann-ki-baat-26th-march-2017/'
while next_page and len(all_comments) < max_comments :
response = response = urllib.request.urlopen(next_page)
srcode = response.read()
soup = BeautifulSoup(srcode, "html.parser")
all_comments_div=soup.find_all('div', class_="comment_body");
for div in all_comments_div:
data = div.find('p').text
data = data.strip(' \t\n')#actual comment content
data=''.join([ i for i in data if ord(i) < 128 ])
all_comments.append(data)
#getting the link of the stream for more comments
next_page = soup.find('li', class_='pager-next first last')
if next_page :
next_page = base_url + next_page.find('a').get('href')
print('comments: {}'.format(len(all_comments)))
print(all_comments)
And here's the output I'm getting:
comments: 10
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']

strip() only removes spaces, etc from the ends of a string. To remove items inside of a string you need to either use replace or re.sub.
So change:
data = data.strip(' \t\n')
To:
import re
data = re.sub(r'[\t\n ]+', ' ', data).strip()
To remove the \t and \n characters.

Use replace instead of strip:
div = "/n blablabla /t blablabla"
div = div.replace('/n', '')
div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Some explanation:
Strip doesn't work in your case because it only removes specified characters from beginning or end of a string, you can't remove it from the middle.
Some examples:
div = "/nblablabla blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
Result:
blablabla blablabla
Character anywhere between start and end will not be removed:
div = "blablabla /n blablabla"
div = div.strip('/n')
#div = div.replace('/t','')
print(div)
result:
blablabla /n blablabla

You can split you text (which will remove all white spaces, including TABs) and join the fragments back again, using only one space as the "glue":
data = " ".join(data.split())

As others have mentioned, strip only removes spaces from start and end. To remove specific characters i.e. \t and \n in your case.
With regex (re) it's easliy possible. Specify the pattern (to filter characters you need to replace). The method you need is sub (substitute):
import re
data = re.sub(r'[\t\n ]+', ' ', data)
sub(<the characters to replace>, <to replace with>) - above we have set a pattern to get [\t\n ]+ , the + is for one or more, and [ ] is to specify the character class.
To handle sub and strip in single statement:
data = re.sub(r'[\t\n ]+', ' ', data).strip()
The data: with \t and \n
["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
Test Run:
import re
data = ["Sir my humble submission is that please ask public not to man handle doctors because they work in a very delicate situation, to save a patient is not always in his hand. The incidents of manhandling doctors is increasing day by day and it's becoming very difficult to work in these situatons. Majority are not Opting for medical profession, it will create a crisis in medical field.In foreign no body can dare to manhandle a doctor, nurse, ambulance worker else he will be behind bars for 14 years.", 'Hello\n Sir.... Mera AK idea hai Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE\n1. SAB LOG TRAFIC STRIKLY FOLLOW KARENEGE...\n TAHNKYOU SIR..', 'Respect sir,\nI am Hindi teacher in one of the cbse school of Nagpur city.My question is that in 9th and10th STD. Why the subject HINDI is not compulsory. In the present pattern English language is Mandatory for students to learn but Our National Language HINDI is not .\nSir I request to update the pattern such that the Language hindi should be mandatory for the students of 9th and 10th.', 'Sir\nsuggestions AADHAR BASE SYSTEM\n1.Cash Less Education PAN India Centralised System\n2.Cash Less HEALTH POLICY for All & Centralised Rate MRP system\n3.All Private & Govt Hospitals must be CASH LESS\n4.All Toll Booth,Parking Etc CASHLESS Compulsory\n5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL\n6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT Mentioned\n7.Municipal Corporations/ZP must be CASH Less System\nAffordable Min Tax Housing\nCancel TDS', 'SIR KINDLY LOOK INTO MARITIME SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR.\nJAI HIND', ' ? ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 1.1 .2017 ?', ' 9 / ', ' 9 Central Government and Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1 .2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952, PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641 ', ' ,\n \n, ,\n\t Central Government and Central Autonomous Bodies pensioners/ family pensioners ? ']
data_out = []
for s in data:
data_out.append(re.sub(r'[\t\n ]+', ' ', s).strip())
The output:
["Sir my humble submission is that please ask public not to man handle
doctors because they work in a very delicate situation, to save a
patient is not always in his hand. The incidents of manhandling
doctors is increasing day by day and it's becoming very difficult to
work in these situatons. Majority are not Opting for medical
profession, it will create a crisis in medical field.In foreign no
body can dare to manhandle a doctor, nurse, ambulance worker else he
will be behind bars for 14 years.", 'Hello Sir.... Mera AK idea hai
Jese bus ticket ki machine hai aur pata chalta hai ki din me kitni
ticket nikali USSI TARH hum traffic police ko bhi aishi machine de to
usee (1)JO MEMO DUPLICATE BANATE THE VO BHI NIKL JAYENGE MEANS A SAB
LEGAL HO JAYEGA.... AUR HMARI SARKAR K TRAZERY ACCOUNT ME DIRECTLY
CREDIT HO JANA CHI A TAKI SAB KO PATA CHALE KI HMARA JO TRAFIC POLICE
NE FIND(DAND) LIYA HAI VO LIGALLY HAI... USEE 1. SAB LOG TRAFIC
STRIKLY FOLLOW KARENEGE... TAHNKYOU SIR..', 'Respect sir, I am Hindi
teacher in one of the cbse school of Nagpur city.My question is that
in 9th and10th STD. Why the subject HINDI is not compulsory. In the
present pattern English language is Mandatory for students to learn
but Our National Language HINDI is not . Sir I request to update the
pattern such that the Language hindi should be mandatory for the
students of 9th and 10th.', 'Sir suggestions AADHAR BASE SYSTEM 1.Cash
Less Education PAN India Centralised System 2.Cash Less HEALTH POLICY
for All & Centralised Rate MRP system 3.All Private & Govt Hospitals
must be CASH LESS 4.All Toll Booth,Parking Etc CASHLESS Compulsory
5.Compulsory HEALTH INsurance & AGRICULTURE Insurance for ALL 6.All Bank, GOVT Sector, PVT Sector should produce Acknowledgements with TAT
Mentioned 7.Municipal Corporations/ZP must be CASH Less System
Affordable Min Tax Housing Cancel TDS', 'SIR KINDLY LOOK INTO MARITIME
SECTOR SPECIALLY GOVERNMENT MARITIME TRAINING INSTITUTIONS REALLY
CONDITIONS GOING WORST IT NEEDS IMMEDIATE CHANGES AND ATTENTION TO
PROTECT OUR INDIAN REPUTATION IN MARITIME SECTOR. JAI HIND', '?', '9
Central Government and Central Autonomous Bodies pensioners/ family
pensioners 1 2016 , 1.1 .2017 ?', '9 /', '9 Central Government and
Central Autonomous Bodies pensioners/ family pensioners 1 2016 , 01.1
.2017 DOPPW/E/2017/03242, DOPPW/E/2017/04085, PMOPG/E/2017/0148952,
PMOPG/E/2017/0115720 , PMOPG/E/2017/0123641', ', , , Central
Government and Central Autonomous Bodies pensioners/ family pensioners
?']

// using javascript
function removeLineBreaks(str = '') {
return str.replace(/[\r\n\t]+/gm, '');
}
const data = removeLineBreaks('content \n some \t more \t content');
// data -> content some more content

Related

How to highlight MULTIPLE matching sequences of words in two strings in Python?

I'm using the code below to highlight a single matching sequence. (Just copy-paste it in a new Colab notebook, it'll work perfectly.
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
Now when I try highlighting multiple sequences, the code breaks (doesn't show the full text, and doesn't highlight the second or more sequence).
import textwrap
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from difflib import SequenceMatcher
import nltk
nltk.download('punkt')
print('')
text1 = \
'''
commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. North American mainland at least 12,000 years ago, and advanced cultures began to appear later on.
'''
text2 = \
'''
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a transcontinental country located primarily in North America. It consists of 50 states, a federal district, five major unincorporated territories, nine minor outlying islands,[j] and 326 Indian reservations. It is the third-largest country by both land and total area.[d] The United States shares land borders with Canada to its north and with Mexico to its south. It has maritime borders with the Bahamas, Cuba, Russia, and other nations.[k] With a population of over 331 million,[e] it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city and financial center is New York City.
Paleo-aboriginals migrated from Siberia to the North American mainland at least 12,000 years ago, and advanced cultures began to appear later on. These advanced cultures had almost completely declined by the time European colonists arrived during the 16th century. The United States emerged from the Thirteen British Colonies established along the East Coast when disputes with the British Crown over taxation and political representation led to the American Revolution (1765–1784), which established the nation's independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states. By 1848, the United States spanned the continent from east to west. The controversy surrounding the practice of slavery culminated in the secession of the Confederate States of America, which fought the remaining states of the Union during the American Civil War (1861–1865). With the Union's victory and preservation, slavery was abolished by the Thirteenth Amendment.
'''
temp = SequenceMatcher(None, word_tokenize(text1), word_tokenize(text2))
print(temp.get_matching_blocks())
print('Similarity Score: ', temp.ratio())
print('')
search_length = len(text1)
total_length = len(text2)
matching_blocks = temp.get_matching_blocks()
beginning = matching_blocks[0][0]
start = matching_blocks[0][1]
stop = (matching_blocks[0][1] + matching_blocks[0][2])
end = matching_blocks[1][1]
tokenized = word_tokenize(text2)
before_match = TreebankWordDetokenizer().detokenize(tokenized[beginning:start])
match = TreebankWordDetokenizer().detokenize(tokenized[start:stop])
after_match = TreebankWordDetokenizer().detokenize(tokenized[stop:end])
print(textwrap.fill(before_match + '\x1b[0;30;42m' + match + '\x1b[0m' + after_match, 150))
print('')
print('Percentage Similarity: ' + str(round(((search_length/(total_length + search_length)) * 100), 2)) + '%')
I need to highlight at least 2 sequences. I'm trying to make some sort of if else statement right now, maybe it'll work. Or is there a better library?

Tokenizing sentences a special way

from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
After running the code, here output of dataset["test"]["abstract"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .',
"said dpp alison saunders had ` damaged public confidence ' in justice .",
'ms saunders ruled lord janner unfit to stand trial over child abuse claims .',
'the cps has pursued at least 19 suspected paedophiles with dementia .'],
['an increasing number of surveys claim to reveal what makes us happiest .',
'but are these generic lists really of any use to us ?',
'janet street-porter makes her own list - of things making her unhappy !'],
["author of ` into the wild ' spoke to five rape victims in missoula , montana .",
"` missoula : rape and the justice system in a college town ' was released april 21 .",
"three of five victims profiled in the book sat down with abc 's nightline wednesday night .",
'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football '
'players .',
"huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .",
'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable '
'cause .',
'mr krakauer wrote book after realizing close friend was a rape victim .'],
['tesco announced a record annual loss of £ 6.38 billion yesterday .',
'drop in sales , one-off costs and pensions blamed for financial loss .',
'supermarket giant now under pressure to close 200 stores nationwide .',
'here , retail industry veterans , plus mail writers , identify what went wrong .'],
...,
['snp leader said alex salmond did not field questions over his family .',
"said she was not ` moaning ' but also attacked criticism of women 's looks .",
'she made the remarks in latest programme profiling the main party leaders .',
'ms sturgeon also revealed her tv habits and recent image makeover .',
'she said she relaxed by eating steak and chips on a saturday night .']]
I would like that each sentence to have this structure of tokenizing. How can I do such thing using huggingface? In fact, I think I have to flatten each list of the above list to get a list of strings and then tokenize each string.

Create tuples of (lemma, NER type) in python , Nlp problem

I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)
I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?
urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")
Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.
Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Unable to parse names properly from some elements

I've written a script in python to parse some names out of some elements. When i execute my script, it does parse names but the output is weird to look at. The names are being parsed in such a way so that it looks like two big names. The names are separated by br tag. How can i get each names individually?
Elements within which the names are:
html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts <br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''
The script I've written to parse names:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
Output I'm having (partial result):
DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche
Output I wanna get:
DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
FYI, when i take a closer look at the result, I could find that each separate names are attached to each other with no gap in between.
Using item.text removes all the tags, you need to replace the <br> tags with '\n'. Using the answer provided by Ian Mackinnon for the question: Convert </br> to end line
your script should be:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for br in soup.find_all("br"):
br.replace_with("\n")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
and the output:
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care
Check below solution and let me know if some improvements required:
for items in soup.select(".second-child"):
for text_nodes in items.select("p"):
name = " \n".join([item for item in text_nodes.strings if item])
print(name)
Output
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith
E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

Categories