Text Information not scrape properly-Python

Text Information not scrape properly-Python - python

I need to scrape the text information between the following HTML. My code below is not working properly for cases where tag and class names are same. Here i need to get the text in single list element and not as two different list element. The code i have written here for the case where there is no split like below. In my case i need to scrape both kind of text and append it to a single list.
Sample HTML code(where list element is one)- working correctly:
<DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">The board of Hillshire Brands has withdrawn its recommendation to acquire frozen foods maker Pinnacle Foods, clearing the way for Tyson Foods' $8.55bn takeover bid.</SPAN><SPAN CLASS="c2"> </SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Last Monday Tyson won the bidding war for Hillshire, maker of Ball Park hot dogs, with a $63-a-share offer, topping rival poultry processor Pilgrim's Pride's $7.7bn bid.</SPAN></P>
Sample HTML Code(where list element is two):
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>
Python Code:
soup = BeautifulSoup(response, 'html.parser')
tree = html.fromstring(response)
values = [[''.join(text for text in div.xpath('.//p[#class="c9"]//span[#class="c2"]//text()'))] for div in tree.xpath('//div[#class="c5"]') if div.getchildren()]
split_at = ','
textvalues = [list(g) for k, g in groupby(values, lambda x: x != split_at) if k]
list2 = [x for x in textvalues[0] if x]
def purify(list2):
for (i, sl) in enumerate(list2):
if type(sl) == list:
list2[i] = purify(sl)
return [i for i in list2 if i != [] and i != '']
list3=purify(list2)
flattened = [val for sublist in list3 for val in sublist]
Current Output:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi","--Remaining text--"]
Expected Sample Output:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi --Remaining text--"]
Please help me to resolve the above issue.

Something like this?
from bs4 import BeautifulSoup
a="""
<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>
"""
l = BeautifulSoup(a).text.split('\n')
b = [' '.join(l[1:])]
print b
Output:
[u"M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago. But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food. Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0 "]

text = '''<DIV CLASS="c5"><BR><P CLASS="c6"><SPAN CLASS="c8">HIGHLIGHT:</SPAN><SPAN CLASS="c2"> News analysis<BR></SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">M&A simmers as producers swallow up brands to win shelf space, writes Neil Munhsi</SPAN></P>
</DIV>
<BR><DIV CLASS="c5"><P CLASS="c9"><SPAN CLASS="c2">Pickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.</SPAN></P>
<P CLASS="c9"><SPAN CLASS="c2">Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.</SPAN><SPAN CLASS="c2"> </SPAN></P>'''
html = etree.HTML(text)
res = html.xpath('//span[#class="c2" and ../#class="c9"]/text()')
print([''.join(res)])
out:
["M&A simmers as producers swallow up brands to win shelf space, writes Neil MunhsiPickles may go with sandwiches, as Hillshire Brands chief executive Sean Connolly put it two weeks ago.But many were puzzled by the US food group's announcement that it would pay $6.6bn to acquire New Jersey-based rival Pinnacle Foods, maker of Vlasic pickles and Birds Eye frozen food.Without the sort of mooted cost savings necessary to justify the purchase price, many saw the move by Hillshire, known in the US for Ball Park hot dogs and Jimmy Dean sausages, as a way to head off a potential takeover.\xa0"]

Related

How do i turn this string into 2 arrays in python?

I have this string and want to turn it into two arrays, one has the film title and the other one has the year. Their positions in the array need to correspond with each other. Is there a way to do this?
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")

First split the input string on comma to generate a list, then use comprehensions to get the title and year as separate lists.
films_list = re.split(r',\s*', films)
titles = [re.split(r'\s*(?=\(\d+\))', x)[0] for x in films_list]
years = [re.split(r'\s*(?=\(\d+\))', x)[1] for x in films_list]

Answer of Tim is well enough. I will try to write an alternative someone who would like to solve the problem without using regex.
a = films.split(",")
years = []
for i in a:
years.append(i[i.find("(")+1:i.find(")")])
Same approach can be applied for titles.

You can do something like this (without any kind of import or extra module needed, or regex complexity):
delimeter = ", "
movies_with_year = pfilms.split(delimeter)
movies = []
years = []
for movie_with_year in movies_with_year:
movie = movie_with_year[:-6]
year = movie_with_year[-6:].replace("(","").replace(")","")
movies.append(movie)
years.append(year)
This script will result in something like this:
movies : ['Endless Love ', ...]
years : ['1981', ...]

You shuold clear all "new line" (|n) and use try/except to pass over the last elemet issue.
films = ("""Endless Love (1981), Top Gun (1986), The Color of Money (1986), Rain Man (1988),
Born on the Fourth of July (1989), Interview with the Vampire: The Vampire Chronicles (1994),
Mission: Impossible (1996), Jerry Maguire (1996), The Matrix (1999), Mission: Impossible II (2000),
Vanilla Sky (2001), Cocktail (1988), A Few Good Men (1992), The Firm (1993), Eyes Wide Shut (1999),
Magnolia (1999), Minority Report (2002), Austin Powers in Goldmember (2002), Days of Thunder (1990),
The Powers of Matthew Star (1982), Cold Mountain (2003), The Talented Mr. Ripley (1999),
War of the Worlds (2005), The Oprah Winfrey Show (1986), Far and Away (1992), Taps (1981),
The Last Samurai (2003), Valkyrie (2008), Jack Reacher (2012), Edge of Tomorrow (2014),
Enemy of the State (1998), Mission: Impossible III (2006), Crimson Tide (1995), Reign Over Me (2007),
Batman Forever (1995), Batman Begins (2005), The Simpsons (1989), The Simpsons: Brother from the Same Planet (1993),
The Simpsons: When You Dish Upon a Star (1998), End of Days (1999), House of D (2004), The Indian Runner (1991),
Harry & Son (1984), Mission: Impossible - Ghost Protocol (2011), Aladdin (1992), Pacific Rim (2013),
Oblivion (2013), Knight and Day (2010),
""")
movies = []
years = []
for item in films.replace("\n", "").split("),"):
try:
movies.append(item.split(" (")[0])
years.append(item.split(" (")[-1])
except:
...

How to get only texts of tags that contain a certain string by using beautifulsoup?

Situation
Given is an unordered list with some list elements that contain the string "is" - I only want to get these texts:
<ul class="fun-facts">
<li>Owned my dream car in high school <sup>1</sup></li>
<li>Middle name is Ronald</li>
<li>Never had been on a plane until college</li>
<li>Dunkin Donuts coffee is better than Starbucks</li>
<li>A favorite book series of mine is <i>Ender's Game</i></li>
<li>Current video game of choice is <i>Rocket League</i></li>
<li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>
</ul>
My approach
facts = webpage.select('ul.fun-facts li')
facts_with_is = [fact.find(string=re.compile('is')) for fact in facts]
facts_with_is1 = [fact for fact in facts_with_is if fact]
facts_with_is2 = [fact.find_parent().get_text() for fact in facts_with_is if fact]
Results
facts:
[<li>Owned my dream car in high school <sup>1</sup></li>, <li>Middle name is Ronald</li>, <li>Never had been on a plane until college</li>, <li>Dunkin Donuts coffee is better than Starbucks</li>, <li>A favorite book series of mine is <i>Ender's Game</i></li>, <li>Current video game of choice is <i>Rocket League</i></li>, <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>]
facts_with_is1 (after filter None value of facts_with_is ):
['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks', 'A favorite book series of mine is ', 'Current video game of choice is ', "The band that I've seen the most times live is the "]
facts_with_is2:
['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks', "A favorite book series of mine is Ender's Game", 'Current video game of choice is Rocket League', "The band that I've seen the most times live is the Zac Brown Band"]
How can I get the expected result (fact_with_is2) with a simpler approach?

solution bs4 only
Select all <li> and check in a loop if string is in string:
from bs4 import BeautifulSoup
html_text='''<ul class="fun-facts">
<li>Owned my dream car in high school <sup>1</sup></li>
<li>Middle name is Ronald</li>
<li>Never had been on a plane until college</li>
<li>Dunkin Donuts coffee is better than Starbucks</li>
<li>A favorite book series of mine is <i>Ender's Game</i></li>
<li>Current video game of choice is <i>Rocket League</i></li>
<li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>
</ul>'''
soup= BeautifulSoup (html_text,'lxml')
[x.get_text() for x in soup.select('ul.fun-facts li') if ' is ' in x.get_text()]
Output
['Middle name is Ronald',
'Dunkin Donuts coffee is better than Starbucks',
"A favorite book series of mine is Ender's Game",
'Current video game of choice is Rocket League',
"The band that I've seen the most times live is the Zac Brown Band"]

Create tuples of (lemma, NER type) in python , Nlp problem

I wrote the code below, and I made a dictionary for it, but I want Create tuples of (lemma, NER type) and Collect counts over the tuples I dont know how to do it? can you pls help me? NER type means name entity recognition
text = """
Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.
"""
doc = nlp(text).ents
en = [(entity.text, entity.label_) for entity in doc]
en
#entities
#The list stored in variable entities is has type list[list[tuple[str, str]]],
#from pprint import pprint
pprint(en)
sum(filter(None, entities), [])
from collections import defaultdict
type2entities = defaultdict(list)
for entity, entity_type in sum(filter(None, entities), []):
type2entities[entity_type].append(entity)
from pprint import pprint
pprint(type2entities)

I hope the following code snippets solve your problem.
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
text = ("Seville.
Summers in the flamboyant Andalucían capital often nudge 40C, but spring is a delight, with the parks in bloom and the scent of orange blossom and jasmine in the air. And in Semana Santa (Holy Week, 14-20 April) the streets come alive with floats and processions. There is also the raucous annual Feria de Abril – a week-long fiesta of parades, flamenco and partying long into the night (4-11 May; expect higher hotel prices if you visit then).
Seville is a romantic and energetic place, with sights aplenty, from the Unesco-listed cathedral – the largest Gothic cathedral in the world – to the beautiful Alcázar royal palace. But days here are best spent simply wandering the medieval streets of Santa Cruz and along the river to La Real Maestranza, Spain’s most spectacular bullring.
Seville is the birthplace of tapas and perfect for a foodie break – join a tapas tour (try devoursevillefoodtours.com), or stop at the countless bars for a glass of sherry with local jamón ibérico (check out Bar Las Teresas in Santa Cruz or historic Casa Morales in Constitución). Great food markets include the Feria, the oldest, and the wooden, futuristic-looking Metropol Parasol.
Nightlife is, unsurprisingly, late and lively. For flamenco, try one of the peñas, or flamenco social clubs – Torres Macarena on C/Torrijano, perhaps – with bars open across town until the early hours.
Book it: In an atmospheric 18th-century house, the Hospes Casa del Rey de Baeza is a lovely place to stay in lively Santa Cruz. Doubles from £133 room only, hospes.com
Trieste.")
doc = nlp(text)
lemma_ner_list = []
for entity in doc.ents:
lemma_ner_list.append((entity.lemma_, entity.label_))
# print list of lemma ner tuples
print(lemma_ner_list)
# print count of tuples
print(len(lemma_ner_list))

Extract information part f URL in python

I have a list of 200k urls, with the general format of:
http[s]://..../..../the-headline-of-the-article
OR
http[s]://..../..../the-headline-of-the-article/....
The number of / before and after the-headline-of-the-article varies
Here is some sample data:
'http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision',
I want to extract the-headline-of-the-article only.
ie.
call-to-end-affordable-care-act-is-immoral-says-cha-president
global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429
correction-trump-investigations-sater-lawsuit-story
I am sure this is possible, but am relatively new with regex in python.
In pseudocode, I was thinking:
split everything by /
keep only the chunk that contains -
replace all - with \s
Is this possible in python (I am a python n00b)?

urls = [...]
for url in urls:
bits = url.split('/') # Split each url at the '/'
bits_with_hyphens = [bit.replace('-', ' ') for bit in bits if '-' in bit] # [1]
print (bits_with_hyphens)
[1] Note that your algorithm assumes that only one of the fragments after splitting the url will have a hyphen, which is not correct given your examples. So at [1], I'm keeping all the bits that do so.
Output:
['national news', 'call to end affordable care act is immoral says cha president']
['new website puts louisiana art on businesses walls']
['global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429']
['BP General+News', 'female music art to take center stage at swan day in new britain']
['Trump orders Treasury HUD to develop new plan 13721842.php']
['research delivers insight into the global business voip services market during the period 2018 2025']
['why mirza international limited nse 233259149.html']
['indian gaming industry grows in revenues.asp']
['facebook instagram banning pro white 210002719.html']
['press release', 'fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27']
['top firms decry religious exemption bills proposed in texas', 'article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html']
['correction trump investigations sater lawsuit story', 'article_ed20e441 de30 5b57 aafd b1f7d7929f71.html']
['weather channel sued 125 million over death storm chase collision']
PS. I think your algorithm could do with a bit of thought. Problems that I see:
more than one bit might contain a hyphen, where:
both only contain dictionary words (see first and fourth output)
one of them is "clearly" not a headline (see second and third from bottom)
spurious string fragments at the end of the real headline: eg "13721842.php", "revenues.asp", "210002719.html"
Need to substitute in a space for characters other than '/', (see fourth, "General+News")

Here's a slightly different variation which seems to produce good results from the samples you provided.
Out of the parts with dashes, we trim off any trailing hex strings and file name extension; then, we extract the one with the largest number of dashes from each URL, and finally replace the remaining dashes with spaces.
import re
regex = re.compile(r'(-[0-9a-f]+)*(\.[a-z]+)?$', re.IGNORECASE)
for url in urls:
parts = url.split('/')
trimmed = [regex.sub('', x) for x in parts if '-' in x]
longest = sorted(trimmed, key=lambda x: -len(x.split('-')))[0]
print(longest.replace('-', ' '))
Output:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan
research delivers insight into the global business voip services market during the period
why mirza international limited nse
indian gaming industry grows in revenues
facebook instagram banning pro white
fluence receives another aspiraltm bulk order with partner itest in china
top firms decry religious exemption bills proposed in texas
correction trump investigations sater lawsuit story
weather channel sued 125 million over death storm chase collision
My original attempt would clean out the numbers from the end of the URL only after extracting the longest, and it worked for your samples; but trimming off trailing numbers immediately when splitting is probably more robust against variations in these patterns.

Since the url's are not in a consistent pattern, Stating the fact that the first and the third url are of different pattern than those of the rest.
Using r.split():
s = ['http://catholicphilly.com/2019/03/news/national-news/call-to-end-affordable-care-act-is-immoral-says-cha-president/',
'https://www.houmatoday.com/news/20190327/new-website-puts-louisiana-art-on-businesses-walls',
'https://feltonbusinessnews.com/global-clean-energy-inc-otcpkgcei-climbs-investors-radar-as-key-momentum-reading-hits-456-69429/149601/',
'http://www.bristolpress.com/BP-General+News/347592/female-music-art-to-take-center-stage-at-swan-day-in-new-britain',
'https://www.sfgate.com/business/article/Trump-orders-Treasury-HUD-to-develop-new-plan-13721842.php',
'https://industrytoday.co.uk/it/research-delivers-insight-into-the-global-business-voip-services-market-during-the-period-2018-2025',
'https://news.yahoo.com/why-mirza-international-limited-nse-233259149.html',
'https://www.indianz.com/IndianGaming/2019/03/27/indian-gaming-industry-grows-in-revenues.asp',
'https://www.yahoo.com/entertainment/facebook-instagram-banning-pro-white-210002719.html',
'https://www.marketwatch.com/press-release/fluence-receives-another-aspiraltm-bulk-order-with-partner-itest-in-china-2019-03-27',
'https://www.valleymorningstar.com/news/elections/top-firms-decry-religious-exemption-bills-proposed-in-texas/article_68a5c4d6-2f72-5a6e-8abd-4f04a44ee74f.html',
'https://tucson.com/news/national/correction-trump-investigations-sater-lawsuit-story/article_ed20e441-de30-5b57-aafd-b1f7d7929f71.html',
'https://www.publicradiotulsa.org/post/weather-channel-sued-125-million-over-death-storm-chase-collision']
for url in s:
url = url.replace("-", " ")
if url.rsplit('/', 1)[1] == '': # For case 1 and 3rd url
if url.rsplit('/', 2)[1].isdigit(): # For 3rd case url
print(url.rsplit('/', 3)[1])
else:
print(url.rsplit('/', 2)[1])
else:
print(url.rsplit('/', 1)[1]) # except 1st and 3rd case urls
OUTPUT:
call to end affordable care act is immoral says cha president
new website puts louisiana art on businesses walls
global clean energy inc otcpkgcei climbs investors radar as key momentum reading hits 456 69429
female music art to take center stage at swan day in new britain
Trump orders Treasury HUD to develop new plan 13721842.php
research delivers insight into the global business voip services market during the period 2018 2025
why mirza international limited nse 233259149.html
indian gaming industry grows in revenues.asp
facebook instagram banning pro white 210002719.html
fluence receives another aspiraltm bulk order with partner itest in china 2019 03 27
article_68a5c4d6 2f72 5a6e 8abd 4f04a44ee74f.html
article_ed20e441 de30 5b57 aafd b1f7d7929f71.html
weather channel sued 125 million over death storm chase collision

Unable to parse names properly from some elements

I've written a script in python to parse some names out of some elements. When i execute my script, it does parse names but the output is weird to look at. The names are being parsed in such a way so that it looks like two big names. The names are separated by br tag. How can i get each names individually?
Elements within which the names are:
html_content='''
<div class="second-child"><div class="richText"> <p></p>
<p><strong>D<br></strong>Daiwa House Industry<br>Danske Bank<br>DaVita HealthCare Partners<br>Delphi Automotive<br>Denso<br>Dentsply International<br>Deutsche Boerse<br>Deutsche Post<br>Deutsche Telekom<br>Diageo<br>Dialight<br>Digital Realty Trust<br>Donaldson Company<br>DSM<br>DS Smith </p>
<p><strong>E<br></strong>East Japan Railway Company<br>eBay<br>EDP Renováveis<br>Edwards Lifesciences<br>Elekta<br>EnerNOC<br>Enphase Energy<br>Essilor<br>Etsy<br>Eurazeo<br>European Investment Bank (EIB)<br>Evonik Industries<br>Express Scripts <br><br><strong>F<br></strong>Fielmann<br>First Solar<br>FMO<br>Ford Motor<br>Fresenius Medical Care<br><br></p></div></div>
'''
The script I've written to parse names:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
Output I'm having (partial result):
DDaiwa House IndustryDanske BankDaVita HealthCare PartnersDelphi AutomotiveDensoDentsply InternationalDeutsche
Output I wanna get:
DDaiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
FYI, when i take a closer look at the result, I could find that each separate names are attached to each other with no gap in between.

Using item.text removes all the tags, you need to replace the <br> tags with '\n'. Using the answer provided by Ian Mackinnon for the question: Convert </br> to end line
your script should be:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content,"lxml")
for br in soup.find_all("br"):
br.replace_with("\n")
for items in soup.select(".second-child"):
name = ' '.join([item.text for item in items.select("p")])
print(name)
and the output:
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith E
East Japan Railway Company
eBay
EDP Renováveis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

Check below solution and let me know if some improvements required:
for items in soup.select(".second-child"):
for text_nodes in items.select("p"):
name = " \n".join([item for item in text_nodes.strings if item])
print(name)
Output
D
Daiwa House Industry
Danske Bank
DaVita HealthCare Partners
Delphi Automotive
Denso
Dentsply International
Deutsche Boerse
Deutsche Post
Deutsche Telekom
Diageo
Dialight
Digital Realty Trust
Donaldson Company
DSM
DS Smith
E
East Japan Railway Company
eBay
EDP RenovÃ¡veis
Edwards Lifesciences
Elekta
EnerNOC
Enphase Energy
Essilor
Etsy
Eurazeo
European Investment Bank (EIB)
Evonik Industries
Express Scripts 
F
Fielmann
First Solar
FMO
Ford Motor
Fresenius Medical Care

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text Information not scrape properly-Python - python

Related

How do i turn this string into 2 arrays in python?

How to get only texts of tags that contain a certain string by using beautifulsoup?

Create tuples of (lemma, NER type) in python , Nlp problem

Extract information part f URL in python

Unable to parse names properly from some elements

Categories

Resources