Extract tag text from line BeautifulSoup - python

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks

Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text".
And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.
You can also find h2 and clear() or extract() this tag and later you can get text from all divand you get without "Header2"
Documentation: get_text(), clear(), extract()

Related

How to get text from an element with selenium

I am trying to get text from an element and then searching for it in another place, the problem is that when I get the text using .text, I noticed that it misses if there is two spaces, and when it searches in the next page, it can't find it. So is there a way to get text as it is with spaces?
Code:
self.name = session.find_element('xpath', './/a[contains(#href, "name")]').text
self.dataBrowser.find_elements(By.XPATH, f'//tr[child::td[child::div[child::strong[text()="{self.name}"]]]]')
Not always the text property will actually hold the text. So, try this:
element = browser.find_element(By.CSS, 'CSS_EXPRESSION')
text = element.get_attribute('text')
if text is None or text is '':
text = element.get_attribute('value')
If text and value both returns nothing just try reading the innnerHTML attribute.

Scrapy: how can I extract text with hyperlink together

I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?
You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)
Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'

Get text via XPath, ignoring markup

I have to retrieve text inside an HTML table, in the cells the text sometimes is inside a <div> and sometimes is not.
How can I make a div in a XPath optional?
My actual code:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div/text()")
Wanted pseudocode:
stuff = tree.xpath("/html/body/table/tbody/tr/td[5]/div or nothing/text()")
You want the string value of the td[5] element. Use string():
stuff = tree.xpath("string(/html/body/table/tbody/tr/td[5])")
This will return text without markup beneath td[5].
You can also indirectly obtain the string value of an element via normalize-space() as suggested by splash58 in the comments, if you also want whitespace to be trimmed on the ends and reduced interiorly.

Xpath: how to get the text of <a> tag inside a <p> tag

I have the following issue when trying to get information from some website using scrapy.
I'm trying to get all the text inside <p> tag, but my problem is that in some cases inside those tags there is not just text, but sometimes also an <a> tag, and my code stops collecting the text when it reaches that tag.
This is my Xpath expression, it's working properly when there aren't tags contained inside:
description = descriptionpath.xpath("span[#itemprop='description']/p/text()").extract()
Posting Pawel Miech's comment as an answer as it appears his comment has helped many of us thus far and contains the right answer:
Tack //text() on the end of the xpath to specify that text should be recursively extracted.
So your xpath would appear like this:
span[#itemprop='description']/p//text()

fetching and parsing text not enclosed within tags

I'm trying to work on a project about page ranking. I want to make an index (dictionary) which looks like this:
file1.html -> [[cat, ate, food, drank, milk], [file2.html, file3.html]]
file2.html -> [[dog, barked, ran, away], [file1.html, file4.html]]
Fetching links is easy - look for anchor tags. My question is - how do I fetch text? The text in the html files is not enclosed within any tags like <p>.
Here's an example of one of the input HTML files:
d_9.html
d_3.html
bedote charlatanism nondecision pudsey Antaean haec euphoniously Bixa bacteriologically hesitantly Hobbist petrosa emendable counterembattled noble hornlessness chemolyze spittoon flatiron formalith wreathingly hematospermatocele theosophically sarking nontruth possessionist gravimetry matico unlawly abator hyetological Microconodon supermuscan
Maybe, the text above is not HTML, but then how do I fetch and parse it? Any ideas?
One way to go about this is to simply ignore all the tags and what you've got left is assumed to be text. It will make the regex large though.
I wouldn't use regex, I would use something like lxml, that way you can get the tags, the text and also the structure of the document as needed.
You say the text is "not HTML," and "is not enclosed within any tags." So it's just plain text, there's nothing to parse. Fetch the url, and the contents returned to you are a string full of words. Split the words with .split(), and you have a list of words.
i think what you want is to get data (links , keywords ...) from an HTML File , but your problem is that some part of your HTML file does not contain any tags to parse it properly, or is it all the HTML file that don't have tags ? if yes you can format the html file with tidy, it can help you for parsing it ;
so if i were you i will just use regex to match links something like :
links = re.finditer(".*html", text) # by the way the regex must be more complicated than that.
and for the keywords "[cat, ate, food, drank, milk]" i don't know what you are looking for exactly ;
hope this can help

Categories