Scrapy: how can I extract text with hyperlink together - python

I'm crawling a professor's webpage.
Under her research description, there are two hyperlinks, which are " TEDx UCL" and "here".
I use xpath like '//div[#class="group"]//p/text()'
to get the first 3 paragraphs.
And '//div[#class="group"]/text()' to get the last paragraph with some newlines. But these can be cleaned easily.
The problem is the last paragraph contains only text. The hyperlinks are lost. Though I can extract them separately, it is tedious to put them back to their corresponding position.
How can I get all the text and keep the hyperlinks?

You can use html2text.
sample = response.xpath("//div[#class="group"]//p/text()")
converter = html2text.HTML2Text()
converter.ignore_links = True
converter.handle(sample)

Try this:
'//div[#class="group"]/p//text()[normalize-space(.)]'

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

How to access text element in selenium if it is splitted by body tags

I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?

Is there any way to select element through HTML, by selenium, python

I am making crawling app through selenium, python and I am stuck.
enter image description here
as in picture I can select text(with underline).
but what I need is numbers next to text.
but in F12 in chrome
enter image description here
numbers(red cricle) has class name, but that class names are all same.
there is no indicator that I can use to select numbers through selenium.(as far as I know)
so I tried to find any way to select element through HTML by selenium.
but I couldn't find any. Is there any way to do?
If I am looking for something does not exist, I am very sorry.
I only know python and selenium.. so If I cannot handle this, please let me know.
---edit
I think I make bad explanation.
what I need is find text first, than collect numbers (both of two).
but there is tons of text. I just screenshot little bit.
so I can locate texts by it's specific ids(lot's of it).
but how can I get numbers that is nest to text.
this is my question. sorry for bad explanation
and if BeautifulSoup can handle this please let me know. Thanks for your help.
special thanks to Christine
her code solved my problem.
You can use an XPath index to accomplish selecting first td element. Given the screenshot, you can select the first td containing 2,.167 as such:
cell = driver.find_element_by_xpath("//tr[td/a[text()='TEXT']]/td[#class='txt-r'][1]")
print(cell.text)
You should replace TEXT with the characters you underlined in your screenshot -- I do not have this keyboard so I cannot type the text for you.
The above XPath will query on all table rows, pick the row with your desired text, then query on table cells with class txt-r within a row. Because the two td elements both have class txt-r, you only want to pick one of them, using an index indicated by [1]. The [1] will pick the first td, with text 2,167.
Full sample as requested by the user:
# first get all text on the page
all_text_elements = driver.find_elements_by_xpath("//a[contains(#class, 'link-resource')]")
# iterate text elements and print both numbers that are next to text
for text_element in all_text_elements:
# get the text from web element
text = text_element.text
# find the first number next to it (2,167 from sample HTML)
first_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][1]")
print(first_number.text)
# find 2nd number (0 from sample HTML)
second_number = driver.find_element_by_xpath("//tr[td/a[text()='" + text + "']]/td[#class='txt-r'][2]")
print(second_number.text)

Extract tag text from line BeautifulSoup

Recently I've been working on a scraping project. I'm kinda new to it, but could manage to do almost everything, but I'm having trouble with a little issue. I captured every line of a news article doing this:
lines=bs.find('div',{'class':'Text'}).find_all('div')
But for some reason, there's some lines that contain an h2 tag and a br tag, like this one:
<div><div><h2>Header2</h2></div><div><br/></div><div>Paragraph text
So if I run .text on that snippet I get "Header2Paragraph text". I've got the "Header2" text stored in other line, so I want to delete this second apparition.
I managed to isolate those lines doing this:
for n,t in enumerate(lines):
if t.find('h2') is not None and t.find('br') is not None:
print('\n',n,':',t)
But I don't know how to erase the text associated to the h2 tag, so those lines become "Paragraph text" instead of "Header2Paragraph text". What can I do? Thanks
Use .get_text(split=' ') instead of .text and you get text with space "Header2 Paragraph text"
You can also use different char - ie. "|" - .get_text(split='|') and you get "Header2|Paragraph text".
And then you can use split("|") to get list ["Header2", "Paragraph text"] and keep last element.
You can also find h2 and clear() or extract() this tag and later you can get text from all divand you get without "Header2"
Documentation: get_text(), clear(), extract()

How to Collect the line with Selenium Python

I want to know how I can collect line, mailto link using selenium python the emails contains # sign in the contact page I tried the following code but it is somewhere works and somewhere not..
//*[contains(text(),"#")]
the emails formats are different somewhere it is <p>Email: name#domain.com</p> or <span>Email: name#domain.com</span> or name#domain.com
is there anyway to collect them with one statement..
Thanks
Here is the XPath you are looking for my friend.
//*[contains(text(),"#")]|//*[contains(#href,"#")]
You could create a collection of the link text values that contain # on the page and then iterate through to format. You are going to have to format the span like that has Email: name#domain.com anyway.
Use find_elements_by_partial_link_text to make the collection.
I think you need 2 XPath. First XPath for finding element that contains text "Email:", second XPath for element that contains attribute "mailto:".
//*[contains(text(),"Email:")]|//*[contains(#href,"mailto:")]

Categories