Passing href text and referencing web page in Scrapy - python

Any way to do it using crawl spider? Not yielding requests. Just an example would suffice. I want to use the href text as the title of web page and have a link to the url that contained the link. I'm just using basic selectors to fill my item, but not sure how to get this information.
Edit:
I looked into it and I want to be able to pass in meta data of the href title and referencing url and also be able to comply with the rules I've defined rather than having to get all urls and conditioning on them myself.
meta={"hrefText" : ..., "refURL": ...}

see CrawlSpider code:
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
meaning you can get href text from response.meta['link_text']

Related

Python/Selenium web scrap how to find hidden src value from a links?

Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))

How to get href link from in this a tag?

I successfully get href link from http://quotes.toscrape.com/ example by implementing:
response.css('div.quote > span > a::attr(href)').extract()
and it gives all partial link inside href of each a tag:
['/author/Albert-Einstein', '/author/J-K-Rowling', '/author/Albert-Einstein', '/author/Jane-Austen', '/author/Marilyn-Monroe', '/author/Albert-Einstein', '/author/Andre-Gide', '/author/Thomas-A-Edison', '/author/Eleanor-Roosevelt', '/author/Steve-Martin']
by the way in above example each a tag has this format:
(about)
So I tried to make the same for this site: http://www.thegoodscentscompany.com/allproc-1.html
The problem here is that the style of a tag is a bit different as such:
formaldehyde
As you see I can't get link from href by using similar method above. I want to get link (http://www.thegoodscentscompany.com/data/rw1247381.html) from this a tag, but i could not make it. How can i get this link?
Try this response.css('a::attr(onclick)').re(r"Window\('(.*?)'\)")

Why does HTML source code change while extracting text from soup object?

I am trying to scrape news articles from the results of a search term using Selenium and BeautifulSoup on Python. I have arrived at the final page which contains the text using:
article_page = requests.get(articles.link_of_article[0])
article_soup = BeautifulSoup(article_page.text, "html.parser")
for content in article_soup.find_all('div',{"class":"name_of_class_with_contained_text"}):
content.get_text()
I notice that "name_of_class_with_contained_text" is present when I visually inspect the source code in the browser but the class is not present in the soup object. Also, all the "p" tags are replaced with the following code "\\u003c/p\\u003e\\u003cp\\u003e \\u003c/p\\u003e\\u003cp\\u003e".
I am unable to find the class name or tags to get the text contained.
Any help or reasoning as to why this happens would be appreciated.
P.S: Relatively new to scraping and HTML
UPDATE:Adding the link of the final page here.
https://www.fundfire.com/c/2258443/277443?referrer_module=searchSubFromFF&highlight=eileen%20neill%20verus

How to access text from a scope inside a href (selenium, python)

I am trying to web scrape text from a href inside a scope, which can't be accessed through xpath as I want to iterate through a table and find the text inside the box.
Here is a screenshot of what I want to find
I think you can do
import urllib.request
page_url = #Insert your url here
with urllib.request.urlopen(page_url) as f:
html = f.read().decode('utf-8')
html.find(#whatever)
which I think should return the entirety of the html for the page, which you can then scrape for whatever you need.
According to information provided by you, you wanted to print the text, for example in this case: "3i Group PLC".
I just call the xpath link for 3i Group PLC is link_abc because the xpath link is not provided by you.
Here is the code by using selenium library:
name = driver.find_element_by_xpath("link_abc")
print(name.text)
The output should be 3i Group PLC.

Creating multiple requests from same method in Scrapy

I am parsing webpages that have a similar structure to this page.
I have the following two functions:
def parse_next(self, response):
# implementation goes here
# create Request(the_next_link, callback=parse_next)
# for link in discovered_links:
# create Request(link, callback=parse_link)
def parse_link(self, response):
pass
I want parse_next() to create a request for the *Next link on the web page. At the same time, I want it to create requests for all the URLs that were discovered on the current page by using parse_link() as the callback. Note that I want parse_next to recursively use itself as a callback because this seems to me as the only possible way to generate requests for all the *Next links.
*Next: The link that appears besides all the numbers on the this page
How am I supposed to solve this problem?
Use a generator function and loop through
your links, then call this on the links
that you want to make a request to:
for link in links:
yield Request(link.url)
Since you are using scrapy, I'm assuming you have link extractors set up.
So, just declare your link extractor as a variable like this:
link_extractor = SgmlLinkExtractor(allow=('.+'))
Then in the parse function, call the link extractor on the 'the_next_link':
links = self.link_extractor.extract_links(response)
Here you go:
http://www.jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained

Categories