I'm currently creating a cutsom webcrawler with Scrapy and try to index the fetched content with Elasticsearch.
Works fine until as of now, but I'm only capable of adding content to the search index in the order the crawler filters html tags.
So for example with
sel.xpath("//div[#class='article']/h2//text()").extract()
I can get all the content from all h2 tags inside a div with the class "article", so far so good. The next elements that get inside the index are from all h3 tags, naturally:
sel.xpath("//div[#class='article']/h3//text()").extract()
But the problem here is that the entire order of the text on a site would get messed up like that, since all headlines would get indexed first and only then their child nodes get the chance, which is kind of fatal for a search index.
Does have a tip how to properly get all the content from a page in the right order? (doesnt have to be xpath, just with Scrapy)
I guess you could solve the issue with something like this:
# Select multiple targeting nodes at once
sel_raw = '|'.join([
"//div[#class='article']/h2",
"//div[#class='article']/h3",
# Whatever else you want to select here
])
for sel in sel.xpath(sel_raw):
# Extract the texts for later use
texts = sel.xpath('self::*//text()').extract()
if sel.xpath('self::h2'):
# A h2 element. Do something with texts
pass
elif sel.xpath('self::h3'):
# A h3 element. Do something with texts
pass
Related
Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))
I'm having some issues in crawling this website search:
https://www.simplyhired.com/search?q=data+engineer&l=United+States&pn=1&job=ZMzeXt6JW0jMuZc6H-3Af3sqOGzeQMLj7X5mnXXv9ZteeAoGm6oDdg
I'm trying to extract these elements from de SimplyHired search jobs for Data Engineer in US:
But when I try using xpath locator to any of them using selector module I'm getting different results and in different order.
Also the output for all of them isn't matching (The index corresponding to xpath job name is not the same index for ther location in xpath location for example).
Here is my code:
from scrapy import Selector
import requests
response = requests.get('https://www.simplyhired.com/search?q=data+engineer&l=united+states&mi=exact&sb=dd&pn=1&job=X1yGOt2Y8QTJm0tYqyptbgV9Pu19ge0GkVZK7Im5WbXm-zUr-QMM-A').content
sel=Selector(text=response)
#job name
sel.xpath('//main[#id="job-list"]/div/article[contains(#class,"SerpJob")]/div/div[#class="jobposting-title-container"]/h2/a/text()').extract()
#company
sel.xpath('//main[#id="job-list"]/div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').extract()
#location
sel.xpath('//main[#id="job-list"]//div/article/div/h3[#class="jobposting-subtitle"]/span[#class="JobPosting-labelWithIcon jobposting-location"]/span/span/text()').extract()
#salary estimates
sel.xpath('//main[#id="job-list"]//div/article/div/div[#class="SerpJob-metaInfo"]//div[#class="SerpJob-metaInfoLeft"]/span/text()[2]').extract()
I'm not quite sure whether you're trying to use Scrapy or requests. Looks like you're wanting to use requests but with xpath selectors.
For websites like this, it's best to look at each individual job advert as a 'card'. You want to loop over each card with the XPATH selectors that you need to get the data you want.
Code Example
card = sel.xpath('//div[#class="SerpJob-jobCard card"]')
for a in card:
title = a.xpath('.//a[#class="card-link"]/text()').get()
company = a.xpath('.//span[#class="JobPosting-labelWithIcon jobposting-company"]/text()').get()
salary = a.xpath('.//span[#class="jobposting-salary"]/text()').get()
location = a.xpath('.//span[#class="jobposting-location"]/text()').get()
Explanation
You want to search each card with relative XPATH selectors. The .// searches within the chunk of HTML downstream of the card variable.
Always use get() instead of extract(). get() is used to get one value and returns a string always, here that's what we want when we're looping over each card. extract() extracts all values if there are multiple and if there's only one value for the XPATH selector it puts it into a list which is often not what you want. The ambiguity of extract() is not ideal, if you want multiple values to use getall(), this is explicit and will only give you multiple values.
Additional Information
If you're finding you're not getting the correct data in the right format, always look to see if javascript content is being added to the website. Turn off your browsers javascript to refresh the page. On this particular site, none of the data you require is loaded by javascript, this makes it much easier to scrape.
this is a tricky problem from my side where I was stuck into the webscraping part and was not able to proceed further.
https://i.stack.imgur.com/r4tN2.png
I need only grid-cell answers in a loop
I tried using
grid_cell=driver.find_element_by_css_selector('#tags-browser > div:nth-child(2) > div.mt-auto.grid.jc-space-between.fs-caption.fc-black-300 > div:nth-child(1)')
Now displaying the text of the tag will show 2061748 questions
grid_cell.text
but this is only for one element.
What if I wanted to have it in a loop where I need all the count for all the tags available in that page?
In this case, as per image, I iterated a for loop over '''javascript''' and '''java'''
but
get_element_using_css_selector would give a a specific count for either java or javascript but not for both.
And also if I choose
tag_counts = body.find_all('div', class_='grid_cell')
then I would get other classes also that are below grid-cell in the picture attached which are to be excluded.
Please suggest some solution. Any help would be appreciated.
There are 2 ways of achieving this:
First option:
Remove the tags you don't want to scrape and then scrape the tags that you do want. For example:
tags = body.find_all('div', class_='grid_cell s-anchor') # TODO: add full class name (to remove this tag)
for tag in tags:
tag.extract() # Remove tag from body
tags = body.find_all('div', class_='grid_cell') # This will contain all the tags you want.
Second option:
Loop through parent html tag and get the first tag using find(). For exmaple:
containers = body.find_all('div', class_='mt-auto grid') # Find parent tag
for container in containers:
tag = container.find('div', class_='grid_cell') # Get first tag in the container div
print(tag.text.strip())
I have a problem while trying to access some values on the website during the process of web scraping the data. The problem is that the text I want to extract is in the class which contains several texts separated by tags (these body tags also have texts which are also important for me).
So firstly, I tried to look for the tag with the text I needed ('Category' in this case) and then extract the exact category from the text below this body tag assignment. I could use precise XPath but here it is not the case because other pages I need to web scrape contain a different amount of rows in this sidebar so the locations, as well as XPaths, are different.
The expected output is 'utility' - the category in the sidebar.
The website and the text I need to extract look like that (look right at the sidebar containing 'Category':
The element looks like that:
And the code I tried:
driver = webdriver.Safari()
driver.get('https://www.statsforsharks.com/entry/MC_Squares')
element = driver.find_elements_by_xpath("//b[contains(text(), 'Category')]/following-sibling")
for value in element:
print(value.text)
driver.close()
the link to the page with the data is https://www.statsforsharks.com/entry/MC_Squares.
Thank you!
You might be better off using regex here, as the whole text comes under the 'company-sidebar-body' class, where only some text is between b tags and some are not.
So, you can the text of the class first:
sidebartext = driver.find_element_by_class_name("company-sidebar-body").text
That will give you the following:
"EOY Proj Sales: $1,000,000\r\nSales Prev Year: $200,000\r\nCategory: Utility\r\nAsking Deal\r\nEquity: 10%\r\nAmount: $300,000\r\nValue: $3,000,000\r\nEquity Deal\r\nSharks: Kevin O'Leary\r\nEquity: 25%\r\nAmount: $300,000\r\nValue: $1,200,000\r\nBite: -$1,800,000"
You can then use regex to target the category:
import re
c = re.search("Category:\s\w+", sidebartext).group()
print(c)
c will result in 'Category: Utility' which you can then work with. This will also work if the value of the category ('Utility') is different on other pages.
There are easier ways when it's a MediaWiki website. You could, for instance, access the page data through the API with a JSON request and parse it with a much more limited DOM.
Any particular reason you want to scrape my website?
I am working on a project where I am crawling thousands of websites to extract text data, the end use case is natural language processing.
EDIT * since I am crawling 100's of thousands of websites I cannot tailor a scraping code for each one, which means I cannot search for specific element id's, the solution I am looking for is a general one *
I am aware of solutions such as the .get_text() function from beautiful soup. The issue with this method is that it gets all the text from the website, much of it being irrelevant to the main topic on that particular page. for the most part a website page will be dedicated to a single main topic, however on the sides and top and bottom there may be links or text about other subjects or promotions or other content.
With the .get_text() function it return all the text on the site page in one go. the problem is that it combines it all (the relevant parts with the irrelevant ones. is there another function similar to .get_text() that returns all text but as a list and every list object is a specific section of the text, that way it can be know where new subjects start and end.
As a bonus, is there a way to identify the main body of text on a web page?
Below I have mentioned snippets that you could use to query data in desired way using BeautifulSoup4 and Python3:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://yoursite/page')
soup = BeautifulSoup(response.text, 'html.parser')
# Print the body content in list form
print(soup.body.contents[0])
# Print the first found div on html page
print(soup.find('div'))
# Print the all divs on html page in list form
print(soup.find_all('div'))
# Print the element with 'required_element_id' id
print(soup.find(id='required_element_id'))
# Print the all html elements in list form that matches the selectors
print(soup.select(required_css_selectors))
# Print the attribute value in list form
print(soup.find(id='someid').get("attribute-name"))
# You can also break your one large query into multiple queries
parent = soup.find(id='someid')
# getText() return the text between opening and closing tag
print(parent.select(".some-class")[0].getText())
For your more advance requirement, you can check Scrapy as well. Let me know if you face any challenge in implementing this or if your requirement is something else.