How to scrape Price from a site that has a changing structure?

How to scrape Price from a site that has a changing structure? - python

I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?

You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.

Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))

from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)

E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.

Related

Unable to locate elements using requests and BeautifulSoup

I am writing a script in Python using the modules 'requests' and 'BeautifulSoup' to scrape results from football matches found in the links from the following page:
https://www.premierleague.com/results?co=1&se=363&cl=-1
The task consists of two steps (taking the first match, Arsenal against Brighton, as an example):
Extract and navigate to the href "https://www.premierleague.com/match/59266" found in the element:
div data-template-rendered data-href.
Navigate or to the "Stats"-tab and extracting the information found in the element:
tbody class = "matchCentreStatsContainer".
I have already tried things like
page = requests.get("https://www.premierleague.com/match/59266")
soup = BeautifulSoup(page.text, "html.parser")
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
but I am not able to locate any of the elements in step 1) or 2) (empty list is returned).

Instead of this:
soup.findAll("div", {"class" : "matchCentreStatsContainer"})
Use this
soup.findAll({"class" : "matchCentreStatsContainer"})
It will work.

In this case the problem is simply that you are looking for the wrong thing. There is no <div class="matchCentreStatsContainer"> on that page, that's a <tbody> so it doesn't match. If you want the div, do:
divs = soup.find_all("div", class_="statsSection")
Otherwise search for the tbodys:
soup.find_all("tbody", class_="matchCentreStatsContainer")
Incidentally the Right Way (TM) to match classes is with class_, which takes either a list or a string (for a single class). This was added to bs4 a while back, but the old syntax is still floating around a lot.
Do note your first url as posted here is invalid: it needs a http: or https: before it.
Update
Please note I would not parse this particularly file like this. It has likely everything you already want as json. I would just do:
import json
data = json.loads(soup.find("div", class_="mcTabsContainer")["data-fixture"])
print(json.dumps(data, indent=2))
Note that data is just a dictionary: I'm only using json.dumps at the end to pretty print it.

Unable to extract text from website containing a filter

I'm trying to get all the locations out of the following website (www.mars.com/locations) using Python, with Requests and BeautifulSoup.
The website has a filter to select continent, country and region, so that it will display only the locations the company has in the selected area. They also include their headquarters at the bottom of the page, and this information is always there regardless of the filter applied.
I have no problem extracting the data for the headquarters using the code below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mars.com/locations'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
HQ = soup.find('div', class_='global-headquarter pr-5 pl-3').text.strip()
print(HQ)
The output of the code is:
Mars, Incorporated (Global Headquarters)
6885 Elm Street
McLean
Virginia
22101
+1(703) 821-4900
I want to do the same for all other locations, but I'm struggling to extract the data using the same approach (adjusting the path, of course). I've tried everything and I'm out of ideas. Would really appreciate someone giving me a hand or at least pointing me in the right direction.
Thanks a lot in advance!

All location data can be retrieved in text format. Decomposing this into a string is one way to do it. I'm not an expert in this field, so I can't help you any more.
content_json = soup.find('div', class_='location-container')
data = content_json['data-location']

i'm not an expert in BeautifulSoup, so i'll use parsel to get the data. all the locations are embedded in a location-container css class, with a data-location attribute.
import requests
from parsel import Selector
response = requests.get(url).text
selector = Selector(text=response)
data = selector.css(".location-container").xpath("./#data-location").getall()

Scrapy/Python web table missing closing TR / TD Tags

I'm redoing a data scraping project. There's a website with a table of data that is missing most or all of the closing TR and TD tags. When I first did the project with JS, I just copied the site and then split the data into arrays of rows when it encountered a new "" tag.
I want to try to rebuild this project using python/scrapy and just wondering if there was an easier way to access the data using selectors. Also I'm a little confused how to split the data when the response.data.split(') doesn't work.

I understand your problem . you can use beautyfulsoup's select method for successfully query. I make a demo code for you. hope this will help you.
import requests
from bs4 import BeautifulSoup
url = 'http://killedbypolice.net/';
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
soup.select('table tr')
print(soup.select('table')[0])

Advice about using a loop through parameters of a get request

I am trying to get each runners' information from this 2017 marathon. The problem is that to get the information I want, I would have to click on each runners' name to get his partial splits.
I know that I can use a get request to get each runners' information. For example, for the runner Josh Griffiths I can use request.get using the parameters in the url.
My problem is that I don't know how to figure out the idp term because this term changes with every runner.
My questions are the following:
Is it possible to use a loop to get all runners' information using a get request? How can I solve the issue with the `idp? I mean, the fact that I don't know how this term is determined and I don't know how to define a loop using it.
Is there a better method to get each runners' information? I thought about using Selenium-Webdriver, but this would be very slow.
Any advice would be appreciated!

You will need to use something like BeautifulSoup to parse the HTML for the links you need, that way there is no need to try and figure out how to construct the request.
import requests
from bs4 import BeautifulSoup
base_url = "http://results-2017.virginmoneylondonmarathon.com/2017/"
r = requests.get(base_url + "?pid=list")
soup = BeautifulSoup(r.content, "html.parser")
tbody = soup.find('tbody')
for tr in tbody.find_all('tr'):
for a in tr.find_all('a', href=True, class_=None):
print
print a.parent.get_text(strip=True)[1:]
r_runner = requests.get(base_url + a['href'])
soup_runner = BeautifulSoup(r_runner.content, "html.parser")
# Find the start of the splits
for h2 in soup_runner.find_all('h2'):
if "Splits" in h2:
splits_table = h2.find_next('table')
splits = []
for tr in splits_table.find_all('tr'):
splits.append([td.text for td in tr.find_all('td')])
for row in splits:
print ' {}'.format(', '.join(row))
break
For each link, you then need to follow it and parse splits from the returned HTML. The script will display starting as follows:
Boniface, Anna (GBR)
5K, 10:18:05, 00:17:55, 17:55, 03:35, 16.74, -
10K, 10:36:23, 00:36:13, 18:18, 03:40, 16.40, -
15K, 10:54:53, 00:54:44, 18:31, 03:43, 16.21, -
20K, 11:13:25, 01:13:15, 18:32, 03:43, 16.19, -
Half, 11:17:31, 01:17:21, 04:07, 03:45, 16.04, -
25K, 11:32:00, 01:31:50, 14:29, 03:43, 16.18, -
30K, 11:50:44, 01:50:34, 18:45, 03:45, 16.01, -
35K, 12:09:34, 02:09:24, 18:51, 03:47, 15.93, -
40K, 12:28:43, 02:28:33, 19:09, 03:50, 15.67, -
Finish, 12:37:17, 02:37:07, 08:35, 03:55, 15.37, 1
Griffiths, Josh (GBR)
5K, 10:15:52, 00:15:48, 15:48, 03:10, 18.99, -
10K, 10:31:42, 00:31:39, 15:51, 03:11, 18.94, -
....
To better how understand how this works, you first need to take a look at the HTML source for each of the pages. The idea being is to find something unique about what you are looking for in the structure of the page to allow you to extract it using a script.
Next I would recommend reading through the documentation page for BeautifulSoup. This assumes you understand the basic structure of an HTML document. This library gives you many tools to help you search and extract elements from the HTML. For example finding where the links are. Not all webpages can be parsed like this as the information is often created using Javascript. In these cases you would need to use something like selenium but in this case, requests and beautifulsoup do the job nicely.

Python web-scraping with changing href

I have been scraping some websites using Python 2.7
page = requests.get(URL)
tree = html.fromstring(page.content)
prices = tree.xpath('//span[#class="product-price"]/text()')
titles = tree.xpath('//span[#class="product-title"]/text()')
This works fine for websites that have these clear tags in them but a lot of the websites I encounter have the following HTML setup:
<strong>Populous</strong>
(I am tyring to extract the title: Populous)
Where an href changes for every title I am extracting, I have tried the following for the above example hoping it would see the class and that would be enough but that doesn't work
titles = tree.xpath('//a[#class="product-name"]/text()')
I was searching for a character that would work like *, as in 'I don't care what's in here, just take everything with a href=.. But couldn't find anything
titles = tree.xpath('//a[#href="*"]/text()')
Also, would I need to specify that there is also class= in the a tag like
titles = tree.xpath('//a[#href="*" #class="product-name"]/text()')
EDIT: I also found a fix if there are only changing tags in the a path using
titles = tree.xpath('//h3/a/#title')
example for this tag
<h3>4 in 1 fun pack</h3>

try this:
titles = tree.xpath('//a[#class="product-name"]//text()')
notice // after class selector.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape Price from a site that has a changing structure? - python

from bs4 import BeatifulSoup page = request.get(url, headers) soup = BeautifulSoup(page.content, 'html.parser') for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}): price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'}) print(price.text)

Related

Unable to locate elements using requests and BeautifulSoup

Unable to extract text from website containing a filter

Scrapy/Python web table missing closing TR / TD Tags

Advice about using a loop through parameters of a get request

Python web-scraping with changing href

Categories

Resources