I have been scraping some websites using Python 2.7
page = requests.get(URL)
tree = html.fromstring(page.content)
prices = tree.xpath('//span[#class="product-price"]/text()')
titles = tree.xpath('//span[#class="product-title"]/text()')
This works fine for websites that have these clear tags in them but a lot of the websites I encounter have the following HTML setup:
<strong>Populous</strong>
(I am tyring to extract the title: Populous)
Where an href changes for every title I am extracting, I have tried the following for the above example hoping it would see the class and that would be enough but that doesn't work
titles = tree.xpath('//a[#class="product-name"]/text()')
I was searching for a character that would work like *, as in 'I don't care what's in here, just take everything with a href=.. But couldn't find anything
titles = tree.xpath('//a[#href="*"]/text()')
Also, would I need to specify that there is also class= in the a tag like
titles = tree.xpath('//a[#href="*" #class="product-name"]/text()')
EDIT: I also found a fix if there are only changing tags in the a path using
titles = tree.xpath('//h3/a/#title')
example for this tag
<h3>4 in 1 fun pack</h3>
try this:
titles = tree.xpath('//a[#class="product-name"]//text()')
notice // after class selector.
Related
Scrapping links should be a simple feat, usually just grabbing the src value of the a tag.
I recently came across this website (https://sunteccity.com.sg/promotions) where the href value of a tags of each item cannot be found, but the redirection still works. I'm trying to figure out a way to grab the items and their corresponding links. My typical python selenium code looks something as such
all_items = bot.find_elements_by_class_name('thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
However, I can't seem to retrieve any href, onclick attributes, and I'm wondering if this is even possible. I noticed that I couldn't do a right-click, open link in new tab as well.
Are there any ways around getting the links of all these items?
Edit: Are there any ways to retrieve all the links of the items on the pages?
i.e.
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
...
Edit:
Adding an image of one such anchor tag for better clarity:
By reverse-engineering the Javascript that takes you to the promotions pages (seen in https://sunteccity.com.sg/_nuxt/d4b648f.js) that gives you a way to get all the links, which are based on the HappeningID. You can verify by running this in the JS console, which gives you the first promotion:
window.__NUXT__.state.Promotion.promotions[0].HappeningID
Based on that, you can create a Python loop to get all the promotions:
items = driver.execute_script("return window.__NUXT__.state.Promotion;")
for item in items["promotions"]:
base = "https://sunteccity.com.sg/promotions/"
happening_id = str(item["HappeningID"])
print(base + happening_id)
That generated the following output:
https://sunteccity.com.sg/promotions/724
https://sunteccity.com.sg/promotions/731
https://sunteccity.com.sg/promotions/751
https://sunteccity.com.sg/promotions/752
https://sunteccity.com.sg/promotions/754
https://sunteccity.com.sg/promotions/280
https://sunteccity.com.sg/promotions/764
https://sunteccity.com.sg/promotions/766
https://sunteccity.com.sg/promotions/762
https://sunteccity.com.sg/promotions/767
https://sunteccity.com.sg/promotions/732
https://sunteccity.com.sg/promotions/733
https://sunteccity.com.sg/promotions/735
https://sunteccity.com.sg/promotions/736
https://sunteccity.com.sg/promotions/737
https://sunteccity.com.sg/promotions/738
https://sunteccity.com.sg/promotions/739
https://sunteccity.com.sg/promotions/740
https://sunteccity.com.sg/promotions/741
https://sunteccity.com.sg/promotions/742
https://sunteccity.com.sg/promotions/743
https://sunteccity.com.sg/promotions/744
https://sunteccity.com.sg/promotions/745
https://sunteccity.com.sg/promotions/746
https://sunteccity.com.sg/promotions/747
https://sunteccity.com.sg/promotions/748
https://sunteccity.com.sg/promotions/749
https://sunteccity.com.sg/promotions/750
https://sunteccity.com.sg/promotions/753
https://sunteccity.com.sg/promotions/755
https://sunteccity.com.sg/promotions/756
https://sunteccity.com.sg/promotions/757
https://sunteccity.com.sg/promotions/758
https://sunteccity.com.sg/promotions/759
https://sunteccity.com.sg/promotions/760
https://sunteccity.com.sg/promotions/761
https://sunteccity.com.sg/promotions/763
https://sunteccity.com.sg/promotions/765
https://sunteccity.com.sg/promotions/730
https://sunteccity.com.sg/promotions/734
https://sunteccity.com.sg/promotions/623
You are using a wrong locator. It brings you a lot of irrelevant elements.
Instead of find_elements_by_class_name('thumb-img') please try find_elements_by_css_selector('.collections-page .thumb-img') so your code will be
all_items = bot.find_elements_by_css_selector('.collections-page .thumb-img')
for promo in all_items:
a = promo.find_elements_by_tag_name("a")
print("a[0]: ", a[0].get_attribute("href"))
You can also get the desired links directly by .collections-page .thumb-img a locator so that your code could be:
links = bot.find_elements_by_css_selector('.collections-page .thumb-img a')
for link in links:
print(link.get_attribute("href"))
xpath_id = '/html/body'
conf_code = driver.find_element(By.XPATH, (xpath_id))
code_list = []
for c in range(len(conf_code)):
code_list.append(conf_code[c].text)
as seen above i chose the xpath locator, but i can't locate the text, that is because this particular webpage is completly blank as only as text in the «body»
the html of the page is bellow:
«html» , «head», «body» 'text that i want to read and save' «body», «/html»
how to read this text and then store it in a variable
Your question is not clear enough.
Anyway, in case there are multiple elements containing texts on that page you can use something like this:
xpath_id = '/html/body/*'
conf_code = driver.find_elements(By.XPATH, (xpath_id))
code_list = []
for c in conf_code:
code_list.append(c.text)
Don't forget to add some delay to make the page completely loaded before you getting all these elements from there
If you're really just grabbing a website that is so simple, you don't need selenium. Grab the website with requests and split the result on the body tags to get the text. Much simpler code and avoids the overhead of the selenium driver.
import requests
url = "http://your-url-here.com"
content = requests.get(url).text
the_string_youre_looking_for = content.split('<body>')[1].split('</body>')[0]
Is this what you're looking for? If not, maybe try and reword your question, because it's a bit hard to understand what you want your code to do and in what context.
Resolved using
print(driver.page_source)
I got full HTML content, and due to its simplicity it was easy to extract to required content withing the <body> TAG
I Have a problem while trying to extract some text from a span element using Python and lxml, I have managed this to work for some sites but not all.
So i have a function that will extract the price from an site, this worked when using the URL and xpath in the following code snippet.
def get_price(last_date):
page = requests.get('https://www.komplett.no/product/954922/gaming/gaming-utstyr/gamingskjermer/hp-omen-27-gamingskjerm-z4d33aa')
tree = html.fromstring(page.content)
prices = tree.xpath('//span[#class="product-price-now"]/text()')
currentPrice = 0
for string in prices:
currentPrice = string.strip(",-")
print(currentPrice)
foo(currentPrice, last_date)
But when i tried to same method using a different URL and a different span element with more than one property, this wouldn't work, here is the code snippet of the span element i cant get the text of using
<span class="DFlfde SwHCTb" data-precision="2" data-value="77954.88534">77,954.89</span>
Then i tried to extract the text from this span element by doing the following:
prices = tree.xpath('//span[#class="DFlfde SwHCTb"]/text()')
But that didn't work, any idea why`?
I am trying to write a script which performs a Google search for the input keyword and returns only the content from the top 10 URLs.
Note: Content specifically refers to the content that is being requested by the searched term and is found in the body of the returned URLs.
I am done with the search and top 10 url retrieval part. Here is the script:
from google import search
top_10_links = search(keyword, tld='com.in', lang='en',stop=10)
however i am unable to retrieve only the content from the links without knowing their structure. I can scrape content from a particular site by finding the class etc. of the tags using dev tools.But i am unable to figure out how to get content from the top 10 result URLs since for every searched term there are different URLs(different sites have different css selectors) and it would to pretty hard to find the css class of the required content. here is the sample code to extract content from a particular site.
content_dict = {}
i = 1
for page in links:
print(i, ' # link: ', page)
article_html = get_page(page)#get_page() returns page's html
soup = BeautifulSoup(article_html, 'lxml')
content = soup.find('div',{'class': 'entry-content'}).get_text()
content_dict[page] = content
i += 1
However the css class changes for the different sites. Is there someway i can get this script working and get the desired content?
You can't do scraping without knowing the structure of what you're scraping.But there is a package that does something similar. Take a look at newspaper
I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?
You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.
Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))
from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)
E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.