How do you use Python xpath to get information from webpage?

How do you use Python xpath to get information from webpage? - python

I have this problem with a simple xpath, I can't figure out why it's not working.
I copied the function from a working function, and i seriously don't have a clue why this doesn't work.
I read several tutorials and have a working function in another script, but this function doesn't do what I want. It should get some strings from the webpage, but I just get empty variables.
def getWeather():
try:
page = requests.get('https://www.google.com/search?q=wetter&oq=wetter&ie=UTF-8')
except:
print('URL not reachable')
tree = html.fromstring(page.content)
#print( tree )
weatherInfo = tree.xpath('//span[#id="wob_dc"]/text()')
tempInfo = tree.xpath('//span[#id="wob_tm"]/text()')
windInfo = tree.xpath('//span[#id="wob_ws"]/text()')
print (weatherInfo) # empty
r = str(weatherInfo) + " " + str(tempInfo) + " " + str(windInfo)
return r
Can you give any advice?

It's all about headers in your requests. This sample works for me:
from lxml import html
import requests
def getWeather():
try:
page = requests.get(
'https://www.google.com/search?q=wetter&oq=wetter&ie=UTF-8',
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
)
except:
print('URL not reachable')
tree = html.fromstring(page.content)
#print( tree )
weatherInfo = tree.xpath('//span[#id="wob_dc"]/text()')
tempInfo = tree.xpath('//span[#id="wob_tm"]/text()')
windInfo = tree.xpath('//span[#id="wob_ws"]/text()')
print (weatherInfo) # empty
r = str(weatherInfo) + " " + str(tempInfo) + " " + str(windInfo)
return r
getWeather()

This is because of Google. Their servers does not get rentable pages. So this question is not for python, but for a web-developers.
Version of non-web-developer(me): server creates page with weather, then sends you according your location and then deletes this. If you aren't from German, you get another page.
Problem isn't in xpath, but in request.
P.S.: I checked this code on my own with another link and it works.

Related

getting an empty list when trying to extract urls from google with beautifulsoup

I am trying to extract the first 100 urls that return from a location search in google
however i am getting an empty list every time ("no results found")
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = []
if results:
counter = 0
for result in results:
websites.append(result.find("a")["href"])
counter += 1
if counter == 100:
break
else:
print("No search results found.")
return websites
location = "Athens"
print(get_location_info(location))
No search results found.
[]
I have also tried this approach :
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = [result.find("a")["href"] for result in results][:10]
return websites
location = "sifnos"
print(get_location_info(location))`
and i get an empty list. I think i am doing everything suggested in similar posts but i still get nothing

Always and first of all, take a look at your soup to see if all the expected ingredients are in place.
Select your elements more specific in this case for example with css selector:
[a.get('href') for a in soup.select('a:has(>h3)')]
To void consent banner also send some cookies:
cookies={'CONSENT':'YES+'}
Example
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text, 'html.parser')
websites = [a.get('href') for a in soup.select('a:has(>h3)')]
return websites
location = "sifnos"
print(get_location_info(location))
Output
['https://www.griechenland.de/sifnos/', 'http://de.sifnos-greece.com/plan-trip-to-sifnos/travel-information.php', 'https://www.sifnosisland.gr/', 'https://www.visitgreece.gr/islands/cyclades/sifnos/', 'http://www.griechenland-insel.de/Hauptseiten/sifnos.htm', 'https://worldonabudget.de/sifnos-griechenland/', 'https://goodmorningworld.de/sifnos-griechenland/', 'https://de.wikipedia.org/wiki/Sifnos', 'https://sifnos.gr/en/sifnos/', 'https://www.discovergreece.com/de/cyclades/sifnos']

How to grab image links correctly? My scraper only make blank folders

My code is only making empty folders and not downloading images.
So, I think I need it to be modified so that the images can be clearly downloaded.
I tried to fix it by myself, but can't figure it out how to do.
Anyone please help me. Thank you!
import requests
import parsel
import os
import time
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
response = requests.get(url, headers=headers)
html_data = response.text
selector = parsel.Selector(html_data)
containers = selector.xpath('//div[#class="container"]/div/article[#class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/#href').get()
print(title, title_url)
if not os.path.exists('img\\' + title):
os.mkdir('img\\' + title)
response_image = requests.get(url=title_url, headers=headers).text
selector_image = parsel.Selector(response_image)
# Full Size Images
images_url = selector_image.xpath('//div[#class="image-context"]/a[#class="download"]/#href').getall()
for title_url in images_url:
image_data = requests.get(url=title_url, headers=headers).content
file_name = title_url.split('/')[-1]
time.sleep(1)
with open(f'img\\{title}\\' + file_name, mode='wb') as f:
f.write(image_data)
print('Download complete!!:', file_name)

This page uses JavaScript to create link "download" but requests/urllib/beautifulsoup/lxml/parsel/scrapy can't run JavaScript - and this makes problem.
But it seems page uses the same urls to display images on page - so you may use //img/#src
But this makes another problem because page uses JavaScript for "lazy loading" images and only first img has src. Other images have url in data-src (and normally Javascript copy data-src to src when you scroll page) so you have to get data-src to download some of images.
You need something like this to get #src (for first image) and #data-src (for other images).
images_url = selector_image.xpath('//div[#id="content"]//img/#src').getall() + \
selector_image.xpath('//div[#id="content"]//img/#data-src').getall()
Full working code (with other small changes)
Because I use Linux so string img\\{title} creates wrong path
so I use os.path.join('img', title, filename) to create correct path on Windows, Linux, Mac.
import requests
import parsel
import os
import time
# you can define it once
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
for page in range(1, 310): # Total 309pages
print(f'======= Scraping data from page {page} =======')
url = f'https://www.bikeexif.com/page/{page}'
response = requests.get(url, headers=headers)
selector = parsel.Selector(response.text)
containers = selector.xpath('//div[#class="container"]/div/article[#class="smallhalf"]')
for v in containers:
old_title = v.xpath('.//div[2]/h2/a/text()').get()#.replace(':', ' -')
if old_title is not None:
title = old_title.replace(':', ' -')
title_url = v.xpath('.//div[2]/h2/a/#href').get()
print(title, title_url)
os.makedirs( os.path.join('img', title), exist_ok=True ) # it create only if doesn't exists
response_article = requests.get(url=title_url, headers=headers)
selector_article = parsel.Selector(response_article.text)
# Full Size Images
images_url = selector_article.xpath('//div[#id="content"]//img/#src').getall() + \
selector_article.xpath('//div[#id="content"]//img/#data-src').getall()
print('len(images_url):', len(images_url))
for img_url in images_url:
response_image = requests.get(url=img_url, headers=headers)
filename = img_url.split('/')[-1]
with open( os.path.join('img', title, filename), 'wb') as f:
f.write(response_image.content)
print('Download complete!!:', filename)

How to download image captcha using requests.Session from specific url

Hi everybody I'm trying to get the image captcha in a website to scrape it. My problem is that the url to get the image captcha contains a parameter where I can't find where it is from. So I got using parser.xpath but It doesn't work. This is my code:
import requests, io, re
from PIL import Image
from lxml import html
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
session = requests.Session()
login_url = 'https://www.sat.gob.pe/WebSiteV8/popupv2.aspx?t=6'
login_form_res = session.get(login_url, headers=headers)
myhtml = login_form_res.text
evalu = ''
for match in re.finditer(r'(mysession=)(.*?)(")', myhtml):
evalu = myhtml[match.start():match.end()]
evalu = evalu.replace("mysession=", "")
evalu = evalu.replace('"', '')
print(evalu)
url_infractions = 'https://www.sat.gob.pe/VirtualSAT/modulos/RecordConductor.aspx?mysession=' + evalu
login_form_res = session.get(url_infractions, headers=headers)
myhtml = login_form_res.text
parser = html.fromstring(login_form_res.text)
idPic = parser.xpath('//img[#class="captcha_class"]/#src')
urlPic = "https://www.sat.gob.pe/VirtualSAT" + idPic[0].replace("..","")
print(urlPic)
image_content = session.get(urlPic, headers=headers)
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB').content
image.show()
As a result I have an exception which says TypeError: a bytes-like object is required, not 'Response'. I'm confused. I will really appreciate your help. Thanks in advance

Unstable Results when scraping Amazon

I am new in the web scraping field. So hopefully this question is clear.
I found a tutorial on the internet to scrape Amazon data, based on a given ASIN (unique Amazon number). See : https://www.scrapehero.com/tutorial-how-to-scrape-amazon-product-details-using-python/
When running this code (I adjusted a bit of the code) I faced the issue that I received every time different results (even when running 5 seconds later). In my example, one time the Titles are found, but 5 seconds later the result is NULL.
I think the reason is because I searched the XPATH via Google Chrome, and in the beginning of the code, there is the
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
My question: how can I scrape the content on a stable way? (e.g.: getting the real results of the pages, by using ASIN numbers)
Below the code for reproducing. You can run the script via the command line:
python script_name.py
Thanks a lot for your help!
The script:
from lxml import html
import csv,os,json
import requests
#from exceptions import ValueError
from time import sleep
def AmzonParser(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url,headers=headers)
while True:
sleep(5)
try:
doc = html.fromstring(page.content)
# Title
XPATH_NAME = '//*[#id="productTitle"]/text()'
XPATH_NAME1 = doc.xpath(XPATH_NAME)
TITLE = ' '.join(''.join(XPATH_NAME1).split()) if XPATH_NAME1 else None
#XPATH_SALE_PRICE = '//span[contains(#id,"ourprice") or contains(#id,"saleprice")]/text()'
#XPATH_ORIGINAL_PRICE = '//td[contains(text(),"List Price") or contains(text(),"M.R.P") or contains(text(),"Price")]/following-sibling::td/text()'
#XPATH_CATEGORY = '//a[#class="a-link-normal a-color-tertiary"]//text()'
#XPATH_AVAILABILITY = '//div[#id="availability"]//text()'
#RAW_NAME = doc.xpath(XPATH_NAME)
#RAW_SALE_PRICE = doc.xpath(XPATH_SALE_PRICE)
#RAW_CATEGORY = doc.xpath(XPATH_CATEGORY)
#RAW_ORIGINAL_PRICE = doc.xpath(XPATH_ORIGINAL_PRICE)
#RAw_AVAILABILITY = doc.xpath(XPATH_AVAILABILITY)
#NAME = ' '.join(''.join(RAW_NAME).split()) if RAW_NAME else None
#SALE_PRICE = ' '.join(''.join(RAW_SALE_PRICE).split()).strip() if RAW_SALE_PRICE else None
#CATEGORY = ' > '.join([i.strip() for i in RAW_CATEGORY]) if RAW_CATEGORY else None
#ORIGINAL_PRICE = ''.join(RAW_ORIGINAL_PRICE).strip() if RAW_ORIGINAL_PRICE else None
#AVAILABILITY = ''.join(RAw_AVAILABILITY).strip() if RAw_AVAILABILITY else None
#if not ORIGINAL_PRICE:
# ORIGINAL_PRICE = SALE_PRICE
if page.status_code!=200:
raise ValueError('captha')
data = {
'TITLE':TITLE
#'SALE_PRICE':SALE_PRICE,
#'CATEGORY':CATEGORY,
#'ORIGINAL_PRICE':ORIGINAL_PRICE,
#'AVAILABILITY':AVAILABILITY,
#'URL':url,
}
return data
except Exception as e:
print(e)
def ReadAsin():
# AsinList = csv.DictReader(open(os.path.join(os.path.dirname(__file__),"Asinfeed.csv")))
AsinList = [
'B00AEINQ9K',
'B00JWP8F3I']
extracted_data = []
for i in AsinList:
url = "http://www.amazon.com/dp/"+i
print ("Processing: "+url)
extracted_data.append(AmzonParser(url))
sleep(5)
f=open('data_scraped_data.json','w')
json.dump(extracted_data,f,indent=4)
if __name__ == "__main__":
ReadAsin()

Recursively parse all category links and get all products

I've been playing around with web-scraping (for this practice exercise using Python 3.6.2) and I feel like I'm loosing it a bit. Given this example link, here's what I want to do:
First, as you can see, there are multiple categories on the page. Clicking each of the categories from above will give me other categories, then other ones, an so on, until I reach the products page. So I have to go in depth x number of times. I thought recursion will help me achieve this, but somewhere I did something wrong.
Code:
Here, I'll explain the way I approached the problem. First, I created a session and a simple generic function which will return a lxml.html.HtmlElement object:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
Then, I thought I'll need two other functions:
one to get the category links
and another one to get the product links
To distinguish between one and another, I figured out that only on category pages, there's a title which contains CATEGORIES every time, so I used that:
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
Now, the only thing left, is the recursion part, where I'm sure I did something wrong:
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
Here's all the code put together:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[#class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[#id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[#id='prodResult']/li//div[#class='imgWrapper']/a")
]
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
if __name__ == '__main__':
main()
Could someone please point me into the right direction regarding the recursion function?

Here is how I would solve this:
from lxml import html as html_parser
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
def dig_up_products(url, session=Session()):
html = session.get(url, headers=HEADERS).text
page = html_parser.fromstring(html)
# if it appears to be a categories page, recurse
for link in page.xpath('//h2[contains(., "CATEGORIES")]/'
'following-sibling::div[#id="carouselSegment1b"]//li//a'):
yield from dig_up_products(link.attrib["href"], session)
# if it appears to be a products page, return the links
for link in page.xpath('//ul[#id="prodResult"]/li//div[#class="imgWrapper"]/a'):
yield link.attrib["href"]
def main():
start = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
for link in dig_up_products(start):
print(link)
if __name__ == '__main__':
main()
There is nothing wrong with iterating over an empty XPath expression result, so you can simply put both cases (categories page/products page) into the same function, as long as the XPath expressions are specific enough to identify each case.

You can do like this as well to make your script slightly concise. I used lxml library along with css selector to do the job. The script will parse all the links under category and look for the dead end, when it appears then it parse title from there and do the whole stuff over and over again until all the links are exhausted.
from lxml.html import fromstring
import requests
def products_links(link):
res = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
page = fromstring(res.text)
try:
for item in page.cssselect(".contentHeading h1"): #check for the match available in target page
print(item.text)
except:
pass
for link in page.cssselect("h2:contains('CATEGORIES')+[id^='carouselSegment'] .touchcarousel-item a"):
products_links(link.attrib["href"])
if __name__ == '__main__':
main_page = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
products_links(main_page)
Partial result:
BRILLANTÉ DOORS
BRILLANTÉ DRAWER FRONTS
BRILLANTÉ CUT TO SIZE PANELS
BRILLANTÉ EDGEBANDING
LACQUERED ZENIT DOORS
ZENIT CUT-TO-SIZE PANELS
EDGEBANDING
ZENIT CUT-TO-SIZE PANELS

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do you use Python xpath to get information from webpage? - python

Related

getting an empty list when trying to extract urls from google with beautifulsoup

How to grab image links correctly? My scraper only make blank folders

How to download image captcha using requests.Session from specific url

Unstable Results when scraping Amazon

Recursively parse all category links and get all products

Categories

Resources