Extracting link from a website using python, scrapy is not working - python

I want to scrape website link from "https://www.theknot.com/marketplace/bayside-bowl-portland-me-1031451", and I have used proper code of xpath i.e. response.xpath("//a[#title='website']/#href").get(), but it shows null result while scraping.

Some websites (in fact, many) use Javascript to generate content. That's why you need always check a source HTML code.
For this page you'll find that all information you need is inside a script tag with this text window.__INITIAL_STATE__ = . You need to get this text and next you can use json module to parse it:
import scrapy
import json
class TheknotSpider(scrapy.Spider):
name = 'theknot'
start_urls = ['https://www.theknot.com/marketplace/bayside-bowl-portland-me-1031451']
def parse(self, response):
initial_state_raw = response.xpath('//script[contains(., "window.__INITIAL_STATE__ = ")]/text()').re_first(r'window\.__INITIAL_STATE__ = (\{.+?)$')
# with open('Samples/TheKnot.json', 'w', encoding='utf-8') as f:
# f.write(initial_state_raw)
initial_state = json.loads(initial_state_raw)
website = initial_state['vendor']['vendor']['displayWebsiteUrl']
print(website)

Related

how to scrape javascript web site?

Hello everyone I'm a beginner at scraping and i try to scrape all iPhones in https://www.electroplanet.ma/
this is the scripts i wrote
import scrapy
from ..items import EpItem
class ep(scrapy.Spider):
name = "ep"
start_urls = ["https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=1",
"https://www.electroplanet.ma/smartphone-tablette-gps/smartphone/iphone?p=2"
]
def parse(self, response):
products = response.css("ol li") # to find all items in the page
for product in products :
try:
lien = product.css("a.product-item-link::attr(href)").get() # get the link of each item
image= product.css("a.product-item-photo::attr(href)").get() # get the image
# and to get in each item page and scrap it, i use follow method
# i passed image as argument to parse_item cauz i couldn't scrap the image from item's page
# i think it's hidden
yield response.follow(lien,callback = self.parse_item,cb_kwargs={"image":image})
except: pass
def parse_item(self,response,image):
item = EpItem()
item["Nom"]= response.css(".ref::text").get()
pattern = re.compile(r"\s*(\S+(?:\s+\S+)*)\s*")
item["Catégorie"]= pattern.search(response.xpath("//h1/a/text()").get()).group(1)
item["Marque"]=pattern.search(response.xpath("//*[#data-th='Marque']/text()").get()).group(1)
try :
item["RAM"]= pattern.search(response.xpath("//*[#data-th='MÉMOIRE RAM']/text()").get()).group(1)
except:
pass
item["ROM"]=pattern.search(response.xpath("//*[#data-th='MÉMOIRE DE STOCKAGE']/text()").get()).group(1)
item["Couleur"]=pattern.search(response.xpath("//*[#data-th='COULEUR']/text()").get()).group(1)
item["lien"]=response.request.url
item["image"]=image
item["état"]="neuf"
item["Market"]= "Electro Planet"
yield item
i found problems to scrape all the pages, because it uses javascript to follow pages so i write all pages links in start urls and i believe it's not the best practice so i ask you to give some advices to improve my code
you can use the scrapy-playwright plugin to scrape the interactive websites, and for the start_urls, just add the main website index URL if there is just one website, and check this link in the scrapy docs to make the spider follow the pages links automatically instead of written them manually

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

Web scraping large amount of links?

I am very new to Web scraping. I have started using BeautifulSoup in Python. I wrote a code that would loop through a list of urls and get me the data i need. The code works fine for 10-12 links but I am not sure if the same code will be effective if the list has over 100 links. Is there any alternative way or any other library to get the data by inputing a list of large number of url's without harming the website in any way. Here is my code so far.
url_list = [url1, url2,url3, url4,url5]
mylist = []
for l in url_list:
url = l
res = get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find('pre').text
mylist.append(data)
Here's an example, maybe for you.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'my_spider'
start_urls = ['url1']
# refresh_urls = True # If you want to download the downloaded link again, please remove the "#" in the front
def __init__(self):
# If your link is stored elsewhere, read it out here.
self.start_urls = utils.getFileLines('you url file name.txt')
Spider.__init__(self,self.name) # Necessary
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
data = doc.select('pre>text()') # Extract the data you want.
return {'Urls': None, 'Data':{'data':data} } # Return the data to the framework, which will save it for you.
SimplifiedMain.startThread(MySpider()) # Start download
You can see more examples here, as well as the source code of Library simplified_scrapy: https://github.com/yiyedata/simplified-scrapy-demo

Scrapy request doesn't yield full HTML - but Requests library does

I am building a crawl.spider to scrape statutory law data from the following website (https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm). I am aiming to extract the statute text, which is contained in the following XPath [//div[#class = 'first']/p/text()]. This path should provide the statute text.
All of my scrapy requests are yielding incomplete html responses, such that when I search for the relevant xpath queries, it yields an empty list. However, when I use the requests library, the html downloads correctly.
Using XPath tester online, I've verified that my xpath queries should produce the desired content. Using scrapy shell, I've viewed the response object from scrapy in my browser - and it looks just like it does when I'm browsing natively. I've tried enabling middleware for both BeautifulSoup and Selenium, but neither has appeared to work.
Here's my crawl spider
class AZspider(CrawlSpider):
name = "arizona"
start_urls = [
"https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm",
]
rule = (Rule(LinkExtractor(restrict_xpaths="//div[#class = 'article']"), callback="parse_stats_az", follow=True),)
def parse_stats_az(self, response):
statutes = response.xpath("//div[#class = 'first']/p")
yield{
"statutes":statutes
}
And here's the code that succsessfuly generated the correct response object
az_leg = requests.get("https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm")

Python Web Scraping - Navigating to next page link and obtaining data

I am trying to navigate to links and extracting data (the data is a href download link),this data should be added to a new field besides the previous fields of the first page (from where i got the links),but i am struggling how to do that
Firstable,i've created a parse and extracted all the links of the first page and added it to a field named "Links",this links are redirecting to a page that contains a download Button,so i need the real link of the download button,so what i did here is to create a for loop with the previous links and executing the function yield response.follow but it didn't go well.
import scrapy
class thirdallo(scrapy.Spider):
name = "thirdallo"
start_urls = [
'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
]
def parse(self, response):
yield {
'path': response.css('ol.breadcrumb li a::text').extract(),
'links': response.css('#top .default .er').xpath('#href').extract()
}
hrefs=response.css('#top .default .er').xpath('#href').extract()
for i in hrefs:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('#href)').extract() })
In the #href you are trying to scrape out, it seems that you have some .rar links, that can't be parsed with the designated function.
Find my code below, with requests and lxml libraries:
>>> import requests
>>> from lxml import html
>>> s = requests.Session()
>>> resp = s.get('https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii')
>>> doc = html.fromstring(resp.text)
>>> doc.xpath("//*[#id='top']//*//*[#class='default']//*//*[#class='er']/#href")
['https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-1-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-2-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-3-aldora-2.rar', 'https://www.alloschool.com/assets/documents/course-342/jdadat-alftra-4-aldora-2.rar', 'https://www.alloschool.com/element/44905', 'https://www.alloschool.com/element/43081', 'https://www.alloschool.com/element/43082', 'https://www.alloschool.com/element/43083', 'https://www.alloschool.com/element/43084', 'https://www.alloschool.com/element/43085', 'https://www.alloschool.com/element/43086', 'https://www.alloschool.com/element/43087', 'https://www.alloschool.com/element/43088', 'https://www.alloschool.com/element/43080', 'https://www.alloschool.com/element/43089', 'https://www.alloschool.com/element/43090', 'https://www.alloschool.com/element/43091', 'https://www.alloschool.com/element/43092', 'https://www.alloschool.com/element/43093', 'https://www.alloschool.com/element/43094', 'https://www.alloschool.com/element/43095', 'https://www.alloschool.com/element/43096', 'https://www.alloschool.com/element/43097', 'https://www.alloschool.com/element/43098', 'https://www.alloschool.com/element/43099', 'https://www.alloschool.com/element/43100', 'https://www.alloschool.com/element/43101', 'https://www.alloschool.com/element/43102', 'https://www.alloschool.com/element/43103', 'https://www.alloschool.com/element/43104', 'https://www.alloschool.com/element/43105', 'https://www.alloschool.com/element/43106', 'https://www.alloschool.com/element/43107', 'https://www.alloschool.com/element/43108', 'https://www.alloschool.com/element/43109', 'https://www.alloschool.com/element/43110', 'https://www.alloschool.com/element/43111', 'https://www.alloschool.com/element/43112', 'https://www.alloschool.com/element/43113']
In your code, try this:
for i in hrefs:
if '.rar' not in i:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('#href)').extract() })

Categories