Web scraping large amount of links? - python

I am very new to Web scraping. I have started using BeautifulSoup in Python. I wrote a code that would loop through a list of urls and get me the data i need. The code works fine for 10-12 links but I am not sure if the same code will be effective if the list has over 100 links. Is there any alternative way or any other library to get the data by inputing a list of large number of url's without harming the website in any way. Here is my code so far.
url_list = [url1, url2,url3, url4,url5]
mylist = []
for l in url_list:
url = l
res = get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find('pre').text
mylist.append(data)

Here's an example, maybe for you.
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'my_spider'
start_urls = ['url1']
# refresh_urls = True # If you want to download the downloaded link again, please remove the "#" in the front
def __init__(self):
# If your link is stored elsewhere, read it out here.
self.start_urls = utils.getFileLines('you url file name.txt')
Spider.__init__(self,self.name) # Necessary
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
data = doc.select('pre>text()') # Extract the data you want.
return {'Urls': None, 'Data':{'data':data} } # Return the data to the framework, which will save it for you.
SimplifiedMain.startThread(MySpider()) # Start download
You can see more examples here, as well as the source code of Library simplified_scrapy: https://github.com/yiyedata/simplified-scrapy-demo

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

How to Extract/Scrape a specific part of information from WikiData URLs

I have a list of webid that I want to scrape from WikiData website. Here are the two links as an example.
https://www.wikidata.org/wiki/Special:EntityData/Q317521.jsonld
https://www.wikidata.org/wiki/Special:EntityData/Q478214.jsonld
I only need the first set of "P31" from the URL. For the first URL, the information that I need will be "wd:Q5" and second URL will be ["wd:Q786820", "wd:Q167037", "wd:Q6881511","wd:Q4830453","wd:Q431289","wd:Q43229","wd:Q891723"] and store them into a list.
When I use find and input "P31", I only need the first results out of all the results. The picture above illustrate it
The output will look like this.
info = ['wd:Q5',
["wd:Q786820", "wd:Q167037", "wd:Q6881511","wd:Q4830453","wd:Q431289","wd:Q43229","wd:Q891723"],
]
lst = ["Q317521","Q478214"]
for q in range(len(lst)):
link =f'https://www.wikidata.org/wiki/Special:EntityData/{q}.jsonld'
page = requests.get(link)
soup = BeautifulSoup(page.text, 'html.parser')
After that, I do not know how to extract the information from the first set of "P31". I am using request, BeautifulSoup, and Selenium libraries but I am wondering are there any better ways to scrape/extract that information from the URL besides using XPath or Class?
Thank you so much!
You only need requests as you are getting a JSON response.
You can use a function which loops the relevant JSON nested object and exits at first occurrence of target key whilst appending the associated value to your list.
The loop variable should be the id to add into the url for the request.
import requests
lst = ["Q317521","Q478214"]
info = []
def get_first_p31(data):
for i in data['#graph']:
if 'P31' in i:
info.append(i['P31'])
break
with requests.Session() as s:
s.headers = {"User-Agent": "Safari/537.36"}
for q in lst:
link =f'https://www.wikidata.org/wiki/Special:EntityData/{q}.jsonld'
try:
r = s.get(link).json()
get_first_p31(r)
except:
print('failed with link: ', link)

Extracting link from a website using python, scrapy is not working

I want to scrape website link from "https://www.theknot.com/marketplace/bayside-bowl-portland-me-1031451", and I have used proper code of xpath i.e. response.xpath("//a[#title='website']/#href").get(), but it shows null result while scraping.
Some websites (in fact, many) use Javascript to generate content. That's why you need always check a source HTML code.
For this page you'll find that all information you need is inside a script tag with this text window.__INITIAL_STATE__ = . You need to get this text and next you can use json module to parse it:
import scrapy
import json
class TheknotSpider(scrapy.Spider):
name = 'theknot'
start_urls = ['https://www.theknot.com/marketplace/bayside-bowl-portland-me-1031451']
def parse(self, response):
initial_state_raw = response.xpath('//script[contains(., "window.__INITIAL_STATE__ = ")]/text()').re_first(r'window\.__INITIAL_STATE__ = (\{.+?)$')
# with open('Samples/TheKnot.json', 'w', encoding='utf-8') as f:
# f.write(initial_state_raw)
initial_state = json.loads(initial_state_raw)
website = initial_state['vendor']['vendor']['displayWebsiteUrl']
print(website)

Python Web Scraping - Navigating to next page link and obtaining data

I am trying to navigate to links and extracting data (the data is a href download link),this data should be added to a new field besides the previous fields of the first page (from where i got the links),but i am struggling how to do that
Firstable,i've created a parse and extracted all the links of the first page and added it to a field named "Links",this links are redirecting to a page that contains a download Button,so i need the real link of the download button,so what i did here is to create a for loop with the previous links and executing the function yield response.follow but it didn't go well.
import scrapy
class thirdallo(scrapy.Spider):
name = "thirdallo"
start_urls = [
'https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii',
]
def parse(self, response):
yield {
'path': response.css('ol.breadcrumb li a::text').extract(),
'links': response.css('#top .default .er').xpath('#href').extract()
}
hrefs=response.css('#top .default .er').xpath('#href').extract()
for i in hrefs:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('#href)').extract() })
In the #href you are trying to scrape out, it seems that you have some .rar links, that can't be parsed with the designated function.
Find my code below, with requests and lxml libraries:
>>> import requests
>>> from lxml import html
>>> s = requests.Session()
>>> resp = s.get('https://www.alloschool.com/course/alriadhiat-alaol-ibtdaii')
>>> doc = html.fromstring(resp.text)
>>> doc.xpath("//*[#id='top']//*//*[#class='default']//*//*[#class='er']/#href")
['https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-1-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-2-aldora-1.rar', 'https://www.alloschool.com/assets/documents/course-342/jthathat-alftra-3-aldora-2.rar', 'https://www.alloschool.com/assets/documents/course-342/jdadat-alftra-4-aldora-2.rar', 'https://www.alloschool.com/element/44905', 'https://www.alloschool.com/element/43081', 'https://www.alloschool.com/element/43082', 'https://www.alloschool.com/element/43083', 'https://www.alloschool.com/element/43084', 'https://www.alloschool.com/element/43085', 'https://www.alloschool.com/element/43086', 'https://www.alloschool.com/element/43087', 'https://www.alloschool.com/element/43088', 'https://www.alloschool.com/element/43080', 'https://www.alloschool.com/element/43089', 'https://www.alloschool.com/element/43090', 'https://www.alloschool.com/element/43091', 'https://www.alloschool.com/element/43092', 'https://www.alloschool.com/element/43093', 'https://www.alloschool.com/element/43094', 'https://www.alloschool.com/element/43095', 'https://www.alloschool.com/element/43096', 'https://www.alloschool.com/element/43097', 'https://www.alloschool.com/element/43098', 'https://www.alloschool.com/element/43099', 'https://www.alloschool.com/element/43100', 'https://www.alloschool.com/element/43101', 'https://www.alloschool.com/element/43102', 'https://www.alloschool.com/element/43103', 'https://www.alloschool.com/element/43104', 'https://www.alloschool.com/element/43105', 'https://www.alloschool.com/element/43106', 'https://www.alloschool.com/element/43107', 'https://www.alloschool.com/element/43108', 'https://www.alloschool.com/element/43109', 'https://www.alloschool.com/element/43110', 'https://www.alloschool.com/element/43111', 'https://www.alloschool.com/element/43112', 'https://www.alloschool.com/element/43113']
In your code, try this:
for i in hrefs:
if '.rar' not in i:
yield response.follow(i, callback=self.parse,meta={'finalLink' :response.css('a.btn.btn-primary').xpath('#href)').extract() })

Website Downloader using Python

I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.
You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.
done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.
Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.

Categories