How to skip a link that won't open while scraping?

How to skip a link that won't open while scraping? - python

I'm trying to write a .txt file for each article in the 'Capitalism' section on this page. But it stops after the 7th article, because the link to the 8th won't load. How do I skip it then?
res = session.get('https://www.theschooloflife.com/thebookoflife/category/work/?index')
soup = BeautifulSoup(res.text, 'lxml')
sections = soup.select('section ')
my_section = sections[7]
cat = my_section.select('.category_title')[0].text
titles = [title.text for title in my_section.select('.title')]
links = [link['href'] for link in my_section.select('ul.indexlist a')]
path = '{}'.format(cat)
os.mkdir(path)
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
# ...and then I make a numbered .txt file containing the text found in each link. Image below.

You haven't provided the most important parts: the exception that you are getting and the code which is responsible for url retrieval. Without that the only recommendation is to wrap your for body in exception, and continue your loop if any error, relevant to url retrieval occurs. Assuming that you are using requests library (as seen in session.get), you should end up with something like this:
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
try:
# ...and then I make a numbered .txt file containing the text found in each link. Image below.
except requests.exceptions.RequestException:
continue
requests.exceptions.RequestException is a general exception for requests module, you can find a more suiting one for your case here: https://requests.readthedocs.io/en/latest/user/quickstart/#errors-and-exceptions

Related

For some reason .write() does not write string to .txt file anymore

So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.
I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush() to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link from the loop gets given as a string, so that shouldn't be a problem. I also tried:
print(link, file=all_urls, end='\n')
as a replacement to
all_urls.write(link)
all_urls.write('\n')
with no result.
My full function:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
print(type(link))
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
for sublink in soup.findAll('a'):
templinks.append(sublink.get('href'))
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
if templink.find(url) == 0 and templink not in links:
links.append(templink)
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links

I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with.
[Just make sure to remove the lines involving allurl in your current code first just in case - or just try this with a different filename while checking if it works]
Since you're appending all the urls to tested_links anyway, you could just write it all at once after the while loop
with open('all_urls.txt', 'w') as f:
f.write('\n'.join('tested_links')+'\n')
or, if you have to write link by link, you can append by opening with mode='a':
# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')
# and inside the try block:
with open('all_urls.txt', 'a') as f:
f.write(f'{link}\n')

not a direct answer, but in my early days this happened with me. The requests module of Python sends request with headers indicating python and that could be quickly detected by websites and your IP can get blocked and you are getting unusual response that's why your working function is not working now.
Solution:
Use natural request headers see the code below
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(URL, headers=headers)
Use a proxy in case you got blocked on your IP it is highly recommended

Here is your slitely changed script with marked (*****************) changes:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
# ******************* added sublinks_list variable ******************
sublinks_list = []
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
sublinks = soup.findAll('a')
for sublink in sublinks:
#templinks.append(sublink.get('href')) ***************** changed the line with next row *****************
templinks.append(sublink['href'])
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
#if templink.find(url) == 0 and templink not in links: ******************* changed the line with next row *****************
if templink not in sublinks_list:
#links.append(templink) ******************* changed the line with next row *****************
sublinks_list.append(templink)
all_urls.write(templink + '\n') # ******************* added this line *****************
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links
lnks = get_subpages('https://www.jhanley.com/blog/pyscript-creating-installable-offline-applications/') # # ******************* url used for testing *****************
It works and there is over 180 links in the file. Please test it yourself. There are still some missfits and questionable sintax so you should test your code thoroughly again - but the part with writing links into a file works.
Regards...

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name

You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")

It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Python Selenium to script Medium rss feed

I'm trying to script some blogs with python and selenium.
However, the source page is limited to a few articles, thus I need to scroll down to load the AJAX..
Is there a way to get the full source in one call with selenium?
The code would be something like:
# url and page source generating
url = url_constructor_medium_news(blog_name)
content = social_data_tools.selenium_page_source_generator(driver, url)
try:
# construct soup
soup = BeautifulSoup(content, "html.parser").rss.channel
# break condition
divs = soup.find_all('item')
except AttributeError as e:
print(e.__cause__)
# friendly
time.sleep(3 + random.randint(1, 5))

I don't believe there is a way to populate the driver with unloaded data that would otherwise be obtained by scrolling.
An alternative solution to getting the data would be driver.execute_script("windows.scrollTo(0, document.body.scrollHeight);")
I've previously used this as a reference.
I hope this helps!

LXML unable to retrieve webpage using link from file

Hi this may look like a repost but is not. I have recently posted a similar question but this is another issue that links to that problem. So as seen from the previous question(LXML unable to retrieve webpage with error "failed to load HTTP resource"), I am now able to read and print the article if the link is the first line of the file. However, once I try to do it multiple times, it comes back with the error (http://tinypic.com/r/2rr2mau/8)
import lxml.html
def fetch_article_content_cna (i):
BASE_URL = "http://channelnewsasia.com"
f = open('cnaurl2.txt')
line = f.readlines()
print line [i]
url = urljoin(BASE_URL, line[i])
t = lxml.html.parse(url)
#print t.find(".//title").text
content = '\n'.join(t.xpath('.//div[#class="news_detail"]/div/p/text()'))
return content
cnaurl2.txt
/news/world/tripoli-fire-rages-as/1287826.html
/news/asiapacific/korea-ferry-survivors/1287508.html

Try:
url = urljoin(BASE_URL, line[i].strip())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to skip a link that won't open while scraping? - python

Related

For some reason .write() does not write string to .txt file anymore

list index out of range - beautiful soup

Recursive crawling with BeautifulSoup really slow

Python Selenium to script Medium rss feed

LXML unable to retrieve webpage using link from file

Categories

Resources