LXML unable to retrieve webpage using link from file - python

Hi this may look like a repost but is not. I have recently posted a similar question but this is another issue that links to that problem. So as seen from the previous question(LXML unable to retrieve webpage with error "failed to load HTTP resource"), I am now able to read and print the article if the link is the first line of the file. However, once I try to do it multiple times, it comes back with the error (http://tinypic.com/r/2rr2mau/8)
import lxml.html
def fetch_article_content_cna (i):
BASE_URL = "http://channelnewsasia.com"
f = open('cnaurl2.txt')
line = f.readlines()
print line [i]
url = urljoin(BASE_URL, line[i])
t = lxml.html.parse(url)
#print t.find(".//title").text
content = '\n'.join(t.xpath('.//div[#class="news_detail"]/div/p/text()'))
return content
cnaurl2.txt
/news/world/tripoli-fire-rages-as/1287826.html
/news/asiapacific/korea-ferry-survivors/1287508.html

Try:
url = urljoin(BASE_URL, line[i].strip())

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name
You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

How to skip a link that won't open while scraping?

I'm trying to write a .txt file for each article in the 'Capitalism' section on this page. But it stops after the 7th article, because the link to the 8th won't load. How do I skip it then?
res = session.get('https://www.theschooloflife.com/thebookoflife/category/work/?index')
soup = BeautifulSoup(res.text, 'lxml')
sections = soup.select('section ')
my_section = sections[7]
cat = my_section.select('.category_title')[0].text
titles = [title.text for title in my_section.select('.title')]
links = [link['href'] for link in my_section.select('ul.indexlist a')]
path = '{}'.format(cat)
os.mkdir(path)
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
# ...and then I make a numbered .txt file containing the text found in each link. Image below.
You haven't provided the most important parts: the exception that you are getting and the code which is responsible for url retrieval. Without that the only recommendation is to wrap your for body in exception, and continue your loop if any error, relevant to url retrieval occurs. Assuming that you are using requests library (as seen in session.get), you should end up with something like this:
for n,(title,link) in list(enumerate(zip(titles,links), start=1)):
try:
# ...and then I make a numbered .txt file containing the text found in each link. Image below.
except requests.exceptions.RequestException:
continue
requests.exceptions.RequestException is a general exception for requests module, you can find a more suiting one for your case here: https://requests.readthedocs.io/en/latest/user/quickstart/#errors-and-exceptions

Extract images from HTML file using python standard libraries

so I'm trying to write a script that basically parses through an HTML file, finds all the images and saves those images into another folder. How would one accomplish this only using libraries that come with python3 when you install it on your computer? I currently have this script that I would like to incorporate more into.
date = datetime.date.today()
backup_path = os.path.join(str(date), language)
if not os.path.exists(backup_path):
os.makedirs(backup_path)
log = []
endpoint = zendesk + '/api/v2/help_center/en-us/articles.json'
while endpoint:
response = requests.get(endpoint, auth=credentials)
if response.status_code != 200:
print('Failed to retrieve articles with error {}'.format(response.status_code))
exit()
data = response.json()
for article in data['articles']:
if article['body'] is None:
continue
title = '<h1>' + article['title'] + '</h1>'
filename = '{id}.html'.format(id=article['id'])
with open(os.path.join(backup_path, filename), mode='w', encoding='utf-8') as f:
f.write(title + '\n' + article['body'])
print('{id} copied!'.format(id=article['id']))
log.append((filename, article['title'], article['author_id']))
endpoint = data['next_page']
This is a script I found on a zendesk forum that basically backs up our articles on Zendesk.
Try using beautiful soup to retrieve all the nodes and for each node using urllib to get the picture.
from bs4 import BeautifulSoup
#note here using response.text to get raw html
soup = BeautifulSoup(response.text)
#get the src of all images
img_source = [x.src for x in soup.find_all("img")]
#get the images
images = [urllib.urlretrieve(x) for x in img_source]
And you probably need to add some error handling and change it a bit to fit your page, but the idea remains the same.

How to Download PDFs from Scraped Links [Python]?

I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.
Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import shutil
bs = BeautifulSoup
url = input("Enter the URL you want to scrape from: ")
print("")
suffix = ".pdf"
link_list = []
def getPDFs():
# Gets URL from user to scrape
response = requests.get(url, stream=True)
soup = bs(response.text)
#for link in soup.find_all('a'): # Finds all links
# if suffix in str(link): # If the link ends in .pdf
# link_list.append(link.get('href'))
#print(link_list)
with open('CS112.Lecture.09.pdf', 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)
del response
print("PDF Saved")
getPDFs()
Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.
Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.
If it's of any use, I'm using Python 3.4.2
If this is something that does not require being logged in, you can use urlretrieve():
from urllib.request import urlretrieve
for link in link_list:
urlretrieve(link)

Performing a get request in Python

Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.

Categories