Python to Save Web Pages

Python to Save Web Pages - python

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.

This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

Related

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

I have written a simple script for myself as practice to find who had bought same tracks as me on bandcamp to ideally find accounts with similar tastes and so more same music on their accounts.
The problem is that fan list on a album/track page is lazy loading. Using python's requests and bs4 I am only getting 60 results out of potential 700.
I am trying to figure out how to send request i.e. here https://pitp.bandcamp.com/album/fragments-distancing to open more of the list. After finding what request is send when I click in finder, I used that json request to send it using requests, although without any result
res = requests.get(track_link)
open_more = {"tralbum_type":"a","tralbum_id":3542956135,"token":"1:1609185066:1714678:0:1:0","count":100}
for i in range(0,3):
requests.post(track_link, json=open_more)
Will appreciate any help!

i think that just typing a ridiculous number for count will do. i did some automation on your script too if you want to get data on other albums
from urllib.parse import urlsplit
import json
import requests
from bs4 import BeautifulSoup
# build the post link
get_link="https://pitp.bandcamp.com/album/fragments-distancing"
link=urlsplit(get_link)
base_link=f'{link.scheme}://{link.netloc}'
post_link=f"{base_link}/api/tralbumcollectors/2/thumbs"
with requests.session() as s:
res = s.get(get_link)
soup = BeautifulSoup(res.text, 'lxml')
# the data for tralbum_type and tralbum_id
# are stored in a script attribute
key="data-band-follow-info"
data=soup.select_one(f'script[{key}]')[key]
data=json.loads(data)
open_more = {
"tralbum_type":data["tralbum_type"],
"tralbum_id":data["tralbum_id"],
"count":1000}
r=s.post(post_link, json=open_more).json()
print(r['more_available']) # if not false put a bigger count

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

I am working on a project and one of the steps includes getting a random word which I will use later. When I try to grab the random word, it gives me '<span id="result"></span>' but as you can see, there is no word inside.
Code:
import urllib2
from bs4 import BeautifulSoup
quote_page = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find("span", {"id": "result"})
print name_box
name = name_box.text.strip()
print name
I am thinking that maybe it might need to wait for a word to appear, but I'm not sure how to do that.

This word is added to the page using JavaScript. We can verify this by looking at the actual HTML that is returned in the request and comparing it with what we see in the web browser DOM inspector. There are two options:
Use a library capable of executing JavaScript and giving you the resulting HTML
Try a different approach that doesn't require JavaScript support
For 1, we can use something like requests_html. This would look like:
from requests_html import HTMLSession
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
session = HTMLSession()
r = session.get(url)
# Some sleep required since the default of 0.2 isn't long enough.
r.html.render(sleep=0.5)
print(r.html.find('#result', first=True).text)
For 2, if we look at the network requests that the page is making, then we can see that it retrieves random words by making a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord. Making a direct request with a library like requests (recommended in the standard library documentation here) looks like:
import requests
url = 'http://watchout4snakes.com/wo4snakes/Random/RandomWord'
print(requests.post(url).text)

So the way that the site works is that it sends you the site with no word in the span box, and edits it in later through JavaScript; that's why you get a span box with nothing inside.
However, since you're trying to get the word I'd definitely suggest you use a different method to getting the word, rather than scraping the word off the page, you can simply send a POST request to http://watchout4snakes.com/wo4snakes/Random/RandomWord with no body and receive the word in response.
You're using Python 2 but in Python 3 (for example, so I can show this works) you can do:
>>> import requests
>>> r = requests.post('http://watchout4snakes.com/wo4snakes/Random/RandomWord')
>>> print(r.text)
doom
You can do something similar using urllib in Python 2 as well.

Python : Extract requests query parameters that any given URL receives

Background:
Typically, if I want to see what type of requests a website is getting, I would open up chrome developer tools (F12), go to the Network tab and filter the requests I want to see.
Example:
Once I have the request URL, I can simply parse the URL for the query string parameters I want.
This is a very manual task and I thought I could write a script that does this for any URL I provide. I thought Python would be great for this.
Task:
I have found a library called requests that I use to validate the URL before opening.
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urlopen(validatedRequest)
However, I am unsure of how to get the requests that the URL I enter receives. Is this possible in python? A point in the right direction would be great. Once I know how to access these request headers, I can easily parse through.
Thank you.

You can use the urlparse method to fetch the query params
Demo:
import requests
import urllib
from urlparse import urlparse
testPage = "http://www.google.com"
validatedRequest = str(requests.get(testPage, verify=False).url)
page = urllib.urlopen(validatedRequest)
print urlparse(page.url).query
Result:
gfe_rd=cr&dcr=0&ei=ISdiWuOLJ86dX8j3vPgI
Tested in python2.7

Downloading a .csv file from the web (with redirects) in python

Let me start by saying that I know there are a few topics discussing problems similar to mine, but the suggested solutions do not seem to work for me for some reason.
Also, I am new to downloading files from the internet using scripts. Up until now I have mostly used python as a Matlab replacement (using numpy/scipy).
My goal:
I want to download a lot of .csv files from an internet database (http://dna.korea.ac.kr/vhot/) automatically using python. I want to do this because it is too cumbersome to download the 1000+ csv files I require by hand. The database can only be accessed using a UI, where you have to select several options from a drop down menu to finally end up with links to .csv files after some steps.
I have figured out that the url you get after filling out the drop down menus and pressing 'search' contains all the parameters of the drop-down menu. This means I can just change those instead of using the drop down menu, which helps a lot.
An example url from this website is (lets call it url1):
url1 = http://dna.korea.ac.kr/vhot/search.php?species=Human&selector=drop&mirname=&mirname_drop=hbv-miR-B2RC&pita=on&set=and&miranda_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&gene=
On this page I can select 5 csv-files, one example directs me to the following url:
url2 = http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=and&gene_filter=&method=pita&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=99999&targetscan=&miranda=&rnahybrid=&microt=&pita=on
However, this doesn't contain the csv file directly, but appears to be a 'redirect' (a new term for me, that I found by googeling, so correct me if I am wrong).
One strange thing. I appear to have to load url1 in my browser before I can access url2 (I do not know if it has to be the same day, or hour. url2 didn't work for me today and it did yesterday. Only after after accessing url1 did it work again...). If I do not access url1 before url2 I get "no results" instead of my csv file from my browser. Does anyone know what is going on here?
However, my main problem is that I cannot save the csv files from python.
I have tried using the packages urllib, urllib2 and request but I cannot get it to work.
From what i understand the Requests package should take care of redirects, but I haven't been able to make it work.
The solutions from the following web pages do not appear to work for me (or I am messing up):
stackoverflow.com/questions/7603044/how-to-download-a-file-returned-indirectly-from-html-form-submission-pyt
stackoverflow.com/questions/9419162/python-download-returned-zip-file-from-url
techniqal.com/blog/2008/07/31/python-file-read-write-with-urllib2/
Some of the things I have tried include:
import urllib2
import csv
import sys
url = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#1
u = urllib2.urlopen(url)
localFile = open('file.csv', 'w')
localFile.write(u.read())
localFile.close()
#2
req = urllib2.Request(url)
res = urllib2.urlopen(req)
finalurl = res.geturl()
pass
# finalurl = 'http://dna.korea.ac.kr/vhot/download.php?mirname=hbv-miR-B2RC&species_filter=species_id+%3D+9606&set=or&gene_filter=&method=targetscan&m_th=-5&rh_th=-10&ts_th=0&mt_th=7.3&pt_th=-10&targetscan=on&miranda=&rnahybrid=&microt=&pita='
#3
import requests
r = requests.get(url)
r.content
pass
#r.content = "< s c r i p t > location.replace('download_send.php?name=qgN9Th&type=targetscan'); < / s c r i p t >"
#4
import requests
r = requests.get(url,
allow_redirects=True,
data={'download_open': 'Download', 'format_open': '.csv'})
print r.content
# r.content = "
#5
import urllib
test1 = urllib.urlretrieve(url, "test.csv")
test2 = urllib.urlopen(url)
pass
For #2, #3 and #4 the outputs are displayed after the code.
For #1 and #5 I just get a .csv file with </script>'
Option #3 just gives me a new redirect I think, can this help me?
Can anybody help me with my problem?

The page does not send a HTTP Redirect, instead the redirect is done via JavaScript.
urllib and requests do not process javascript, so they cannot follow to the download url.
You have to extract the final download url by yourself, and then open it, using any of the methods.
You could extract the URL using the re module with a regex like r'location.replace\((.*?)\)'

Based on the response from ch3ka, I think I got it to work. From the source code I get the java redirect, and from this redirect I can get the data.
#Find source code
redirect = requests.get(url).content
#Search for the java redirect (find it in the source code)
# --> based on answer ch3ka
m = re.search(r"location.replace\(\'(.*?)\'\)", redirect).group(1)
# Now you need to create url from this redirect, and using this url get the data
data = requests.get(new_url).content

Avoiding 503 errors with urllib2

I'm new to web scraping with python, so I don't know if I'm doing this right.
I'm using a script that calls BeautifulSoup to parse the URLs from the first 10 pages of a google search. Tested with stackoverflow.com, worked just fine out-of-the-box. I tested with another site a few times, trying to see if the script was really working with higher google page requests, then it 503'd on me. I switched to another URL to test and worked for a couple, low-page requests, then also 503'd. Now every URL I pass to it is 503'ing. Any suggestions?
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text

Automated querying is not permitted by Google Terms of Service.
See this article for information:
Unusual traffic from your computer
and also Google Terms of service

As Ettore said, scraping the search results is against our ToS. However check out the WebSearch api, specifically the bottom section of the documentation which should give you a hint about how to access the API from non-javascipt environments.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python to Save Web Pages - python

Mechanize is a great package for crawling the web with python. A simple example for your issue would be: import mechanize br = mechanize.Browser() response = br.open("www.xyz.com/somestuff/ID") print response This simply grabs your url and prints the response from the server.

Related

Web scraping lazy list (lazy loading) using python request (without selenium/scarpy)

When I take html from a website using urllib2, the inner html is empty. Anyone know why?

Python : Extract requests query parameters that any given URL receives

Downloading a .csv file from the web (with redirects) in python

Avoiding 503 errors with urllib2

Categories

Resources