Please tell me why this similar lists of code get different results.
First one (yandex.ru) get page of request, and another one get Main page of site (moyareklama.ru)
import urllib
base = "http://www.moyareklama.ru/single_ad_new.php?"
data = {"id":"201623465"}
url = base + urllib.urlencode(data)
print url
page = urllib.urlopen(url).read()
f = open ("1.html", "w")
f.write(page)
f.close()
print page
##base = "http://yandex.ru/yandsearch?"
##data = (("text","python"),("lr","192"))
##url = base + urllib.urlencode(data)
##print url
##page = urllib.urlopen(url).read()
##f = open ("1.html", "w")
##f.write(page)
##f.close()
##print page
Most likely the reason you get something different with urllib.urlopen and your browser is because your browser can be redirected with javascript and meta/refresh tags as well as standard HTTP 301/302 responses. I'm pretty sure the urllib module will only be redirected by HTTP 301/302 responses.
Related
I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks
You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)
You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.
I want to download all the pdf docs corresponding to a list of "API#" values from http://imaging.occeweb.com/imaging/UIC1012_1075.aspx
So far I have managed to post the "API#" request but not sure what to do next.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
payload = {'txtIndex7':'1','txtIndex2': API}
session = requests.Session()
res = session.post(url,headers=headers,data=payload)
It is a bit more complicated than that, there are some additional event validation hidden input fields that you need to take into account. For that you first need to get the page, collect all the hidden values, set the value for the API and then make a POST request with following HTML parsing of the HTML response.
Fortunately, there is a tool called MechanicalSoup that may help to auto-fill these hidden fields in your the form submission request. Here is a complete solution including sample code for parsing the resulting table:
import mechanicalsoup
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr'):
print([td.get_text() for td in tr.find_all("td")])
import mechanicalsoup
import urllib
url = 'http://imaging.occeweb.com/imaging/UIC1012_1075.aspx'
Form = '1012'
API = '15335187'
browser = mechanicalsoup.StatefulBrowser(
user_agent='Mozilla/5.0'
)
browser.open(url)
# Fill-in the search form
browser.select_form('form#Form1')
browser["txtIndex7"] = Form
browser["txtIndex2"] = API
browser.submit_selected("Button1")
# Display the results
for tr in browser.get_current_page().select('table#DataGrid1 tr')[2:]:
try:
pdf_url = tr.select('td')[0].find('a').get('href')
except:
print('Pdf not found')
else:
pdf_id = tr.select('td')[0].text
response = urllib.urlopen(pdf_url) # for python 2.7, for python 3. urllib.request.urlopen()
pdf_str = "C:\\Data\\"+pdf_id+".pdf"
file = open(pdf_str, 'wb')
file.write(response.read())
file.close()
print('Pdf '+pdf_id+' saved')
I want to submit a multipart/form-data that sets the input for a simulation on TRILEGAL, and download the file available from a redirected page.
I studied documentation of requests, urllib, Grab, mechanize, etc. , and it seems that in mechanize my code would be :
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
However, I could not test it because it is not available in python 3.
So I tried requests :
import requests
url = 'http://stev.oapd.inaf.it/cgi-bin/trilegal'
values = {'gal_coord':"2",
'eq_alpha':"277.981111",
'eq_delta':"-19.0833",
'field':" 0.047117",
}
r = requests.post(url, files = values)
but I can't figure out how to get to the results page - if I do
r.content
it displays the content of the form that I had just submitted, whereas if you open the actual website, and click 'submit', you see a new window (following the method="post" action="./trilegal_1.6" ).
How can I get to that new window with requests (i.e. follow to the page that opens up when I click the submit button) , and click the link on the results page to retrieve the results file ( "The results will be available after about 2 minutes at THIS LINK.") ?
If you can point me to any other tool that could do the job I would be really grateful - I spent hours looking through SO for something that could help solve this problem.
Thank you!
Chris
Here is working solution for python 2.7
from mechanize import Browser
from urllib import urlretrieve # for download purpose
from bs4 import BeautifulSoup
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = ["277.981111"]
browser['eq_delta'] = ["-19.0833"]
response = browser.submit()
content = response.read()
soup = BeautifulSoup(content, 'html.parser')
base_url = 'http://stev.oapd.inaf.it'
# fetch the url from page source and it to base url
link = soup.findAll('a')[0]['href'].split('..')[1]
url = base_url + str(link)
filename = 'test.dat'
# now download the file
urlretrieve(url, filename)
Your file will be downloaded as test.dat. You can open it with respective program.
I post a separate answer because it would be too cluttered. Thanks to #ksai, this works in python 2.7 :
import re
import time
from mechanize import Browser
browser = Browser()
browser.open("http://stev.oapd.inaf.it/cgi-bin/trilegal")
browser.select_form(nr=0)
#set appropriate form contents
browser['gal_coord'] = ["2"]
browser['eq_alpha'] = "277.981111"
browser['eq_delta'] = "-19.0833"
browser['field'] = " 0.047117"
browser['photsys_file'] = ["tab_mag_odfnew/tab_mag_lsst.dat"]
browser["icm_lim"] = "3"
browser["mag_lim"] = "24.5"
response = browser.submit()
# wait 1 min while results are prepared
time.sleep(60)
# select the appropriate url
url = 'http://stev.oapd.inaf.it/' + str(browser.links()[0].url[3:])
# download the results file
browser.retrieve(url, 'test1.dat')
Thank you very much!
Chris
I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.
You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!
There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)
How can I list files and folders if I only have an IP-address?
With urllib and others, I am only able to display the content of the index.html file. But what if I want to see which files are in the root as well?
I am looking for an example that shows how to implement username and password if needed. (Most of the time index.html is public, but sometimes the other files are not).
Use requests to get page content and BeautifulSoup to parse the result.
For example if we search for all iso files at http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/:
from bs4 import BeautifulSoup
import requests
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid/'
ext = 'iso'
def listFD(url, ext=''):
page = requests.get(url).text
print page
soup = BeautifulSoup(page, 'html.parser')
return [url + '/' + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
for file in listFD(url, ext):
print file
You cannot get the directory listing directly via HTTP, as another answer says. It's the HTTP server that "decides" what to give you. Some will give you an HTML page displaying links to all the files inside a "directory", some will give you some page (index.html), and some will not even interpret the "directory" as one.
For example, you might have a link to "http://localhost/user-login/": This does not mean that there is a directory called user-login in the document root of the server. The server interprets that as a "link" to some page.
Now, to achieve what you want, you either have to use something other than HTTP (an FTP server on the "ip address" you want to access would do the job), or set up an HTTP server on that machine that provides for each path (http://192.168.2.100/directory) a list of files in it (in whatever format) and parse that through Python.
If the server provides an "index of /bla/bla" kind of page (like Apache server do, directory listings), you could parse the HTML output to find out the names of files and directories. If not (e.g. a custom index.html, or whatever the server decides to give you), then you're out of luck :(, you can't do it.
Zety provides a nice compact solution. I would add to his example by making the requests component more robust and functional:
import requests
from bs4 import BeautifulSoup
def get_url_paths(url, ext='', params={}):
response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
url = 'http://cdimage.debian.org/debian-cd/8.2.0-live/i386/iso-hybrid'
ext = 'iso'
result = get_url_paths(url, ext)
print(result)
HTTP does not work with "files" and "directories". Pick a different protocol.
You can use the following script to get names of all files in sub-directories and directories in a HTTP Server. A file writer can be used to download them.
from urllib.request import Request, urlopen, urlretrieve
from bs4 import BeautifulSoup
def read_url(url):
url = url.replace(" ","%20")
req = Request(url)
a = urlopen(req).read()
soup = BeautifulSoup(a, 'html.parser')
x = (soup.find_all('a'))
for i in x:
file_name = i.extract().get_text()
url_new = url + file_name
url_new = url_new.replace(" ","%20")
if(file_name[-1]=='/' and file_name[0]!='.'):
read_url(url_new)
print(url_new)
read_url("www.example.com")