Prefix "http://" valid but actually ""https://" - python

A long list of incomplete websites, some missing prefix like "http://www." etc.
pewresearch.org
narod.ru
intel.com
xda-developers.com
oecd.org
I tried:
import requests
from lxml.html import fromstring
to_check = [
"pewresearch.org",
"narod.ru",
"intel.com",
"xda-developers.com",
"oecd.org"]
for each in to_check:
r = requests.get("http://www." + each)
tree = fromstring(r.content)
title = tree.findtext('.//title')
print (title)
They returned:
Pew Research Center | Pew Research Center
Лучшие конструкторы сайтов | Народный рейтинг конструкторов для создания сайтов
Intel | Data Center Solutions, IoT, and PC Innovation
XDA Portal & Forums
Home page - OECD
Seems theirs all started with "http://www.", however not - because for example, the right one is "https://www.pewresearch.org/".
What's the quickest way, with online tool or Python, that I can find out their complete and correct addresses, instead of keying them one-by-one in web browser? (some might be http, some https).

Write a script / short program to send a HEAD request to each site. The server should respond with a redirect (e.g. to HTTPS). Follow each redirect until no further redirects are received.
The C# HttpClient can follow redirects automatically.
For Python, see #jterrace's answer here using the requests library with the code snippet below:
>>> import requests
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r
<Response [200]>
>>> r.history
[<Response [301]>]
>>> r.url
u'https://github.com/'

Related

Python: use request to get the result of auto suggestion from a webpage

I am looking at this web page.
In that page, there is a small box that says GET QUOTE.
If I for example type AMD the auto-suggestion opens up and shows a list like this:
My question is how to use request in Python3 to get this list, meaning to get:
AMD Advanced Micro Devices
AMDA Amedia Corp
Thanks for the help.
You can use browser's debugging facility to see what's going on when you request, and what you've got. For example, in Chrome, you can use Network tab of Developer Tools to see what request/response are made.
Use json parameter to send application/json request, and use Response.json() to decode the json response text:
>>> import requests
>>> url = 'http://research.investors.com/services/AutoSuggest.asmx/GetQuoteResults'
>>> response = requests.post(url, json={'q':'AMD','limit':10})
>>> data = response.json()
>>> [row['Symbol'] for row in data['d']]
['AMD', 'AMDA', 'DOX']

xpath on authenticated page in Python

I am extracting content from page using the below code. But I now want to use this on a page which is in an authenticated page. Is there any way I can do this within python?
Below is sample code am using.
from lxml import html
import requests
page = requests.get('http://www.thesiteurl.com/')
tree = html.fromstring(page.text)
logo = tree.xpath('//*[#id="wraper"]/div[3]/header/div[1]/div[2]/div[1]/a/img//#src')
print logo
I assume you mean you want to get an authenticated page using requests (since you can do whatever you want after you fetch the html)?
If so, it depends on how the page authenticates. The requests documentation discusses various ways of doing so here: link. The simplest scheme (username, password) is supported with fairly painless syntax:
>>> requests.get('https://api.github.com/user', auth=('user', 'pass'))
<Response [200]>

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.
Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.
This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html
Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

urllib2 does not read entire page

A portion of code that I have that will parse a web site does not work.
I can trace the problem to the .read function of my urllib2.urlopen object.
page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()
Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.
I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.
I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.
Does anybody know why this may be happening or what I could try to determine the issue?
The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:
import urllib2
import zlib
request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.
import requests
data = requests.get('http://magiccards.info/us/en.html').text
Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.
import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)

Browser simulation - Python 3

I need to access a few HTML pages through a Python 3 script, problem is that I need COOKIE functionality, therefore a simple urllib HTTP request won't work.
Any ideas?
python3's urllib has cookie support, look at urllib.request.HTTPCookieProcessor, and http.cookiejar
Use requests.
>>> import requests
>>> url = 'http://httpbin.org/cookies/set/requests-is/awesome'
>>> r = requests.get(url)
>>> print r.cookies
{'requests-is': 'awesome'}
Reference: http://docs.python-requests.org/en/latest/user/quickstart/#cookies
As of a few days ago, requests supports Python 3, though you might have to use one of the develop branches, not entirely sure about the status of upstream integration.

Categories