python 3 open and read url without url name - python

I have gone through the relevant questions, and I did not find answer to this one:
I want to open an url and parse its contents.
When I do that on, say, google.com, no problem.
When I do it on an url that does not have a file name, I often get told that I read an empty string.
See the code below as an example:
import urllib.request
#urls = ["http://www.google.com", "http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
#urls = ["http://www.whoscored.com", "http://www.whoscored.com/LiveScores"]
urls = ["http://www.whoscored.com/LiveScores"]
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock=urllib.request.urlopen(url)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.read()
print("I read the source code...")
htmlSourceLine = sock.readlines()
sock.close()
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))
I do not know what does that I sometimes get that empty string as a return for urls that don't have a file name--such as "www.whoscored.com/LiveScores" in the example--whereas "google.com" or "www.whoscored.com" seem to work all the time.
I hope that my formulation is understandable...

It looks like the site is coded to explicitly reject requests from non-browser clients. You'll have to spoof creating sessions and the like, ensuring that Cookies are passed back and forth as required. The third-party requests library can help you with these tasks, but the bottom line is you are going to have to find out more about how that site operates.

Your code worked intermittently for me but using requests and sending the user-agent worked perfectly:
headers = {
'User-agent': 'Mozilla/5.0,(X11; U; Linux i686; en-GB; rv:1.9.0.1): Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'}
urls = ["http://www.whoscored.com/LiveScores"]
import requests
print("Type of urls: {0}.".format(str(type(urls))))
for url in urls:
print("\n\n\n\n---------------------------------------------\n\nUrl is: {0}.".format(url))
sock= requests.get(url, headers=headers)
print("I have this sock: {0}.".format(sock))
htmlSource = sock.content
print("I read the source code...")
htmlSourceString = str(htmlSource)
print("\n\nType of htmlSourceString: " + str(type(htmlSourceString)))
htmlSourceString = htmlSourceString.replace(">", ">\n")
htmlSourceString = htmlSourceString.replace("\\r\\n", "\n")
print(htmlSourceString)
print("\n\nI am done with this url: {0}.".format(url))

Related

Have made this scraper but the function returns no values? Just empty cells

So I've made a webscraper and everything seems to be running fine, however, no values are being returned? Assuming there's something wrong with the url but I can't seem to spot anything.
import pandas as pd
import datetime
import requests
from requests.exceptions import ConnectionError
from bs4 import BeautifulSoup
def web_content_div(web_content, class_path):
web_content_div = web_content.find_all('div', {'class': class_path})
try:
spans = web_content_div[0].find_all('span')
texts = [span.get_text() for span in spans]
except IndexError:
texts = []
return texts
def real_time_price(stock_code):
url= 'https://uk.finance.yahoo.com/quote/' + stock_code + '?p=' + stock_code + '&.tsrc=fin-tre-srch'
try:
r = requests.get(url)
web_content = BeautifulSoup(r.text, 'lxml')
texts = web_content_div(web_content, 'My(6px) Pos(r) smartphone_Mt(6px) W(100%)')
if texts != []:
price, change = texts[0], texts[1]
else:
price, change = [], []
except ConnectionError:
price, change = [], []
return price, change
Stock = ['BRK-B']
print(real_time_price('BRK-B'))
There's nothing wrong with the URL, which you can easily check by running something like this from the command line (get curl for your OS if you don't have it):
curl --output result.txt "https://uk.finance.yahoo.com/quote/BRK-B?p=BRK-B&.tsrc=fin-tre-srch"
That works, and saves the text you're after in result.txt.
So, it's not the URL - usual suspect then is the user agent, and lo and behold, spoofing a normal web browser User Agent works just fine:
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
r = requests.get(url, headers=headers)
This is just some random user agent string, you could try to find something more generic, but the key thing here is that Yahoo doesn't want to serve your Python script and you'll have to lie to Yahoo about what you're really doing to get what you want (which you do at your own risk, I'm not saying you should, I'm just saying how it's possible - don't).
Since you indicated the above "doesn't do it" - I can only assume you did try it and noticed that the content gets retrieved correctly, but the expression you pass to find_all gets you no results. That's because you cannot just pass all the classes in a single string separated by spaces: 'My(6px) Pos(r) smartphone_Mt(6px) W(100%)'.
However, if you just pass 'smartphone_Mt(6px)', you'll notice that it only finds a single result anyway. With a bit more work, you can make a more specific selection, if needed for other elements.
Of course there may be different reasons for your problem and I can't jump into conclusions and suggest a generic solution to completely solve the problem.
First that I ran your code on my local, was getting 404 when making requests.get and thought that the url is malformed or wrong. Then I guessed python requests, having experienced some odd behavior before, is causing some problem and not getting what you want.
But then I guesssed, the problem may be due to the dynamic behaviour of the page, writing data into page with javascript or xhr requests or doing document.write(sth) to populate the page which causes the html file not to include actual data.
To cope with the javascript problem, I recommend using selenium or similar libraries. Selenium may also help you in cases a pop-up appears when you load the page, e.g. a dialog saying "do you consent to our rules or accept cookies or ..." and you can handle those conditions by clicking the right button.
Finally, you can Try using user-agent in your header which may some times be the case. I took a look at your sites' robots.txt and founded that it Disallows some agents, so it is always a good idea to change this parameter (and some others checked by the server). (Also try to separate your query params, which is much cleaner):
...
url= 'https://uk.finance.yahoo.com/quote/' + stock_code
params = {
'p': stock_code,
'.tsrc': 'fin-tre-srch',
}
headers = {'user-agent': 'my-app/0.0.1'}
# alternatively: headers = {'user-agent': 'PostmanRuntime/7.28.4'}
url = 'https://uk.finance.yahoo.com/quote/BRK-B'
try:
r = requests.get(url, params=params, headers=headers)
...

How to get contents of frames automatically if browser does not support frames + can't access frame directly

I am trying to automatically download PDFs from URLs like this to make a library of UN resolutions.
If I use beautiful soup or mechanize to open that URL, I get "Your browser does not support frames" -- and I get the same thing if I use the copy as curl feature in chrome dev tools.
The standard advice for the "Your browser does not support frames" when using mechanize or beautiful soup is to follow the source of each individual frame and load that frame. But if I do so, I get to an error message that the page is not authorized.
How can I proceed? I guess I could try this in zombie or phantom but I would prefer to not use those tools as I am not that familiar with them.
Okay, this was an interesting task to do with requests and BeautifulSoup.
There is a set of underlying calls to un.org and daccess-ods.un.org that are important and set relevant cookies. This is why you need to maintain requests.Session() and visit several urls before getting access to the pdf.
Here's the complete code:
import re
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.un.org/en/ga/search/'
URL = "http://www.un.org/en/ga/search/view_doc.asp?symbol=A/RES/68/278"
BASE_ACCESS_URL = 'http://daccess-ods.un.org'
# start session
session = requests.Session()
response = session.get(URL, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})
# get frame links
soup = BeautifulSoup(response.text)
frames = soup.find_all('frame')
header_link, document_link = [urljoin(BASE_URL, frame.get('src')) for frame in frames]
# get header
session.get(header_link, headers={'Referer': URL})
# get document html url
response = session.get(document_link, headers={'Referer': URL})
soup = BeautifulSoup(response.text)
content = soup.find('meta', content=re.compile('URL='))['content']
document_html_link = re.search('URL=(.*)', content).group(1)
document_html_link = urljoin(BASE_ACCESS_URL, document_html_link)
# follow html link and get the pdf link
response = session.get(document_html_link)
soup = BeautifulSoup(response.text)
# get the real document link
content = soup.find('meta', content=re.compile('URL='))['content']
document_link = re.search('URL=(.*)', content).group(1)
document_link = urljoin(BASE_ACCESS_URL, document_link)
print document_link
# follow the frame link with login and password first - would set the important cookie
auth_link = soup.find('frame', {'name': 'footer'})['src']
session.get(auth_link)
# download file
with open('document.pdf', 'wb') as handle:
response = session.get(document_link, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
You should probably extract separate blocks of code into functions to make it more readable and reusable.
FYI, all of this could be more easily done through the real browser with the help of selenium of Ghost.py.
Hope that helps.

How can I test this script that accesses urls through several different proxy servers?

Right now this is the script:
import json
import urllib2
with open('urls.txt') as f:
urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
for url in urls:
data = urllib2.urlopen(url).read()
print data
This is the urls.txt file:
http://myipaddress.com
and the proxies.txt file:
{"https": "https://87.98.216.22:3128"}
{"https": "http://190.153.7.189:8080"}
{"https": "http://125.39.68.181:80"}
that I got at http://hidemyass.com
I have been trying to test it by going through the terminal output (a bunch of html) and looking to see if it shows the ip address somewhere and hoping that it is one of the proxy ip's. But this doesn't seem to work. Depending on the ip recognition site, either it throws a connection error or tells me I have to enter validation letters (though the site viewed through the browser works fine).
So am I going about this in the best way? Is there a simpler way to check what ip address the url is seeing?
Edit: I heard elsewhere (on another forum) that one way to check if the url is being accessed from a different ip is to check for cross headers (like the html header indicates that it was redirected). But I can't find any more info.
You can use simpler site like this. Example:
Code:
import json
import urllib2
with open('urls.txt') as f:
urls = [line.rstrip() for line in f]
with open('proxies.txt') as proxies:
for line in proxies:
proxy = json.loads(line)
proxy_handler = urllib2.ProxyHandler(proxy)
opener = urllib2.build_opener(proxy_handler)
urllib2.install_opener(opener)
for url in urls:
try:
data = urllib2.urlopen(url).read()
print proxy, "-", data
except:
print proxy, "- not working"
urls.txt:
http://api.exip.org/?call=ip
proxies.txt:
{"http": "http://218.108.114.140:8080"}
{"http": "http://59.47.43.93:8080"}
{"http": "http://218.108.170.172:80"}
Output:
{u'http': u'http://218.108.114.140:8080'} - 218.108.114.140
{u'http': u'http://59.47.43.93:8080'} - 118.207.240.161
{u'http': u'http://218.108.170.172:80'} - not working
[Finished in 25.4s]
Note: none of this is my real IP.
Or if you want to use http://myipaddress.com you can do that with BeautifulSoup, by extracting exact HTML element which contains you IP

Trouble Getting a clean text file from HTML

I have looked at these previous questions
I am trying to consolidate news and notes from websites.
Reputed News service websites allow Users to post comments and views.
I am trying to get only the news content without the users comments. I tried working with BeautifulSoup and html2text. But user-comments are being included in the text file. I have even tried developing a custom program but with no useful progress than the above two.
Can anybody provide some clue how to proceed?
The code:
import urllib2
from bs4 import BeautifulSoup
URL ='http://www.example.com'
print 'Following: ',URL
print "Loading..."
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
identify_as = { 'User-Agent' : user_agent }
print "Reading URL:"+str(URL)
def process(URL,identify_as):
req = urllib2.Request(URL,data=None,headers=identify_as)
response = urllib2.urlopen(req)
_BSobj = BeautifulSoup(response).prettify(encoding='utf-8')
return _BSobj #return beauifulsoup object
print 'Processing URL...'
new_string = process(URL,identify_as).split()
print 'Buidiing requested Text'
tagB = ['<title>','<p>']
tagC = ['</title>','</p>']
reqText = []
for num in xrange(len(new_string)):
buffText = [] #initialize and reset
if new_string[num] in tagB:
tag = tagB.index(new_string[num])
while new_string[num] != tagC[tag]:
buffText.append(new_string[num])
num+=1
reqText.extend(buffText)
reqText= ''.join(reqText)
fileID = open('reqText.txt','w')
fileID.write(reqText)
fileID.close()
Here's a quick example I wrote using urllib which gets the contents of a page to a file:
import urllib
import urllib.request
myurl = "http://www.mysite.com"
sock = urllib.request.urlopen(myurl)
pagedata = str(sock.read())
sock.close()
file = open("output.txt","w")
file.write(pagedata)
file.close()
Then with a lot of string formatting you should be able to extract the parts of the html you want. This gives you something to get started from.

406 Error with Mechanize

I'm getting a 406 error with Mechanize when trying to open a URL:
for url in urls:
if "http://" not in url:
url = "http://" + url
print url
try:
page = mech.open("%s" % url)
except urllib2.HTTPError, e:
print "there was an error opening the URL, logging it"
print e.code
logfile = open ("log/urlopenlog.txt", "a")
logfile.write(url + "," + "couldn't open this page" + "\n")
continue
else:
print "opening this URL..."
page = mech.open(url)
Any idea what would cause a 406 error to occur? If I go to the URL in question I can open it in the browser.
Try adding headers to your request based on what your browser sends; start with adding an Accept header (406 normally means the server didn't like what you want to accept).
See "Adding headers" in the documentation:
req = mechanize.Request(url)
req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
page = mechanize.urlopen(req)
The Accept header value there is based on the header sent by Chrome.
If you want to find out which headers your browser sends, this webpage shows them to you: https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending
The 'Accept' and 'User-Agent' headers should be enough. This is what I did to get rid of the error:
#establish counter
j = 0
#Create headers for webpage
headers = {'User-Agent': 'Mozilla/5.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
#Create for loop to get through list of URLs
for url in URLs:
#Verify scraper agent so that web security systems don't block webpage scraping upon URL opening, with j as a counter
req = mechanize.Request(URLs[j], headers = headers)
#Open the url
page = mechanize.urlopen(req)
#increase counter
j += 1
You could also try importing the "urllib2" or "urllib" libraries to open these URLs. The syntax is the same.

Categories