Retrieve image from method called by URL in python - python

I'm trying to retrieve an image that is returned through a given URL using python, for example this one:
http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108
I am trying to do this by using urllib retrieve method:
import urllib
urlStr = "http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108"
filename = "image.png"
urllib.urlretrieve(urlStr,filename)
I already used this for other URLs, (such as http://chart.finance.yahoo.com/z?s=CMIG4.SA&t=9m), but for the first one it's not working.
Does anyone have an idea about how to make this for the given URL?
Note: I'm using Python 2.7

You need to use a session which you can do with requests:
import requests
with requests.Session() as s:
s.get("http://fundamentus.com.br/graficos.php?papel=CMIG4&tipo=2")
with open("out.png", "wb") as f:
f.write(s.get("http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108").content)
It works in your browser as you had visited the initial page where the image was so any necessary cookies were set.

While more verbose than #PadraicCunningham response. This should also do the trick. I'd run into a similar problem (host would only support certain browsers), so i had to start using urllib2 instead of just urllib. pretty powerful and is a module which comes with python.
Basically, you capture all the information you need during your initial request, and add it to your next request and subsequent requests. The requests module seems to pretty much do all of this for you behind the scenes. If only I'd known about that all these years...
import urllib2
urlForCookie = 'http://fundamentus.com.br/graficos.php?papel=CMIG4&tipo=2'
urlForImage = 'http://fundamentus.com.br/graficos3.php?codcvm=2453&tipo=108'
initialRequest = urllib2.Request(urlForCookie)
siteCookie = urllib2.urlopen(req1).headers.get('Set-Cookie')
imageReq = urllib2.Request(urlForImage)
imageReq.add_header('cookie', siteCookie)
with open("image2.pny",'w') as f:
f.write(urllib2.urlopen(req2).read())
f.close()

Related

I am trying to read in a url in python but it is giving an incomplete read

I am trying to read in a url in python 3 however when I tried it did not completely red in the URL
Here is my code
my_url="https://www.newegg.ca/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20cards"
Uclient=uReq(my_url)
page_html=Uclient.read()
Have you tried importing it using requests? As you do not show you direct imports I am assuming you are using urllib.request. The code below should provide you the entire html text available before any javascript is loaded (if the case)
import requests
my_url="https://www.newegg.ca/Video-Cards-Video-Devices/Category/ID-38?Tpk=graphics%20cards"
r = requests.get(my_url)
print (r.text)

python3.6 aiohttp response times and sizes

With python3.6 aiohttp is there a way to get all the following stats which i used to get using pycurl when making a request.
name_lookup_time = curl_handle.getinfo(pycurl.NAMELOOKUP_TIME)
connect_time = curl_handle.getinfo(pycurl.CONNECT_TIME)
app_connect_time = curl_handle.getinfo(pycurl.APPCONNECT_TIME)
pre_transfer_time = curl_handle.getinfo(pycurl.PRETRANSFER_TIME)
start_transfer_time = curl_handle.getinfo(pycurl.STARTTRANSFER_TIME)
total_time = curl_handle.getinfo(pycurl.TOTAL_TIME)
redirect_time = curl_handle.getinfo(pycurl.REDIRECT_TIME)
redirect_cnt = curl_handle.getinfo(pycurl.REDIRECT_COUNT)
total_size = curl_handle.getinfo(pycurl.SIZE_DOWNLOAD)
Currently i'm using a pycurl multi implementation in my crawler to make the requests and then i have code which gathers detailed data for each request. To replace pycurl and learn something new. I am interested in replacing the pycurl multi code which makes the requests with a aiohttp client implementation.
Reading the aiohttp(https://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientResponse) documentation, i found that one can get the redirect_cnt by looking and counting the history Sequence which is part of the ClientResponse object.
history
A Sequence of ClientResponse objects of preceding requests (earliest request first) if there were redirects, an empty sequence otherwise.
I could do without most of the detailed time data. However i would like to gather total_time and total_size for my minimum requirement.

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup
Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Custom CSS in Jupyter notebook that works in nbviewer

I was reading these two webs that show how to use custom CSS in IPython notebooks published in nbviewer:
http://www.aaronschlegel.com/display-custom-ipython-notebook-themes-in-nbviewer/
https://github.com/titipata/customize_ipython_notebook
Im trying to do the same in a Jupyter notebook. However the code below fails
from IPython.core.display import HTML
import urllib.request
def css():
style = urllib.request.urlopen('some url with css').read()
return HTML(style)
css()
the URL is this, Im trying to use the same CSS that is shown in one of the examples.
However trying to run the above code in a cell it throw the error "TypeError: HTML() expects text not b'\n\nhtml..." what is exactly the content of the link!
I did the same operation using the library requests instead of urllib.request with a similar code and I got the same typeError.
What Im doing wrong? How I can fix it? Thank you in advance.
The problem is that urlopen().read() method is returning a bytes type object rather than a str. You can add a .decode("utf-8") at the end to convert to str.
However, you mentioned attempting to use the requests library, and since that is so much nicer for these types of things I'll convert you code to use requests that also parses the response into a str. You probably tried to use the .read() method or something on the requests response, when requests has a built-in attribute to get the text from a response.
from IPython.core.display import HTML
import requests
def css():
style = requests.get('http://www.aaronschlegel.com/display-custom-ipython-notebook-themes-in-nbviewer/#viewSource').text
return HTML(style)
css()

Extracting data from Web

One really newbie question.
I'm working on a small python script for my home use, that will collect data of a specific air ticket.
I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:
http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html
And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png
Because they are not located in the HTML, can I extract them?
I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.
I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:
Rendered/interactive javascript with gtk/webkit/jswebkit
Rendered Javascript Crawler With Scrapy and Selenium RC
I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.
I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.
A great introduction to Selenium can be found here.
Install: pip install selenium
Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!
Ex:
airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'
soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))
Now you can parse soup like you normally would with Beautifulsoup.
I hope that helps you!
from selenium import webdriver
import requests
class SeleniumWebScraper():
def __init__(self):
self.source_code = ''
self.is_page_loaded = 0
self.driver = webdriver.Firefox()
self.is_browser_closed = 0
# To ensure the page has fully loaded we will 'implicitly' wait
self.driver.implicitly_wait(10) # Seconds
def close(self):
self.driver.close()
self.clear_source_code()
self.is_page_loaded = 0
self.is_browser_closed = 1
def clear_source_code(self):
self.source_code = ''
self.is_page_loaded = 0
def retrieve_source_code(self, domain):
if self.is_browser_closed:
self.driver = webdriver.Firefox()
# The driver.get method will navigate to a page given by the URL.
# WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
# before returning control to your test or script.
# It's worth nothing that if your page uses a lot of AJAX on load then
# WebDriver may not know when it has completely loaded.
self.driver.get(domain)
self.is_page_loaded = 1
self.source_code = self.driver.page_source
return self.source_code
You don't even need BeautifulSoup to extract data.
Just do this and your response is converted to a Dictionary which is very easy to handle.
text = json.loads("You text of the main response content")
You can now print any key value pair from the dictionary.
Give it a try. It is super easy.

Categories