Download svg image from website using python - python

I am trying to create a simple weather script. The idea is to look in this website and extract the current temperature and the icon to its left.
I know how to extract the html code using request and extract the temperature. But I have no clue on how to get the icon. I am not sure on how to get the icon/image path (not sure even if is possible).
Here my code
import requests
import urllib
from bs4 import BeautifulSoup
url = "https://www.meteored.com.ar/tiempo-en_La+Plata-America+Sur-Argentina-Provincia+de+Buenos+Aires-SADL-1-16930.html"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
a=urllib.request.Request(url,headers = hdr)
response = urllib.request.urlopen(a).read()
soup = BeautifulSoup(response,"lxml")
data = soup.find_all(class_ = "dato-temperatura changeUnitT")
temperature = data[0].text

Related

BeautifulSoup find_all not finding all containers of a class

I'm trying to write a script to scrape some data off a Zillow page (https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/). Obviously I'm just trying to gather data from every listing. However I cannot grab the data from every listing as it only finds 9 instances of the class I'm searching for ('list-card-addr') even though I've checked the html from listings it does not find and the class exists. Anyone have any ideas for why this is? Here's my simple code
from bs4 import BeautifulSoup
import requests
req_headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.8',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
url="https://www.zillow.com/homes/for_rent/38.358484,-77.27869,38.218627,-77.498417_rect/X1-SSc26hnbsm6l2u1000000000_8e08t_sse/"
response = requests.get(url, headers=req_headers)
data = response.text
soup = BeautifulSoup(data,'html.parser')
address = soup.find_all(class_='list-card-addr')
print(len(address))
Data is stored within a comment. You can regex it out easily as string defining JavaScript object you can handle with json
import requests, re, json
r = requests.get('https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22savedSearchEnrollmentId%22%3A%22X1-SSc26hnbsm6l2u1000000000_8e08t%22%2C%22mapBounds%22%3A%7B%22west%22%3A-77.65840518457031%2C%22east%22%3A-77.11870181542969%2C%22south%22%3A38.13250414385234%2C%22north%22%3A38.444339281260426%7D%2C%22isMapVisible%22%3Afalse%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22mostrecentchange%22%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D',
headers = {'User-Agent':'Mozilla/5.0'})
data = json.loads(re.search(r'!--(\{"queryState".*?)-->', r.text).group(1))
print(data['cat1'])
print(data['cat1']['searchList'].keys())
Within this are details on pagination and the next url, if applicable, to get all results. You have only asked for page 1 here.
For example, print addresses
for i in data['cat1']['searchResults']['listResults']:
print(i['address'])

How to scrape a hidden table with beautifulsoup

It's about scraping a hidden table with beautifulsoup.
As you can see in this website, there is a button "choisissez votre séance" and when we click on it a table will be shown.
When I click on inspect the table element i can see the tag that contains attributes like price. However, when I view the website's source code, I can't find this information.
There is something in the code of the table 'display : none' which I think affects this, but I can't find a solution.
It would appear the page is using AJAX and loading the data for pricing in the background. Using Chrome I pressed F12 and had a look under the network tab. When I clicked the "choisissez votre séance" button I noticed a POST to this address:
'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304'
This is great news for you as you do not need to scrape the HTML data, you simply need to provide the ID (in page source) to the API.
In the below code I am
Requesting the initial page
Collecting the cookie
Posting the ID (data) and the cookie we collected
Returning the JSON data you require to further process (variable J)
Hope the below helps out!
Cheers,
Adam
import requests
from bs4 import BeautifulSoup
h = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
s = requests.session()
initial_page_request = s.get('https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',headers=h)
soup = BeautifulSoup(initial_page_request.text,'html.parser')
idseanc = soup.find("select",{"id":"sessionsSelect"})("option")[0]['value'].split("_")[1]
cookies = initial_page_request.cookies.get_dict()
headers = {
'Origin': 'https://www.ticketmaster.fr',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
'Content-Type': 'application/json; charset=UTF-8',
'Accept': '*/*',
'Referer': 'https://www.ticketmaster.fr/fr/manifestation/holiday-on-ice-billet/idmanif/446304',
'X-Requested-With': 'XMLHttpRequest',
'Connection': 'keep-alive',
}
data = {'idseanc':str(idseanc)}
response = s.post('https://www.ticketmaster.fr/planPlacement/FindPrices/connected/false/idseance/2870471', headers=headers, cookies=cookies, data=data)
j = response.json()

Scraping data from website graph with Python

As an exercise, I am trying to scrape data from a dynamic graph using Python. The graph can be found at this link (let's say I want the data from the first one).
Now, I was thinking of doing something like:
src = 'https://marketchameleon.com/Overview/WFT/IV/#_ABSTRACT_RENDERER_ID_11'
import json
import urllib.request
with urllib.request.urlopen(src) as url:
data = url.read()
reply = json.loads(data)
However, I receive an error message on the last line of the code, saying:
JSONDecodeError: Expecting value
"data" is not empty, so I believe there is a problem with the format of the information within it. Does someone have an idea to solve this issue? Thanks!
I opened that link and see that the site loads data from another URL - https://marketchameleon.com/charts/histStockChartData?p=747&m=12&_=1534060722519
You can use json.loads() function twice and do some hacks with headers (urllib2.Request is your friend in case of Python 2) since server returns HTTP 500 when you don't imitate browser
src = 'https://marketchameleon.com/charts/histStockChartData?p=747&m=12'
import json
import urllib.request
user_agent = {
'Host': 'marketchameleon.com',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Upgrade-Insecure-Requests': 1,
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7,kk;q=0.6'
}
request = urllib.request.Request(src, headers=user_agent)
data = urllib.request.urlopen(request).read()
print(data)
reply = json.loads(data)
table = json.loads(reply['GTable'])
print(table)

how to Parse emails from mailto urls in Python

I am trying to parse emails from web page.
my code:
import urllib2,cookielib
import re
site= "http://www.traidnt.net/vb/traidnt207743"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
page = urllib2.urlopen(req)
content = page.read()
links = re.findall('mailto:.+?#.+.', content)
for link in links:
print link[7:-1]
and the result come like:
email1#
email2#
email3#
...
but i need to get all emails with complete form.
Please how i can do that to get complete form of all emails.
Thank you!
I just added the following code to your code and it works perfectly:
regexp = re.compile(("mailto:([a-z0-9!#$%&'*+\/=?^_`{|}~-]+#[a-z0-9]+\.[a-zA-Z0-9-.]+)"))
links = re.findall(regexp, content)
print links
Output:
['njm-kwt#hotmail.com', 'fnan-ksa#hotmail.com', 'k-w-t7#hotmail.com', 'coool-uae#hotmail.com', 'qsd#hotmail.de', 'o1ooo#hotmail.de', 'm-p-3#hotmail.de', 'ya7oo#hotmail.de', 'g5x#hotmail.de', 'f7t#hotmail.de', 'm2y#hotmail.de', 's2udi#hotmail.de', 'q2tar#hotmail.de', 'kuw2it#hotmail.de', 's2udi#hotmail.fr', 'qxx#hotmail.de', 'y-e-s#hotmail.de', 'y-a#hotmail.de', 'qqj#hotmail.de', 'qjj#hotmail.de', 'admin_vb#hotmail.de', 'eng-vb#hotmail.com', 'a3lantk#hotmail.com', 'a3lnkm#hotmail.com', 't7t#hotmail.de', 'mohamed_fathy41#hotmail.com', 'ox-9#hotmail.com', 'ox-9#hotmail.com']
You shold use special library like that
https://pypi.python.org/pypi/urlinfo
and contribute and create issue to make Python better ;)

How can I scrape ajax site products with Python urllib & bs4?

I'm trying to use urllib + bs4 to scrape naturebox's website.
I want to create a voting webapp so my office can vote for new snacks from naturebox each month.
Ideally I'd like to pull the snacks from naturebox's website as they change frequently.
Here's my code:
import urllib2,cookielib
site = 'http://www.naturebox.com/browse/'
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
req = urllib2.Request(site, headers=hdr)
try:
page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
print content
but the output is just javascript and none of the data for the actual snacks (picture, name, description, etc.)
Obviously I'm a noob - how can I get at this data?
You can use the Selenium Webdriver. This will allow you to load the site, including the javascript/ajax, and access the html from there. Also, try using PhantomJS so that it can run invisibly in the background.

Categories