using data from this website: https://ourworldindata.org/grapher/total-daily-covid-deaths?tab=map
I am trying to interact with the link 'total-daily-covid-deaths.csv' which has the href link 'blob:https://ourworldindata.org/b1c6f69e-4df4-4458-8aa0-35173733b364'. After clicking on the link I am taken to a page with a lot of data and I am merely trying to write a python script to take that data and put it into a csv file for me to use. While researching this I found that there was an overwhelming amount of information and I got confused very quickly. I have experience web scraping using beautiful soup and requests however I haven't been able to get it working since the blob link isn't an actual website. I was hoping someone could shed some light for me and point me in the right direction.
This is the code I'm using:
import urllib.request as request
url = 'https://ourworldindata.org/grapher/total-daily-covid-deaths?tab=map'
# fake user agent of Safari
fake_useragent = 'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25'
r = request.Request(url, headers={'User-Agent': fake_useragent})
f = request.urlopen(r)
# print or write
print(f.read())
blob urls are already explain in below url.
Convert blob URL to normal URL
Can you share your code-snippet to get a better idea about your problem.
Related
I am trying to scrape and download files from search results displayed when just clicking the search button on https://elibrary.ferc.gov/eLibrary/search. When the search results are displayed, the links look like https://elibrary.ferc.gov/eLibrary/filedownload?fileid=823F0FAB-E23A-CD5B-9FDD-7B3B7A700000 as an example. Clicking on this link on the search results page forces a download (content-disposition: attachment). However, I am saving the search results as an html page and then scraping the links. I am trying to get the file associated with the link and store it locally however my code isn't working.
#!/usr/local/bin python3
import os
import sys
import psycopg2
from pathlib import Path
import urllib.request
import requests
session = requests.Session()
session.headers.update(
{"User-Agent": "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)"}
)
r1 = session.get("https://elibrary.ferc.gov/eLibrary/search", verify=False)
dl_url = "https://elibrary.ferc.gov/eLibrary/filedownload?fileid=020E6084-66E2-5005-8110-C31FAFC91712"
req = session.get(dl_url, verify=False)
with open("dl_url", "wb") as outfile:
outfile.write(req.content)
I am not able to download the file contents at all (pdf, docx etc). The code above is just to try and solve the local download issue. Thanks in advance for any help.
Solved by using a json post request. Original URL won't work.
So I'm trying to scrape a website for frames of videos and getting no luck with a particularly hard-to-get website. I'm new to web scraping, so I could just be missing on something important.
My process for other websites is doing it through youtube-dl and ffmpeg, youtube-dl had support for this website but it is no longer working. I thought on writing a new extractor as I did for other websites but the issue in this website seems different. It's easy to get the .mp4 link for the video but it's hard to get it to work, that is, to display the HTML video player and not a 403: Forbidden or a 'wrong cookie' message.
I guess I have to mimic a browser request for the link to work, but I'm not sure what I'm missing on.
This is what I tried so far to identify the problem:
Running youtube-dl using the current (not working) implementation for the website. It can successfully get the .mp4 link, but it's never able to access it.
Output:
59378: Downloading webpage
WARNING: unable to extract description; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; type youtube-dl -U to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
ERROR: unable to download video data: HTTP Error 403: Forbidden
Using the python requests library. I used sessions to try to keep the cookies from the regular video page to the real URL of the video. It also successfully gets the .mp4 link, but it's never able to access it. Here is the code:
from lxml import html, etree
import requests
url = 'LINK GOES HERE'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.12 Safari/537.36'}
s = requests.Session()
s.headers.update(header)
page = s.get(url)
extractedHtml = html.fromstring(page.content)
videoUrl = extractedHtml.xpath("//video/#src")
print("Video URL: ", *videoUrl)
videoPage = s.get(*videoUrl)
print(videoPage.content)
print("Done.")
And the output:
Video URL: REAL VIDEO (.MP4) URL HERE
b'Wrong Cookie'
Done.
Opening the regular video page, then the .mp4 page in Selenium. If I make one get request to the regular page, .mp4 page can work but not reliably as I would ocasionally get 403 or 'wrong cookie' messages. If I make two get requests for the regular page, the .mp4 page will work 100% of the time. So my code is:
from lxml import html, etree
from selenium import webdriver
url = 'LINK GOES HERE'
browser = webdriver.Chrome()
browser.get(url)
browser.get(url)
extractedHtml = html.fromstring(browser.page_source)
videoUrl = extractedHtml.xpath("//video/#src")
browser.get(*videoUrl)
print("Done.")
Output of this is Selenium opening the .mp4 video page successfully everytime, but I don't know how I could use this to get frames of the video without needing to download the whole thing.
Each website is different, so I'm leaving a reference for a video to make everything easier. However the website has NSFW content, so I don't think I can just drop a random link in here. So here is a pastebin with a link to the most SFW video I could find. Discretion is advised.
https://pastebin.com/cBsWg1C7
If you have any thoughts about this please comment. I'm dreadfully stuck.
When you open that webpage it shows a disclaimer. Once you click on Accept, a cookie is set via their JavaScript code $.cookie("disclaimer", 1 { in jcore.v1.1.229.min.js (you can find this file attached as a script to the source of the webpage.
Also for a successful connection, you need to send referer with the URL of the webpage.
Below is Python code which accepts the disclaimer and downloads the file as out.mp4:
import requests
from lxml import html
url = '<webpage-url>' # change this to the relevant URL
# Get the download link
link_response = requests.get(url)
extracted_html = html.fromstring(link_response.content)
video_link = extracted_html.xpath('//*[#id="videoContainer"]/#data-src')[0]
# Get the video
headers = {'referer': url, 'cookie': 'disclaimer=1'}
video_response = requests.get(video_link, headers=headers)
# Save the video
with open('out.mp4', 'wb') as f:
f.write(video_response.content)
I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.
I am trying to scrape the full text of articles from a New York Times archives search for an NLP task (search here: http://query.nytimes.com/search/sitesearch/). I have legal access to all of the articles and can view them if I search the archives manually.
However, when I use urllib2, mechanize or requests to pull the HTML from the search results page, they are not pulling the relevant part of the code (links to the articles, number of hits) so that I can scrape the full articles. I am not getting an error message, the relevant sections, which are clearly visible in inspect element, are simply missing from the HTML that is pulled.
Because some of the articles are accessible to subscribers only, it occurred to me that this may be the problem and I have supplied my user credentials through Mechanize with the request, however this makes no difference in the code pulled.
There is a NYT API, however it does not give access to the full text of the articles, so it is useless to me for my purposes.
I assume that NYT has intentionally made scraping the page difficult, but I have a legal right to view all of these articles and so would appreciate any help with strategies that may help me get around the hurdles they have put up. I am new to web-scraping and am not sure where to start in figuring out this problem.
I tried pulling the HTML with all of the following, and got the same incomplete results each time:
url = 'http://query.nytimes.com/search/sitesearch/#/India+%22united+states%22/from19810101to20150228/allresults/1/allauthors/relevance/Opinion/'
#trying urllib
import urllib
opener = urllib.FancyURLopener()
print opener.open(url).read()
#trying urllib2
import urllib2
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print response.read()
#trying requests
import requests
print requests.get(url).text
#trying mechanize (impersonating browser)
import mechanize
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
r = br.open(url)
print r.read()
Why don't you use a framework like Scrapy? This will give you a lot of power out of the box. For example, you will be able to retrieve those parts of the page you are interested in and discard the rest. I wrote a little example for dealing with Scrapy and ajax pages here: http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
Maybe it can help you to get an idea of how Scrapy works.
You could try using a tool like kimonolabs.com to scrape the articles. If you're having trouble using authentication, kimono has a built-in framework that allows you to enter and store your credentials, so that might help where you're otherwise limited with the NYT API. I made this NYT API using kimono that you can clone and use if you make a kimono account: https://www.kimonolabs.com/apis/c8i366xe
Here's a help center article for how to make an API behind a login: https://help.kimonolabs.com/hc/en-us/articles/203275010-Fetch-data-from-behind-a-log-in-auth-API-
This article walks you through how to go through links to get the detailed page information: https://help.kimonolabs.com/hc/en-us/articles/203438300-Source-URLs-to-crawl-from-another-kimono-API
i will take a screenshot from this page: http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500 or save the image that it outputs.
But i can't find a way. With wget/curl i get an "unavailable error" and also with others tools like webkit2png/wkhtmltoimage/wkhtmltopng.
Is there a clean way to do it with python or from commandline?
Best regards!
You can use ghost.py if you like.
https://github.com/jeanphix/Ghost.py
Here is an example of how to use it.
from ghost import Ghost
ghost = Ghost(wait_timeout=4)
ghost.open('http://www.google.com')
ghost.capture_to('screen_shot.png')
The last line saves the image in your current directory.
Hope this helps
I had difficulty getting Ghost to take a screenshot consistently on a headless Centos VM. Selenium and PhantomJS worked for me:
from selenium import webdriver
br = webdriver.PhantomJS()
br.get('http://www.stackoverflow.com')
br.save_screenshot('screenshot.png')
br.quit
Sometimes you need extra http headers such User-Agent to get downloads to work. In python 2.7, you can:
import urllib2
request = urllib2.Request(
r'http://books.google.de/books?id=gikDAAAAMBAJ&pg=PA1&img=1&w=2500',
headers={'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 firefox/2.0.0.11'})
page = urllib2.urlopen(request)
with open('somefile.png','wb') as f:
f.write(page.read())
Or you can look at the params for adding http headers in wget or curl.