Download text data from HTML link in Python

Download text data from HTML link in Python - python

Hi I want to download delimited text which is hosted on a HTML Link. (The link is accessible on a Private network only, so can't share here).
In R, following function solves the purpose (all other functions gave "Unauthorized access" or "401" error)
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
download.file(url, "~/insights_dashboard/testing_file.tsv")
a = read.csv("~/insights_dashboard/testing_file.tsv",header = T,stringsAsFactors = F,sep='\t')
I want to do the same thing in Python, for which I used:
(A)urllib and requests.get()
import urllib.request
url_get = requests.get(url, verify=False)
urllib.request.urlretrieve(url_get, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
(B)requests.get() and read.html
url='https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
s = requests.get(url, verify=False)
a = pd.read_html(io.StringIO(s.decode('utf-8')))
(C) Using wget:
import wget
url = 'https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain'
wget.download(url,--auth-no-challenge, 'C:\\Users\\cssaxena\\Desktop\\24.tsv')
OR
wget --server-response -owget.log "https://dw-results.ansms.com/dw-platform/servlet/results? job_id=13802737&encoding=UTF8&mimeType=plain"
NOTE: The URL doesn't asks for any credentials and it is accessible by browser and able to download using R with download.file. I am looking for a solution in Python

def geturls(path):
yy=open(path,'rb').read()
yy="".join(str(yy))
yy=yy.split('<a')
out=[]
for d in yy:
z=d.find('href="')
if z>-1:
x=d[z+6:len(d)]
r=x.find('"')
x=x[:r]
x=x.strip(' ./')
#
if (len(x)>2) and (x.find(";")==-1):
out.append(x.strip(" /"))
out=set(out)
return(out)
pg="./test.html"# your html
url=geturls(pg)
print(url)

Related

Downloading/Exporting sites Search Results by using Export button in Python

So I'm trying to use Python to scrape data from the following website (with sample query): https://par.nsf.gov/**search**/fulltext:NASA%20NOAA%20coral
However instead of scraping the search results, I realized that it would just be easier if I somehow click on the Save Results as "CSV" link in a programmatic way and work with that CSV data instead as it would free me from having to navigate all the pages of the search results.
I inspected the CSV link element and found it called an "exportSearch('csv') function.
By typing the function's name into the console I found that the CSV link is just setting window.location.href to: https://par.nsf.gov/export/format:csv/fulltext:NASA%20NOAA%20coral
If I follow that link in the same browser, the save prompt will open up with a csv to save.
My issue starts when I want to do replicate this process using python. If I try to call the export link directly using the Requests library, the response is empty.
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
response = requests.get(url)
print("Response: ", len(response.content))
Can someone show me what I'm missing? I don't know how to first establish search results on the website's server that I can then access for export using Python.

I believe the link to download the CSV is here:
https://par.nsf.gov/export/format:csv//term:your_search_term
your_search_term is URL encoded
in your case, the link is: https://par.nsf.gov/export/format:csv//filter-results:F/term:NASA%20NOAA%20coral

You can use the below to download the file in python using the urllib library.
# Include your search term
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import urllib.request
urllib.request.urlretrieve(url)
# here you can also specify the path where you want to download the file like
urllib.request.urlretrieve(url,'/Users/Downloads/)
# Run this command only when you face an SSL certificate error, generally it occurs for Mac users with python 3.6 .
# Run this in jupyter notebook
!open /Applications/Python\ 3.6/Install\ Certificates.command
# On terminal just run
open /Applications/Python\ 3.6/Install\ Certificates.command
Similarly you can also use wget to fetch your file:
url = "https://par.nsf.gov/export/format:csv//filter-results:F/term:" + urllib.parse.quote("NASA NOAA coral")
print("URL: ", url)
import wget
wget.download(url)

Turns out I was missing some cookies that didn't come up when you do a simple requests GET (e.g. WT_FPC).
To get around this I used selenium's webdriver to do an initial GET request and use the cookies from that request to set up in the POST request for downloading the CSV data.
from selenium import webdriver
chrome_path = "path to chrome driver"
with requests.Session() as session:
url = "https://par.nsf.gov/search/fulltext:" + urllib.parse.quote("NASA NOAA coral")
#GET fetches website plus needed cookies
browser = webdriver.Chrome(executable_path = chrome_path)
browser.get(url)
## Session is set with webdriver's cookies
request_cookies_browser = browser.get_cookies()
[session.cookies.set(c['name'], c['value']) for c in request_cookies_browser]
url = "https://par.nsf.gov/export/format:csv/fulltext:" + urllib.parse.quote("NASA NOAA coral")
response = session.post(url)
## No longer empty
print(response.content.decode('utf-8'))

Accessing API through request but getting errors in Python

I am trying to access an API and it is mentioned that it gives in HTML. I went through these answers
(Get html using Python requests?) but I am not getting my results. I just wanted to make sure I am doing it correctly as I am getting error like this ("'{"request":{"category_id":"717234","command":"category"},"data":{"error":"invalid or missing api_key'" Is this API not working ? Is there any way to get HTML data and convert them to CSV or excel?
Here is the code which I am using.
import requests
URL = "https://api.eia.gov/category?api_key=YOUR_API_KEY_HERE&category_id=717234"
r = requests.get(url = URL)
r.text[:100]

you are using a invalid api the link to your html page is not working :
import requests
URL = "https://api.eia.gov/category?api_key=YOUR_API_KEY_HERE&category_id=717234"
headers = {'Accept-Encoding': 'identity'}
r = requests.get(URL, headers=headers)
print(r.text[:100])
output:
{"request":{"category_id":"717234","command":"category"},"data":{"error":"invalid or missing api_key
i try to change the link of the link with that one given in the answer that you put the link and i get a result :
import requests
URL = "http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F"
headers = {'Accept-Encoding': 'identity'}
r = requests.get(URL, headers=headers)
print(r.text[:100])
output:
<!DOCTYPE html>
<HTML>
<HEAD><TITLE>Average of Precipitation, Station id: 028815</TITLE></HEAD>
<BO
as a solution you can use an external api the devoloper mode of that api : https://www.eia.gov//developer// or check this link to get a key :https://www.eia.gov/opendata/

Its not an error.
seems to me like youre missing your API KEY.
this is what wrote in the link you put:
{"request":{"category_id":"717234","command":"category"},"data":{"error":"invalid or missing api_key. For key registration, documentation, and examples see https:\/\/www.eia.gov\/developer\/"}}

How to download in python big media links of a web page behind a log in form?

I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks

You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)

You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.

Using requests function in python to submit data to a website and call back a response

I am trying to use the requests function in python to post the text content of a text file to a website, submit the text for analysis on said website, and pull the results back in to python. I have read through a number of responses here and on other websites, but have not yet figured out how to correctly modify the code to a new website.
I'm familiar with beautiful soup so pulling in webpage content and removing HTML isn't an issue, its the submitting the data that I don't understand.
My code currently is:
import requests
fileName = "texttoAnalyze.txt"
fileHandle = open(fileName, 'rU');
url_text = fileHandle.read()
url = "http://www.webpagefx.com/tools/read-able/"
payload = {'value':url_text}
r = requests.post(url, payload)
print r.text
This code comes back with the html of the website, but hasn't recognized the fact that I'm trying to a submit a form.
Any help is appreciated. Thanks so much.

You need to send the same request the website is sending, usually you can get these with web debugging tools (like chrome/firefox developer tools).
In this case the url the request is being sent to is: http://www.webpagefx.com/tools/read-able/check.php
With the following params: tab=Test+by+Direct+Link&directInput=SOME_RANDOM_TEXT
So your code should look like this:
url = "http://www.webpagefx.com/tools/read-able/check.php"
payload = {'directInput':url_text, 'tab': 'Test by Direct Link'}
r = requests.post(url, data=payload)
print r.text
Good luck!

There are two post parameters, tab and directInput:
import requests
post = "http://www.webpagefx.com/tools/read-able/check.php"
with open("in.txt") as f:
data = {"tab":"Test by Direct Link",
"directInput":f.read()}
r = requests.post(post, data=data)
print(r.content)

Python to Save Web Pages

This is probably a very simple task, but I cannot find any help. I have a website that takes the form www.xyz.com/somestuff/ID. I have a list of the IDs I need information from. I was hoping to have a simple script to go one the site and download the (complete) web page for each ID in a simple form ID_whatever_the_default_save_name_is in a specific folder.
Can I run a simple python script to do this for me? I can do it by hand, it is only 75 different pages, but I was hoping to use this to learn how to do things like this in the future.

Mechanize is a great package for crawling the web with python. A simple example for your issue would be:
import mechanize
br = mechanize.Browser()
response = br.open("www.xyz.com/somestuff/ID")
print response
This simply grabs your url and prints the response from the server.

This can be done simply in python using the urllib module. Here is a simple example in Python 3:
import urllib.request
url = 'www.xyz.com/somestuff/ID'
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
src = page.readall()
print(src)
For more info on the urllib module -> http://docs.python.org/3.3/library/urllib.html

Do you want just the html code for the website? If so, just create a url variable with the host site and add the page number as you go. I'll do this for an example with http://www.notalwaysright.com
import urllib.request
url = "http://www.notalwaysright.com/page/"
for x in range(1, 71):
newurl = url + x
response = urllib.request.urlopen(newurl)
with open("Page/" + x, "a") as p:
p.writelines(reponse.read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download text data from HTML link in Python - python

Related

Downloading/Exporting sites Search Results by using Export button in Python

Accessing API through request but getting errors in Python

How to download in python big media links of a web page behind a log in form?

Using requests function in python to submit data to a website and call back a response

Python to Save Web Pages

Categories

Resources