Download/read GovData with R

Download/read GovData with R - python

I stumbled over the following site and I wanted to download the data for the digital elevation model for the waterways.
https://www.govdata.de/web/guest/daten/-/details/1c669080-c804-11e4-8731-1681e6b88ec1bkg
Now, I have following problem, I do not understand how I can download the data.
Anybody knows how I could download the data, e.g. by using the programming language R or Python.

You will need to be on the webpage where the data is stored, not the webpage with the links to the data. Depending on what format the data is in you will need to change the (sep='\t') to fit your needs,
ex. a csv would be (sep=',')
You will then need to fine tune the formatting.
library(RCurl)
urlcontent<-
getURL('https://www.govdata.de/web/guest/daten/-/details/1c669080-c804-11e4-
8731-1681e6b88ec1bkg')
DATA<- read.table(textConnection(urlcontent), header=T, sep = '\t')
Note the read.table function may only work with a tsv type page, you will need to fine tune the reading of the page based on the formatting.
EDIT:
Using the link address for the URL I was able to successfully grab the URL, the problem though is an access error, I do not have access to download the data. This may be another error in the code, or an actual credential problem on the website side.
library(RCurl)
urlcontent<-
getURL('https://www.govdata.de/ckan/api/rest/dataset/1c669080-c804-11e4-
8731-1681e6b88ec1bkg')
DATA<- read.table(textConnection(urlcontent), header=T, sep = '\t')
Error:You don't have permission to access this server

Related

Script for automating online tool query

So I had a number of amino acid sequence strings that I wanted to use as input into a tool that studies its interactions with certain components of the human immune system (http://www.cbs.dtu.dk/services/NetMHCcons/).
I wanted to ask what, if any, would be a way of accessing, inputting data and getting the output, via a script (R or python preferably). My main issue was I had a lot of sequences that need to be queried separately so wanted to automate the whole thing. The website has one field that reads "Submission" which takes in the string input. There is another field "select species/loci" which gives a drop down menu from which an option needs to be selected. Lastly there's a "submit" button. The output simply loads on the page after hitting submit.
I've tentatively poked around with RSelenium and Rcurl but wanted to ask if there was a more efficient method.

I took a look at what it'd take to send a POST request to this service from Python, and it looks possible:
this form takes in "multipart/form-data" (see: How to send a "multipart/form-data" with requests in python?), you'll need to send your data in this format. You could inspect a request from the browser (using the dev tools) and copy the fields from there as a starting point.
once the form is submitted, it doesn't give you the result right away. You'd need to get your job ID from the response, and then poll the URL: http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi?jobid={your_job_id}&wait=20 until it gives you the result
the result will then need to be downloaded and parsed
This tool is however available as a portable version for linux/mac: https://services.healthtech.dtu.dk/software.php
Perhaps downloading this version would make it easier?

Try this :
Submitting to a web form using python
This link is an answer to how to send web forms in python, using urllib. Check your source code and extract the necessary data using re module from the source code of the link you have put up, and send the request.
save the HTML source code of http://www.cbs.dtu.dk/services/NetMHCcons/ in the python file as
source_code = '''...'''
The HTML can be found by using CTRL+U in firefox.

Accessing the contents of a public Google Sheet as CSV using Requests (or other library)

I wrote a small python program that works with data from a CSV file. I am tracking some numbers in a google sheet and I created the CSV file by downloading the google sheet. I am trying to find a way to have python read in the CSV file directly from google sheets, so that I do not have to download a new CSV when I update the spreadsheet.
I see that the requests library may be able to handle this, but I'm having a hard time figuring it out. I've chosen not to try the google APIs because this way seems simpler as long as I don't mind making the sheet public to those with the link, which is fine.
I've tried working with the requests documentation but I'm a novice programmer and I can't get it to read in as a CSV.
This is how the data is currently taken into python:
file = open('data1.csv', newline='')
reader = csv.reader(file)
I would like the file = open() to ideally be replaced by the requests library and pull directly from the spreadsheet.

You need to find the correct URL request that download the file.
Sample URL:
csv_url='https://docs.google.com/spreadsheets/d/169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc/export?format=csv&id=169AMdEzYzH7NDY20RCcyf-JpxPSUaO0nC5JRUb8wwvc&gid=0'
The way to doing it is by manually download your file while inspecting the requests URL at the Network tab in the Developer Tools in your browser.
Then the following is enough:
import requests as rs
csv_url=YOUR_CSV_DOWNLOAD_URL
res=rs.get(url=csv_url)
open('google.csv', 'wb').write(res.content)
It will save CSV file with the name 'google.csv' in the folder of you python script file.

import pandas as pd
import requests
YOUR_SHEET_ID=''
r = requests.get(f'https://docs.google.com/spreadsheet/ccc?key={YOUR_SHEET_ID}&output=csv')
open('dataset.csv', 'wb').write(r.content)
df = pd.read_csv('dataset.csv')
df.head()
I tried #adirmola's solution but I had to tweak it a little.
When he wrote "You need to find the correct URL request that download the file" he has a point. An easy solution is what I'm showing here. Adding "&output=csv" after your google sheet id.
Hope it helps!

I'm not exactly sure about your usage scenario, and Adirmola already provided a very exact answer to your question, but my immediate question is why you want to download the CSV in the first place.
Google Sheets has a python library so you can just get the data from the GSheet directly.
You may also be interested in this answer since you're interested in watching for changes in GSheets

I would just like to say using the Oauth keys and google python API is not always an option. I found the above to be quite useful for my current application.

How to retrieve data from API Explorer?

My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.

It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.

How to export specific sheet using new (v3) google drive API

I am trying to download specific sheet from a spread-sheet (on Google Drive) but unable to find a method to do so. I am using Python Client API library (v3) and passing file_id and mimeType in export_media() function as shown below:
request = service.files().export_media(fileId=file_id,mimeType='text/csv')
media_request = http.MediaIoBaseDownload(local_fd, request)
This code always export the sheet which is present at first place. Can you please describe a method through which I can download specific sheet/sheets by providing gid or any other parameter.

I don't think the Drive API has a feature to specify a sheet name.
Two workarounds spring to mind...
You could use the Sheets API (https://developers.google.com/sheets/api/reference/rest/) and write your own csv formatter. It sounds more complex than it is. It's probably 10 lines of code, especially if you go for Tab Separated instead of Comma Separated.
Use the Google Spreadsheet File/Publish to the Web feature to publish a csv of any given sheet. Note that the content will be public, so anybody with the link (which is pretty obtuse) would be able to read the data.

You can use an old visualization API URL (see other answer)
f'https://docs.google.com/spreadsheets/d/{doc_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}'
To make this request using the Google API Python library, you can use the credentials you already have and create an HTTP client instance yourself:
http_client = googleapiclient.discovery._auth.authorized_http(creds)
response, content = http_client.request(url)
Check response.status before you proceed.
Note that this API behaves a bit differently than your regular CSV export. Specifically there are some things I saw it does with headers - it will make them disappear if they are not set to Plain Text on a numeric column (see here), or merge multiple text rows appearing in the top of your sheet as a single header row.

Scrape with Python , commanded by excel vba

I already had a previous question, but that was pasted in vba tags etc. So I'll try again with proper tags and title since I gained a bit of knowledge now, hopefully.
The problem:
I need to find ~1000 dates from a database with plant variety data which probably is behind a login so here is a screenshot . Now I could of course fill out this form ~1000 times but there must be a smarter way to do this. If it were an HTML site I would know what to do, and have vba just pull in the results. I have been reading all morning about these javascript pages and ajax libraries but it is above my level. So hopefully someone can help me out a bit. I also used firebug to see what is going on when I press search:
These codes are similar to the last picture posted, make it easier to read. Code left here for copying.
f.cc.facet.limit
-1
f.cc.facet.mincount
1
f.end_date.facet.date.end
2030-01-01T00:00:00Z
f.end_date.facet.date.gap
+5YEARS
f.end_date.facet.date.oth...
all
f.end_date.facet.date.sta...
1945-01-01T00:00:00Z
f.end_type.facet.limit
20
f.end_type.facet.mincount
1
f.grant_start_date.facet....
NOW/YEAR
f.grant_start_date.facet....
+5YEARS
f.grant_start_date.facet....
all
f.grant_start_date.facet....
1900-01-01T00:00:00Z
f.status.facet.limit
20
f.status.facet.mincount
1
f.type.facet.limit
20
f.type.facet.mincount
1
facet
true
facet.date
grant_start_date
facet.date
end_date
facet.field
cc
facet.field
type
facet.field
status
facet.field
end_type
fl
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl
true
hl.fl
cc,latin_name,den_info,den_final
hl.fragsize
5000
hl.requireFieldMatch
false
json.nl
map
q
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi
3-9BgbCWwYBd7aIWPU1/onjQ==
rows
25
sort
uc asc,score desc
start
0
type
upov
wt
json
Source
fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date
And this is what it looks like in HTML, atleast according to firebug:
{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}
Edit:
It uses the GET method and XMLHttpRequest, as can be seen from this screenshot:
I already found how to make python run from excel vba here in this topic
I also downloaded beautiful soup but python is not my kind of language, so any help would be greatly appreciated.
Image refered to in comment on answer of Will

1) Use Excel to store your search parameters.
2) Run a few manual searches to find out what parameters you need to change on each request.
3) Invoke an http get request to the url that you have found in firebug/Fiddler (the url that it calls when you click "search" manually). See Urllib3 https://urllib3.readthedocs.org/en/latest/
3) Look at Json pickle to help you deal with the json response, saving (serializing) it to a file.
4) Reading and writing data involves IO libraries. Google is your friend. (Possibly easier to save your excel file as a csv and then just read the csv file for your search parameters).
5) Download PyCharm for your python development - it's really good.
Hope this helps.

I finally figured it out. I don't need to use python, I can just use an url, and then import the content into excel. I found out with Fiddler that the URL should become https://www3.wipo.int/pluto/user/jsp/select.jsp? And then the piece of code from the OP goes behind that.
The rest of my solution can be found in another question I had. It uses no Python but only VBA, which commands IE to open a website and copies the content of it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download/read GovData with R - python

Related

Script for automating online tool query

Accessing the contents of a public Google Sheet as CSV using Requests (or other library)

How to retrieve data from API Explorer?

How to export specific sheet using new (v3) google drive API

Scrape with Python , commanded by excel vba

Categories

Resources