So I had a number of amino acid sequence strings that I wanted to use as input into a tool that studies its interactions with certain components of the human immune system (http://www.cbs.dtu.dk/services/NetMHCcons/).
I wanted to ask what, if any, would be a way of accessing, inputting data and getting the output, via a script (R or python preferably). My main issue was I had a lot of sequences that need to be queried separately so wanted to automate the whole thing. The website has one field that reads "Submission" which takes in the string input. There is another field "select species/loci" which gives a drop down menu from which an option needs to be selected. Lastly there's a "submit" button. The output simply loads on the page after hitting submit.
I've tentatively poked around with RSelenium and Rcurl but wanted to ask if there was a more efficient method.
I took a look at what it'd take to send a POST request to this service from Python, and it looks possible:
this form takes in "multipart/form-data" (see: How to send a "multipart/form-data" with requests in python?), you'll need to send your data in this format. You could inspect a request from the browser (using the dev tools) and copy the fields from there as a starting point.
once the form is submitted, it doesn't give you the result right away. You'd need to get your job ID from the response, and then poll the URL: http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi?jobid={your_job_id}&wait=20 until it gives you the result
the result will then need to be downloaded and parsed
This tool is however available as a portable version for linux/mac: https://services.healthtech.dtu.dk/software.php
Perhaps downloading this version would make it easier?
Try this :
Submitting to a web form using python
This link is an answer to how to send web forms in python, using urllib. Check your source code and extract the necessary data using re module from the source code of the link you have put up, and send the request.
save the HTML source code of http://www.cbs.dtu.dk/services/NetMHCcons/ in the python file as
source_code = '''...'''
The HTML can be found by using CTRL+U in firefox.
Related
I work with python and data mine some content which I categorize into different categories.
Then I go to a specific webpage and submit manually the results.
Is there a way to automate the process? I guess this is a "form-submit" thread but I haven't seen any relevant module in Python. Can you suggest me something?
Selenium Webdriver is the most popular way to drive web pages, but Python also has beautifulsoup; Either library will work.
If you want make this automatic yo have to see which params are send in the form and make a request with this params to the endpoint but directly from your python app, or search a package that simulate a browser and fill the form, but I think that the correct way is making the request directly from your app
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...
I am programming in Python.
I would like to extract real time data from a webpage without refreshing it:
http://www.fxstreet.com/rates-charts/currency-rates/
I think the real time data webpage is written in AJAX but I am not quite sure..
I thought about opening an internet browser with the program but I do not really know/like this way... Is there an other way to do it?
I would like to fill a dictionnary in my program (or even a SQL database) with the latest numbers each second.
please help me in python, thanks!
To get the data, you'll need to look through the javascript and HTML source to find what URL it's hitting to get the data it's displaying. Then, you can call that URL with urllib or your favorite python library and parse it
Also, it may be easier if you use a plugin like Firebug that lets you watch the AJAX requests.
I'm trying to read in info that is constantly changing from a website.
For example, say I wanted to read in the artist name that is playing on an online radio site.
I can grab the current artist's name but when the song changes, the HTML updates itself and I've already opened the file via:
f = urllib.urlopen("SITE")
So I can't see the updated artist name for the new song.
Can I keep closing and opening the URL in a while(1) loop to get the updated HTML code or is there a better way to do this? Thanks!
You'll have to periodically re-download the website. Don't do it constantly because that will be too hard on the server.
This is because HTTP, by nature, is not a streaming protocol. Once you connect to the server, it expects you to throw an HTTP request at it, then it will throw an HTTP response back at you containing the page. If your initial request is keep-alive (default as of HTTP/1.1,) you can throw the same request again and get the page up to date.
What I'd recommend? Depending on your needs, get the page every n seconds, get the data you need. If the site provides an API, you can possibly capitalize on that. Also, if it's your own site, you might be able to implement comet-style Ajax over HTTP and get a true stream.
Also note if it's someone else's page, it's possible the site uses Ajax via Javascript to make it up to date; this means there's other requests causing the update and you may need to dissect the website to figure out what requests you need to make to get the data.
If you use urllib2 you can read the headers when you make the request. If the server sends back a "304 Not Modified" in the headers then the content hasn't changed.
Yes, this is correct approach. To get changes in web, you have to send new query each time. Live AJAX sites do exactly same internally.
Some sites provide additional API, including long polling. Look for documentation on the site or ask their developers whether there is some.