Scrape with Python , commanded by excel vba - python

I already had a previous question, but that was pasted in vba tags etc. So I'll try again with proper tags and title since I gained a bit of knowledge now, hopefully.
The problem:
I need to find ~1000 dates from a database with plant variety data which probably is behind a login so here is a screenshot . Now I could of course fill out this form ~1000 times but there must be a smarter way to do this. If it were an HTML site I would know what to do, and have vba just pull in the results. I have been reading all morning about these javascript pages and ajax libraries but it is above my level. So hopefully someone can help me out a bit. I also used firebug to see what is going on when I press search:
These codes are similar to the last picture posted, make it easier to read. Code left here for copying.
f.cc.facet.limit
-1
f.cc.facet.mincount
1
f.end_date.facet.date.end
2030-01-01T00:00:00Z
f.end_date.facet.date.gap
+5YEARS
f.end_date.facet.date.oth...
all
f.end_date.facet.date.sta...
1945-01-01T00:00:00Z
f.end_type.facet.limit
20
f.end_type.facet.mincount
1
f.grant_start_date.facet....
NOW/YEAR
f.grant_start_date.facet....
+5YEARS
f.grant_start_date.facet....
all
f.grant_start_date.facet....
1900-01-01T00:00:00Z
f.status.facet.limit
20
f.status.facet.mincount
1
f.type.facet.limit
20
f.type.facet.mincount
1
facet
true
facet.date
grant_start_date
facet.date
end_date
facet.field
cc
facet.field
type
facet.field
status
facet.field
end_type
fl
uc,cc,type,latin_name,common_name,common_name_en,common_name_others,app_num,app_date,grant_start_date
,den_info,den_final,id
hl
true
hl.fl
cc,latin_name,den_info,den_final
hl.fragsize
5000
hl.requireFieldMatch
false
json.nl
map
q
cc:IT AND latin_name:(Zea Mays) AND den_info:Antilles
qi
3-9BgbCWwYBd7aIWPU1/onjQ==
rows
25
sort
uc asc,score desc
start
0
type
upov
wt
json
Source
fl=uc%2Ccc%2Ctype%2Clatin_name%2Ccommon_name%2Ccommon_name_en%2Ccommon_name_others%2Capp_num%2Capp_date
%2Cgrant_start_date%2Cden_info%2Cden_final%2Cid&hl=true&hl.fragsize=5000&hl.requireFieldMatch=false&json
.nl=map&wt=json&type=upov&sort=uc%20asc%2Cscore%20desc&rows=25&start=0&qi=3-9BgbCWwYBd7aIWPU1%2FonjQ
%3D%3D&hl.fl=cc%2Clatin_name%2Cden_info%2Cden_final&q=cc%3AIT%20AND%20latin_name%3A(Zea%20Mays)%20AND
%20den_info%3AAntilles&facet=true&f.cc.facet.limit=-1&f.cc.facet.mincount=1&f.type.facet.limit=20&f.type
.facet.mincount=1&f.status.facet.limit=20&f.status.facet.mincount=1&f.end_type.facet.limit=20&f.end_type
.facet.mincount=1&f.grant_start_date.facet.date.start=1900-01-01T00%3A00%3A00Z&f.grant_start_date.facet
.date.end=NOW%2FYEAR&f.grant_start_date.facet.date.gap=%2B5YEARS&f.grant_start_date.facet.date.other
=all&f.end_date.facet.date.start=1945-01-01T00%3A00%3A00Z&f.end_date.facet.date.end=2030-01-01T00%3A00
%3A00Z&f.end_date.facet.date.gap=%2B5YEARS&f.end_date.facet.date.other=all&facet.field=cc&facet.field
=type&facet.field=status&facet.field=end_type&facet.date=grant_start_date&facet.date=end_date
And this is what it looks like in HTML, atleast according to firebug:
{"response":{"start":0,"docs":[{"id":"6751513","grant_start_date":"1999-02-04T22:59:59Z","den_final":"Antilles","app_num":"005642_A 005642","latin_name":"Zea mays L.","common_name_others":["MAIS"],"uc":"ZEAAA_MAY","type":"NLI","app_date":"1997-01-10T22:59:59Z","cc":"IT"}],"numFound":1},"qi":"3-9BgbCWwYBd7aIWPU1/onjQ==","facet_counts":{"facet_queries":{},"facet_ranges":{},"facet_dates":{"end_date":{"after":0,"start":"1945-01-01T00:00:00Z","before":0,"2010-01-01T00:00:00Z":1,"between":1,"end":"2030-01-01T00:00:00Z","gap":"+5YEARS"},"grant_start_date":{"after":0,"1995-01-01T00:00:00Z":1,"start":"1900-01-01T00:00:00Z","before":0,"between":1,"end":"2015-01-01T00:00:00Z","gap":"+5YEARS"}},"facet_intervals":{},"facet_fields":{"status":{"approved":1},"end_type":{"ter":1},"type":{"nli":1},"cc":{"it":1}}},"sv":"bswa1.wipo.int","lastUpdated":1435987857572,"highlighting":{"6751513":{"den_final":["Antilles<\/em>"],"latin_name":["Zea<\/em> mays<\/em> L."],"cc":["IT<\/em>"]}}}
Edit:
It uses the GET method and XMLHttpRequest, as can be seen from this screenshot:
I already found how to make python run from excel vba here in this topic
I also downloaded beautiful soup but python is not my kind of language, so any help would be greatly appreciated.
Image refered to in comment on answer of Will

1) Use Excel to store your search parameters.
2) Run a few manual searches to find out what parameters you need to change on each request.
3) Invoke an http get request to the url that you have found in firebug/Fiddler (the url that it calls when you click "search" manually). See Urllib3 https://urllib3.readthedocs.org/en/latest/
3) Look at Json pickle to help you deal with the json response, saving (serializing) it to a file.
4) Reading and writing data involves IO libraries. Google is your friend. (Possibly easier to save your excel file as a csv and then just read the csv file for your search parameters).
5) Download PyCharm for your python development - it's really good.
Hope this helps.

I finally figured it out. I don't need to use python, I can just use an url, and then import the content into excel. I found out with Fiddler that the URL should become https://www3.wipo.int/pluto/user/jsp/select.jsp? And then the piece of code from the OP goes behind that.
The rest of my solution can be found in another question I had. It uses no Python but only VBA, which commands IE to open a website and copies the content of it.

Related

How can I find the address of this web downloadable .csv file?

If I go to this website:
https://covid.cdc.gov/covid-data-tracker/#ed-visits
and click the "download" button (on the right), a .csv file is downloaded.
I can't find the address of that csv file, so that I could fetch it automatically with pd.read_csv(). I had a snoop around the web inspector thing, but I don't really know what I'm doing, and nothing jumped out at me as being the obvious answer. I've also looked around various other sites to try to find an API that gives me access to this data, bat there doesn't appear to be such thing.
Can anyone help me with that?
Thanks so much!
You might want to open your web inspector and go to the "Network"-Tab and then reload the page. You are going to see, that there's never a csv actually being loaded.
Also the export button doesn't link to any file. Rather it has some javascript binding, that exports the existing data in your client (the browser) as a csv to your filesystem. In other words: There isn't an address for that file. Its being created in your browser.
So even better, you can read the json directly. Just find the correct data in the Network-Tab, I think it might be this: https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=ed_trend_data
So instead you could directly read the json:
pd.read_json('https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=ed_trend_data') and then filter for the data that you need.

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Parsing from a website -- source code does not contain the info I need

I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)

transferring real time data from a website in python

I am programming in Python.
I would like to extract real time data from a webpage without refreshing it:
http://www.fxstreet.com/rates-charts/currency-rates/
I think the real time data webpage is written in AJAX but I am not quite sure..
I thought about opening an internet browser with the program but I do not really know/like this way... Is there an other way to do it?
I would like to fill a dictionnary in my program (or even a SQL database) with the latest numbers each second.
please help me in python, thanks!
To get the data, you'll need to look through the javascript and HTML source to find what URL it's hitting to get the data it's displaying. Then, you can call that URL with urllib or your favorite python library and parse it
Also, it may be easier if you use a plugin like Firebug that lets you watch the AJAX requests.

Categories