Trying to get links of an interactive map (Web scraping .swf) - python

i need to create a web scraper for this website
However I need to get the links for the counties, stored in the interactive map
Unfortunately, for some reason, their search engine doesn't provide all the results as the interactive map does.
My question:
Could anyone tell me how to get all the links for all the counties, without manually accessing them?
Thanks

Technically you can use a decompiler to do this job.
There are free (e.g.: ActionScript Extractor) and paid (e.g.: Sothink
SWF Decompiler) tools out there.
you can reference this answer
Edit :
Most swf content gets external records from either a .xml or .json file.
Without decompiling and just using the browser's Developer Tools we can see that an xml file is indeed accessed (maybe it contains what you want) :
http://www.allpetservices.co.uk/uk_ir_locator.xml.
Put view-source: in front of the link to read it (if there's an error message).
In that xml you want to extract the contents (the xyz) of each & every <link> xyz </link> tag. This will give you the links of every entry on the map.

The short answer to your question: There's no way to get the links from the site.
The solution: The structure of the links you are trying to retrieve are very predictable. They follow the same structure:
http://www.allpetservices.co.uk/search_map.asp?ccounty={COUNTY_NAME}
So, if you can use another site or data source to get the names of each of the counties, you can formulate each of the links that you need.

Related

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

Accessing Hovertext with html

I am trying to access hover text found on graph points at this site (bottom):
http://matchhistory.na.leagueoflegends.com/en/#match-details/TRLH1/1002200043?gameHash=b98e62c1bcc887e4&tab=overview
I have the full site html but I am unable to find the values displayed in the hover text. All that can be seen when inspecting a point are x and y values that are transformed versions of these values. The mapping can be determined with manual input taken from the hovertext but this defeats the purpose of looking at the html. Additionally, the mapping changes with each match history so it is not feasible to do this for a large number of games.
Is there any way around this?
thank you
Explanation
Nearly everything on this webpage is loaded via JSON through JavaScript. We don't even have to request the original page. You will, however, have to repiece together the page via id's of items, mysteries and etc., which won't be too hard because you can request masteries similar to how we fetch items.
So, I went through the network tab in inspect and I noticed that it loaded the following JSON formatted URL:
https://acs.leagueoflegends.com/v1/stats/game/TRLH1/1002200043?gameHash=b98e62c1bcc887e4
If you notice, there is a gameHash and the id (similar to that of the link you just sent me). This page contains everything you need to rebuild it, given that you fetch all reliant JSON files.
Dealing with JSON
You can use json.loads in Python to load it, but a great tool I would recomend is:
https://jsonformatter.curiousconcept.com/
You copy and paste JSON in there and it will help you understand the data structure.
Fetching items
The webpage loads all this information via a JSON file:
https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json
It contains all of the information and tool tips about each item in the game. You can access your desired item via: theirJson['data']['1001']. Each image on the page's file name is the id (or 1001) in this example.
For instance, for 'Boots of Speed':
import requests, json
itemJson = json.loads(requests.get('https://ddragon.leagueoflegends.com/cdn/7.10.1/data/en_US/item.json').text)
print(itemJson['data']['1001'])
An alternative: Selenium
Selenium could be used for this. You should look it up. It's been ported for several programming languages, one being Python. It may work as you want it to here, but I sincerely think that the JSON method (describe above), although a little more convoluted, will perform faster (since speed, based on your post, seems to be an important factor).

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Automatically preventing wiki-rot in Trac?

Hi guys : Is there a way to improve trac wiki quality using a plugin that deals with artifacts like for obsolete pages, or pages that refer to code which doesn't exist anymore, pages that are unlinked, or pages which have a low update-rate ? I think there might be several heuristics which could be used to prevent wiki-rot :
Number of recent edits
Number of recent views
Wether or not a page links to a source file
Wether or not a wiki page's last update is < or > the source files it links to
Wether entire directories in the wiki have been used/edited/ignored over the last "n" days
etc. etc. etc.
If nothing else, just these metrics alone would be useful for each page and each directory from an administrative standpoint.
I don't know of an existing plugin that does this, but everything you mentioned certainly sounds do-able in one way or another.
You can use the trac-admin CLI command to get a list of wiki pages and to dump the contents of a particular wiki page (as plain text) to a file or stdout. Using this, you can write a script that reads in all of the wiki pages, parses the content for links, and generates a graph of which pages link to what. This should pinpoint "orphans" (pages that aren't linked to), pages that link to source files, and pages that link to external resources. Running external links through something like wget can help you identify broken links.
To access last-edited dates, you'll want to query Trac's database. The query you'll need will depend on the particular database type that you're using. For playing with the database in a (relatively) safe and easy manner, I find the WikiTableMacro and TracSql plugins quite useful.
The hardest feature in your question to implement would be the one regarding page views. I don't think that Trac keeps track of page views, you'll probably have to parse your web server's log for that sort of information.
How about these:
BadLinksPlugin: This plugin logs bad local links found in wiki content.
It's a quite new one, just deals with dangling links, but any bad links as I see from source code. This is at least one building block to your solution request.
VisitCounterMacro: Macro displays how many times was wiki page displayed.
This is a rather old one. You'll get just the statistic per page while an administrative view is missing, but this could be built rather easily, i.e. like a custom PageIndex.

Check if a MediaWiki page exists (Python)

I'm working on a Python script that transforms this:
foo
bar
Into this:
[[Component foo]]
[[bar]]
The script checks (per input line) if the page "Component foo" exists. If it exists then a link to that page is created, if it doesn't exist then a direct link is created.
The problem is that I need a quick & cheap way to check if a lot of wiki pages exist.I don't want to (try to) download all the 'Component' pages.
I already figured out a fast way to do this by hand: Edit a new wiki page. paste all the 'component' links into the page, press preview, and then save the resulting preview HTML page. The resulting HTML file contains a different link for existing pages than for non-existing pages.
So to rephrase my question: How can I save a mediawiki preview page in Python?
(I don't have local access to the database.)
You can definitely use the API to check if a page exists:
# assuming words is a list of words you wish to query for
import urllib
# replace en.wikipedia.org with the address of the wiki you want to access
query = "http://en.wikipedia.org/w/api.php?action=query&titles=%s&format=xml" % "|".join(words)
pages = urllib.urlopen(query)
Now pages you will contain xml like this:
<?xml version="1.0"?><api><query><pages>
<page ns="0" title="DOESNOTEXIST" missing="" />
<page pageid="600799" ns="0" title="FOO" />
<page pageid="11178" ns="0" title="Foobar" />
</pages></query></api>
Pages which don't exist will appear here but they have the missing="" attribute set, as can be seen above. You can also check for the invalid attribute to be on the save side.
Now you can use your favorite xml parser to check for these attributes and react accordingly.
See also: http://www.mediawiki.org/wiki/API:Query
Use Pywikibot to interact with the MediaWiki software. It's probably the most powerful bot framework available.
The Python Wikipediabot Framework (pywikipedia or PyWikipediaBot) is a
collection of tools that automate work on MediaWiki sites. Originally
designed for Wikipedia, it is now used throughout the Wikimedia
Foundation's projects and on many other MediaWiki wikis. It's written
in Python, which is a free, cross-platform programming language. This
page provides links to general information for people who want to use
the bot software.
If you have local access to the wiki database, it might be easiest to do a query against the database to see whether each page exists.
If you only have HTTP access, you might try the mechanize library which lets you programmatically automate tasks that would otherwise require a browser.
You should be able to use the MediaWiki API.
http://www.mediawiki.org/wiki/API (maybe under Queries or Creating/Editing)
I'm not too familiar with it, but for example, you could compare the output of an existing page with a nonexistent page.
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Bill_Gates&rvprop=timestamp
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=NONEXISTENT_PAGE&rvprop=timestamp

Categories