How to generate graphical sitemap of large website [closed]

How to generate graphical sitemap of large website [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I would like to generate a graphical sitemap for my website. There are two stages, as far as I can tell:
crawl the website and analyse the link relationship to extract the tree structure
generate a visually pleasing render of the tree
Does anyone have advice or experience with achieving this, or know of existing work I can build on (ideally in Python)?
I came across some nice CSS for rendering the tree, but it only works for 3 levels.
Thanks

The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.
So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.
As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.

Here is a python web crawler, which should make a good starting point. Your general strategy is this:
you need to take care that outbound links are never followed, including links on the same domain but higher up than your starting point.
as you spider, the site collect a hash of page urls mapped to a list of all the internal urls included in each page.
take a pass over this list, assigning a token to each unique url.
use your hash of {token => [tokens]} to generate a graphviz file that will lay out a graph for you
convert the graphviz output into an imagemap where each node links to its corresponding webpage
The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.

Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView
on how to format tree views. You can also probably modify the example application
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your
pages if they are organized as directories of HTML files.

Related

Need a guidance to choose best approach for dynamic web browsing with python [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am working at the company and one of my tasks is to scan certain tender portals for relevant opportunities and share it with distribution lists I have in excel. It is not difficult but rather exhausting task, especially with other 100 things they put on me. So I decided to apply python to solve my pain, and provide opportunities for gains. I started with simple scraping with soup but I realized that I need something better, like bot or smart selenium based code.
Problem : manual search and collections of info from websites ( search, click, download files, send them)
Sub problem for automated site scraping - credentials
Code background - rare learns from different platforms based on problem at hand ( mostly boring ), mostly python and data science related courses
Desired help - suggest way, framework, examples, for automated web browsing using python so I can collect all info in the matter of clicks ( Data collection using excel is basic, do not have access to databases, however, more sophisticated ideas are appreciated)
PS. Working two jobs and trying to support my family while searching for other career options, but my dedicated and care for business eat up my time as I do not want to be a trouble maker, thus while trying to push to management (which is old school) for support, time goes by.
Please and thank you in advance for your mega smart advices! Many thanks

BeautifulSoup not going to be up to the job simply because it is a parser, not a web browser.
MechanicalSoup might be an option for you of the sites are not too complex and do not require Javascript execution to function.
Selenium is essentially a robotic version of your favourite web browser.
Whether I choose Selenium or MechanicalSoup depends on whether my target data requires Javascript execution, either during login or to get the data itself.
Let's go over your requirements:
Search: Can the search be conducted through a get request? I.e. is the search done based on variables in the URL? Google something and then look at the URL of that Google Search. Is there something similar on your target websites? If yes, MechanicalSoup. If not, Selenium.
Click: As far as I know, MechanicalSoup cannot explicitly click. It can follow URLs if it is given what to look for (and usually this is good enough), but it cannot click a button. Selenium is needed for this.
Download: Either of them can do this as long as no button clicking is required. Again, can it just follow the path of where the button leads to?
Send: Outside the scope of both. You need to look at something else for this, although plenty of mail libraries exist.
Credentials: Both can do this, so the key question is whether login is dependent on Javascript.
This really hinges on the specific details of what you seek to do.
EDIT: Here is an example of what I have done with MechanicalSoup:
https://github.com/MattGaiser/mindsumo-scraper
It is a program which logs into a website, is pointed to a specific page, scrapes that page as well as the other relevant pages to which it links, and from those scrapings generates a CSV of the challenges I have won, the score I earned, and the link to the image of the challenge (which often has insights).

How to solve a reCaptcha in advance using a web scraper? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm currently in the process of trying to solve a reCaptcha. One of the suggestions received was a method called token farming.
For example, it's possible to farm for reCaptcha tokens from another site, and within 2 minutes, apply one of the farmed tokens to the site I'm trying to solve by changing the site's code on the back.
Unfortunately, wasn't able to get any further explanations as to how to go about doing so, especially changing the site's code on the back.
If anyone’s able to elaborate or give insights on the process, would really appreciate the expertise.

Token farming / token harvesting has been described here in detail: https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf
The approach for "token farming" discussed in this paper is based on the following mechanism:
Each user that visits a site with recaptcha is assigned a recaptcha-token.
This token is used to identify the user over multiple site visits and to to mark him a legitimate (or illegitimate) user.
Depending on various factors like age of the recaptcha-token, user behavior and browser configuration the user on each visit is either presented with one of the various recaptcha versions or even no captcha at all.
(more details can be extracted from their code here: https://github.com/neuroradiology/InsideReCaptcha)
Means, if one can create a huge number of fresh and clean tokens for a target site and age them for 9 days (that's what the article found out), these tokens can be used for accessing recaptcha a few protected sites before ever seeing a recaptcha.
To my understanding, such a fresh token has to be passed as a Cookie to the site in question.
However I recall having read somewhere that google closed this gap within a few days after this presentation
Also most probably there are other, similar approaches that have been labeled "token farming".
As far as I know all these approaches exploited loopholes in the recaptcha system and these loopholes were closed by google really fast - often even before the paper or presentation went public as responsible authors usually inform google in advance.
So for you this is most probably only of academic value or for learning about proper protection of captcha systems and token based services in general.
update
A quick check on a few recaptcha protected sites showed that the current system now scrambles the cookies, but the recaptcha-token can be found in the recaptcha form as two hidden input elements with partially different values and the id="recaptcha-token".
When visiting such a page with a clean browser you will get a new recaptcha token which you can save away and insert into the same form later when needed. At least that's the theory, it is very likely that all the cookies and some long term persisted stuff in your browser will keep you from doing this.

How to scrape tag information from questions on Stack Exchange

My problem is that I want to create a data base of all of the questions, answers, and most importantly, the tags, from a certain (somewhat small) Stack Exchange. The relationships among tags (e.g. tags more often used together have a strong relation) could reveal a lot about the structure of the community and popularity or interest in certain sub fields.
So, what is the easiest way to go through a list of questions (that are positively ranked) and extract the tag information using Python?

The easiest way to get the shared-tag count for all questions is to use the Stack Exchange API.
import requests
r = requests.get(
'http://api.stackexchange.com/2.2/tags/python/related?pagesize=3&site=stackoverflow')
for item in r.json()['items']:
print("{name} shares {count} tags with Python".format(**item))
If this doesn't satisfy your need, there are many other API queries available.

Visit the site to find the URL that shows the information you want, then look at the page source to see how it has been formatted.

In order to scrape the pages use the urllib2 library.
Parse the text using the BeautifulSoup library.
Place the data into a database.
The difficult thing is going to be structuring your database and developing queries that reveal what you want.

Full or incremental scraping - What do people use? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a question in regardes to scraping content off websites. Lets imagine in this example we are talking about content on classified style sites, like for example Amazon or Ebay.
Important notes about this content is that it can change and it can be removed.
The way I see it I have two options:
A full fresh scrape on a daily basis. I start the day with a blank
database schema and fully rescrape each site every day and insert
the content into the fresh database.
An incremental scrape, whereby I start with the content that was
scraped yesterday, and when rescraping the site I do the following:
Check existing URL
Content is still online and is it the same - Leave in DB
Content is not availiable - Delete from DB
Content is different - Rescrape content
My question is, is the added complexity of doing an incremental scrape actually worth it, are there any benefits to this? I really like the simplicity of doing a fresh scrape each day but this is my first scraping project and I would really like to know what the scraping specialists do in scenarios like this.

I think the answer depends on how you are using the data you have scraped. Sometimes the added complexity is worth it, sometimes it is not. Ask yourself: what are the requirements for my scraper and what is the minimal amount of work that I need to do to fulfill these requirements?
For instance, if you are scraping for research purposes and it is easier to for you to do a fresh scrape everyday, then that might be the road you want to take.
Doing an incremental scrape is definitely more complex to implement just as you said, because you need to make sure you have changed content is handled correctly (unchanged, changed, removed). Just make sure you also have a method for handling new content as well.
That being said, there are reasons why incremental scraping may be justified or even necessary. For instance if you are building something on top of your scraped data and cannot afford downtime due to active scraping work, you may want to consider incremental scraping.
Note also that there is not just a single way of implementing incremental scrapes: many kinds of incremental scrapes can be implemented. For instance, you may want to prioritize some content over other, say update popular content more often than unpopular. The thing here is that there is no upper limit in how much sophistication you can add to your scrapers. In fact, one could view search engine crawlers as highly sophisticated incremental scrapers.

I implemented a cloud based app that allows you to automate your scraping.
It turns websites into JSON/CSV
You can choose to download the updated full data-set on a daily basis or just the implemental differences.
This example of a daily recurring scrape job for movie showtimes in Singapore

Automatically preventing wiki-rot in Trac?

Hi guys : Is there a way to improve trac wiki quality using a plugin that deals with artifacts like for obsolete pages, or pages that refer to code which doesn't exist anymore, pages that are unlinked, or pages which have a low update-rate ? I think there might be several heuristics which could be used to prevent wiki-rot :
Number of recent edits
Number of recent views
Wether or not a page links to a source file
Wether or not a wiki page's last update is < or > the source files it links to
Wether entire directories in the wiki have been used/edited/ignored over the last "n" days
etc. etc. etc.
If nothing else, just these metrics alone would be useful for each page and each directory from an administrative standpoint.

I don't know of an existing plugin that does this, but everything you mentioned certainly sounds do-able in one way or another.
You can use the trac-admin CLI command to get a list of wiki pages and to dump the contents of a particular wiki page (as plain text) to a file or stdout. Using this, you can write a script that reads in all of the wiki pages, parses the content for links, and generates a graph of which pages link to what. This should pinpoint "orphans" (pages that aren't linked to), pages that link to source files, and pages that link to external resources. Running external links through something like wget can help you identify broken links.
To access last-edited dates, you'll want to query Trac's database. The query you'll need will depend on the particular database type that you're using. For playing with the database in a (relatively) safe and easy manner, I find the WikiTableMacro and TracSql plugins quite useful.
The hardest feature in your question to implement would be the one regarding page views. I don't think that Trac keeps track of page views, you'll probably have to parse your web server's log for that sort of information.

How about these:
BadLinksPlugin: This plugin logs bad local links found in wiki content.
It's a quite new one, just deals with dangling links, but any bad links as I see from source code. This is at least one building block to your solution request.
VisitCounterMacro: Macro displays how many times was wiki page displayed.
This is a rather old one. You'll get just the statistic per page while an administrative view is missing, but this could be built rather easily, i.e. like a custom PageIndex.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.