Full or incremental scraping - What do people use? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a question in regardes to scraping content off websites. Lets imagine in this example we are talking about content on classified style sites, like for example Amazon or Ebay.
Important notes about this content is that it can change and it can be removed.
The way I see it I have two options:
A full fresh scrape on a daily basis. I start the day with a blank
database schema and fully rescrape each site every day and insert
the content into the fresh database.
An incremental scrape, whereby I start with the content that was
scraped yesterday, and when rescraping the site I do the following:
Check existing URL
Content is still online and is it the same - Leave in DB
Content is not availiable - Delete from DB
Content is different - Rescrape content
My question is, is the added complexity of doing an incremental scrape actually worth it, are there any benefits to this? I really like the simplicity of doing a fresh scrape each day but this is my first scraping project and I would really like to know what the scraping specialists do in scenarios like this.

I think the answer depends on how you are using the data you have scraped. Sometimes the added complexity is worth it, sometimes it is not. Ask yourself: what are the requirements for my scraper and what is the minimal amount of work that I need to do to fulfill these requirements?
For instance, if you are scraping for research purposes and it is easier to for you to do a fresh scrape everyday, then that might be the road you want to take.
Doing an incremental scrape is definitely more complex to implement just as you said, because you need to make sure you have changed content is handled correctly (unchanged, changed, removed). Just make sure you also have a method for handling new content as well.
That being said, there are reasons why incremental scraping may be justified or even necessary. For instance if you are building something on top of your scraped data and cannot afford downtime due to active scraping work, you may want to consider incremental scraping.
Note also that there is not just a single way of implementing incremental scrapes: many kinds of incremental scrapes can be implemented. For instance, you may want to prioritize some content over other, say update popular content more often than unpopular. The thing here is that there is no upper limit in how much sophistication you can add to your scrapers. In fact, one could view search engine crawlers as highly sophisticated incremental scrapers.

I implemented a cloud based app that allows you to automate your scraping.
It turns websites into JSON/CSV
You can choose to download the updated full data-set on a daily basis or just the implemental differences.
This example of a daily recurring scrape job for movie showtimes in Singapore

Related

How to upload music to website like spotify, itunes [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I would like to write a Python application where it automate the upload process of a music or podcast to iTunes, Spotify, and other streaming platforms. It supposed to get the music in my directory and then upload it into these platforms (ultimately monetize these media).
I have checked the official APIs of the iTunes and Spotify, but it seems that they don't have an upload feature. However, I have seen website, like this one, which claim to upload (to multiple platforms) and monetize the musics.
I would appreciate it if someone could help with this problem. Or tell me how such website accomplish this task.
Well this problem could have multiple solutions. One of them would be follow these steps:
Get all the data necessary for uploading it in every music distributor :
-Song name, artists, album, etc ...
Store the data in an excel, csv, json or whatever you prefer.
Read the data using python, could use pandas library for this
Create a Selenium(python library for webscraping) bot that accesses every website and program it to fill all the fields for every website
Finally, you could have a bot that reads the data you written and automatically uploads music to all the websites.
NOTE: Only follow these steps if API's from the website are not useful for this task.
PD: It is going to take lot's of time to build this functionality because you have to program every music distributor website. (7 to 15 days of hardwork) but then you are going to be able to upload tons of music in just a few seconds in all the plataforms.
Last note: Be aware of web scraping policy of every website, maybe they do not permit these type of operations and could ban your IP.

Need a guidance to choose best approach for dynamic web browsing with python [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am working at the company and one of my tasks is to scan certain tender portals for relevant opportunities and share it with distribution lists I have in excel. It is not difficult but rather exhausting task, especially with other 100 things they put on me. So I decided to apply python to solve my pain, and provide opportunities for gains. I started with simple scraping with soup but I realized that I need something better, like bot or smart selenium based code.
Problem : manual search and collections of info from websites ( search, click, download files, send them)
Sub problem for automated site scraping - credentials
Code background - rare learns from different platforms based on problem at hand ( mostly boring ), mostly python and data science related courses
Desired help - suggest way, framework, examples, for automated web browsing using python so I can collect all info in the matter of clicks ( Data collection using excel is basic, do not have access to databases, however, more sophisticated ideas are appreciated)
PS. Working two jobs and trying to support my family while searching for other career options, but my dedicated and care for business eat up my time as I do not want to be a trouble maker, thus while trying to push to management (which is old school) for support, time goes by.
Please and thank you in advance for your mega smart advices! Many thanks
BeautifulSoup not going to be up to the job simply because it is a parser, not a web browser.
MechanicalSoup might be an option for you of the sites are not too complex and do not require Javascript execution to function.
Selenium is essentially a robotic version of your favourite web browser.
Whether I choose Selenium or MechanicalSoup depends on whether my target data requires Javascript execution, either during login or to get the data itself.
Let's go over your requirements:
Search: Can the search be conducted through a get request? I.e. is the search done based on variables in the URL? Google something and then look at the URL of that Google Search. Is there something similar on your target websites? If yes, MechanicalSoup. If not, Selenium.
Click: As far as I know, MechanicalSoup cannot explicitly click. It can follow URLs if it is given what to look for (and usually this is good enough), but it cannot click a button. Selenium is needed for this.
Download: Either of them can do this as long as no button clicking is required. Again, can it just follow the path of where the button leads to?
Send: Outside the scope of both. You need to look at something else for this, although plenty of mail libraries exist.
Credentials: Both can do this, so the key question is whether login is dependent on Javascript.
This really hinges on the specific details of what you seek to do.
EDIT: Here is an example of what I have done with MechanicalSoup:
https://github.com/MattGaiser/mindsumo-scraper
It is a program which logs into a website, is pointed to a specific page, scrapes that page as well as the other relevant pages to which it links, and from those scrapings generates a CSV of the challenges I have won, the score I earned, and the link to the image of the challenge (which often has insights).

How to solve a reCaptcha in advance using a web scraper? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm currently in the process of trying to solve a reCaptcha. One of the suggestions received was a method called token farming.
For example, it's possible to farm for reCaptcha tokens from another site, and within 2 minutes, apply one of the farmed tokens to the site I'm trying to solve by changing the site's code on the back.
Unfortunately, wasn't able to get any further explanations as to how to go about doing so, especially changing the site's code on the back.
If anyone’s able to elaborate or give insights on the process, would really appreciate the expertise.
Token farming / token harvesting has been described here in detail: https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf
The approach for "token farming" discussed in this paper is based on the following mechanism:
Each user that visits a site with recaptcha is assigned a recaptcha-token.
This token is used to identify the user over multiple site visits and to to mark him a legitimate (or illegitimate) user.
Depending on various factors like age of the recaptcha-token, user behavior and browser configuration the user on each visit is either presented with one of the various recaptcha versions or even no captcha at all.
(more details can be extracted from their code here: https://github.com/neuroradiology/InsideReCaptcha)
Means, if one can create a huge number of fresh and clean tokens for a target site and age them for 9 days (that's what the article found out), these tokens can be used for accessing recaptcha a few protected sites before ever seeing a recaptcha.
To my understanding, such a fresh token has to be passed as a Cookie to the site in question.
However I recall having read somewhere that google closed this gap within a few days after this presentation
Also most probably there are other, similar approaches that have been labeled "token farming".
As far as I know all these approaches exploited loopholes in the recaptcha system and these loopholes were closed by google really fast - often even before the paper or presentation went public as responsible authors usually inform google in advance.
So for you this is most probably only of academic value or for learning about proper protection of captcha systems and token based services in general.
update
A quick check on a few recaptcha protected sites showed that the current system now scrambles the cookies, but the recaptcha-token can be found in the recaptcha form as two hidden input elements with partially different values and the id="recaptcha-token".
When visiting such a page with a clean browser you will get a new recaptcha token which you can save away and insert into the same form later when needed. At least that's the theory, it is very likely that all the cookies and some long term persisted stuff in your browser will keep you from doing this.

need design suggestions for an efficient webcrawler that is going to parse 8M pages - Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm going to develop a little crawler that's going to fetch a lot of pages from the same website, all of the requests are a change in the ID number of the url.
I need to save all of data I'll parse into a csv (nothing fancy), at most, I will crawl about 6M-8M pages, most of them doesn't contain the data I want, I know that there are about 400K pages which I need to parse, they are all similar in structure, I can't avoid crawl all the urls.
that's how the page looks when I get the data - http://pastebin.com/3DYPhPRg
that's when I don't get the data - http://pastebin.com/YwxXAmih
the data is saved in the spans inside the td's -
I need the data between ">" and "</span>".
<span id="lblCompanyNumber">520000472</span></td>
<span id="lblCompanyNameHeb">חברת החשמל לישראל בעמ</span></td>
<span id="lblStatus">פעילה</span></td>
<span id="lblCorporationType">חברה ציבורית</span></td>
<span id="lblGovCompanyType">חברה ממשלתית</span></td>
<span id="lblLimitType">מוגבלת</span></td>
etc'
that's nothing too hard to parse from the document.
the problem is that it will take a few days to fetch the urls and parse them, it will consume a lot of memory and I think that it's going to crash here and then, which is very dangerous for me, it can't crash unless it can't run anymore.
I thought about -
- fetching a url (urllib2)
- if there's an error - move next (if it'll happen 5 times - I stop and save errors to log)
- parse the html (still don't know whats best - BeautifulSoup \ lxml \
scrapy \ HTMLParser etc')
- if it's empty (lblCompanyNumber will be empty) save the ID in the emptyCsvFile.csv
- else: save the data to goodResults.csv
the questions are -
which data types should I use in order to be more efficient and quick (for the data I parse and for the fetched content)?
which HTML parsing library should I use? maybe regex? the span id is fixed and doesn't change when there's data (again, efficient, speed, simplicity)
saving to file, keeping an handle to the file for so long etc' - is there a way that will take less resources and will be more efficient to save save the data? (400K lines at least)
any other thing I haven't thought about and I need to deal with, and maybe some optimization tips :)
another solution I thought of is using wget, save all pages to disk and then delete all the files who has the same md5sum of an empty document, the only problem is that I'm not saving the empty IDs.
by the way, I need to use py2exe and make an exe out of it, so things like scrapy can be hard to use here (it's known to cause issues with py2exe).
Thanks!
I used httplib2 for this kind of thing because there are supposed to be memory leaks in the Python standard library routines. Also, httplib2 can be configured to keep a cache which might be useful if you have to restart and redo some pages.
I only ran through 1.7 million pages plus about 200000 from another server, so I can't comment on the volume you expect.
But I drove it all using AMQP with a topic exchange and persistent message queues (delivery_mode=2). This fed ny ids into the worker that used httplib2 and made sure that every id was retrieved. I tracked them using a memcache that was persisted using Tokyo Tyrant hash table on disk. I was able to shut down and restart the workers and move them between machines without missing any ids. I've had the worker running for up two three weeks at a time before I killed it to tinker with it.
Also, I used lxml for parsing responses because it is fast.
Oh, and after a page was retrieved and processed successfully, I posted the id as a message to a completed queue. Then later I manual copied the messages off of that queue and compared it to the input list to make sure that the whole process was reliable.
For AMQP I used amqplib with RabbitMQ as the broker. Nowadays I would recommend taking a look at haigha for AMQP. Although its documentation is sparse its model closely follows the AMQP 0.9.1 spec documents so you can use those to figure out options etc.
#YSY: I can't cut and paste the code because I did it at work, however it was nothing special. Just a loop with try: except: wrapped around the http request. Something like this:
retries = 5
while retries > 0:
requestSucceeded = True # assume the best
try:
resp, content = h.request("http://www.example.com/db/1234567")
if resp is None:
requestSucceeded = False
log.warn ("1234567: no http response")
elif resp.status != 200:
requestSucceeded = False
log.warn ("1234567: replied with {0:d}".format(resp.status))
except Exception as e:
requestSuceeded = False
log.warn("1234567: exception - " + str(e))
if not requestSucceeded:
time.sleep(30)
retries -= 1
else:
retries = 0
if requestSucceded:
process_request()
ack_message()
The llop deals with two types of failures, one where the HTTP server talks to us but does not return a reply, and one where there is an exception, maybe a network error or anything else. You could be more sophisticated and handle different failure conditions in different ways. But this generally works. Tweak the sleep time and retries until you get over 90% success rate, then handle the rest later. I believe I'm using half hour sleeps and 3 retries right now, or maybe it is 15 minute sleeps. Not important really.
After a full run through, I process the results (log, and the list of completed messages) to make sure that they agree, and any documents that failed to retrieve, I try again another day before giving up. Of course, I would scan through the logs looking for similar problems and tweaking my code to deal with them if I could think of a way.
Or you could google "scrapy". That might work for you. Personally, I like using AMQP to control the whole process.

How to generate graphical sitemap of large website [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I would like to generate a graphical sitemap for my website. There are two stages, as far as I can tell:
crawl the website and analyse the link relationship to extract the tree structure
generate a visually pleasing render of the tree
Does anyone have advice or experience with achieving this, or know of existing work I can build on (ideally in Python)?
I came across some nice CSS for rendering the tree, but it only works for 3 levels.
Thanks
The only automatic way to create a sitemap is to know the structure of your site and write a program which builds on that knowledge. Just crawling the links won't usually work because links can be between any pages so you get a graph (i.e. connections between nodes). There is no way to convert a graph into a tree in the general case.
So you must identify the structure of your tree yourself and then crawl the relevant pages to get the titles of the pages.
As for "but it only works for 3 levels": Three levels is more than enough. If you try to create more levels, your sitemap will become unusable (too big, too wide). No one will want to download a 1MB sitemap and then scroll through 100'000 pages of links. If your site grows that big, then you must implement some kind of search.
Here is a python web crawler, which should make a good starting point. Your general strategy is this:
you need to take care that outbound links are never followed, including links on the same domain but higher up than your starting point.
as you spider, the site collect a hash of page urls mapped to a list of all the internal urls included in each page.
take a pass over this list, assigning a token to each unique url.
use your hash of {token => [tokens]} to generate a graphviz file that will lay out a graph for you
convert the graphviz output into an imagemap where each node links to its corresponding webpage
The reason you need to do all this is, as leonm noted, that websites are graphs, not trees, and laying out graphs is a harder problem than you can do in a simple piece of javascript and css. Graphviz is good at what it does.
Please see http://aaron.oirt.rutgers.edu/myapp/docs/W1100_2200.TreeView
on how to format tree views. You can also probably modify the example application
http://aaron.oirt.rutgers.edu/myapp/DirectoryTree/index to scrape your
pages if they are organized as directories of HTML files.

Categories