Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm currently in the process of trying to solve a reCaptcha. One of the suggestions received was a method called token farming.
For example, it's possible to farm for reCaptcha tokens from another site, and within 2 minutes, apply one of the farmed tokens to the site I'm trying to solve by changing the site's code on the back.
Unfortunately, wasn't able to get any further explanations as to how to go about doing so, especially changing the site's code on the back.
If anyone’s able to elaborate or give insights on the process, would really appreciate the expertise.
Token farming / token harvesting has been described here in detail: https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf
The approach for "token farming" discussed in this paper is based on the following mechanism:
Each user that visits a site with recaptcha is assigned a recaptcha-token.
This token is used to identify the user over multiple site visits and to to mark him a legitimate (or illegitimate) user.
Depending on various factors like age of the recaptcha-token, user behavior and browser configuration the user on each visit is either presented with one of the various recaptcha versions or even no captcha at all.
(more details can be extracted from their code here: https://github.com/neuroradiology/InsideReCaptcha)
Means, if one can create a huge number of fresh and clean tokens for a target site and age them for 9 days (that's what the article found out), these tokens can be used for accessing recaptcha a few protected sites before ever seeing a recaptcha.
To my understanding, such a fresh token has to be passed as a Cookie to the site in question.
However I recall having read somewhere that google closed this gap within a few days after this presentation
Also most probably there are other, similar approaches that have been labeled "token farming".
As far as I know all these approaches exploited loopholes in the recaptcha system and these loopholes were closed by google really fast - often even before the paper or presentation went public as responsible authors usually inform google in advance.
So for you this is most probably only of academic value or for learning about proper protection of captcha systems and token based services in general.
update
A quick check on a few recaptcha protected sites showed that the current system now scrambles the cookies, but the recaptcha-token can be found in the recaptcha form as two hidden input elements with partially different values and the id="recaptcha-token".
When visiting such a page with a clean browser you will get a new recaptcha token which you can save away and insert into the same form later when needed. At least that's the theory, it is very likely that all the cookies and some long term persisted stuff in your browser will keep you from doing this.
Related
I'm pretty inexperienced with any form of http request more complicated than a basic GET query. I've tried to do research online, but I'm having trouble figuring out where to start because I don't know all the required terminology.
For several years I've worked a side job for a data entry company. Basically what I do is Google several things, find the results from a few specific webpages, and copy those URLs into the company's system. About two years ago I wrote a very basic Python program to do the Googling part for me, and now I want to rewrite it and expand it to do the rest of it as well.
The website uses a combination of POST and PATCH requests to update the information on the database, and because the information is attached to my account I assume there is some form of authentication involved. I don't have access to the system's backend so the best I can do is head to the Network tab under Inspect Element. I can't find anything in the requests' headers that seem to attach to my account.
What do I need to do to authenticate, and if it's not that simple, where's the best place to start learning?
Let me know if you need more information and I'll try to give you what you need--I don't know exactly what's required.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I would like to write a Python application where it automate the upload process of a music or podcast to iTunes, Spotify, and other streaming platforms. It supposed to get the music in my directory and then upload it into these platforms (ultimately monetize these media).
I have checked the official APIs of the iTunes and Spotify, but it seems that they don't have an upload feature. However, I have seen website, like this one, which claim to upload (to multiple platforms) and monetize the musics.
I would appreciate it if someone could help with this problem. Or tell me how such website accomplish this task.
Well this problem could have multiple solutions. One of them would be follow these steps:
Get all the data necessary for uploading it in every music distributor :
-Song name, artists, album, etc ...
Store the data in an excel, csv, json or whatever you prefer.
Read the data using python, could use pandas library for this
Create a Selenium(python library for webscraping) bot that accesses every website and program it to fill all the fields for every website
Finally, you could have a bot that reads the data you written and automatically uploads music to all the websites.
NOTE: Only follow these steps if API's from the website are not useful for this task.
PD: It is going to take lot's of time to build this functionality because you have to program every music distributor website. (7 to 15 days of hardwork) but then you are going to be able to upload tons of music in just a few seconds in all the plataforms.
Last note: Be aware of web scraping policy of every website, maybe they do not permit these type of operations and could ban your IP.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I'm still a novice with python and now using multiprocessing is a big job for me.
So my question is, how do I speed to crawl the comments section of YouTube using the YouTube API whilst using multiprocessing?
This project is to crawl few 100000++ of videos for their comments in a limited time. I understand that multiprocessing is used on normal scraping methods such as BeautifulSoup/Scrapy, but how about when I use the YouTube API?
If I use the YouTube API (which requires API keys) to crawl the data, will multiprocessing be able to do the job using multiple keys or will it use the same one over and over again for different tasks?
To simplify, is it possible to use multiprocessing that uses API keys in the code instead of normal scraping methods that do not require API keys?
Anyone have any idea?
This won't directly answer your question, but I suggest having a look at the YouTube API quota:
https://developers.google.com/youtube/v3/getting-started#calculating-quota-usage
By default, your project will have a quota of just 10,000 units per day, and retrieving comments will cost between 1 and 5 units per comment (if you want the video data they're attached to, add another 21 units per video). Realistically, you'll be able to only retrieve 2000 comments per day via the API without putting in a quota increase request, which can take weeks.
Edit: Google will populate code for you in the language of your choice for a given request. I'd recommend populating the form here with your request, and using that as a starting point: https://developers.google.com/youtube/v3/docs/comments/list
(click "Populate APIs Explorer" -> "See Code Samples" -> enter more info on the left)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I am working at the company and one of my tasks is to scan certain tender portals for relevant opportunities and share it with distribution lists I have in excel. It is not difficult but rather exhausting task, especially with other 100 things they put on me. So I decided to apply python to solve my pain, and provide opportunities for gains. I started with simple scraping with soup but I realized that I need something better, like bot or smart selenium based code.
Problem : manual search and collections of info from websites ( search, click, download files, send them)
Sub problem for automated site scraping - credentials
Code background - rare learns from different platforms based on problem at hand ( mostly boring ), mostly python and data science related courses
Desired help - suggest way, framework, examples, for automated web browsing using python so I can collect all info in the matter of clicks ( Data collection using excel is basic, do not have access to databases, however, more sophisticated ideas are appreciated)
PS. Working two jobs and trying to support my family while searching for other career options, but my dedicated and care for business eat up my time as I do not want to be a trouble maker, thus while trying to push to management (which is old school) for support, time goes by.
Please and thank you in advance for your mega smart advices! Many thanks
BeautifulSoup not going to be up to the job simply because it is a parser, not a web browser.
MechanicalSoup might be an option for you of the sites are not too complex and do not require Javascript execution to function.
Selenium is essentially a robotic version of your favourite web browser.
Whether I choose Selenium or MechanicalSoup depends on whether my target data requires Javascript execution, either during login or to get the data itself.
Let's go over your requirements:
Search: Can the search be conducted through a get request? I.e. is the search done based on variables in the URL? Google something and then look at the URL of that Google Search. Is there something similar on your target websites? If yes, MechanicalSoup. If not, Selenium.
Click: As far as I know, MechanicalSoup cannot explicitly click. It can follow URLs if it is given what to look for (and usually this is good enough), but it cannot click a button. Selenium is needed for this.
Download: Either of them can do this as long as no button clicking is required. Again, can it just follow the path of where the button leads to?
Send: Outside the scope of both. You need to look at something else for this, although plenty of mail libraries exist.
Credentials: Both can do this, so the key question is whether login is dependent on Javascript.
This really hinges on the specific details of what you seek to do.
EDIT: Here is an example of what I have done with MechanicalSoup:
https://github.com/MattGaiser/mindsumo-scraper
It is a program which logs into a website, is pointed to a specific page, scrapes that page as well as the other relevant pages to which it links, and from those scrapings generates a CSV of the challenges I have won, the score I earned, and the link to the image of the challenge (which often has insights).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a question in regardes to scraping content off websites. Lets imagine in this example we are talking about content on classified style sites, like for example Amazon or Ebay.
Important notes about this content is that it can change and it can be removed.
The way I see it I have two options:
A full fresh scrape on a daily basis. I start the day with a blank
database schema and fully rescrape each site every day and insert
the content into the fresh database.
An incremental scrape, whereby I start with the content that was
scraped yesterday, and when rescraping the site I do the following:
Check existing URL
Content is still online and is it the same - Leave in DB
Content is not availiable - Delete from DB
Content is different - Rescrape content
My question is, is the added complexity of doing an incremental scrape actually worth it, are there any benefits to this? I really like the simplicity of doing a fresh scrape each day but this is my first scraping project and I would really like to know what the scraping specialists do in scenarios like this.
I think the answer depends on how you are using the data you have scraped. Sometimes the added complexity is worth it, sometimes it is not. Ask yourself: what are the requirements for my scraper and what is the minimal amount of work that I need to do to fulfill these requirements?
For instance, if you are scraping for research purposes and it is easier to for you to do a fresh scrape everyday, then that might be the road you want to take.
Doing an incremental scrape is definitely more complex to implement just as you said, because you need to make sure you have changed content is handled correctly (unchanged, changed, removed). Just make sure you also have a method for handling new content as well.
That being said, there are reasons why incremental scraping may be justified or even necessary. For instance if you are building something on top of your scraped data and cannot afford downtime due to active scraping work, you may want to consider incremental scraping.
Note also that there is not just a single way of implementing incremental scrapes: many kinds of incremental scrapes can be implemented. For instance, you may want to prioritize some content over other, say update popular content more often than unpopular. The thing here is that there is no upper limit in how much sophistication you can add to your scrapers. In fact, one could view search engine crawlers as highly sophisticated incremental scrapers.
I implemented a cloud based app that allows you to automate your scraping.
It turns websites into JSON/CSV
You can choose to download the updated full data-set on a daily basis or just the implemental differences.
This example of a daily recurring scrape job for movie showtimes in Singapore