Is there a way to start a program N times at once? - python

I am making a python bot that makes twitter accounts, just for educational purposes with throw-away emails (catchalls) and russian phonenumbers. I managed to get through both email and phone verification and was wondering if I can create accounts at a large scale by starting N webdrivers at once.
Right now I made a code that only loops the program N times. I removed the code but it looked like this:
amount = input....
for i in range(amount)
App.run()
This was my only hope in actually doing this. Does anyone know how I can do this and if a computer can actually handle 10 or 100 headless webdrivers from selenium at once?

Well, you need to create multiple threads instead of looping, then you can start each upload in parallel threads. You are on the right track. You don't need a selenium grid to achieve this.
lookup about multithreading. You can start with this answer(Threads in Java)
It's not right you need grid for executing multiple browser sessions. You can invoke multiple browser sessions by just creating multiple driver objects, and managing them. Each session will be separate if you want them to be.
Grid is for scaling as there is a limitation on the no of browser instances you can run keeping your machine performance intact and tests stable. Like more than 5 chrome instances in a single machine. If you want to do more than that then you have to use selenium Grid.

Related

Python+Selenium+Chrome Increasing Memory Usage Over Time [duplicate]

I am using selenium to run chrome headless with the following command:
system "LC_ALL=C google-chrome --headless --enable-logging --hide-scrollbars --remote-debugging-port=#{debug_port} --remote-debugging-address=0.0.0.0 --disable-gpu --no-sandbox --ignore-certificate-errors &"
However it appears that chrome headless is consuming too much memory and cpu,anyone know how we can limit CPU/Memory usage of chrome headless? Or if there is some workaround.
Thanks in advance.
There had been a lot of discussion going around about the unpredictable CPU and Memory Consumption by Chrome Headless sessions.
As per the discussion Building headless for minimum cpu+mem usage the CPU + Memory usage can be optimized by:
Using either a custom proxy or C++ ProtocolHandlers you could return stub 1x1 pixel images or even block them entirely.
Chromium Team is working on adding a programmatic control over when frames are produced. Currently headless chrome is still trying to render at 60 fps which is rather wasteful. Many pages do need a few frames (maybe 10-20 fps) to render properly (due to usage of requestAnimationFrame and animation triggers) but we expect there are a lot of CPU savings to be had here.
MemoryInfra should help you determine which component is the biggest consumer of memory in your setup.
An usage can be:
$ headless_shell --remote-debugging-port=9222 --trace-startup=*,disabled-by-default-memory-infra http://www.chromium.org
Chromium is always going to use as much resources as are available to it. If you want to effectively limit it's utilization, you should look into using cgroups
Having said the above mentioned points here are some of the common best practices to adapt when running headless browsers in a production environment:
Fig: Volatile resource usage of Headless Chrome
Don't run a headless browser:
By all accounts, if at all possible, just don't run a headless browser. Headless browsers are un-predictable and hungry. Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. There are libraries those offer elegant Node API's for fetching data via HTTP requests and scraping if that's your end-goal.
Don't run a headless browser when you don't need to:
There are users those attempt to keep the browser open, even when not in use, so that it's always available for connections. While this might be a good strategy to help expedite session launch it'll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you're not actively using the browser, close it!
Parallelize with browsers, not pages:
We should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through pages, if one page crashes it can bring down the entire browser with it. That, plus each page isn't guaranteed to be totally clean (cookies and storage might bleed-through).
page.waitForNavigation:
One of the most common issues observed are the actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload can often cause subsequent work to get swallowed. In order to get around this issue, you will generally have to invoke the page-loading-action and immediately wait for the next pageload.
Use docker to contain it all:
Chrome takes a lot of dependencies to get running properly. Even after all of that's complete then there's things like fonts and phantom-processes you have to worry about so it's ideal to use some sort of container to contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. Create your own Dockerfile yourself.
And to avoid running into zombie processes (which commonly happen with Chrome), you'll want to use something like dumb-init to properly start-up.
Two different runtimes:
There can be two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).
As an example, while using page.evaluate deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won't work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.
Reference: Observations running 2 million headless sessions
Consider to use Docker. It has well documented features for thresholding usage of system resources like memory and cpu. The good news is that it's pretty easy to build a Docker image with headless Chromes (on top of X11) inside it.
There are lots of out of box solutions on that, check it out: https://hub.docker.com/r/justinribeiro/chrome-headless/

Selenium python web scraper clogging my RAM: drivers and xvfb display not closing following display.stop() and driver.close() [duplicate]

I am using selenium to run chrome headless with the following command:
system "LC_ALL=C google-chrome --headless --enable-logging --hide-scrollbars --remote-debugging-port=#{debug_port} --remote-debugging-address=0.0.0.0 --disable-gpu --no-sandbox --ignore-certificate-errors &"
However it appears that chrome headless is consuming too much memory and cpu,anyone know how we can limit CPU/Memory usage of chrome headless? Or if there is some workaround.
Thanks in advance.
There had been a lot of discussion going around about the unpredictable CPU and Memory Consumption by Chrome Headless sessions.
As per the discussion Building headless for minimum cpu+mem usage the CPU + Memory usage can be optimized by:
Using either a custom proxy or C++ ProtocolHandlers you could return stub 1x1 pixel images or even block them entirely.
Chromium Team is working on adding a programmatic control over when frames are produced. Currently headless chrome is still trying to render at 60 fps which is rather wasteful. Many pages do need a few frames (maybe 10-20 fps) to render properly (due to usage of requestAnimationFrame and animation triggers) but we expect there are a lot of CPU savings to be had here.
MemoryInfra should help you determine which component is the biggest consumer of memory in your setup.
An usage can be:
$ headless_shell --remote-debugging-port=9222 --trace-startup=*,disabled-by-default-memory-infra http://www.chromium.org
Chromium is always going to use as much resources as are available to it. If you want to effectively limit it's utilization, you should look into using cgroups
Having said the above mentioned points here are some of the common best practices to adapt when running headless browsers in a production environment:
Fig: Volatile resource usage of Headless Chrome
Don't run a headless browser:
By all accounts, if at all possible, just don't run a headless browser. Headless browsers are un-predictable and hungry. Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. There are libraries those offer elegant Node API's for fetching data via HTTP requests and scraping if that's your end-goal.
Don't run a headless browser when you don't need to:
There are users those attempt to keep the browser open, even when not in use, so that it's always available for connections. While this might be a good strategy to help expedite session launch it'll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you're not actively using the browser, close it!
Parallelize with browsers, not pages:
We should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through pages, if one page crashes it can bring down the entire browser with it. That, plus each page isn't guaranteed to be totally clean (cookies and storage might bleed-through).
page.waitForNavigation:
One of the most common issues observed are the actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload can often cause subsequent work to get swallowed. In order to get around this issue, you will generally have to invoke the page-loading-action and immediately wait for the next pageload.
Use docker to contain it all:
Chrome takes a lot of dependencies to get running properly. Even after all of that's complete then there's things like fonts and phantom-processes you have to worry about so it's ideal to use some sort of container to contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. Create your own Dockerfile yourself.
And to avoid running into zombie processes (which commonly happen with Chrome), you'll want to use something like dumb-init to properly start-up.
Two different runtimes:
There can be two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).
As an example, while using page.evaluate deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won't work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.
Reference: Observations running 2 million headless sessions
Consider to use Docker. It has well documented features for thresholding usage of system resources like memory and cpu. The good news is that it's pretty easy to build a Docker image with headless Chromes (on top of X11) inside it.
There are lots of out of box solutions on that, check it out: https://hub.docker.com/r/justinribeiro/chrome-headless/

What is the best way to run python scripts in the background of Ubuntu?

I created a python script that work as a bot for instagram (using selenium).
Currently I have 5 profile running, for each of them I have all the files stored in folders (called with the name of the ig profile) and for each profile I have a screen where I can see the "log" of each program.
But now, 5 profile are difficult to manage and sometimes also a little messy.
Is there a way to see the log of all 5 scripts in a unique window?
I'm open also to another way to run the scripts in the background, maybe not "screen" but something else.
Thankyou
If you really want to go the clean way and if you think this will get bigger, you might want to have a look towards Django and Celery.
You can create a web interface, so that you can monitor any way you like.
And you can have cron jobs with Celery so that your bot is always on, or has recurring tasks, etc...
More info on their respective docs, as usual. http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html

Scraping Edgar with Python regular expressions

I am working on a personal project's initial stage of downloading 10-Q statements from EDGAR. Quick disclaimer, I am very new to programming and python so the code that I wrote is very basic, not even using custom functions and classes, just a very long script that I'm more comfortable editing. As a result, some solutions are quite rough (i.e. concatenating urls using CIKs and other search options instead of doing requests with "browser" headers)
I keep running into a problem that those who have scraped EDGAR might be familiar with. Every now and then my script just stops running. It doesn't raise any exceptions (I created some that append txt reports with links that can't be opened and so forth). I suspect that either SEC servers have a certain limit of requests from an IP per some unit of time (if I wait some time after CTRL-C'ing the script and run it again, it generates more output compared to rapid re-activation), alternatively it could be TWC that identifies me as a bot and limits such requests.
If it's SEC, what could potentially work? I tried learning how to work with TOR and potentially get a new IP every now and then but I can't really find some basic tutorial that would work for my level of expertise. Maybe someone can recommend something good on the topic?
Maybe the timers would work? Like force the script to sleep every hour or so (still trying to figure out how to make such timers and reset them if an event occurs). The main challenge with this particular problem is that I can't let it run at night.
Thank you in advance for any advice, I keep fighting with it for days and at this stage it could take me more than a month to get what I want (before I even start tackling 10-Ks)
It seems like delays are pretty useful - sitting at 3.5k downloads with no interruptions thanks to a simple:
import(time)
time.sleep(random.randint(0, 1) + abs(random.normalvariate(0, 0.2)))

How can I run a script constantly in background of App Engine website?

I'm trying to use Google App Engine (Python) to make a simple web app. I want to maintain one number x in the datastore that models a random walk. I need a script running 24 hours a day that, every second, randomly chooses to either increment or decrement x (saving the change to the datastore). Users should be able to go to a url to see the current value of x.
I've thought of two ways to accomplish the constant script issue:
1) I can have an admin-access page that runs a continuous loop in javascript which, each second, makes an AJAX request to the server to update x. If I leave this page open on my computer 24 hours a day, this should work. The problem with this approach is that if my computer crashes then the script dies with it.
2) I can use a CRON job. But the interval between jobs cannot be smaller than 1 minute, so this doesn't really work.
It seems like there should be a simple way to just run a script constantly (that exists only server side) with Google App Engine.
I appreciate any advice. Thanks for your time!
Start a backend instance using Modules (either programmatically or by hitting a special URL accessible to admins only). Run the script for as long as the instance lives.
Note that an instance can die, just like your computer can crash. For this reason, you are probably better off with a Google Compute Engine instance (choose the smallest) than with an App Engine instance. Note that the Compute Engine instance will be many times cheaper.
Compute Engine instances can also fail, though it is much less likely. There are ways to create a fail-over implementation (when one instance is creating your random numbers while the other instance - which can run on some other platform - waits for the first one to fail), but this will obviously cost more.

Categories