Scraping Edgar with Python regular expressions

Scraping Edgar with Python regular expressions - python

I am working on a personal project's initial stage of downloading 10-Q statements from EDGAR. Quick disclaimer, I am very new to programming and python so the code that I wrote is very basic, not even using custom functions and classes, just a very long script that I'm more comfortable editing. As a result, some solutions are quite rough (i.e. concatenating urls using CIKs and other search options instead of doing requests with "browser" headers)
I keep running into a problem that those who have scraped EDGAR might be familiar with. Every now and then my script just stops running. It doesn't raise any exceptions (I created some that append txt reports with links that can't be opened and so forth). I suspect that either SEC servers have a certain limit of requests from an IP per some unit of time (if I wait some time after CTRL-C'ing the script and run it again, it generates more output compared to rapid re-activation), alternatively it could be TWC that identifies me as a bot and limits such requests.
If it's SEC, what could potentially work? I tried learning how to work with TOR and potentially get a new IP every now and then but I can't really find some basic tutorial that would work for my level of expertise. Maybe someone can recommend something good on the topic?
Maybe the timers would work? Like force the script to sleep every hour or so (still trying to figure out how to make such timers and reset them if an event occurs). The main challenge with this particular problem is that I can't let it run at night.
Thank you in advance for any advice, I keep fighting with it for days and at this stage it could take me more than a month to get what I want (before I even start tackling 10-Ks)

It seems like delays are pretty useful - sitting at 3.5k downloads with no interruptions thanks to a simple:
import(time)
time.sleep(random.randint(0, 1) + abs(random.normalvariate(0, 0.2)))

Related

Is it possible to see what a Python process is doing?

I created a python script that grabs some info from various websites, is it possible to analyze how long does it take to download the data and how long does it take to write it on a file?
I am interested in knowing how much it could improve running it on a better PC (it is currently running on a crappy old laptop.

If you just want to know how long a process takes to run the time command is pretty handy. Just run time <command> and it will report how much time it took to run with it counted in a few categories, like wall clock time, system/kernel time and user space time. This won't tell you anything about which parts of the system are taking up the amount of time. You can always look at a profiler if you want/need that type of information.
That said, as Barmar said, if you aren't doing much processing of the sites you are grabbing, the laptop is probably not going to be a limiting factor.

You can always store the system time in a variable before a block of code that you want to test, do it again after then compare them.

How to run a Python script online every N minutes?

To work a bit on my Python, I decided to try to code a simple script for my private use which monitors sites with offers and sends you an email whenever a new offer which you are interested in pops out. I guess I could handle the coding part (extracting the newest one from HTML and such) but I've never really run online any script which requires being fired every N minutes or so. What kind of hosting/server do I need to make my script run independently of my computer and refresh every, say, 5 minutes sending me an email when there's an update?

If you have shell access, you an use crontab to schedule a recurring job.
Otherwise you can use a service like SetCronJob or EasyCron or similar to invoke a script regularly.
Some hosters also provide similar functionalities in their administration interface...

Python watch-dog script : load url asynchronously

I have simple Python script which do check few urls :
f = urllib2.urlopen(urllib2.Request(url))
as i have socket timeout setted on 5 seconds sometimes is bothering to wait 5sec * number of urls on results.
Is there any easy standartized way how to run those url checks asynchronously without big overhead. Script must use standart python components on vanilla ubuntu distribution (no additional installations).
Any ideas ?

I wrote something called multibench a long time ago. I used it for almost the same thing you want to do here, which was to call multiple concurrent instances of wget and see how long it takes to complete. It is a crude load testing and performance monitoring tool. You will need to adapt this somewhat, because this runs the same command n times.

Install additional software. It's a waste of time you re-invent something just because of some packaging decisions made by someone else.

GAE Backend fails to respond to start request

This is probably a truly basic thing that I'm simply having an odd time figuring out in a Python 2.5 app.
I have a process that will take roughly an hour to complete, so I made a backend. To that end, I have a backend.yaml that has something like the following:
-name: mybackend
options: dynamic
start: /path/to/script.py
(The script is just raw computation. There's no notion of an active web session anywhere.)
On toy data, this works just fine.
This used to be public, so I would navigate to the page, the script would start, and time out after about a minute (HTTP + 30s shutdown grace period I assume, ). I figured this was a browser issue. So I repeat the same thing with a cron job. No dice. Switch to a using a push queue and adding a targeted task, since on paper it looks like it would wait for 10 minutes. Same thing.
All 3 time out after that minute, which means I'm not decoupling the request from the backend like I believe I am.
I'm assuming that I need to write a proper Handler for the backend to do work, but I don't exactly know how to write the Handler/webapp2Route. Do I handle _ah/start/ or make a new endpoint for the backend? How do I handle the subdomain? It still seems like the wrong thing to do (I'm sticking a long-process directly into a request of sorts), but I'm at a loss otherwise.

So the root cause ended up being doing the following in the script itself:
models = MyModel.all()
for model in models:
# Magic happens
I was basically taking for granted that the query would automatically batch my Query.all() over many entities, but it was dying at the 1000th entry or so. I originally wrote it was computational only because I completely ignored the fact that the reads can fail.
The actual solution for solving the problem we wanted ended up being "Use the map-reduce library", since we were trying to look at each model for analysis.

long time running python script

I have application of following parts:
client->nginx->uwsgi(python)
and some python scripts can be running long time (2-6 minutes). After execution of script I should give to client content, but connection break with error "gateway timeout 504". What can I use for my case to avoid this error?

So is your goal to reduce the run time of the scripts, or to not have them time out? Browsers are going to give up on a 6 minute request no matter what you try.
Perhaps try doing the work on the server, and then polling for progress with AJAX requests?
Or, if possible, try optimizing the scripts. For example, if you have some horribly slow SQL stuff going on, try cleaning that up.
Otherwise, without more information, a more specific answer is hard to give.

I once set up a system where the "main page" contained an Iframe which showed the output of the long running program as text/plain. I think the the handler for the the Iframe content was a Python CGI script which emitted all headers and then the program output line by line under an Apache server.
I don't know whether this would work under your configuration.

This heavily depends on your server setup (i.e. how easy it is to push data back to the client), but is it possible while running your lengthy application to periodically send some “null” content (e.g plain newlines assuming your output is html) so that the browser thinks this is just a slow connection and not a stalled one?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.