Run Python scraping script on a server - python

I apologize ahead of time, as this may seem like a ridiculous question. I have used Python for the purposes of regex, automation and leveraging pandas for work. I would now like to run a web scraping script (nothing special, < 50 lines, 24 hours a day, which i think i will need server access for to do so(can't run on laptop if laptop is shut while i am sleeping) for personal use, but I have no idea how to go about this?
Would i need to rent a virtual machine, physical server space? My intent is to run the script and send an email with the results, its all very low tech and low capacity, more of a notification. How does one go about modifying their code to do something like this and then what kind of server access would i need? any suggestions would be helpful!!

UNIX style approach to the problem:
Rent a server ( VPS ) - you can find something for $2.5-$5
Create and upload your scraping script to the server
Create crontab ( tool that's responsible for executing your script on regular basic - say once per hour or whatever you want )
Use https://docs.python.org/3/library/email.html package for sending emails from Python

Related

็How to prevent heroku sleeping mode by python code

I do chatbot on Dialogflow and connect Heroku with python in Github. How can I prevent Heroku from sleeping by python? Could you share the code with me?
If you are using the Free Plan you cannot prevent the Dyno to sleep (after 30 min inactivity).
There is a workaround if your Chatbot runs on a web site: in this case you could send a dummy request (just to start the Dyno) to your webhook hosted on Heroku when the user access the page, for example
<body onload="wakeUpCall();">
<script language="javascript">
function wakeUpCall() {
var xhr2 = new XMLHttpRequest();
xhr2.open("GET", "https://mywebhook.herokuapp.com/", true);
xhr2.send(null);
}
</script>
It is not obviously a perfect approach (it works only if you control the client and it relies on the Dyno starting before the chatbot sends data to the webhook), but it is an option if you want to keep working with the Free plan.
First some things to keep in mind before you try to use the free dyno for something it wasn't intended to be used for:
Heroku provides 1000 free hours a month. This is enough to only run single Heroku dyno at the free tier level. If you need to avoid the startup delay for two apps, then you'll need to pay for at least one of them.
Heroku still only allows for a single free dyno to run on your app. This you might lose traffic when you you are pushing new code (since the free web down has to shut down so you can built a new one).
There are undoubtably other issues as well, but those are the main ones I can think of offhand.
Now the solution:
Really, you just need something to ping your site at least once every 30 minutes. You could write a script for this, but there is an extremely useful tool that already does something like this that provides more benefit to you.
That would be the Availability (or Uptime) Monitoring tool. This is a tool that ensures your site is "still up and running" by pinging a URL every X minutes and ensuring that the response is a valid, expected response (IE: 200 status code and/or checking for certain text on the page). These often also provide the benefit of contacting you if it receives unexpected response (almost certainly an error) for too long.
Here is an example of an availability monitor:
https://uptimerobot.com/

Need help troubleshooting Google App Engine job that worked in dev but not production

I have been working on a website for over a year now, using Django and Python3 primarily. A few of my buddies and I built a front end where a user enters some parameters and submits, this goes to the GAE to run the job and return the results.
In my local dev environment, everything works well. I have two separate dev environments. One builds the entire service up in a docker container. This produces the desired results in roughly 11 seconds. The other environment runs the source files locally on my computer and connects to the Postgres database hosted in Google Cloud. The Python application runs locally. It takes roughly 2 minutes for it to run locally, a lot of latency between the cloud and the post/gets from my local machine.
Once I perform the Gcloud app deploy and attempt to run in production, it never finishes. I have some print statements built into the code, I know it gets to the part where the submitted parameters go to the Python code. I monitor via this command on my local computer: gcloud app logs read.
I suspect that since my local computer is a beast (i7-7770 processor with 64 GB of RAM), it runs the whole thing no problem. But in the GAE, I don't think it's providing the proper machines to do the job efficiently (not enough compute, not enough RAM). That's my guess.
So, I need help in how to troubleshoot this. I tried changing my app.yaml file so that resources would scale to 16 GB of memory, but it would never deploy. I received an error 13.
One other note, after it spins around trying to run the job for 60 minutes, the website crashes and displays this message:
502 Server Error
Error: Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
OK, so just in case anybody in the future is having a similar problem...the constant crashing of my Google App Engine workers was because of using Pandas dataframes in the production environment. I don't know exactly what Pandas was doing, but I kept getting Memory Errors, it would crash the site...and it didn't appear to be occurring in a single line of code. That is, it randomly happened somewhere in a Pandas Dataframe operation.
I am still using a Pandas Dataframe simply to read in a csv file. I then use
data_I_care_about = dict(zip(df.col1, df.col2))
#or
other_data = df.col3.values.tolist()
and then go to town with processing. As a note, on my local machine (my development environment basically) - it took 6 seconds to run from start to finish . That's a long time for a web request but I was in a hurry, thus why I used Pandas to begin with.
After refactoring, the same job completed in roughly 200ms using python lists and dicts (again, in my dev environment). The website is up and running very smoothly now. It takes a maximum of 7 seconds after pressing "Submit" for the back-end to return the data sets and render on the web page. Thanks for the help peeps!

GUI for python app that uses interactive broker API that will eventually run on EC2

I have an Interactive Brokers [IB] account and am using the IB API to make an automated trading system in python. Version 1.0 is nearing the test stage.
I am thinking about creating a GUI for it so that I can real-time watch various custom indicators and adjust trading parameters. It is all (IB TWS/IB Gateway and my app) running on my local windows 10 pc (I could run it on Ubuntu if that made it easier) with startup config files presently being the only way to adjust parameter and then watch the results scroll by on the console window.
Eventually I would like to run both IB TWS/IB Gateway and the app on Amazon EC2/AWS and access it from anywhere. I only mention this as may be a consideration on how to setup the GUI now to avoid having to redo it then.
I am not going to write this myself and will contract someone else to do it. After spending 30+ hrs researching this I still really have no idea on what the best way would be to implement this (browser based, standalone app, etc.) and/or what skills the programmer would need for me to describe the job.
An estimate on how long it would take to get a bare bones GUI real-time displaying data from my app and real-time sending inputs back to my app would be additionally helpful.
The simplest and quickest way will probably be to add GUI directly to your Python App. If you don't need it to be pretty or to run on mobile, I'd say go with TKinter for simplicity. Then, connect to wherever the App is located and control it remotely.
Adding another component that will communicate with your Python App introduces a higher level of complexity which I think is redundant in this case.
You didn't specify in details what kind of data you will require the app to display. If this includes any form of charting, I'd use an existing charting software such as Ninjatrader / Multicharts / Sierracharts to run my indicators and see the positions status, and restrict the GUI of the python app to adjusting the trading parameters and reporting numerical stats.

Is there anyway to run my script on the cloud with regular interval?

I have writtern some python script to capture the real-time air quality information and weather condition around the world. The data on the source website is updated once in 1 hour.
So, I need to re-run my script by 1 hour and update my saving datafiles. I only have a laptop which is impossible to run continuous about 1 year with not shutting down just for this job.
In this way, I want to ask is there any website that I can upload and run my scripts and it will save the result on the cloud.
Add
My script is pretty easy with the support of Python BeautifulSoup.
My target website is offering API key and I have one.
With my test, the script can run in OS.X system and CentOS system.
I think scrapinghub meets your requirements perfectly, thought there is a data retention limit, at least in the free and the basic plans.

Port desktop to web application (bioinformatic)

I want to port a few bioinformatic programs which I wrote for Windows OS to web applications. I'm using a few bioinformatic packages like BLAST, Bowtie or Primer3. These external tools usually take a file which the user provides, processes it and creates an output file which I parse and display. In addition these tools are using specific databases, which are created and reused by the user.
Up to now I was saving the databases created by the tools (the file is also provided by the user) and the output results on the PC where my software is installed. Now, I do not know how to handle such a setup on a web server. I cannot save all the databases created by the users from all over the world, but at the same time it is quite nasty to create a database again every time (e.g. the human genome db is 2.7 GB and takes some time to create it) when the user comes back (I guess one user creates about 5-10 databases per tool; I have 3 tools: 1 MB - 50 GB).
How can this problem be solved with web apps?
Edit
To make things more clear, I want actually only to know whether there is a more sophisticated way to reuse data which the user creates. I was thinking about to store those files temporally for a session. There is no possibility to ask for charging because those tools are quite specific and I don't have many users. In addition most users are close colleagues. After years fighting with different OS, debugging and maintaining my programs, I finally give up (I do this in my private time), it is simply to time consuming (in addition I have some request for Linux, Android and IOS).
Thanks

Categories