I am building an app that uses the Oanda API to retrieve historical forex data. There are 28 forex pairs and 10 years worth of historical data. Since none of this data changes I was planning on saving it into my database rather than blowing up the API.
My goal is to initially populate the database for all pairs and then update the data once per minute from then on.
What I can't figure out is how to do this effectively.
Where should the logic for this live inside the Django app?
How can I execute the initial population of the data so that it will save?
It's the saving that is giving me the most problems. As far as I can tell Django only likes to save model instances from the shell.
Any help would be greatly appreciated.
You might want to take a look at this answer and to the django-admin commands.
I hope this helps! =)
Generally you should do operations like this inside proper view.
If you want to save some data once per minute just create a method that will implement it and refresh it with for example Ajax from time to time (for example once per minute). You don't have to render page from beginning - everything can work in the background.
Remember that you will need psycopg2 module to interact with Postgres
Related
Please assume I have no experience in developing the structure of websites.
I am currently working on a website that displays data fetched from an API. (I use Django and python for the web app)
I want to display the current Covid-19 data and analyze the trend in graphs and charts.
I am able to grab the data from the API, but question is for a site like that needs an update on the numbers on a daily basis, what is the best practice for retrieving data. Should I be saving the api data to my database for easier access or should I just call the API everytime when the web page is visited (which is already a necessity due to the day to day update) and process data? (Performance wise or easily of processing data.)
I would appreciate if anyone would share their experience.
I am building a Django web app which will essentially serve static data to the users. By static, I mean that admins will be able to upload new datasets but no data entries will be made by users. Effectively, once the data is uploaded, it will be read-only on request by a user.
Given that these are quite large datasets (200k+ rows), I figured that SQL would be the best way to store the data - this avoids reading large datasets into memory (as you'd have to with a pickle or json?). This has the added bonus of using Django models to access the data.
However, I am not sure of the best way to do this, or if there is a better alternative to SQL. I currently have an admin page that allows you to upload .xlsx files which are then parsed and added as model entries row-by-row. It takes FOREVER (30+ minutes for 100K rows). Perhaps I should be creating a whole new db outside of Django and then importing that somehow, but I can't find much documentation on how this could/should be done. Any ideas would be greatly appreciated! Thanks in advance for any wisdom.
You can try to use .csv file format instead of .xlsx. Python has libraries that allow you to easily write to an sql database using .csv format (comma separated value). This answer could be of further assistance. I hope you find what you're looking for and happy coding!
I'm learning Django and to practice I'm currently developing a clone page of YTS, it's a movie torrents repository*.
As of right now, I scrapped all the movies in the website and have them on a single db table called Movie with all the basic information of each movie (I'm planning on adding one more for Genre).
Every few days YTS will post new movies and I want my clone-web to automatically add them to the database. I'm currently stuck on deciding how to do this:
I was planning on comparing the movie id of the last movie in my db against the last movie in the YTS db each time the user enters the website, but that'd mean make a request to YTS every time my page loads, it'd also mean some very slow code should be executed inside my index() views method.
Another strategy would be to query the last time my db was updated (new entries were introduced) and if it's let's say bigger than a day then request new movies to YTS. Problem with this is I don't seem to find any method to query the time of last db updates. Does it even exist such method?
I could also set a cron job to update the information but I'm having problems to make changes from a separated Python function (I import django.db and such but the interpreter refuses to execute django db instructions).
So, all in all, what's the best strategy to update my database from a third party service/website without bothering the user with loading times? How do you set such updates in non-intrusive way to the user? How do you generally do it?
* I know a torrents website borders the illegal and I'm not intended, in any way, to make my project available to the public
I think you should choose definetely the third alternative, a cron job to update the database regularly seems the best option.
You don' t need to use a seperate python function, you can schedule a task with celery, which can be easily integrated with django using django-celery
The simplest way would be to write a custom management command and run it periodically from a cron job.
I have a basic personal project website that I am looking to learn some web dev fundamentals with and database (SQL) fundamentals as well (If SQL is even the right technology to use??).
I have the basic skeleton up and running but as I am new to this, I want to make sure I am doing it in the most efficient and "correct" way possible.
Currently the site has a main index (landing) page and from there the user can select one of a few subpages. For the sake of understanding, each of these sub pages represents a different surf break and they each display relevant info about that particular break i.e. wave height, wind, tide.
As I have already been able to successfully scrape this data, my main questions revolve around how would I go about inserting this data into a database for future use (historical graphs, trends)? How would I ensure data is added to this database in a continuous manner (once/day)? How would I use data that was scraped from an earlier time, say at noon, to be displayed/used at 12:05 PM rather than scraping it again?
Any other tips, guidance, or resources you can point me to are much appreciated.
This kind of data is called time series. There are specialized database engines for time series, but with a not-extreme volume of observations - (timestamp, wave heigh, wind, tide, which break it is) tuples - a SQL database will be perfectly fine.
Try to model your data as a table in Postgres or MySQL. Start by making a table and manually inserting some fake data in a GUI client for your database. When it looks right, you have your schema. The corresponding CREATE TABLE statement is your DDL. You should be able to write SELECT queries against your table that yield the data you want to show on your webapp. If these queries are awkward, it's a sign that your schema needs revision. Save your DDL. It's (sort of) part of your source code. I imagine two tables: a listing of surf breaks, and a listing of observations. Each row in the listing of observations would reference the listing of surf breaks. If you're on a Mac, Sequel Pro is a decent tool for playing around with a MySQL database, and playing around is probably the best way to learn to use one.
Next, try to insert data to the table from a Python script. Starting with fake data is fine, but mold your Python script to read from your upstream source (the result of scraping) and insert into the table. What does your scraping code output? Is it a function you can call? A CSV you can read? That'll dictate how this script works.
It'll help if this import script is idempotent: you can run it multiple times and it won't make a mess by inserting duplicate rows. It'll also help if this is incremental: once your dataset grows large, it will be very expensive to recompute the whole thing. Try to deal with importing a specific interval at a time. A command-line tool is fine. You can specify the interval as a command-line argument, or figure out out from the current time.
The general problem here, loading data from one system into another on a regular schedule, is called ETL. You have a very simple case of it, and can use very simple tools, but if you want to read about it, that's what it's called. If instead you could get a continuous stream of observations - say, straight from the sensors - you would have a streaming ingestion problem.
You can use the Linux subsystem cron to make this script run on a schedule. You'll want to know whether it ran successfully - this opens a whole other can of worms about monitoring and alerting. There are various open-source systems that will let you emit metrics from your programs, basically a "hey, this happened" tick, see these metrics plotted on graphs, and ask to be emailed/texted/paged if something is happening too frequently or too infrequently. (These systems are, incidentally, one of the main applications of time-series databases). Don't get bogged down with this upfront, but keep it in mind. Statsd, Grafana, and Prometheus are some names to get you started Googling in this direction. You could also simply have your script send an email on success or failure, but people tend to start ignoring such emails.
You'll have written some functions to interact with your database engine. Extract these in a Python module. This forms the basis of your Data Access Layer. Reuse it in your Flask application. This will be easiest if you keep all this stuff in the same Git repository. You can use your chosen database engine's Python client directly, or you can use an abstraction layer like SQLAlchemy. This decision is controversial and people will have opinions, but just pick one. Whatever database API you pick, please learn what a SQL injection attack is and how to use user-supplied data in queries without opening yourself up to SQL injection. Your database API's documentation should cover the latter.
The / page of your Flask application will be based on a SQL query like SELECT * FROM surf_breaks. Render a link to the break-specific page for each one.
You'll have another page like /breaks/n where n identifies a surf break (an integer that increments as you insert surf break rows is customary). This page will be based on a query like SELECT * FROM observations WHERE surf_break_id = n. In each case, you'll call functions in your Data Access Layer for a list of rows, and then in a template, iterate through those rows and render some HTML. There are various Javascript and Python graphing libraries you can feed this list of rows into and get graphs out of (client side or server side). If you're interested in something like a week-over-week change, you should be able to express that in one SQL query and get that dataset directly from the database engine.
For performance, try not to get in a situation where more than one SQL query happens during a page load. By default, you'll be doing some unnecessary work by going back to the database and recomputing the page every time someone requests it. If this becomes a problem, you can add a reverse proxy cache in front of your Flask app. In your case this is easy, since nothing users do to the app cause its content to change. Simply invalidate the cache when you import new data.
I'm mostly working on backend staff, except now in a project I need to use python to do computing and visualize the results on google maps. Think about it as, for example, compute the geographical clusters of people tweeting in new york city.
In the python program, it runs about 10 seconds, and then output one iteration of data, which is a json object for coordinates. I'm wondering how should I connect this data to google maps?
What I thought was let python write data into a file and JS would listen to that file every few milliseconds. However that sounds too hacky. Just wondering is there a better way to do it?
I'm really a newbie to js. please forgive my ignorance.
Thanks
The normal way a HTML page gets data from a backend service (like your coordinate generator every 10 seconds) is to poll a web service (usually, a JSON feed) for updates.
All of the dynamic Google Maps stuff happens within a browser, and that page polls a JSON endpoint, or uses something fancier like websockets to stream data into the browser window.
For the frontend, consider using jQuery, which makes polling JSON dead simple. Here's some examples.
Your "python program" should dump results into a simple database. While relational and traditional databases like MySQL or PostgreSQL should suffice, i'd encourage you to use a NoSQL database, which handles capped collections. This prevents you from having to clean old data out from a cron schedule. It additionally allows storing data in ranged buckets for some cool playback style histories.
You should then have a simple web server which can handle the JSON requests from the HTML frontend page, and simply pulls data from the MongoDB. This can be done quickly in any one of the python web frameworks like Flask, Bottle or Pyramid. You could also play with something a little sexier like node.js. The only requirement here is that a database driver exists for it.
Hope that gives a 10,000 foot view of what you need to do now.