Best practice in updating Django web app content when deployed on Heroku - python

I'm finally at the stage of deploying my Django web-app on Heroku. The web-app retrieves and performs financial analysis on a subset of public companies. Most of the content of the web-app is stored in .csv and .xslx files. There's a master .xslx file that I need to open and manually update on a daily basis. Then I run scripts which based on info in that .xsls file retrieve financial data, news, etc and store the data in .csv files. Then my Django views and html templates are referring to those files (and not to a sql/postgress database). This is my setup in a nutshell.
Now I'm wondering what is the best way to make this run smoothly in production.
Shall I store these .xslx and .csv files on AWS S3 and have Django access them and update them from there? I assume that this way I can easily open and edit the master .xslx file anytime. Is that a good idea and do I run any performance or security issues?
Is it better to convert all these data into a postgress database on Heroku? Is there a good guideline in terms of how I can technically do that? In that scenario wouldn't it be much more challenging for me to edit the data in the master .xsls file?
Are there any better ways you would suggest to handle this?
I'd highly appreciate any advice on this matter.

You need to perform a trade-off between easy of use (access/update the source XSLX file) and maintainability (storing safely and efficiently the data).
Option #1 is more convenient if you need to quickly open and change the file using your Excel/Numbers application. On the other hand your application needs to access physical files to perform the logic and render the views.
BTW some time ago I have created a repository Heroku Files to present some options in terms of using external files.
Option #2 is typically better from a design point of view: the data is organised in the database and can be more efficiently queried and manipulated. The challenge in this case is that you need a way to view/edit the data, and this normally requires more development (creating new screens, etc..)
Involving a database is normally the preferred approach as you can scale up to large dataset without problems (which is not the case with files).
On the other hand if the XLS file stays small and you only need simple (quick) updates your current architecture can work.

Related

Users change codes directly on the production server

I'm working on a project that has all its data written as python files(codes). Users with authentication could upload data through a web page, and this will directly add and change codes on the production server. This is causing trouble every time I want to git pull changes from the git repo to the production server since the codes added by users directly on production are untracked.
I wonder if anyone has some other ideas. I know this is ill-designed but this is what I got from the beginning, and implementing a database will require a lot of effort since all the current codes are designed for python files.I guess it's because the people who wrote this didn't know much about databases and it works only because there is relatively little data.
The only two solutions I could think of is
use a database instead of having all datas being codes
git add/commit/push all changes on the production server every time the user upload data
Details added:
There is a 'data' folder of python files, with each file storing the information of a book. For instance, latin_stories.py has a dictionary variable text, an integer variable section, a string variable language, etc. When users upload a csv file according to a certain format, a program will automatically add and change the python files in the data folder, directly on the production server.

Django data storage - SQL or something else?

I am building a Django web app which will essentially serve static data to the users. By static, I mean that admins will be able to upload new datasets but no data entries will be made by users. Effectively, once the data is uploaded, it will be read-only on request by a user.
Given that these are quite large datasets (200k+ rows), I figured that SQL would be the best way to store the data - this avoids reading large datasets into memory (as you'd have to with a pickle or json?). This has the added bonus of using Django models to access the data.
However, I am not sure of the best way to do this, or if there is a better alternative to SQL. I currently have an admin page that allows you to upload .xlsx files which are then parsed and added as model entries row-by-row. It takes FOREVER (30+ minutes for 100K rows). Perhaps I should be creating a whole new db outside of Django and then importing that somehow, but I can't find much documentation on how this could/should be done. Any ideas would be greatly appreciated! Thanks in advance for any wisdom.
You can try to use .csv file format instead of .xlsx. Python has libraries that allow you to easily write to an sql database using .csv format (comma separated value). This answer could be of further assistance. I hope you find what you're looking for and happy coding!

Django Server - How to prevent caching on csv files?

I have a server which generates some data on a daily basis. I am using D3 to visualise the data (d3.csv("path")).
The problem is I can only access the files if they are under my static_dir in the project.
However, if I put them there, they do eventually get cached and I stop seeing the updates, which is fine for css and js files but not for the underlying data.
Is there a way to put these files maybe in a different folder and prevent caching on them? Under what path will I be able to access them?
Or alternatively how would it be advisable to structure my project differently in order to maybe avoid this operation in the first place. Atm, I have a seperate process that generates the data and stores it the given folder which is independent from the server.
Many thanks,
Tony
When accessing the files you can always add ?t=RANDOM to the request in order to get a "new" data all the time.
Because the request (on the server-side) is "new" - there will be no cache, and from the client side it doesn't really matter.
To get a new random you can use Date.now():
url = "myfile.csv?t="+Date.now()

What is the best strategy for web application to create temporary .csv files?

My application would create .csv files temporarily for storing some data rows. What is the best strategy to manage this kind of temporary files creation and delete them after user logs out of the app?
I think creating temporary .csv files on the server isn't good idea.
Is there any simple way to manage temporary file creation at client machine (browser)?
These .csv files contains table records -> which would be used as source later for d3.js visualization charts/elements.
Please share your experience on real time applications for this scenario ?
I'm using DJango framework (Python) for doing this.
Why create on-disk files at all? For smaller files, use an in-memory file object like StringIO.
If your CSV file sizes can potentially get large, use a tempfile.SpooledTemporaryFile() object; these dynamically swap out data to a on-disk file if you write enough data to them. Once closed, the file is cleared from disk automatically.
If your preference is to store the data client side, "HTML local storage" could be an option for you. This lets you store string data in key value pairs in the user's browser subject to a same origin policy (so data for one origin (domain) is visible only to that origin). There is also a 5MB limit on data size which must be strings - OK for CSV.
Provided that your visualisation pages/code is served from the same domain, the data in local storage will be accessible to it.

Is there a way to tell a browser to download a file as a different name than as it exists on disk?

I am trying to serve up some user uploaded files with Flask, and have an odd problem, or at least one that I couldn't turn up any solutions for by searching. I need the files to retain their original filenames after being uploaded, so they will have the same name when the user downloads them. Originally I did not want to deal with databases at all, and solved the problem of filename conflicts by storing each file in a randomly named folder, and just pointing to that location for the download. However, stuff came up later that required me to use a database to store some info about the files, but I still kept my old method of handling filename conflicts. I have a model for my files now and storing the name would be as simple as just adding another field, so that shouldn't be a big problem. I decided, pretty foolishly after I had written the implmentation, on using Amazon S3 to store the files. Apparently S3 does not deal with folders in the way a traditional filesystem does, and I do not want to deal with the surely convoluted task of figuring out how to create folders programatically on S3, and in retrospect, this was a stupid way of dealing with this problem in the first place, when stuff like SQLalchemy exists that makes databases easy as pie. Anyway, I need a way to store multiple files with the same name on s3, without using folders. I thought of just renaming the files with a random UUID after they are uploaded, and then when they are downloaded (the user visits a page and presses a download button so I need not have the filename in the URL), telling the browser to save the file as its original name retrieved from the database. Is there a way to implement this in Python w/Flask? When it is deployed I am planning on having the web server handle the serving of files, will it be possible to do something like this with the server? Or is there a smarter solution?
I'm stupid. Right in the Flask API docs it says you can include the parameter attachment_filename in send_from_directory if it differs from the filename in the filesystem.

Categories