storing uploaded photos and documents - filesystem vs database blob

storing uploaded photos and documents - filesystem vs database blob - python

My specific situation
Property management web site where users can upload photos and lease documents. For every apartment unit, there might be 4 photos, so there won't be an overwhelming number of photo in the system.
For photos, there will be thumbnails of each.
My question
My #1 priority is performance. For the end user, I want to load pages and show the image as fast as possible.
Should I store the images inside the database, or file system, or doesn't matter? Do I need to be caching anything?
Thanks in advance!

While there are exceptions to everything, the general case is that storing images in the file system is your best bet. You can easily provide caching services to the images, you don't need to worry about additional code to handle image processing, and you can easily do maintenance on the images if needed through standard image editing methods.
It sounds like your business model fits nicely into this scenario.

File system. No contest.
The data has to go through a lot more layers when you store it in the db.
Edit on caching:
If you want to cache the file while the user uploads it to ensure the operation finishes as soon as possible, dumping it straight to disk (i.e. file system) is about as quick as it gets. As long as the files aren't too big and you don't have too many concurrent users, you can 'cache' the file in memory, return to the user, then save to disk. To be honest, I wouldn't bother.
If you are making the files available on the web after they have been uploaded and want to cache to improve the performance, file system is still the best option. You'll get caching for free (may have to adjust a setting or two) from your web server. You wont get this if the files are in the database.
After all that it sounds like you should never store files in the database. Not the case, you just need a good reason to do so.

Definitely store your images on the filesystem. One concern that folks don't consider enough when considering these types of things is bloat; cramming images as binary blobs into your database is a really quick way to bloat your DB way up. With a large database comes higher hardware requirements, more difficult replication and backup requirements, etc. Sticking your images on a filesystem means you can back them up / replicate them with many existing tools easily and simply. Storage space is far easier to increase on filesystem than in database, as well.

Comment to the Sheepy's answer.
In common storing files in SQL is better when file size less than 256 kilobytes, and worth when it greater 1 megabyte. So between 256-1024 kilobytes it depends on several factors. Read this to learn more about reasons to use SQL or file systems.

a DB might be faster than a filesystem on some operations, but loading a well-identified chunk of data 100s of KB is not one of them.
also, a good frontend webserver (like nginx) is way faster than any webapp layer you'd have to write to read the blob from the DB. in some tests nginx is roughly on par with memcached for raw data serving of medium-sized files (like big HTMLs or medium-sized images).
go FS. no contest.

Maybe on a slight tangent, but in this video from the MySQL Conference, the presenter talks about how the website smugmug uses MySQL and various other technologies for superior performance. I think the video builds upon some of the answers posted here, but also suggest ways of improving website performance outside the scope of the DB.

Related

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?

I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Most efficient way to store financial data (Python)

The data exists out of Date, Open, High, Low, Close, Volume and it's currently stored in a .csv file. It's currently updating every minute and when time goes by the file keeps growing and growing. A problem is when I need 500 observations from the data, I need to import the whole .csv file and that is a problem yes. Especially when I need to access the data fast.
In Python I use the data mostly in a data frame or panel.

I would also suggest to use DB, it is much more convenient to update tables in DB than a csv file, moreover if you have a substantial amount of observations, you will be able to access/manipulate your data much more faster.
Another solution is to keep separate updates in separate .csv files.
You can still keep your major file (the one which is regularly updated), and at the same time create separate files for each update.

You maybe wanna check RethinkDB, it gives you fastness, reliability and also flexible searching ability, it has a good python driver. I also recommend to use docker, because in that case, regardless of which database you want to use, you can easily store the data of your db inside a folder, and you can anytime change that folder(when you have a 1TB hard, now you want to change it to 4TB hard). maybe using docker in your project is more important than DB.

This was also my problem in 2017 and I struggled with the performance of relational databases. Switching to Redis was 20x faster, although very difficult to implement. To make things easier for myself and others, I wrote an open-source ORM for Python applications on Redis.
import popoto
# DEFINE YOUR MODEL
class AssetPrice(popoto.Model):
asset = popoto.KeyField()
timestamp = popoto.SortedField(type=datetime, sort_by=('asset',))
close_price = popoto.FloatField()
# ADD DATA AS IT STREAMS IN
AssetPrice.create(asset="Bitcoin", timestamp=datetime(2022,1,1,12,0,0), close_price=47686.81)
# QUERY BY ASSET AND TIMESTAMP RANGE
AssetPrice.query.filter(
asset="Bitcoin",
timestamp__gte=datetime(2022,1,1), timestamp__lt=datetime(2022,1,2),
values=('close_price',)
)
>>> [{'close_price': 47686.81}, ..]
it's available as an easy pip install:
pip install popoto
Documentation here: https://popoto.readthedocs.io
Popoto is free and open source, so feel free to copy it and customize for your own needs

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.

Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.

Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.

I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.

I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

S3 and Filepicker.io - Multi-file, zipped download on the fly

I am using Ink (Filepicker.io) to perform multi-file uploads and it is working brilliantly.
However, a quick look around the internet shows multi-file downloads are more complicated. While I know it is possible to spin up an EC2 instance to zip on the fly, this would also entail some wait-time on the user's part, and the newly created file would also not be available on my Cloudfront immediately.
Has anybody done this before, and what are the practical UX implications - is the wait time significant enough to negatively affect the user experience?
The obvious solution would be to create the zipped files ahead of time, but this would result in some (unnecessary?) redundancy.
What is the best way to avoid redundant storage while reducing wait times for on-the-fly folder compression?

You can create the ZIP archive on the client side using JavaScript. Check out:
http://stuk.github.io/jszip/

python post large files to django

I am trying to find the best way (most efficient way) to post large files from a python application to a Django server.
If I rely on raw_post_data on the Django side then all the content needs to be in RAM before I can read it which doesn't seem efficient at all if the file received is 100s of megs.
Is it better to use the file uploads methods Django has. This means using a multipart/form-data post.
or maybe something better ?
Laurent

I think only files less than 2.5MB are stored in the memory, any file that is larger than 2.5MB is streamed or written to temporary file in temp directory..
reference:
http://simonwillison.net/2008/Jul/1/uploads/ and here http://docs.djangoproject.com/en/dev/topics/http/file-uploads/

If you really want to optimize it and don't want Django to suffer whilst the bytes are being streamed and thus occupying one of the Django threads you can use the nginx upload module
(see also this blog post)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.