How to store scientific large (microscopy) files? [closed]

How to store scientific large (microscopy) files? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Here's the problem:
In a laboratory, very large microscopy data is created (from 1GB to 200GB per file).
We store the metadata as JSONS in MongoDB. But we can not find a suitable local / open source platform to store these files.
We have tried Hadoop but it is a very complex framework and we do not need many features. We only need a BLOB / Object Storage, if possible with a Python API to read and write data via a self-built GUI.
Have already evaluated Ceph, OpenStack Swift, OwnCloud, Gluster, etc., but we fail with each of them because of max_limit_size_of_file. Many of these mentioned have a max limit of 5GB per file.
What is the best way to store these files?
We need the following features:
Python (and REST) API
No Max-Limit size
Open Source / Local Software
Object / Blob Storage
If possible replication of the data
Unfortunately, for compliance reasons, cloud solutions are not an option.

Have you had a look at OMERO? It sounds as if it covers most of your requirements. Although I dont know how far you can go with the Python API.

For cases like these, sometimes the best thing to do is use the built in file-system to store your files.

How many files do you need to keep? A plain file system with a file share works really well for storing large binary data. You can store your metadata in the mongoDB as well as the path to the directory.
One thing you might or might not need to worry about is how many files you need to store. In my experience if you're storing thousands of files then you need to work out how to distribute the files across the folders. If you store the hash of the object you can create a function that calculates what directory to store the file based on hash. If you're familiar with git, this is exactly how it stores objects.

vaex is a library for loading in dataframes larger than the system memory allows, if you were to store your metadata with MongoDB and have a field for filename and you'd have your query abilities whilst keeping your data on the filesystem in a usable way

Related

run c++ script multiple times which reads csv file [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I have a c++ console app to be ran multiple times. It reads a large csv file everytime it runs which doesn't change . It is a slow process.
Is there any way to just "load" the file in memory and not have to read everytime I run the file.
I was thinking in ways R and python works. You load a csv as dataframe and can use it in other
R scripts without loading everytime.

Each time your C++ app exits, its memory will be freed, which implies that your data won't be kept for the next time. Therefore, if you store your data in your app memory heap, it should be read from file each time you run the app.
If you really want to avoid reading your data from the filesystem, the easiest path is to use a separated process, i.e. a in-memory database, such as Redis or SQLite, so you can read your cdv once, store it in the DB memory, and then access your data through your C++ app.
List of In-memory DB
In your case, I would suggest to choose Redis (easier than SQLite since you don't need to create tables).
If you're not familiar with it, it's quite simple to start : Redis is a key-value storage system.
You just need to install a redis server for our environment, and you can use a C++ lib to immediately store and retrieve data in it. All you have to do is use two types of command: SET (when you read your CSV file for the first time) and GET (when you access the data for your processing purpose).
The simplest way to store your data is probably to store each line of your CSV with a key composed by the filename and the line number. For instance, if your file name is artists_rock.csv, you can do this to store the line 909:
SET ART_ROCK_909 "Lennon;John;Liverpool"
and you can get your record like that:
GET ART_ROCK_909
The key format is up to you, so that one makes it easy to iterate or access a line directly, just as if you were reading your file.
And if you use a C++ lib to parse your CSV records (meaning you never manipulate the original strings), you can also store your object as a set and manipulate it with HSET and HGET. The previous example would look like this:
HSET ART_ROCK_909 name "Lennon" fistname "John" birthplace "Liverpool"
and you would access data with
HGET ART_ROCK_909 birthplace
All you need to do is to choose a C++ lib to talk to you Redis server. There are many wrappers to the hiredis C library, such as redis-plus-plus that you can find on Github.
Here is a getting started sample code.
To keep the same example as above, the corresponding code would look like this:
#include <sw/redis++/redis++.h>
using namespace sw::redis;
try {
// Create an Redis object, which is movable but NOT copyable.
auto redis = Redis("tcp://127.0.0.1:6379");
auto line = my_csv_reading_function();
redis.set("ART_ROCK_909", line);
auto val = redis.get("ART_ROCK_909"); // val is of type OptionalString. See 'API Reference' section for details.
}

Assuming most of time is wasted on CSV parsing, and not on file-system operations, you can store your parsed data to a new file using fwrite. On second execution, read the other file using fread, instead of parsing the CSV-file.
Pseudo-code:
data = allocate()
open 'file.parsed'
if successful:
fread('file.parsed', data) # this is supposed to be fast
else:
parse('file.csv', data) # this is slow; will do only once
fwrite('file.parsed', data)

Processing data from a large data grab

I've downloaded a large (>75GB) data grab from archive.org containing most or all of the tweets from June 2020. The archive itself consists of 31 .tar files, each containing nested folders with the lowest level containing several compressed .json files. I need a way to access the data stored in this archive from my Python application. I would like to use MongoDB since its document-based database structure seems well suited to the type of data in this archive. What would be the best way of doing so?
Here is what the archive looks like (you can find it here):
Any help would be appreciated.
Edit - to be clear, I am not set on using MongoDB. I am open to other database solutions as well.

MongoDB is certainly not a good idea. Because you need to load the database in the RAM. Excepted if you have a cluster or so, you surely not have enough RAM to host this content.
So you may want to filter it if you still want to use mongoDB to this extent.

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?

I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Python or Powershell - Import Folder Name & Text file Content in Folder into Excel [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have been looking at some python modules and powershell capabilities to try and import some data a database recently kicked out in the form of folders and text files.
File Structure:
Top Level Folder > FOLDER (Device hostname) > Text File (also contains the hostname of device) (with data I need in a single cell in Excel)
The end result I am trying to accomplish is have the first cell be the FOLDER (device name) and the second column contain the text of the text file within that folder.
I found some python modules but they all focus on pulling directly from a text doc...I want to have the script or powershell function iterate through each folder and pull both the folder name and text out.

This is definitely do-able in Powershell. If I understand your question correctly, you're going to want to use Get Child-Itemand Get Content then -recurse if necessary. As far as an export you're going to want to use Out-File which can be a hassle when exporting directly to xlsx. If you had some code to work with I could help better but until then this should get you started in the right direction. I would read up on the Getcommands because Powershell is very simple to write but powerful.

Well, Since you're asking in a general sense -- you can do this project simply in either choice of script languages. If it were me -- and I had to do this job once -- I'd probably just bang out a PoSH script to do it. If this script had to be run repeatedly, and potentially end up with more complex functions I'd probably then switch to Python (but that's just based on my personal preference as PoSH is pretty powerful in a Windows env). Really this seems like a 15 line recursive function in either choice.
You can look up "recursive function in powershell" and the same for python and get plenty of code examples, as File Tree Walks (FTW :) ) are one of the most solved problems on sites like this. You'd just replace whatever the other person is doing in their walk with a read of the file and write. You probably also want to output in CSV format because it's easier and imports into excell just fine.

storing uploaded photos and documents - filesystem vs database blob

My specific situation
Property management web site where users can upload photos and lease documents. For every apartment unit, there might be 4 photos, so there won't be an overwhelming number of photo in the system.
For photos, there will be thumbnails of each.
My question
My #1 priority is performance. For the end user, I want to load pages and show the image as fast as possible.
Should I store the images inside the database, or file system, or doesn't matter? Do I need to be caching anything?
Thanks in advance!

While there are exceptions to everything, the general case is that storing images in the file system is your best bet. You can easily provide caching services to the images, you don't need to worry about additional code to handle image processing, and you can easily do maintenance on the images if needed through standard image editing methods.
It sounds like your business model fits nicely into this scenario.

File system. No contest.
The data has to go through a lot more layers when you store it in the db.
Edit on caching:
If you want to cache the file while the user uploads it to ensure the operation finishes as soon as possible, dumping it straight to disk (i.e. file system) is about as quick as it gets. As long as the files aren't too big and you don't have too many concurrent users, you can 'cache' the file in memory, return to the user, then save to disk. To be honest, I wouldn't bother.
If you are making the files available on the web after they have been uploaded and want to cache to improve the performance, file system is still the best option. You'll get caching for free (may have to adjust a setting or two) from your web server. You wont get this if the files are in the database.
After all that it sounds like you should never store files in the database. Not the case, you just need a good reason to do so.

Definitely store your images on the filesystem. One concern that folks don't consider enough when considering these types of things is bloat; cramming images as binary blobs into your database is a really quick way to bloat your DB way up. With a large database comes higher hardware requirements, more difficult replication and backup requirements, etc. Sticking your images on a filesystem means you can back them up / replicate them with many existing tools easily and simply. Storage space is far easier to increase on filesystem than in database, as well.

Comment to the Sheepy's answer.
In common storing files in SQL is better when file size less than 256 kilobytes, and worth when it greater 1 megabyte. So between 256-1024 kilobytes it depends on several factors. Read this to learn more about reasons to use SQL or file systems.

a DB might be faster than a filesystem on some operations, but loading a well-identified chunk of data 100s of KB is not one of them.
also, a good frontend webserver (like nginx) is way faster than any webapp layer you'd have to write to read the blob from the DB. in some tests nginx is roughly on par with memcached for raw data serving of medium-sized files (like big HTMLs or medium-sized images).
go FS. no contest.

Maybe on a slight tangent, but in this video from the MySQL Conference, the presenter talks about how the website smugmug uses MySQL and various other technologies for superior performance. I think the video builds upon some of the answers posted here, but also suggest ways of improving website performance outside the scope of the DB.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.