Saving multiple pandas dataframes to one file and then reading them back - python

I am working on a python app and I am trying to logistically plan how saving/loading files will work
In this app multiple data sheets will be loaded and they will all be used in some capacity, the problem I am having is that I want users to be able to save the changes to the files that they have imported.
I was thinking more of a save file that holds the content of all the files they have loaded into the app. But I have no idea how to structure this! I've done some research and heard that parquet/feather are good formats for saving files but I don't know if they support saving multiple data frames to the same file.
The most important part about this is that the files need to be loadable by pandas/another library so that a user can save the changes they've made and then load it back up later if they were so inclined
Any advice is appreciated!

Related

How to store all files universally in memory (variable) Python

Essentially, I would want to be able to go through a folder with text files, jpg files, csv files, png files, any kind of file, and be able to load it into memory as some kind of object. When necessary, I would then like to be able to save it and create an instance on disk. This would need to work for any kind of file type.
I would create a class that would contain the file data itself as well as metadata, but that is not necessary for my question.
Is this possible ,and if so, how can I do this?

Best way to link excel workbooks feeding into each other for memory efficiency. Python?

I am building a tool which displays and compares data for a non specialist audience. I have to automate the whole procedure as much as possible.
I am extracting select data from several large data sets, processing it into a format that is useful and then displaying it in a variety of ways. The problem i foresee is in the updating of the model.
I don't really want the user to have to do anything more than download the relevant files from the relevant database, re-name and save them to a location and the spreadsheet should do the rest. Then the user will be able to look at the data in a variety of ways, perform a few different analytical functions, depending on what they are looking at. Output some graphs etc
Although some database exported files wont be that large, other data will be being pulled from very large xml or csv files (500000x50 cells) and there are several arrays working on the pulled data once it has been chopped down to the minimum possible. So it will be necessary to open and update several files in order, so that the data in the user control panel is up to date and not all at once so that the user machine freezes.
At the moment I am building all of this just using excel formulas.
My question is how best to do the updating and feeding bit. Perhaps some kind of controller program built with python? I don't know Python but i have other reasons to learn it.
Any advice would be very welcome.
Thanks

pyspark MLUtils saveaslibsvm saving only under _temporary and not saving on master

I use pyspark
And use MLUtils saveaslibsvm to save an RDD on labledpoints
It works but keeps that files in all the worker nodes under /_temporary/ as many files.
No error is thrown, i would like to save the files in the proper folder, and preferably saving all the output to one libsvm file that will be located on the nodes or on the master.
Is that possible?
edit
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No matter what i do, i can't use MLUtils.loadaslibsvm() to load the libsvm data from the same path i used to save it. maybe something is wrong with writing the file?
This is a normal behavior for Spark. All writing and reading activities are performed in parallel directly from the worker nodes and data is not passed to or from driver node.
This why reading and writing should be performed using storage which can be accessed from each machine, like distributed file system, object store or database. Using Spark with local file system has very limited applications.
For testing you can can use network file system (it is quite easy to deploy) but it won't work well in production.

Saving File to Dictionary - Python

Is there any way to save a file to a dictionary under python?
(Indeed I am not asking how to export dictionaries to files here.)
Maybe a file could be pickled or transformed into me python object
and then saved.
Is this generally advisable?
Or should I only save the file's path to the dictionary?
How would I retrieve the file later on?
The background of my question relates to my usage of dictionaries as
databases. I use the handy little module sqlitshelf as a form of permanent dictionary: https://github.com/shish/sqliteshelf
Each dataset includes a unique config file (~500 kB) which is retrieved from an application. Upon opening of the respective data set the config file are copied into and back from the working directory of the application. I might use a folder instead where I save the config files to. Yet, it strikes me as more elegant to save them together with the other data.

Choice of technology for loading large CSV files to Oracle tables

I have come across a problem and am not sure which would be the best suitable technology to implement it. Would be obliged if you guys can suggest me some based on your experience.
I want to load data from 10-15 CSV files each of them being fairly large 5-10 GBs. By load data I mean convert the CSV file to XML and then populate around 6-7 stagings tables in Oracle using this XML.
The data needs to be populated such that the elements of the XML and eventually the rows of the table come from multiple CSV files. So for e.g. an element A would have sub-elements coming data from CSV file 1, file 2 and file 3 etc.
I have a framework built on Top of Apache Camel, Jboss on Linux. Oracle 10G is the database server.
Options I am considering,
Smooks - However the problem is that Smooks serializes one CSV at a time and I cant afford to hold on to the half baked java beans til the other CSV files are read since I run the risk of running out of memory given the sheer number of beans I would need to create and hold on to before they are fully populated written to disk as XML.
SQLLoader - I could skip the XML creation all together and load the CSV directly to the staging tables using SQLLoader. But I am not sure if I can a. load multiple CSV files in SQL Loader to the same tables updating the records after the first file. b. Apply some translation rules while loading the staging tables.
Python script to convert the CSV to XML.
SQLLoader to load a different set of staging tables corresponding to the CSV data and then writing stored procedure to load the actual staging tables from this new set of staging tables (a path which I want to avoid given the amount of changes to my existing framework it would need).
Thanks in advance. If someone can point me in the right direction or give me some insights from his/her personal experience it will help me make an informed decision.
regards,
-v-
PS: The CSV files are fairly simple with around 40 columns each. The depth of objects or relationship between the files would be around 2 to 3.
Unless you can use some full-blown ETL tool (e.g. Informatica PowerCenter, Pentaho Data Integration), I suggest the 4th solution - it is straightforward and the performance should be good, since Oracle will handle the most complicated part of the task.
In Informatica PowerCenter you can import/export XML's +5GB.. as Marek response, try it because is work pretty fast.. here is a brief introduction if you are unfamiliar with this tool.
Create a process / script that will call a procedure to load csv files to external Oracle table and another script to load it to the destination table.
You can also add cron jobs to call these scripts that will keep track of incoming csv files into the directory, process it and move the csv file to an output/processed folder.
Exceptions also can be handled accordingly by logging it or sending out an email. Good Luck.

Categories