I have been trying to wrap my head around pyarrow for a while, reading their documentation but I still feel like I have not been able to grasp it in it's entirety. I saw their depcrecated method of serialization for arbitrary python objects, but since it's deprecated I was wondering what the correct way is to save for example a list of objects or an arbitrary python object in general?
When do you want to bother using pyarrow as well?
PyArrow is python binding for (Apache) Arrow. Arrow is a cross-language specification that describes how to store columnar data in memory. It serves as the internals of data processing applications & libraries, allowing them to efficiently work with large tabular datasets.
When do you want to bother using pyarrow as well?
One simple use case for PyArrow is to convert between Pandas/Numpy/dict and the Parquet file format. So for example, if you had columnar data (eg DataFrames) that you need to share between programs written in different languages, or even programs using different versions of python, a nice way to do this is to save your Pandas/Numpy/dict to a Parquet file (serialisation). This is a much more portable format that, for example, pickle. It also allows you to embed custom metadata in a portable fashion.
I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?
It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.
It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.
I am currently working on aligning text data, mostly hidden in CSV or Excel files from multiple sources. I've done this easily enough with python (even on a Raspberry Pi) and Openoffice. The issues are:
transforming disparate names to unique names (easy)
storing the data in CSV or Excel files (because my collabs use Excel)
Eventually building a real DB (SQL based- MariaDB, Postgres) from the Excel files
Doing statistics on the data; mostly enumeration from different CSV files and comparison between samples - nice to generate graphs
for debugging purposes it would be nice to quickly generate bar charts and such of groups of the data
Nothing superfancy, except it gets slow in python (no doubt generously helped by my "I am not a programmer" 'code' . The data sets will get 'large' (10's of thousands of lines of data times multiple dozens data sets). I would like a programming tool which facilitates this.
I looked into Ch (& cling, cint) because I still remember a bit of C, interpreted, but Ch seems to offer a good set of libs. Python is ok for much of it, but I dislike the syntax. I try to work on Linux as much as I can, but eventually I have to hand it off to Windows users in a country not known for having fast computers. I was looking at ceemple (ceemple.com) and was wondering if anyone has used that for a project and what their experience has been. Does it help with cross platform issues (e.g., line termination)? Should I just forget Linux (with that wonderful shell and easy python and text editors which can load large files w/o bogging down) and move it to Windows? If so, then compiled is just about the only way to go for me, likely precluding Ch and probably python.
Please keep in mind that this is my 'side job' - I'm not a professional programmer. Low learning curve and least amount of tools required is important.
I am working on some cfd-simulations with c/CUDA and python, at the moment the workflow goes like this:
Start a simulation written in pure c / cuda
Write output to a binary file
Reopen files with python i.e. numpy.fromfile and do some analysis.
Since I have a lot of data and also some metadata I though it would be better
to switch to hdf5 file format. So my Idea was something like,
Create some initial conditions data for my simulations using pytables.
Reopen and write to the datasets in c by using the standard hdf5 library.
Reopen files using pytables for analysis.
I really would like to do some live analysis of the data i.e.
write from the c-programm to hdf5 and directly read from python using pytables.
This would be pretty useful, but I am really not
sure how much this is supported by pytables.
Since I never worked with pytables or hdf5 it would be good to know
if this is a good approach or if there are maybe some pitfalls.
I think it is a reasonable approach, but there is a pitfall indeed. The HDF5 C-library is not thread-safe (there is a "parallel" version, more on this later). That means, your scenario does not work out of the box: one process writing data to a file while another process is reading (not necessarily the same dataset) will result in a corrupted file. To make it work, you must either:
implement file locking, making sure that no process is reading while the file is being written to, or
serialize access to the file by delegating reads/writes to a distinguished process. You must then communicate with this process through some IPC technique (Unix domain sockets, ...). Of course, this might affect performance because data is being copied back and forth.
Recently, the HDF group published an MPI-based parallel version of HDF5, which makes concurrent read/write access possible. Cf. http://www.hdfgroup.org/HDF5/PHDF5/. It was created for use cases like yours.
To my knowledge, pytables does not provide any bindings to parallel HDF5. You should use h5py instead, which provides very user-friendly bindings to parallel HDF5. See the examples on this website: http://docs.h5py.org/en/2.3/mpi.html
Unfortunately, parallel HDF5 has a major drawback: to date, it does not support writing compressed datasets (reading is possible, though). Cf. http://www.hdfgroup.org/hdf5-quest.html#p5comp
VBA is not cutting it for me anymore. I have lots of huge Excel files to which I need to make lots of calculations and break them down into other Excel/CSV files.
I need a language that I can pick up within the next couple of days to do what I need, because it is kind of an emergency. I have been suggested python, but I would like to check with you if there is anything else that does CSV file handling quickly and easily.
Python is an excellent choice. The csv module makes reading and writing CSV files easy (even Microsoft's, uh, "idiosyncratic" version) and Python syntax is a breeze to pick up.
I'd actually recommend against Perl, if you're coming to it fresh. While Perl is certainly powerful and fast, it's often cryptic to the point of incomprehensible to the uninitiated.
What kind of calculation you have to do? Maybe R would be an alternative?
EDIT: just to give a few basic examples
# Basic usage
data <- read.csv("myfile.csv")
# Pipe-separated values
data <- read.csv("myfile.csv", sep="|")
# File with header (columns will be named as header)
data <- read.csv("myfile.csv", header=TRUE)
# Skip the first 5 lines of the file
data <- read.csv("myfile.csv", skip=5)
# Read only 100 lines
data <- read.csv("myfile.csv", nrows=100)
There are many tools for the job, but yes, Python is perhaps the best these days. There is a special module for dealing with csv files. Check the official docs.
Python definitely has a small learning curve, and works with csv files well
You say you have "excel files to which i need to make lots of calculations and break them down into other excel/csv files" but all the answers so far talk about csv only ...
Python has a csv read/write module as others have mentioned. There are also 3rd party modules xlrd (reads) and xlwt (writes) modules for XLS files. See the tutorial on this site.
You know VBA? Why not Visual Basic 2008 / 2010, or perhaps C#? I'm sure languages like python and ruby would be relatively easier for the job, but you're already accustomed to the ".NET way" of doing things, so it makes sense to keep working with them instead of learning a whole new thing just for this job.
Using C#:
var csvlines = File.ReadAllLines("file.csv");
var query = from csvline in csvlines
let data = csvline.Split(',')
select new
{
ID = data[0],
FirstName = data[1],
LastName = data[2],
Email = data[3]
};
.NET: Linq to CSV library.
.NET: Read CSV with LINQ
Python: Read CSV file
Perl is surprisingly efficient for a scripting language for text. cpan.org has a tremendous number of modules for dealing with CSV data. I've also both written and wrote data in XLS format with another Perl module. If you were able to use VBA, you can certainly learn Perl (the basics of Perl are easy, though it's just as easy for you or others to write terse yet cryptic code).
That depends on what you want to do with the files.
Python's learning curve is less steep than R's. However, R has a bunch of built-in functions that make it very well suited for manipulating .csv files easily, particularly for statistical purposes.
Edit: I'd recommend R over Python for this purpose alone, if only because the basic operations (reading files, dropping rows, dropping columns, etc.) are slightly faster to write in R than in Python.
I'd give awk a try. If you're running windows, you can get awk via the cygwin utilities.
This may not be anybody's popular language du-jour, but since CSV files are line-oriented and split into fields, dealing with them is just about the perfect application for awk. It was built for processing line oriented text data that can be split into fields.
Most of the other languages folks are going to reccomend will be much more general-purpose, so there's going to be a lot more in them that isn't nessecarily applicable to processing line-oriented text data.
PowerShell has CSV import built in.
The syntax is ugly as death, but it's designed to be useful for administrators more than for programmers -- so who knows, you might like it.
It's supposed to be a quick get-up-and-go language, for better and worse.
I'm surprised nobody's suggested PowerQuery; it's perfect for consolidating and importing files to Excel, does column calculations nicely and has a good graphical editor built in. Works for csvs and excel files but also SQL databases and most other things you'd expect. I managed to get some basic cleaning and formatting stuff up and running in a day, maybe a few days to start writing my own functions (break free from the GUI)
And since it only really does database stuff, it's got barely any functions to learn (the actual language is called "M")
PHP has a couple of csv functions that are easy to use:
http://www.php.net/manual-lookup.php?pattern=csv&lang=en