I just began using hadoop on a single node cluster on my laptop and I tried to do it in Python which I know better than Java. Apparently streaming is the simplest way to do so without installing any others packages.
Well my question is, when I do a little data analysis with streaming, I had to:
Transform my data (matrix, array ... ) into text file which fit in the default input file format for streaming.
Re-construct my data in my mapper.py to make explicitly (key, value) pairs and print them out.
Read the result in text format and transform then into matrix data so that I could do other things with them.
When you do a wordcount with text file as input, everything looks fine. But how do you handle data structure within streaming then? The way I did seems just unacceptable...
For python and hadoop, please look for MRjob package, http://pythonhosted.org/mrjob/
You can write your ouwn encoding-decoding protocol, streaming matrix row as a rownum-values pair, or every element as row:col-value pair and so on.
Either way, hadoop is not the best framework to work with for matrix operations, since its designed for big amounts of non-interrelated data, i.e. when you key-value processing do not depend on other values, or depends in a very limited way.
Using json as a text format makes for very convenient encoding and decoding.
For example a 4*4 identity matrix on hdfs could be stored as:
{"row":3, "values":[0,0,1,0]}
{"row":2, "values":[0,1,0,0]}
{"row":4, "values":[0,0,0,1]}
{"row":1, "values":[1,0,0,0]}
In the mapper use json.loads() from the json library to parse each line into a python dictionary which is very easy to manipulate. Then return a key followed by more json (use json.dumps() to encode a python object as json):
1 {"values":[1,0,0,0]}
2 {"values":[0,1,0,0]}
3 {"values":[0,0,1,0]}
4 {"values":[0,0,0,1]}
In the reducer use json.loads() on the values to create a python dictionary. These could then be easily converted into a numpy array for example.
Related
I want to be able to do two things:
Store a hash of a datasets contents (so I can decide whether it has updated). To date, I have done this via a second output dataset with a single row that stores the hash and row count. In my Transform I can read that output and compare it to the current build's hash and row count to decide if data has updated. This works fine, but I'd like to avoid having a second dataset if possible.
Pass through timestamps from upstream dependencies so that in downstream workflows I can answer "when did dependency X last update?"
It seems like both of these could be solved by some sort of key-value metadata store on the dataset.
You're correct that one of the most straightforward ways to do this is to decorate the rows with a timestamp value, and in fact with Foundry's Parquet storage system, this will be encoded using Dictionary Encoding, a highly efficient mechanism to store repeated values.
The problem with this approach is you'll have to stack a new column for each phase of updating you want to keep track of. This might prove annoying to maintain in practice.
However, if you don't want to add this data to your rows and instead simply want to store your metadata, you have two options, one of which you've already found:
Store metadata in a separate dataset
Write an 'unused' file (probably .csv or .txt) to your output keeping track of this information
Foundry won't consider your .csv or .txt extra file on the output if you're writing a standard DataFrame to it since your schema by default will only read Parquet files. This means you can store this little snippet of information without affecting your output. If you check platform documentation, you can confirm that it's possible to write both a DataFrame to an output and a file of your own.
It may be simpler to interact with a second output however since the mechanisms of Incremental Transforms and schema handling will be taken care of for you, so I'd recommend proceeding with 1. as you are right now.
I am exploring a comparison between Go and Python, particularly for mathematical computation. I noticed that Go has a matrix package mat64.
1) I wanted to ask someone who uses both Go and Python if there are functions / tools comparable that are equivalent of Numpy's savez_compressed which stores data in a npz format (i.e. "compressed" binary, multiple matrices per file) for Go's matrics?
2) Also, can Go's matrices handle string types like Numpy does?
1) .npz is a numpy specific format. It is unlikely that Go itself would ever support this format in the standard library. I also don't know of any third party library that exists today, and (10 second) search didn't pop one up. If you need npz specifically, go with python + numpy.
If you just want something similar from Go, you can use any format. Binary formats include golang binary and gob. Depending on what you're trying to do, you could even use a non-binary format like json and just compress it on your own.
2) Go doesn't have built-in matrices. That library you found is third party and it only handles float64s.
However, if you just need to store strings in matrix (n-dimensional) format, you would use a n-dimensional slice. For 2-dimensional it looks like this: var myStringMatrix [][]string.
npz files are zip archives. Archiving and compression (optional) are handled by the Python zip module. The npz contains one npy file for each variable that you save. Any OS based archiving tool can decompress and extract the component .npy files.
So the remaining question is - can you simulate the npy format? It isn't trivial, but also not difficult either. It consists of a header block that contains shape, strides, dtype, and order information, followed by a data block, which is, effectively, a byte image of the data buffer of the array.
So the buffer information, and data are closely linked to the numpy array content. And if the variable isn't a normal array, save uses the Python pickle mechanism.
For a start I'd suggest using the csv format. It's not binary, and not fast, but everyone and his brother can generate and read it. We constantly get SO questions about reading such files using np.loadtxt or np.genfromtxt. Look at the code for np.savetxt to see how numpy produces such files. It's pretty simple.
Another general purpose choice would be JSON using the tolist format of an array. That comes to mind because GO is Google's home grown alternative to Python for web applications. JSON is a cross language format based on simplified Javascript syntax.
I am pretty new to Matlab but relatively familiar with Python. I am now using an existing Matlab code but I want the the program to generate output that Python can consume. The standard output format, as far as I know is .mat, which is more like a binary format.
Another function I considered is the built-in csvwrite but the problem with that is the variable I want to output is more like a dictionary and it can have several level subfield (e.g., feature.subfeature.subsubfeature = [1, 2, 3]). Another possibility is to output json format but it seems there is no built-in method for output json. There are some toolbox I can use but I don't have sudo permission on the machine that I am using.
Any suggestion on what's a better way to output a format that python can consume? Thanks.
The solution with the least effort on MATLAB side would be to use SciPy, which read/write mat-files:
SciPy
The mat-format is binary, but (surprisingly) open and document by Mathworks.
BACKGROUND
The issue I'm working with is as follows:
Within the context of an experiment I am designing for my research, I produce a large number of large (length 4M) arrays which are somewhat sparse, and thereby could be stored as scipy.sparse.lil_matrix instances, or simply as scipy.array instances (the space gain/loss isn't the issue here).
Each of these arrays must be paired with a string (namely a word) for the data to make sense, as they are semantic vectors representing the meaning of that string. I need to preserve this pairing.
The vectors for each word in a list are built one-by-one, and stored to disk before moving on to the next word.
They must be stored to disk in a manner which could be then retrieved with dictionary-like syntax. For example if all the words are stored in a DB-like file, I need to be able to open this file and do things like vector = wordDB[word].
CURRENT APPROACH
What I'm currently doing:
Using shelve to open a shelf named wordDB
Each time the vector (currently using lil_matrix from scipy.sparse) for a word is built, storing the vector in the shelf: wordDB[word] = vector
When I need to use the vectors during the evaluation, I'll do the reverse: open the shelf, and then recall vectors by doing vector = wordDB[word] for each word, as they are needed, so that not all the vectors need be held in RAM (which would be impossible).
The above 'solution' fits my needs in terms of solving the problem as specified. The issue is simply that when I wish to use this method to build and store vectors for a large amount of words, I simply run out of disk space.
This is, as far as I can tell, because shelve pickles the data being stored, which is not an efficient way of storing large arrays, thus rendering this storage problem intractable with shelve for the number of words I need to deal with.
PROBLEM
The question is thus: is there a way of serializing my set of arrays which will:
Save the arrays themselves in compressed binary format akin to the .npy files generated by scipy.save?
Meet my requirement that the data be readable from disk as a dictionary, maintaining the association between words and arrays?
as JoshAdel already suggested, I would go for HDF5, the simplest way is to use h5py:
http://h5py.alfven.org/
you can attach several attributes to an array with a dictionary like sintax:
dset.attrs["Name"] = "My Dataset"
where dset is your dataset which can be sliced exactly as a numpy array, but in the background it does not load all the array into memory.
I would suggest to use scipy.save and have an dictionnary between the word and the name of the files.
Have you tried just using cPickle to pickle the dictionary directly using:
import cPickle
DD = dict()
f = open('testfile.pkl','wb')
cPickle.dump(DD,f,-1)
f.close()
Alternatively, I would just save the vectors in a large multidimensional array using hdf5 or netcdf if necessary since this allows you to open a large array without bringing it all into memory at once and then get slices as needed. You can then associate the words as an additional group in the netcdf4/hdf5 file and use the common indices to quickly associate the appropriate slice from each group, or just name the group the word and then have the data be the vector. You'd have to play around with which is more efficient.
http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Pytables also might be a useful storage layer on top of HDF5:
http://www.pytables.org
Avoid using shelve, it's bug ridden and has cross-platform issues.
The memory issue, however, has nothing to do with shelve. Numpy arrays provide efficient implementation of the pickle protocol and there is little memory overhead to cPickle.dumps(protocol=-1), compared to binary .npy (only the extra headers in pickle, basically).
So if binary/pickle isn't enough, you'll have to go for compression. Have a look at pytables or h5py (difference between the two).
If specifying the binary protocol in pickle is enough, you can consider something more lightweight than hdf5: check out sqlitedict for a replacement of shelve. It has no additional dependencies.
How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.
Analogously, is there a programmatic python interface for getting phastCons scores?
thanks.
Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO
See my answer to your question over at Biostar:
http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc
Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:
region = rec.seq[start:end]
to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.
Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.
pyfasta is the module you're looking for. From the description
fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
https://github.com/brentp/pyfasta