Fetching genomic sequence efficiently in Python?

Fetching genomic sequence efficiently in Python? - python

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.
Analogously, is there a programmatic python interface for getting phastCons scores?
thanks.

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

See my answer to your question over at Biostar:
http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc
Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:
region = rec.seq[start:end]
to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

pyfasta is the module you're looking for. From the description
fast, memory-efficient, pythonic (and command-line) access to fasta sequence files
https://github.com/brentp/pyfasta

Related

read list of strings as file, python3

I have a list of strings, and would like to pass this to an api that accepts only a file-like object, without having to concatenate/flatten the list to use the likes of StringIO.
The strings are utf-8, don't necessarily end in newlines, and if naively concatenated could be used directly in StringIO.
Preferred solution would be within the standard library (python3.8) (Given the shape of the data is naturally similar to a file (~identical to readlines() obviously), and memory access pattern would be efficient, I have a feeling I'm just failing to DuckDuckGo correctly) - but if that doesn't exist any "streaming" (no data concatenation) solution would suffice.
[Update, based on #JonSG's links]
Both RawIOBase and TextIOBase look provide an api that decouples arbitrarily sized "chunks"/fragments (in my case: strings in a list) from a file-like read which can specify its own read chunk size, while streaming the data itself (memory cost increases by only some window at any given time [dependent of course on behavior of your source & sink])
RawIOBase.readinto looks especially promising because it provides the buffer returned to client reads directly, allowing much simpler code - but this appears to come at the cost of one full copy (into that buffer).
TextIOBase.read() has its own cost for its operation solving the same subproblem, which is concatenating k (k much smaller than N) chunks together.
I'll investigate both of these.

Which data files output format from C++ to be read in python (or other) is more size efficient?

I want to write output files containing tabular datas (float values in lines and columns) from a C++ program.
I need to open those files later on with other languages/softwares (here python and paraview, but might change).
What would be the most efficient output format for tabular files (efficient for files memory sizes efficiency) that would be compatible with other languages ?
E.g., txt files, csv, xml, binarized or not ?
Thanks for advices

HDF5 might be a good option for you. It’s a standard format for storing large amounts of data, and there are Python and C++ libraries for reading and writing to it.
See here for an example

1- Your output files contain tabular data (float values in lines and columns), in other words, a kind of matrix.
2- You need to open those files later on with other languages/softwares
3- You want to have files memory sizes efficiency
That's said, you have to consider one of the two formats below:
CSV: if your data are very simple (a matrix of float without particualr structure)
JSON if you need a minimum structure for your files
These two formats are standard, supported by almost all the known languages and maintained softwares.
Last, if your data have a great complexity structure, prefer to look at a format like XML but the price to pay is then in the size of your files!
Hope this helps!

First of all the efficiency of i/o operations is limited by the buffer size. So if you want to achieve higher throughput you might have to play with the input output buffers. And regarding your doubt of what way to output into the files is dependent on your data and what delimiters you want to use to separate the data in the files.

how to print all duplicate (same contents) files from a given directory?

I'm kind of new in Python and this is the first time I'm trying to wright a script.
I need to go over a given directory and print out all the duplicated files (files with same contents). If there are to sets of duplicated files then I need to print them in different lines.
Anyone have an idea?

Yes, creating a dictionary with the file size as the key and a list of all the files with that size as the value, and then only comparing files of the same size is a good strategy. However, there's another step you can take to improve the efficiency.
Once you've identified your lists of files of the same size, rather than laboriously comparing each pair of files in the list byte by byte (and returning as soon as you find a mis-match), you can compare their digital fingerprints.
A fingerprinting algorithm (also known as a message digest) takes a string of data and returns a short string of bytes that is (hopefully) unique to that input string. So to find duplicate files you just need to generate the fingerprint of each file and then see if any of the fingerprints are duplicates. This is generally a lot faster than actually comparing file contents, since if you have a list of fingerprints you can easily sort it, so all the files with the same fingerprint will be next to each other in the sorted list.
With the usual functions used for fingerprinting there is a tiny probability that two files with the same fingerprint aren't actually identical. So once you've identified files with matching fingerprints you still do need to compare their byte contents. However, in those very rare cases where two non-identical files have matching fingerprints the file contents will generally differ quite radically, so such false positives can be quickly eliminated.
The odds of two non-identical files have matching fingerprints is so tiny that unless you have many thousands of files to de-dupe you can probably skip the first step of grouping files by size and just go straight to fingerprinting.
The Wikipedia article on fingerprinting I linked to above mentions the Rabin fingerprint as a very fast algorithm. There is a 3rd-party Python module available for it on PyPI - PyRabin 0.5, but I've never used it.
However, the standard Python hashlib module provides various secure hash and message digest algorithms, eg MD5 and the SHA family, which are reasonably fast, and quite familiar to most seasoned Python programmers. For this application, I'd suggest using an algorithm that returns a fairly short fingerprint, like sha1() or md5(), since they tend to be faster, although the shorter the fingerprint, the higher the rate of collision. Although MD5 isn't as secure as once thought it's still ok for this application, unless you need to deal with files created by a malicious person who's creating non-identical files with the same MD5 fingerprint on purpose.
Another option if you expect to get lots of duplicates files is to compute two different fingerprints (eg both SHA1 and MD5) - the odds of two non-identical files have matching SHA1 and MD5 fingerprints is microscopically tiny.
FWIW, here's a link to a simple Python program I wrote last year (in an answer on the Unix & Linux Stack Exchange site) that computes the MD5 and SHA256 hashes of a single file. It's quite efficient on large files. You may find it helpful as an example of using hashlib.
Simultaneously calculate multiple digests (md5, sha256)

Streaming in Python for scientific data analysis

I just began using hadoop on a single node cluster on my laptop and I tried to do it in Python which I know better than Java. Apparently streaming is the simplest way to do so without installing any others packages.
Well my question is, when I do a little data analysis with streaming, I had to:
Transform my data (matrix, array ... ) into text file which fit in the default input file format for streaming.
Re-construct my data in my mapper.py to make explicitly (key, value) pairs and print them out.
Read the result in text format and transform then into matrix data so that I could do other things with them.
When you do a wordcount with text file as input, everything looks fine. But how do you handle data structure within streaming then? The way I did seems just unacceptable...

For python and hadoop, please look for MRjob package, http://pythonhosted.org/mrjob/
You can write your ouwn encoding-decoding protocol, streaming matrix row as a rownum-values pair, or every element as row:col-value pair and so on.
Either way, hadoop is not the best framework to work with for matrix operations, since its designed for big amounts of non-interrelated data, i.e. when you key-value processing do not depend on other values, or depends in a very limited way.

Using json as a text format makes for very convenient encoding and decoding.
For example a 4*4 identity matrix on hdfs could be stored as:
{"row":3, "values":[0,0,1,0]}
{"row":2, "values":[0,1,0,0]}
{"row":4, "values":[0,0,0,1]}
{"row":1, "values":[1,0,0,0]}
In the mapper use json.loads() from the json library to parse each line into a python dictionary which is very easy to manipulate. Then return a key followed by more json (use json.dumps() to encode a python object as json):
1 {"values":[1,0,0,0]}
2 {"values":[0,1,0,0]}
3 {"values":[0,0,1,0]}
4 {"values":[0,0,0,1]}
In the reducer use json.loads() on the values to create a python dictionary. These could then be easily converted into a numpy array for example.

I have a large collection of small files of the same nature. Can I build dictionary on them all but compress each file individually?

The corpus consists of strings (files names) and their checksums, so I expect its entropy to be higher than of normal text. Also the collection is too large to be analysed so I'm going to sample it to create global dictionary. Is there a fancy machine learning approach for my task?
Which algorithm or, better, library should I use?
I'm using python in case it matters.

I would suggest you use sparse coding. It allows you to use your data set to infer an overcomplete dictionary which is then used to encode your data. If your data is indeed of similar nature, this could work well for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.