parsing PDB header information with python (protein structure files)?

parsing PDB header information with python (protein structure files)? - python

Is there a parser for PDB-files (Protein Data Bank) that can extract (most) information from the header/REMARK-section, like refinement statistics, etc.?
It might be worthwhile to note that I am mainly interested in accessing data from files right after they have been produced, not from structures that have already been deposited in the Protein Data Bank. This means that there is quite a variety of different "propriety" formats to deal with, depending on the refinement software used.
I've had a look at Biopython, but they explicitly state in the FAQ that "If you are interested in data mining the PDB header, you might want to look elsewhere because there is only limited support for this."
I am well aware that it would be a lot easier to extract this information from mmCIF-files, but unfortunately these are still not output routinely from many macromolecular crystallography programs.

The best way I have found so far is converting the PDB-file into mmcif-format using pdb_extract (http://pdb-extract.wwpdb.org/ , either online or as a standalone).
The mmcif-file can be parsed using Biopythons Bio.PDB-module.
Writing to the mmcif-file is a bit trickier, Python PDBx seems to work reasonably well.
This, and other useful PDB-/mmcif-tools can be found at http://mmcif.wwpdb.org/docs/software-resources.html

Maybe you should try that library?
https://pypi.python.org/pypi/bioservices

Related

Best python library to use for a BNF based autocompleter

I want to create a SQL autocompleter for use with rlwrap: https://github.com/hanslub42/rlwrap
This could then be used with sqlite3 & osqueri for example (I know they already have some autocompletion facility, but it's not good enough, especially under rlwrap).
In fact, more generally I would like to know the best approach for building autocompleters based on BNF grammar descriptions; I may want to produce autocompleters for other rlwrapped REPLs at some point in the future.
I have no experience with parsers, but I have read some stuff online about the different types of parsers and how they work, and this Pyleri tutorial: https://tomassetti.me/pyleri-tutorial/
Pyleri looks fairly straightforward, and has the expecting property which makes it easy to create a auto-completer, but AFAIK it would involve translating the sqlite BNF (and any other BNF's that I might want to use in the future) into python code, which is a drag.
ANTLR has lots of predefined grammar files for many different languages, and the ability to output python code, but I'm not sure how easy it is to produce an autocompleter, and I don't want to read through all the documentation only to find out I've wasted my time.
So can anyone advise me? What's the best approach?

Alternative language for pattern-matching

I'm writing a program in Mathematica that relies on pattern-matching to perform payroll and warrant of payment verification. The crux of the problem is to compare different data files (both CSV and XLS) to make sure they contain the exact same information, since pay is handled by two different third-parties.
My use of Mathematica makes development of the program quite streamlined and fun, but is prohibitive on a distribution level. CDF format is not an option, since the program requires the user to import data files, something which WRI does not permit in CDF.
An ideal programming language for this task would enable me to pack it up as standalone, for OS X, Linux or Windows, as well as being able to do the pattern-matching. Support for GUI (primitive or extensive) is also needed.
I thought of Python to translate my program in, but I'm not sure if it's a good bet.
What suggestions do you have?
My only understanding of pattern-matching is that which the Mathematica documentation has taught me.
An example of a task that Mathematica handles perfectly is the following:
Import XLS file, sort data by dates and names, extract certain dates and names. Import CSV file, sort data by dates and names, extract certain dates and names.
Compare both, produce a nice formatted output containing desired (missing) information.
Navigating through the data in Mathematica is also pretty easy and intuitive.

Consider Haskell which seems to have all the features you want and is cross-platform.

If you want to use a more standard language that has some capabilities for working with spreadsheets, unless I'm misunderstanding the question, I would suggest using just simple Java with the Apache POI library, specifically made for horrible spreadsheet formats. Also it's considerably faster to pick up than Haskell is, though I suppose if you already know Mathematica it wouldn't be that bad to move over to another mathematically inclined language.

Prolog is a logical programming language in that it actually does a proof based on the facts that you give it. Thus is you provide it with the approiate facts for warrenty or payroll information it will be able to prove that it is either of them by trying to get to a base case in which both sides of an equation cancel. There is more to this but I'm on my phone at the moment.
For your situation you would be able to read data in a easier to program language and verify you parameters in Prolog and as long as your Prolog facts are correct it will be able to quickly verify that your data is valid. It can be thought of as regular expressions on steroids with a lot more functionality.
http://www.amzi.com/articles/lsapi_design.htm

Provide validated script configuration input (in Python)

I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.

I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.

Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.

Suggest semantic tags for short snippets of text

I am interested in generating a list of suggested semantic tags (via links to Freebase, Wikipedia or another system) to a user who is posting a short text snippet. I'm not looking to "understand" what the text is really saying, or even to automatically tag it, I just want to suggest to the user the most likely semantic tags for his/her post. My main goal is to force users to tag semantically and therefore consistently and not to write in ambiguous text strings. If there were a reasonably functional and reasonably priced tool on the market, I would use it. I have not found such a tool so I am looking in to writing my own.
My question is first of all, if there is such a tool that I have not encountered. I've looked at Zemanta, AlchemyAPI and OpenCalais and none of them seemed to offer the service I need.
Assuming that I'm writing my own, I'd be doing it in Python (unless there was a really compelling reason to use something else). My first guess would be to search for n-grams that match "entities" in Freebase and suggest them as tags, perhaps searching in descriptions of entities as well to get a little "smarter." If that proved insufficient, I'd read up and dip my toes into the ontological water. Since this is a very hard problem and I don't think that my application requires its solution, I would like to refrain from real semantic analysis as much as possible.
Does anyone have experience working with a semantic database system and could give me some pointers regarding where to begin and what sort of pitfalls to expect?

Take a look at NLTK python library. It contains a vast number of tools, dictionaries and algorithms.

Ideal Data Structure to Deal with XML Data

Maybe a silly question, but I usually learn a lot with those. :)
I'm working on a software that deals a lot with XML, both as input and as output, and in between a lot of processing occurs.
My first thought was to use internally a dict as a internal data structure, and from there, work my way with a process of reading and writing it.
What you guys think? Any better approach, python-wise?

An XML document in general is a tree with lots of bells and whistles (attributes vs. child nodes, mixing of text with child nodes, entities, xml declarations, comments, and many more). Handling that should be left to existing, mature libraries - for Python, it's commonly agreed that lxml is the most convenient choice, followed by the stdlib ElementTree modules (by which one lxml module, lxml.etree, is so heavily inspired that incompabilities are exceptions).
These handle all that complexity and expose it in somewhat handable ways with many convenience methods (lxml's XPath support has saved me lot of code). After parsing, programs can - of course - go on to convert the trees into simpler data structures that fit the data actually modeled much better. What data structures exactly are possible and sensible depending on what you want to represent (if you misuse XML as a flat key-value storage, for instance, you could indeed go on to convert the tree into a dictionary).

It depends completely on what type of data you have in XML, what kinds of processing you'll need to do with it, what sort of output you'll need to produce from it, etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.