renumber residues in a protein structure file (pdb) - python

Hi
I am currently involved in making a website aimed at combining all papillomavirus information in a single place.
As part of the effort we are curating all known files on public servers (e.g. genbank)
One of the issues I ran into was that many (~50%) of all solved structures are not numbered according to the protein.
I.e. a subdomain was crystallized (amino acid 310-450) however the crystallographer deposited this as residue 1-140.
I was wondering whether anyone knows of a way to renumber the entire pdb file. I have found ways to renumber the sequence (identified by seqres), however this does not update the helix and sheet information.
I would appreciate it if you had any suggestions…
Thanks

I'm the maintainer of pdb-tools - which may be a tool that can assist you.
I have recently modified the residue-renumber script within my application to provide more flexibility. It can now renumber hetatms and specific chains, and either force the residue numbers to be continuous or just add a user-specified offset to all residues.
Please let me know if this assists you.

I frequently encounter this problem too. After abandoning an old perl script I had for this I've been experimenting with some python instead. This solution assumes you've got Biopython, ProDy (http://www.csb.pitt.edu/ProDy/#prody) and EMBOSS (http://emboss.sourceforge.net/) installed.
I used one of the papillomavirus PDB entries here.
from Bio import AlignIO,SeqIO,ExPASy,SwissProt
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.Emboss.Applications import NeedleCommandline
from prody.proteins.pdbfile import parsePDB, writePDB
import os
oneletter = {
'ASP':'D','GLU':'E','ASN':'N','GLN':'Q',
'ARG':'R','LYS':'K','PRO':'P','GLY':'G',
'CYS':'C','THR':'T','SER':'S','MET':'M',
'TRP':'W','PHE':'F','TYR':'Y','HIS':'H',
'ALA':'A','VAL':'V','LEU':'L','ILE':'I',
}
# Retrieve pdb to extract sequence
# Can probably be done with Bio.PDB but being able to use the vmd-like selection algebra is nice
pdbname="2kpl"
selection="chain A"
structure=parsePDB(pdbname)
pdbseq_str=''.join([oneletter[i] for i in structure.select("protein and name CA and %s"%selection).getResnames()])
alnPDBseq=SeqRecord(Seq(pdbseq_str,IUPAC.protein),id=pdbname)
SeqIO.write(alnPDBseq,"%s.fasta"%pdbname,"fasta")
# Retrieve reference sequence
accession="Q96QZ7"
handle = ExPASy.get_sprot_raw(accession)
swissseq = SwissProt.read(handle)
refseq=SeqRecord(Seq(swissseq.sequence,IUPAC.protein),id=accession)
SeqIO.write(refseq, "%s.fasta"%accession,"fasta")
# Do global alignment with needle from EMBOSS, stores entire sequences which makes numbering easier
needle_cli = NeedleCommandline(asequence="%s.fasta"%pdbname,bsequence="%s.fasta"%accession,gapopen=10,gapextend=0.5,outfile="needle.out")
needle_cli()
aln = AlignIO.read("needle.out", "emboss")
os.remove("needle.out")
os.remove("%s.fasta"%pdbname)
os.remove("%s.fasta"%accession)
alnPDBseq = aln[0]
alnREFseq = aln[1]
# Initialize per-letter annotation for pdb sequence record
alnPDBseq.letter_annotations["resnum"]=[None]*len(alnPDBseq)
# Initialize annotation for reference sequence, assume first residue is #1
alnREFseq.letter_annotations["resnum"]=range(1,len(alnREFseq)+1)
# Set new residue numbers in alnPDBseq based on alignment
reslist = [[i,alnREFseq.letter_annotations["resnum"][i]] for i in range(len(alnREFseq)) if alnPDBseq[i] != '-']
for [i,r] in reslist:
alnPDBseq.letter_annotations["resnum"][i]=r
# Set new residue numbers in the structure
newresnums=[i for i in alnPDBseq.letter_annotations["resnum"][:] if i != None]
resindices=structure.select("protein and name CA and %s"%selection).getResindices()
resmatrix = [[newresnums[i],resindices[i]] for i in range(len(newresnums)) ]
for [newresnum,resindex] in resmatrix:
structure.select("resindex %d"%resindex).setResnums(newresnum)
writePDB("%s.renumbered.pdb"%pdbname,structure)

pdb-tools
Phenix pdb-tools
BioPython or Bio3D
Check the first one - it should fit your needs

Related

Python comment-preserving parsing using only builtin libraries?

I wrote a library using just ast and inspect libraries to parse and emit [uses astor on Python < 3.9] internal Python constructs.
Just realised that I really need to preserve comments afterall. Preferably without resorting to a RedBaron or LibCST; as I just need to emit the unaltered commentary; is there a clean and concise way of comment-preserving parsing/emitting Python source with just stdlib?
What I ended up doing was writing a simple parser, without a meta-language in 339 source lines:
https://github.com/offscale/cdd-python/blob/master/cdd/cst_utils.py
Implementation of Concrete Syntax Tree [List!]
Reads source character by character;
Once end of statement† is detected, add statement-type into 1D list;
†end of line if line.lstrip().startswith("#") or line not endswith('\\') and balanced_parens(line) else continue munching until that condition is true… plus some edge-cases around multiline strings and the like;
Once finished there is a big (1D) list where each element is a namedtuple with a value property.
Integration with builtin Abstract Syntax Tree ast
Limit ast nodes to modify—not remove—to: {ClassDef,AsyncFunctionDef,FunctionDef} docstring (first body element Constant|Str), Assign and AnnAssign;
cst_idx, cst_node = find_cst_at_ast(cst_list, _node);
if doc_str node then maybe_replace_doc_str_in_function_or_class(_node, cst_idx, cst_list)
…
Now the cst_list contains only changes to those aforementioned nodes, and only when that change is more than whitespace, and can be created into a string with "".join(map(attrgetter("value"), cst_list)) for outputting to eval or straight out to a source file (e.g., in-place overriding).
Quality control
100% test coverage
100% doc coverage
Support for last 6 versions of Python (including latest alpha)
CI/CD
(Apache-2.0 OR MIT) licensed
Limitations
Lack of meta-language, specifically lack of using Python's provided grammar means new syntax elements won't automatically be supported (match/case is supported, but if there's new syntax introduced since, it isn't [yet?] supported… at least not automatically);
Not builtin to stdlib so stdlib could break compatibility;
Deleting nodes is [probably] not supported;
Nodes can be incorrectly identified if there are shadow variables or similar issues that linters should point out.
Comments can be preserved by merging them back into the generated source code by capturing them with the tokenizer.
Given a toy program in a program variable, we can demonstrate how comments get lost in the AST:
import ast
program = """
# This comment lost
p1v = 4 + 4
p1l = ['a', # Implicit line joining comment for a lost
'b'] # Ending comment for b lost
def p1f(x):
"p1f docstring"
# Comment in function p1f lost
return x
print(p1f(p1l), p1f(p1v))
"""
tree = ast.parse(program)
print('== Full program code:')
print(ast.unparse(tree))
The output shows all comments gone:
== Full program code:
p1v = 4 + 4
p1l = ['a', 'b']
def p1f(x):
"""p1f docstring"""
return x
print(p1f(p1l), p1f(p1v))
However, if we scan the comments with the tokenizer, we can
use this to merge the comments back in:
from io import StringIO
import tokenize
def scan_comments(source):
""" Scan source code file for relevant comments
"""
# Find token for comments
for k,v in tokenize.tok_name.items():
if v == 'COMMENT':
comment = k
break
comtokens = []
with StringIO(source) as f:
tokens = tokenize.generate_tokens(f.readline)
for token in tokens:
if token.type != comment:
continue
comtokens += [token]
return comtokens
comtokens = scan_comments(program)
print('== Comment after p1l[0]\n\t', comtokens[1])
Output (edited to split long line):
== Comment after p1l[0]
TokenInfo(type=60 (COMMENT),
string='# Implicit line joining comment for a lost',
start=(4, 12), end=(4, 54),
line="p1l = ['a', # Implicit line joining comment for a lost\n")
Using a slightly modified version of ast.unparse(), replacing
methods maybe_newline() and traverse() with modified versions,
you should be able to merge back in all comments at their
approximate locations, using the location info from the comment
scanner (start variable), combined with the location info from the
AST; most nodes have a lineno attribute.
Not exactly. See for example the list variable assignment. The
source code is split out over two lines, but ast.unparse()
generates only one line (see output in the second code segment).
Also, you need to ensure to update the location info in the AST
using ast.increment_lineno() after adding code.
It seems some more calls to
maybe_newline() might be needed in the library code (or its
replacement).

How to use refactoring method with functions on python code?

I am in the learning phase of writing python code's. I have created the below code and have got results successfully however, i have been asked to refactor the code and i am not very sure how to proceed. I did refer to multiple post related to refactoring but got more confused and was not clear how its done. Any assistance will be appreciated. Thanks.
import pandas as pd
import numpy as np
pd.set_option('display.max_columns',100)
data = pd.read_excel (r'S:\folder\file1.xlsx')
df_mail =pd.DataFrame(data,columns= ['CustomerName','CDAAccount','Transit'])
print(df_mail)
df_maillist =df_mail.rename(columns={'CDAAccount':'ACOUNT_NUM','Transit':'BRANCH_NUM'})
print(df_maillist)
## 1) Read SAS files
pathcifbas = 'S:\folder\custbas.sas7bdat'
pathcifadr = 'S:\folder\cusadr.sas7bdat'
pathcifacc = 'S:\folder\cusact.sas7bdat'
##custbas.sas7bdat
columns=['CIFNUM','CUSTOMR_LANGUG_C']
dfcifbas = pd.read_sas(pathcifbas)
print(dfcifbas.head())
df_langprf= dfcifbas[columns]
print(df_langprf.head())
df_lang =df_langprf.rename(columns={'CUSTOMR_LANGUG_C':'Language Preference'})
print(df_lang)
## cusadr.sas7bdat
dfcifadr = pd.read_sas(pathcifadr)
print(dfcifadr.head())
cols=['CIFNUM','ADRES_STREET_NUM','ADRES_STREET_NAME','ADRES_CITY','ADRES_STATE_PROV_C','FULL_POSTAL','ADRES_COUNTRY_C','ADRES_SPECL_ADRES']
df_adr= dfcifadr[cols]
print(df_adr.head)
### Renaming the columns
df_adrress =df_adr.rename(columns={'ADRES_CITY':'City','ADRES_STATE_PROV_C':'Province','FULL_POSTAL':'Postal Code','ADRES_COUNTRY_C':'Country','ADRES_SPECL_ADRES':'Special Address'})
print(df_adrress)
## cusact.sas7bdat
dfcifacc = pd.read_sas(pathcifacc)
print(dfcifacc.head())
colmns=['CIFNUM','ACOUNT_NUM','BRANCH_NUM','APLICTN_ID']
df_acc= dfcifacc[colmns]
print(df_acc)
## Filtering the tables with ['APLICTN_ID']== b'CDA'
df_cda= df_acc['APLICTN_ID']== b'CDA'
print(df_cda.head())
df_acccda = df_acc[df_cda]
print(df_acccda)
## Joining dataframes (df_lang), (df_adrress) and (df_acccda) on CIF_NUM
from functools import reduce
Combine_CIFNUM= [df_acccda,df_lang,df_adrress ]
df_cifnum = reduce(lambda left,right: pd.merge(left,right,on='CIFNUM'), Combine_CIFNUM)
print(df_cifnum)
#convert multiple columns object byte to string
df_cifnumstr= df_cifnum.select_dtypes([np.object])
df_cifnumstr=df_cifnumstr.stack().str.decode('latin1').unstack()
for col in df_cifnumstr:
df_cifnum[col] = df_cifnumstr[col]
print(df_cifnum) ## Combined Data Frame
# Joining Mail list with df_cifnum(combined dataframe)
Join1_mailcifnum=pd.merge(df_maillist,df_cifnum, on=['ACOUNT_NUM','BRANCH_NUM'],how='left')
print(Join1_mailcifnum)
## dropping unwanted columns
Com_maillist= Join1_mailcifnum.drop(['CIFNUM','APLICTN_ID'], axis =1)
print(Com_maillist)
## concatenating Street Num + Street Name = Street Address
Com_maillist["Street Address"]=(Com_maillist['ADRES_STREET_NUM'].map(str)+ ' ' + Com_maillist['ADRES_STREET_NAME'].map(str))
print (Com_maillist.head())
## Rearranging columns
Final_maillist= Com_maillist[["CustomerName","ACOUNT_NUM","BRANCH_NUM","Street Address","City","Province","Postal Code","Country","Language Preference","Special Address"]]
print(Final_maillist)
## Export to excel
Final_maillist.to_excel(r'S:\Data Analysis\folder\Final_List.xlsx',index= False, sheet_name='Final_Maillist',header=True)```
Good code refactoring can be composed of many different steps, and depending on what your educator/client/manager/etc. expects, could involve vastly different amounts of effort and time spent. It's a good idea to ask this person what expectations they have for this specific project and start there.
However, for someone relatively new to Python I'd recommend you start with readability and organization. Make sure all your variable names are explicit and readable (assuming you're not using a required pattern like Hungarian notation). As a starting point, the Python naming conventions tend to use lowercase letters and underscores, with exceptions for certain objects or class names. Python actually has a really in-depth style guide called PEP-8. You can find it here
https://www.python.org/dev/peps/pep-0008/
A personal favorite of mine are comments. Comments should always contain the "why" of something, not necessarily the "how" (your code should be readable enough to make this part relatively obvious). This is a bit harder for smaller scripts or assignments where you don't have a ton of individual choice, but it's good to keep in mind.
If you've learned about object oriented programming, you should definitely split up tasks into functions and classes. In your specific case, you could create individual functions for things like loading files, performing specific operations on the file contents, and exporting. If you notice a bunch of functions that tend to have similar themes, that may be a good time to look into creating a class for those functions!
Finally, and again this is a personal preference (for basic scripts anyways), but I like to see a main declaration for readability and organization.
# imports go here!
# specific functions
def some_function():
return
if __name__ == "__main__":
# the start of your program goes here!
This is all pretty heavily simplified for the purposes of just starting out. There are plenty of other resources that can go more in depth in organization, good practices, and optimization.
Best of luck!

python-docx - how to restart list lettering

I'm automating the process of creating a Word document with the python-docx module. In particular, I'm creating a multiple choice test where the questions are numbered 1., 2., 3., ... and under each question there are 4 answers that should be labeled as A., B., C., and D. I used a style to create the numbered list and the lettered list. However, I don't know how to restart the letters. For example, the answers for the 2nd question would range from E., F., G., H. Does anyone know how to restart the lettering back to A? I could manually specify the lettering in the answer string but I'm wondering how to do it with the style sheet. Thank you.
The short answer is that this is not supported yet in python-docx but we might be able to provide you with a workaround if you ask on this issue on the Github issue list for the project: https://github.com/python-openxml/python-docx/issues/25
This particular operation turns out to be way more difficult in Word than anyone I know would have imagined. I believe that has to do with a need to maintain certain backward compatibilities on so many versions across the couple three decades of Word now.
Just to give you an idea, the style references a numbering definition which itself references an abstract numbering definition. Each paragraph with the style gets the next number/letter in the sequence. To restart the sequence, you have to create a NEW numbering definition that references the same abstract numbering sequence as the prior one. Then you reference the new numbering definition on the paragraph where the sequence should restart. Paragraphs with the style following that one get the next number/letter in the restarted sequence.
In order to accomplish that, you need to:
locate the numbering definition of the style
locate the abstract numbering definition it points to
create a new numbering definition with the restart bit set
tweak the paragraph element to refer to the new numbering definition
Anyway, now that I've vented about all that I can tell you we've actually gotten it to work before. We haven't added it to the API yet mostly I suppose because it's not entirely clear what the API should look like and it hasn't risen to the top of the backlog yet. But a couple workaround functions could probably get it done for you if you want it badly enough.
In your case I suppose it would be a toss-up. I suppose I would strongly consider placing those bits in directly in the paragraph in this case, but you'll be best able to decide.
I've created a pull request (#582) that addresses this situation at a low-level. All I have done is define the XML types necessary to implement the numbering subsystem of WML. #scanny has create a submodule called xmlchemy that creates a semi-abstract representation of the XML so that you can handle multilevel lists and other numbering tasks if you are familiar with the standard. So if you build my fork, the following code will work:
#!/usr/bin/python
from docx import Document
from docx import oxml
d = Document()
"""
1. Create an abstract numbering definition for a multi-level numbering style.
"""
numXML = d.part.numbering_part.numbering_definitions._numbering
nextAbstractId = max([ J.abstractNumId for J in numXML.abstractNum_lst ] ) + 1
l = numXML.add_abstractNum()
l.abstractNumId = nextAbstractId
m = l.add_multiLevelType()
m.val = 'multiLevel'
"""
2. Define numbering formats for each (zero-indexed)
level. N.B. The formatting text is one-indexed.
The user agent will accept up to nine levels.
"""
formats = {0: "decimal", 1: "upperLetter" }
textFmts = {0: '%1.', 1: '%2.' }
for i in range(2):
lvl = l.add_lvl()
lvl.ilvl = i
n = lvl.add_numFmt()
n.val = formats[i]
lt = lvl.add_lvlText()
lt.val = textFmts[i]
"""
3. Link the abstract numbering definition to a numbering definition.
"""
n = numXML.add_num(nextAbstractId)
"""
4. Define a function to set the (0-indexed) numbering level of a paragraph.
"""
def set_ilvl(p,ilvl):
pr = p._element._add_pPr()
np = pr.get_or_add_numPr()
il = np.get_or_add_ilvl()
il.val = ilvl
ni = np.get_or_add_numId()
ni.val = n.numId
return(p)
"""
5. Create some content
"""
for x in [1,2,3]:
p = d.add_paragraph()
set_ilvl(p,0)
p.add_run("Question %i" % x)
for y in [1,2,3,4]:
p2 = d.add_paragraph()
set_ilvl(p2,1)
p2.add_run("Choice %i" % y)
d.save('test.docx')

Create an index of the content of each file in a folder

I'm making a search tool in Python.
Its objective is to be able to search files by their content. (we're mostly talking about source file, text files, not images/binary - even if searching in their METADATA would be a great improvment). For now I don't use regular expression, casual plain text.
This part of the algorithm works great !
The problem is that I realize I'm searching mostly in the same few folders, I'd like to find a way to build an index of the content of each files in a folder. And be able as fast as possible to know if the sentence I'm searching is in xxx.txt or if it can't be there.
The idea for now is to maintain a checksum for each file that makes me able to know if it contains a particular string.
Do you know any algorithm close to this ?
I don't need a 100% success rate, I prefer a little index than a big one with 100% success.
The idea is to provide a generic tool.
EDIT : To be clear, I want to search a PART of the content of the file. So making a md5 hash of all its content & comparing it with the hash of what i'm searching isn't a good idea ;)
here i am using whoosh lib to make searching/indexing.. .upper part is indexing the files and the lower part is demo search.. .
#indexing part
from whoosh.index import create_in
from whoosh.fields import *
import os
import stat
import time
schema = Schema(FileName=TEXT(stored=True), FilePath=TEXT(stored=True), Size=TEXT(stored=True), LastModified=TEXT(stored=True),
LastAccessed=TEXT(stored=True), CreationTime=TEXT(stored=True), Mode=TEXT(stored=True))
ix = create_in("./my_whoosh_index_dir", schema)
writer = ix.writer()
for top, dirs, files in os.walk('./my_test_dir'):
for nm in files:
fileStats = os.stat(os.path.join(top, nm))
fileInfo = {
'FileName':nm,
'FilePath':os.path.join(top, nm),
'Size' : fileStats [ stat.ST_SIZE ],
'LastModified' : time.ctime ( fileStats [ stat.ST_MTIME ] ),
'LastAccessed' : time.ctime ( fileStats [ stat.ST_ATIME ] ),
'CreationTime' : time.ctime ( fileStats [ stat.ST_CTIME ] ),
'Mode' : fileStats [ stat.ST_MODE ]
}
writer.add_document(FileName=u'%s'%fileInfo['FileName'],FilePath=u'%s'%fileInfo['FilePath'],Size=u'%s'%fileInfo['Size'],LastModified=u'%s'%fileInfo['LastModified'],LastAccessed=u'%s'%fileInfo['LastAccessed'],CreationTime=u'%s'%fileInfo['CreationTime'],Mode=u'%s'%fileInfo['Mode'])
writer.commit()
## now the seaching part
from whoosh.qparser import QueryParser
with ix.searcher() as searcher:
query = QueryParser("FileName", ix.schema).parse(u"hsbc") ## here 'hsbc' is the search term
results = searcher.search(query)
for x in results:
print x['FileName']
It's not the most efficient, but just uses the stdlib and a little bit of work. sqlite3 (if it's enabled on compilation) supports full text indexing. See: http://www.sqlite.org/fts3.html
So you could create a table of [file_id, filename], and a table of [file_id, line_number, line_text], and use those to base your queries on. ie: how many files contain this word and that line, what lines contain this AND this but not etc...
The only reason anyone would want a tool that is capable of searching 'certain parts' of a file is because what they are trying to do is analyze data that has legal restrictions on which parts of it you can read.
For example, Apple has the capability of identifying the GPS location of your iPhone at any moment a text was sent or received. But, what they cannot legally do is associate that location data with anything that can be tied to you as an individual.
On a broad scale you can use obscure data like this to track and analyze patterns throughout large amounts of data. You could feasibly assign a unique 'Virtual ID' to every cell phone in the USA and log all location movement; afterward you implement a method for detecting patterns of travel. Outliers could be detected through deviations in their normal travel pattern. That 'metadeta' could then be combined with data from outside sources such as names and locations of retail locations. Think of all the situations you might be able to algorithmically detect. Like the soccer dad who for 3 years has driven the same general route between work, home, restaurants, and a little league field. Only being able to search part of a file still offers enough data to detect that Soccer Dad's phone's unique signature suddenly departed from the normal routine and entered a gun shop. The possibilities are limitless. That data could be shared with local law enforcement to increase street presence in public spaces nearby; all while maintaining anonymity of the phone's owner.
Capabilities like the example above are not legally possible in today's environment without the method IggY is looking for.
On the other hand, it could just be that he is only looking for certain types of data in certain file types. If he knows where in the file he wants to search for the data he needs he can save major CPU time only reading the last half or first half of a file.
You can do a simple name-based cache as below. This is probably best (fastest) if the file contents is not expected to change. Otherwise, you can MD5 the file contents. I say MD5 because it's faster than SHA, and this application doesn't seem security sensitive.
from hashlib import md5
import os
info_cache = {}
for file in files_to_search:
file_info = get_file_info(file)
file_hash = md5(os.path.abspath(file)).hexdigest()
info_cache[file_hash]=file_info

Extracting move information from a pgn file on Python

How do I go about extracting move information from a pgn file on Python? I'm new to programming and any help would be appreciated.
Try pgnparser.
Example code:
import pgn
import sys
f = open(sys.argv[1])
pgn_text = f.read()
f.close()
games = pgn.loads(pgn_text)
for game in games:
print game.moves
#Dennis Golomazov
I like what Denis did above. To add on to what he did, if you want to extract move information from more than 1 game in a png file, say like in games in chess database png file, use chess.pgn.
import chess.pgn
png_folder = open('sample.pgn')
current_game = chess.pgn.read_game(png_folder)
png_text = str(current_game.mainline_moves())
read_game() method acts as an iterator so calling it again will grab the next game in the pgn.
I can't give you any Python-specific directions, but I wrote a PGN converter recently in java, so I'll try offer some advice. The main disadvantage of Miku's link is the site doesn't allow for variance in .pgn files, which every site seems to vary slightly on the exact format.
Some .pgn have the move number attached to the move itself (1.e4 instead of 1. e4) so if you tokenise the string, you could check the placement of the dot since it only occurs in move numbers.
Work out all the different move combinations you can have. If a move is 5 characters long it could be 0-0-0 (queenside castles), Nge2+ (Knight from g to e2 with check(+)/ checkmate(#)), Rexb5 (Rook on e takes b5).
The longest string a move could be is 7 characters (for when you must specify origin rank AND file AND a capture AND with check). The shortest is 2 characters (a pawn advance).
Plan early for castling and en passant moves. You may realise too late that the way you have built your program doesn't easily adapt for them.
The details given at the start(ELO ratings, location, etc.) vary from file to file.
I dont have PGN parser for python but You can get source code of PGN parser for XCode from this place it can be of assistance

Categories