Extracting chains from a electron microscopy structure - python

I need to extract single chains from a structure file in cif format as available from the PDB. I've read several related questions, such as this and this. The proposed solution indeed works well if the chain ID is an integer or a single character. If applied to a structure such as 6KMW to extract chain aA it raises the error TypeError: %c requires int or char. Full code used to reproduce the error and output included below.
from Bio.PDB import PDBList, PDBIO, FastMMCIFParser, Select
class ChainSelect(Select):
def __init__(self, chain):
self.chain = chain
def accept_chain(self, chain):
if chain.get_id() == self.chain:
return 1
else:
return 0
pdbl = PDBList()
io = PDBIO()
parser = FastMMCIFParser(QUIET = True)
pdbl.retrieve_pdb_file('6kmw', pdir = '.', file_format='mmCif')
structure = parser.get_structure('6kmw', '6kmw.cif')
io.set_structure(structure)
io.save('6kmw_aA.pdb', ChainSelect('aA'))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-095b98a12800> in <module>
18 structure = parser.get_structure('6kmw', '6kmw.cif')
19 io.set_structure(structure)
---> 20 io.save('6kmw_aA.pdb', ChainSelect('aA'))
~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in save(self, file, select, write_end, preserve_atom_numbering)
368 )
369
--> 370 s = get_atom_line(
371 atom,
372 hetfield,
~/miniconda3/envs/lab2/lib/python3.8/site-packages/Bio/PDB/PDBIO.py in _get_atom_line(self, atom, hetfield, segid, atom_number, resname, resseq, icode, chain_id, charge)
227 charge,
228 )
--> 229 return _ATOM_FORMAT_STRING % args
230
231 else:
TypeError: %c requires int or char
Is anyone aware of a Biopython functionality to achieve the result? Preferably one that doesn't rely on parsing the entire file by custom functions.

I think, what you are trying to achieve is just impossible. Effectively you want to convert a cif file to a pdb file. It does not matter that you want to reduce the protein structure to a single chain in the process.
The PDB format is a file format from the last century. (I know how widely spread it is till today...) It is column oriented and only allows for one character for the chain id. This is the reason you cannot download a PDB file for protein 6KMW. See the tooltip at https://www.rcsb.org/structure/6KMW for that: "PDB format files are not available for large structures". In your case "large" means, proteins with so many chains that they need two characters.
You cannot store two characters as the chain name for a PDB file.
You got two options now:
Rename the chain "aA" and save the file in PDB format
Don't use the PDB format as your file format but stick to cif
This snippet renames the chain and stores the structure as a pdb file:
[...]
io.set_structure(structure)
for model in structure:
for chain in model:
if chain.get_id() == "A":
chain.id = "_"
print("renamed chain A to _")
if chain.get_id() == "aA":
chain.id = "A"
print("renamed chain aA to A")
io.save('6kmw_aA.pdb', ChainSelect('A'))
This snippet stores only chain 'aA' in mmCIF format:
from Bio.PDB.mmcifio import MMCIFIO
io = MMCIFIO()
io.set_structure(structure)
io.save("6kmw_aA.cif", ChainSelect('aA'))

Related

TypeError: int() argument must be a string, a bytes-like object or a number, not '_io.TextIOWrapper' [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am trying to read a text file line by line as integers. I did every suggestion I saw here but none works for me. here is the code I'm using. It reads some seismic data from the datadir and evaluates the SNR ratio to decide whether keep the data or remove it. To do so, I need to calculate the distance between stations and the earthquake which the info comes from input files.
from obspy import UTCDateTime
import os
datadir = "/home/alireza/Desktop/Saman/Eqcomplete"
homedir = "/home/alireza/Desktop/Saman"
eventlist = os.path.join (homedir, 'events.dat')
stationlist = os.path.join (homedir, 'all_st')
e = open (eventlist, "r")
for event in e.readlines():
year, mon, day, time, lat, lon = event.split (" ")
h = str (time.split (":")[0]) # hour
m = str (time.split (":")[1]) # minute
s = str (time.split (":")[2]) # second
s = open (stationlist, "r")
for station in s.readlines():
stname, stlo, stla = station.split (" ")
OafterB = UTCDateTime (int(year), int(mon), int(day), int(h), int(m), int(s))
print (OafterB) # just to have an output!
s.close ()
e.close ()`
Update:
There are two input files:
events.dat which is like:
2020 03 18 17:45:39 -11.0521 115.1378
all_st which is like:
AHWZ 48.644 31.430
AFRZ 59.015 33.525
NHDN 60.050 31.493
BDRS 48.881 34.054
BMDN 48.825 33.772
HAGD 49.139 34.922
Here is the output:
Traceback (most recent call last):
File "SNR.py", line 21, in <module>
OafterB = UTCDateTime (int(year), int(mon), int(day), int(h), int(m), int(s))
TypeError: int() argument must be a string, a bytes-like object or a number, not '_io.TextIOWrapper'
Here to test the code you need to install the obspy package.
pip install obspy may work.
You define s here:
s = str (time.split (":")[2]) # second
But then, immediately afterward, you refine it:
s = open (stationlist, "r")
Now s points to a file object, so int(s) fails with the above error. Name your station list file object something different, and the problem goes away.
Other tips which you may find helpful:
split() will automatically split on whitespace unless you tell it otherwise, so there's no need to specify " ".
You can use multiple assignment to assign h, m, and s the same way you did with the previous line. Currently, you're performing the same split operation three different times.
It's recommended to open files using the with keyword, which will automatically handle closing the file, even if an exception occurs.
You can iterate over a file object directly, without creating a list with readlines().
Using pathlib can make it much simpler and cleaner to deal with filesystem paths and separators.
It's considered bad form to put spaces between the name of a function and the parentheses.
There's also a convention that variable names (other than class names) are usually all lowercase, with underscores between words as needed. (See PEP 8 for a helpful rundown of all such style conventions. They're not hard and fast rules, but they can help make code more consistent and readable.)
With those things in mind, here's a slightly spruced up version of your above code:
from pathlib import Path
from obspy import UTCDateTime
data_dir = Path('/home/alireza/Desktop/Saman/Eqcomplete')
home_dir = Path('/home/alireza/Desktop/Saman')
event_list = home_dir / 'events.dat'
station_list = home_dir / 'all_st'
with open(event_list) as e_file:
for event in e_file:
year, mon, day, time, lat, lon = event.split()
h, m, s = time.split(':')
with open(station_list) as s_file:
for station in s_file:
stname, stlo, stla = station.split()
o_after_b = UTCDateTime(
int(year), int(mon), int(day), int(h), int(m), int(s)
)
print(o_after_b)

get the sequenced reads present 300 base near to restriction sites

I have many sam file containing sequenced reads and i want to get the reads which are present near to restriction sequence at least 300 bases in both direction.
1st File containing the position of restrion sites, having two columns.
chr01 4957
chr01 6605
chr02 19968
chr02 21055
chr02 208555
chr03 243398
2nd file having the reads in SAM file format. (almost 2.6M lines)
id1995 147 chr03 119509969 42 85M
id1999 83 chr10 131560619 26 81M
id1999 163 chr10 131560429 26 85M
id2099 73 chr10 60627850 42 81M
Now I want to compare column 3 of sam file with column one of position file and column 4 of sam file with column 2 of position file.
I tried doing in R language but because the data is large it is taking lot of time to do.
if you can improve the R script to do work faster by implementing best algorithm.
R code:
pos = read.csv(file="sites.csv",header=F,sep="\t")
fastq = read.csv(file="reads.sam", header=F,sep="\t")
newFastq = data.frame(fastq)
newFastq = NULL
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
for(i in 1:nrow(fastq)){
for(j in 1:nrow(pos)){
if(as.character(fastq[i,3]) == trim(as.character(pos[j,1]))){
if(fastq[i,4] - pos[j,2] < 300 && fastq[i,4] - pos[j,2] > -300){
newFastq = rbind(newFastq,fastq[i,])
}
}
}
}
#Write data into file
write.table(newFastq, file = "sitesFound.csv",row.names=FALSE, na="",quote=FALSE,col.names=FALSE, sep="\t")
or can you improve this code by writing in perl.
One overall strategy is to make indexed bam files using Bioconductor Rsamtools asBam() and indexBam(). Read your first file into a data.frame and construct a GenomicRanges GRanges() object. Finally, use GenomicAlignments readGAlignments() to read the bam file, using the GRanges() as the which= argument to ScanBamParam(). The Bioconductor support site https://support.bioconductor.org is more appropriate for Bioconductor questions, if you decide to go this route.
It looks like you want reads that are within +/- 300 base pairs of your GRanges object. Resize the GRanges
library(GenomicRanges)
## create gr = GRanges(...)
gr = resize(gr, width = 600, fix="center")
use this as the which= in ScanBamParam(), and read your BAM file
library(GenomicAlignments)
param = ScanBamParam(which = gr)
reads = readGAlignments("your.bam", param = param)
Use what= to control fields read from the BAM file, e.g.
param = ScanBamParam(which = gr, what = "seq")

ValueError in Python 3 code

I have this code that will allow me to count the number of missing rows of numbers within the csv for a script in Python 3.6. However, these are the following errors in the program:
Error:
Traceback (most recent call last):
File "C:\Users\GapReport.py", line 14, in <module>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
File "C:\Users\GapReport.py", line 14, in <genexpr>
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
ValueError: invalid literal for int() with base 10: 'AC-SEC 000000001'
Code:
import csv
def out(*args):
print('{},{}'.format(*(str(i).rjust(4, "0") for i in args)))
prev = 0
data = csv.reader(open('Padded Numbers_export.csv'))
print(*next(data), sep=', ') # header
for line in data:
EndDoc_Padded, EndDoc_Padded = (int(s.strip()[2:]) for s in line)
if start != prev+1:
out(prev+1, start-1)
prev = end
out(start, end)
I'm stumped on how to fix these issues.Also, I think the csv many lines in it, so if there's a section that limits it to a few numbers, please feel free to update me on so.
CSV Snippet (Sorry if I wasn't clear before!):
The values you have in your CSV file are not numeric.
For example, FMAC-SEC 000000001 is not a number. So when you run int(s.strip()[2:]), it is not able to convert it to an int.
Some more comments on the code:
What is the utility of doing EndDoc_Padded, EndDoc_Padded = (...)? Currently you are assigning values to two variables with the same name. Either name one of them something else, or just have one variable there.
Are you trying to get the two different values from each column? In that case, you need to split line into two first. Are the contents of your file comma separated? If yes, then do for s in line.split(','), otherwise use the appropriate separator value in split().
You are running this inside a loop, so each time the values of the two variables would get updated to the values from the last line. If you're trying to obtain 2 lists of all the values, then this won't work.

Custom Floor/Ceiling with Significance on a Time Series data Python

I have been working on a project which uses Time Series for its Calculation.
I would want to have the the Floor and Celing of data (similar to the ones in Excel's FLOOR and CEILING for an entire column
I checked for custom numpy functions, but couldnt see anything which includes significance level
I defined custom functions
def ceil(x, s):
return s * math.ceil(float(x)/s)
def floor(x, s):
return s * math.floor(float(x)/s)
However I cannot use them simultaneously on an entire column
Because of which I need to iterate each row individually:
for i in symbols:
symbols[i]['PutStrike']=0
symbols[i]['CallStrike']=0
for counter in range(0,len(symbols[0])):
symbols[i]['PutStrike'][counter]=floor(symbols[i]['FUT'][counter],Strike_Diff[i])
symbols[i]['CallStrike'][counter]=ceil(symbols[i]['FUT'][counter],Strike_Diff[i])
return symbols
Which of course is not the correct approach along with being time consuming
What I want is something like this:
def CalculateIV(symbols):
for i in symbols:
symbols[i]['PutStrike']=0
symbols[i]['CallStrike']=0
symbols[i]['PutStrike']=floor(symbols[i]['FUT'],Strike_Diff[i])
symbols[i]['CallStrike']=ceil(symbols[i]['FUT'],Strike_Diff[i])
return symbols
However when I run, I get:
CalculateIV(abc)
Traceback (most recent call last):
File "<ipython-input-456-599f9aa19e37>", line 1, in <module>
CalculateIV(abc)
File "<ipython-input-452-190c395d86ed>", line 9, in CalculateIV
symbols[i]['PutStrike']=floor(symbols[i]['FUT'],Strike_Diff[i])
File "<ipython-input-260-8a88fc57ddf5>", line 2, in floor
return s * math.floor(float(x)/s)
File "C:\Users\jay\Anaconda2\lib\site-packages\pandas\core\series.py", line 93, in wrapper
"{0}".format(str(converter)))
TypeError: cannot convert the series to <type 'float'>
Can some one please suggest an alternative/quicker approach or any library which could ease this.
Thanks in Advance
Well it was easier than I thought
I had to use the vectorize function in numpy (np.vectorize)
def ceil(x, s):
return s * math.ceil(float(x)/s)
def floor(x, s):
return s * math.floor(float(x)/s)
vfloor=np.vectorize(floor)
vceil=np.vectorize(ceil)
Thus these functions are now vectorized.
I can straight away use this to process multiple dataframes within seconds.
def CalculateIV(symbols):
for i in symbols:
symbols[i]['PutStrike']=0
symbols[i]['CallStrike']=0
symbols[i]['PutStrike']=vfloor(symbols[i]['FUT'],Strike_Diff[i])
symbols[i]['CallStrike']=vceil(symbols[i]['FUT'],Strike_Diff[i])
return symbols
If pqr has multiple dataframes in it.
I can just use the below to gather the floor and ceil values
output=CalculateIV(pqr)

Generating LaTeX tables from R summary with RPy and xtable

I am running a few linear model fits in python (using R as a backend via RPy) and I would like to export some LaTeX tables with my R "summary" data.
This thread explains quite well how to do it in R (with the xtable function), but I cannot figure out how to implement this in RPy.
The only relevant thing searches such as "Chunk RPy" or "xtable RPy" returned was this, which seems to load the package in python but not to use it :-/
Here's an example of how I use RPy and what happens.
And this would be the error without bothering to load any data:
from rpy2.robjects.packages import importr
xtable = importr('xtable')
latex = xtable('')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-8b38f31b5bb9> in <module>()
----> 1 latex = xtable(res_sum)
2 print latex
TypeError: 'SignatureTranslatedPackage' object is not callable
I have tried using the stargazer package instead of xtable and I get the same error.
Ok, I solved it, and I'm a bit ashamed to say that it was a total no-brainer.
You just have to call the functions as xtable.xtable() or stargazer.stargazer().
To easily generate TeX data from Python, I wrote the following function;
import re
def tformat(txt, v):
"""Replace the variables between [] in raw text with the contents
of the named variables. Between the [] there should be a variable name,
a colon and a formatting specification. E.g. [smin:.2f] would give the
value of the smin variable printed as a float with two decimal digits.
:txt: The text to search for replacements
:v: Dictionary to use for variables.
:returns: The txt string with variables substituted by their formatted
values.
"""
rx = re.compile(r'\[(\w+)(\[\d+\])?:([^\]]+)\]')
matches = rx.finditer(txt)
for m in matches:
nm, idx, fmt = m.groups()
try:
if idx:
idx = int(idx[1:-1])
r = format(v[nm][idx], fmt)
else:
r = format(v[nm], fmt)
txt = txt.replace(m.group(0), r)
except KeyError:
raise ValueError('Variable "{}" not found'.format(nm))
return txt
You can use any variable name from the dictionary in the text that you pass to this function and have it replaced by the formatted value of that variable.
What I tend to do is to do my calculations in Python, and then pass the output of the globals() function as the second parameter of tformat:
smin = 235.0
smax = 580.0
lst = [0, 1, 2, 3, 4]
t = r'''The strength of the steel lies between SI{[smin:.0f]}{MPa} and \SI{[smax:.0f]}{MPa}. lst[2] = [lst[2]:d].'''
print tformat(t, globals())
Feel free to use this. I put it in the public domain.
Edit: I'm not sure what you mean by "linear model fits", but might numpy.polyfit do what you want in Python?
To resolve your problem, please update stargazer to version 4.5.3, now available on CRAN. Your example should then work perfectly.

Categories