parsing pdb file with format change

parsing pdb file with format change - python

I have a file that looks something like this:
ATOM 7748 CG2 ILE A 999 53.647 54.338 82.768 1.00 82.10 C
ATOM 7749 CD1 ILE A 999 51.224 54.016 84.367 1.00 83.16 C
ATOM 7750 N ASN A1000 55.338 57.542 83.643 1.00 80.67 N
ATOM 7751 CA ASN A1000 56.604 58.163 83.297 1.00 80.45 C
ATOM 7752 C ASN A1000 57.517 58.266 84.501 1.00 80.30 C
As you can see the " " disappears between column 4 and 5 (starting to count at 0). Thus the code below fails. I'm new to python (total time now a whole 3 days!) and was wondering what's the best way to handle this. As long as there is a space, the line.split() works. Do I have to do a character count and then parse the string with an absolute reference?
import string
visited = {}
outputfile = open(file_output_location, "w")
for line in open(file_input_location, "r"):
list = line.split()
id = list[0]
if id == "ATOM":
type = list[2]
if type == "CA":
residue = list[3]
if len(residue) == 4:
residue = residue[1:]
type_of_chain = list[4]
atom_count = int(list[5])
position = list[6:9]
if(atom_count >= 1):
if atom_count not in visited and type_of_chain == chain_required:
visited[atom_count] = 1
result_line = " ".join([residue,str(atom_count),type_of_chain," ".join(position)])
print result_line
print >>outputfile, result_line
outputfile.close()

PDB files appear to be fixed column width files, not space delimited. So if you must parse them manually (rather than using an existing tool like pdb-tools), you'll need to chop the line up using something more along the lines of:
id = line[0:4]
type = line[4:9].strip()
# ad nausium

Use string slicing:
print '0123456789'[3:6]
345
There's an asymmetry there - the first number is the 0-based index of the first character you need. The second number is the 0-based index of the first character you no longer need.

It may be worth installing Biopython as it has a module to Parse PDBs.
I used the following code on your example data:
from Bio.PDB.PDBParser import PDBParser
pdb_reader = PDBParser(PERMISSIVE=1)
structure_id="Test"
filename="Test.pdb" # Enter file name here or path to file.
structure = pdb_reader.get_structure(structure_id, filename)
model = structure[0]
for chain in model: # This will loop over every chain in Model
for residue in chain:
for atom in residue:
if atom.get_name() == 'CA': # get_name strips spaces, use this over get_fullname() or get_id()
print atom.get_id(), residue.get_resname(), residue.get_id()[1], chain.get_id(), atom.get_coord()
# Prints Atom Name, Residue Name, Residue number, Chain Name, Atom Co-Ordinates
This prints out:
CA ASN 1000 A [ 56.60400009 58.1629982 83.29699707]
I then tried it on a larger protein which has 14 chains (1aon.pdb) and it worked fine.

Related

Formatted strings, decimals and commas question

I have a .txt file that I read in and wish to create formatted strings using these values. Columns 3 and 4 need decimals and the last column needs a percent sign and 2 decimal places. The formatted string will say something like "The overall attendance at Bulls was 894659, average attendance was 21,820 and the capacity was 104.30%’
the shortened .txt file has these lines:
1 Bulls 894659 21820 104.3
2 Cavaliers 843042 20562 100
3 Mavericks 825901 20143 104.9
4 Raptors 812863 19825 100.1
5 NY_Knicks 812292 19812 100
So far my code looks like this and its mostly working, minus the commas and decimal places.
file_1 = open ('basketball.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
list_1.append(items)
print('Number of teams: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1: #iterate over the lines of the file and print the lines with formatted strings
a, b, c, d, e = line
print (f'The overall attendance at the {b} game was {c}, average attendance was {d}, and the capacity was {e}%.')
Any help with how to format the code to show the numbers with commas (21820 ->21,828) and last column with 2 decimals and a percent sign (104.3 -> 104.30%) is greatly appreciated.

You've got some options for how to tackle this.
Option 1: Using f strings (Python 3 only)
Since your provided code already uses f strings, this solution should work for you. For others reading here, this will only work if you are using Python 3.
You can do string formatting within f strings, signified by putting a colon : after the variable name within the curly brackets {}, after which you can use all of the usual python string formatting options.
Thus, you could just change one of your lines of code to get this done. Your print line would look like:
print(f'The overall attendance at the {b} game was {int(c):,}, average attendance was {int(d):,}, and the capacity was {float(e):.2f}%.')
The variables are getting interpreted as:
The {b} just prints the string b.
The {int(c):,} and {int(d):,} print the integer versions of c and d, respectively, with commas (indicated by the :,).
The {float(e):.2f} prints the float version of e with two decimal places (indicated by the :.2f).
Option 2: Using string.format()
For others here who are looking for a Python 2 friendly solution, you can change the print line to the following:
print("The overall attendance at the {} game was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.".format(b, int(c), int(d), float(e)))
Note that both options use the same formatting syntax, just the f string option has the benefit of having you write your variable name right where it will appear in the resulting printed string.

This is how I ended up doing it, very similar to the response from Bibit.
file_1 = open ('something.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
items[2] = int(items[2])
items[3] = int(items[3])
items[4] = float(items[4])
list_1.append(items)
print('Number of teams/rows: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1:
print ('The overall attendance at the {:s} games was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.'.format(line[1], line[2], line[3], line[4]))

Process lines with different sizes to csv

I'm trying to convert a PDF bank extract to csv. I'm fairly new into python, but I managed to extract text from pdf. I'm ended with something similar to this:
AMAZON 23/12/2019 15:40 -R$ 100,00 R$ 400,00 credit
Some Restaurant 23/12/2019 14:00 -R$ 10,00 R$ 500 credit
Received from John Doe 22/12/2019 15:00 R$ 510 R$ 500,00
03 Games 22/12/2019 15:00 R$ 10 R$ 10,00 debit
I want this output:
AMAZON;23/12/2019;-100,00
Some Restaurant;23/12/2019;-10,00
Received from John Doe;22/12/2019;510
03 Games;22/12/2019;10
First field have different sizes, I don't need time and currency format. I don't need last 2 fields.
I have this code so far (just extracting text from PDF):
import pdfplumber
import sys
url = sys.argv[1]
pdf = pdfplumber.open(url)
pdf_pages = len(pdf.pages)
for i in range(pdf_pages):
page = pdf.pages[(i)]
text = page.extract_text()
print(text)
pdf.close()
Can anyone give some directions?

Try using this the split method. To split the strings into lines and into the separate parts and pick then the parts.
The following link explains it very nicely.
https://www.w3schools.com/python/showpython.asp?filename=demo_ref_string_split
lines:List[str] = text.split("\n")
for line in lines:
entries:List[str] = line.split()
date_entry_index: int = get_date_index(entries)
name = entries[0]
for index in range(1, date_entry_index + 1):
name += " " + entries[index]
print(f"{name};{entries[date_entry_index]};{entries[date_entry_index + 2]}")
def get_date_index(entries_check:List[str]) -> int:
# either you could use the function below or you check if the entry only contains digits and "/"
for index, entry in enumerate(entries):
if len(entry) == 10:
continue
if entry[2] != "/" or entry[5] != "/":
continue
# here you could check if the other parts of the date are digits or some letters or something similar.
return index
else:
raise ValueError("No Date found")
That should print it.

Filtering and parsing text over Solar Region Summary files

I was trying to filter some .txt files that are named after a date in YYYYMMDD format and contain some data about active regions in the Sun. I made a code that, given a date in YYYYMMDD format, can list the files that are within a time range which I expect the active region I am looking for to be and parse the information based on that entry. An example of these txts can be seen below and more information about it (if you feel curious) can be seen on SWPC website.
:Product: 0509SRS.txt
:Issued: 2012 May 09 0030 UTC
# Prepared jointly by the U.S. Dept. of Commerce, NOAA,
# Space Weather Prediction Center and the U.S. Air Force.
#
Joint USAF/NOAA Solar Region Summary
SRS Number 130 Issued at 0030Z on 09 May 2012
Report compiled from data received at SWO on 08 May
I. Regions with Sunspots. Locations Valid at 08/2400Z
Nmbr Location Lo Area Z LL NN Mag Type
1470 S19W68 284 0030 Cro 02 02 Beta
1471 S22W60 277 0120 Cso 05 03 Beta
1474 N14W13 229 0010 Axx 00 01 Alpha
1476 N11E35 181 0940 Fkc 17 33 Beta-Gamma-Delta
1477 S22E73 144 0060 Hsx 03 01 Alpha
IA. H-alpha Plages without Spots. Locations Valid at 08/2400Z May
Nmbr Location Lo
1472 S28W80 297
1475 N05W05 222
II. Regions Due to Return 09 May to 11 May
Nmbr Lat Lo
1460 N16 126
1459 S16 110
The code I am using to parse over these txt files is:
import glob
def seeker(noaa_number, t_start, path = None):
'''
This function will open an SRS file
and look for each line if the given AR
(specified by its NOAA number) is there.
If so, this function should grab the
entries and return them.
'''
#defaulting path if none is given
if path is None:
#assigning
path = 'defaultpath'
#listing the items within the directory
files = sorted(glob.glob(path+'*.txt'))
#finding the index in the list of
#the starting time
index = files.index(path+str(t_start)+'SRS.txt')
#looping over each file
for file in files[index: index+20]:
#opening file
f = open(file, 'r')
#reading the lines
text = f.readlines()
#looping over each line in the text
for line in text:
#checking if the noaa number is mentioned
#in the given line
if noaa_number in line:
#test print
print('Original line: ', line)
#slicing the text to get the column values
nbr = line[:4]
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
MagType = line[37:]
#test prints
print('nbr: ', nbr)
print('location: ', Location)
print('Lo: ', Lo)
print('Area: ', Area)
print('Z: ', Z)
print('LL: ', LL)
print('NN: ', NN)
print('MagType: ', MagType)
return
I tested this and it is working but I fell a bit dumb for two reasons:
Despite these files being made following a standard, one extra space is all it takes to crash the code considering the way I am slicing the arrays by index. Is there a better option to that?
The information on tables IA and II are not relevant for me so, ideally, I would like to prevent my code to scan them. Since the number of lines on the first column varies, is it possible to tell the code when to stop reading a giving document?
Thanks for your time!

Robustness:
Instead of slicing by absolute position you could split the lines up into a list using the .split() method. This will be robust against extra spaces.
So instead of
Location = line[5:11]
Lo = line[14:18]
Area = line[19:23]
Z = line[24:28]
LL = line[29:31]
NN = line[34:36]
You could use
Location = line.split()[1]
Lo = line.split()[2]
Area = line.split()[3]
Z = line.split()[4]
LL = line.split()[5]
NN = line.split()[6]
If you wanted it to be faster you could split the list once and then just pull the relevant data from the same list rather than splitting it every time:
data = line.split()
Location = data[1]
Lo = data[2]
Area = data[3]
Z = data[4]
LL = data[5]
NN = data[6]
Stopping:
To stop it from continuing reading the file after it's passed the relevant data you could just have something that exits the loop once it no longer finds the noaa_number in the line
# In the file function but before looping through the lines.
started_reading = False ## Set this to false so
## that it doesn't exit
## before it gets to the
## relevant data
for line in text:
if noaa_number in line:
started_reading = True
## Parsing stuff
elif started_reading is True:
break # exits the loop

Total from text file

I have a text file with the data with numbers of good and bad of a product by each gender
Male 100 120
Female 110 150
How can I calculate the total from this text file for both gender so that it prints out 480
Here is my attempt to code:
def total():
myFile = open("product.txt", "r")
for result in myFile:
r = result.split()
print(r[1]+r[2])
total()
It prints outs what the column has but it doesn't add them

The result of split is a sequence of strings, not of integers.
"Adding" two strings with + concatenates the strings.
Example interaction with enough clues for you to solve it:
>>> s = "123 456"
>>> ss = s.split()
>>> ss
['123', '456']
>>> ss[0] + ss[1]
'123456'
>>> int(ss[0])
123
>>> int(ss[1])
456
>>> int(ss[0]) + int(ss[1])
579
When you get unexpected results, opening your interpreter and looking at things interactively usually provides plenty of clues.

You need to convert each of the split text entries into an integer, and keep a running total as follows:
def total():
the_total = 0
with open("product.txt", "r") as myFile:
for result in myFile:
r = result.split()
the_total += int(r[1]) + int(r[2])
return the_total
print(total())
This would display:
480
Using with will automatically close the file for you.

Yet another one
def total():
with open('product.txt') as f:
nums = (int(el) for r in f for el in r.split()[1:])
return sum(nums)
print(total())
It works for any number of columns you may have in each row
e.g. with four columns
Male 111 222 333 444
Female 666 777 888 999
produces
4440

As mentioned in the comments by jonrsharpe, you aren't adding the previous values.
Since you want to add everything, keep track of the previous values and add the new lines (all converted to integer). Change your code to:
def total():
t = 0
with open("product.txt", "r") as myFile:
for result in myFile:
r = result.split()
t += int(r[1]) + int(r[2])
return t
print(total()) # 480
Since this got chosen, I'm editing to include file closing.
Mentioned by Martin Evans:
Using with will automatically close the file for you.

>>> def total():
myfile = open("/home/prashant/Desktop/product.txt" , "r")
for res in myfile:
r = res.split()
print (int(r[0])+int(r[1]))
str isn't converted into int that's your problem

How to callthe first float in a list in Python

In Python if I have numerous lines of data containing both strings and floats (sample below) which I have tokenized, how can I call the first float value in each line if this position is not constant? I eventually want to use this as a reference point for latter tokens. Thanks in advance.
F + FR > FR* + F + E 11.60 0 2 FR > FR*
F + FR > FR*** + F 11.60 0 2382 FR > FR***

You can use the re module. Just import it and look for digits with a . within them. For example,
import re
def findFloat(s):
return float( re.search('[0-9]+.[0-9]+', s).group() )
This finds the first occurrence of a group of numbers separated by a .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing pdb file with format change - python

PDB files appear to be fixed column width files, not space delimited. So if you must parse them manually (rather than using an existing tool like pdb-tools), you'll need to chop the line up using something more along the lines of: id = line[0:4] type = line[4:9].strip() # ad nausium

Use string slicing: print '0123456789'[3:6] 345 There's an asymmetry there - the first number is the 0-based index of the first character you need. The second number is the 0-based index of the first character you no longer need.

Related

Formatted strings, decimals and commas question

Process lines with different sizes to csv

Filtering and parsing text over Solar Region Summary files

Total from text file

How to callthe first float in a list in Python

Categories

Resources