How to make file name a variable using np.savetxt in python? - python

Is it possible to make the output filename a variable using np.savetxt? I have multiple input file from where I will read and perform some calculations and output the results in a file. Right now I am changing the file name each time for different output, But is there a way to do it automatically? The code I used is as below:
np.savetxt('ES-0.dat', np.c_[strain_percent, es_avg, es_std])
I would like to change the file name to ES-25.dat, ES 50.dat, ES-75.dat ....etc. This is also dependent upon the input file which I read like this:
flistC11 = glob.glob('ES-0')
Is there also a way to automatically change the input file to ES-25, ES-50, ES-75....etc?
I tried using loops but both the input and output has to be inside ' ' which does not allow me to make it a variable. Any idea how can I solve this problem? My work will be much easier then.
Added information after Saullo Castro's answer:
The file that I'm reading (ES*) consists two simple columns like this:
200 7.94
200 6.55
200 6.01
200 7.64
200 6.33
200 7.96
200 7.92
The whole script is as below:
import numpy as np
import glob
import sys
flistC11 = glob.glob('ES-s*')
#%strain
fdata4 = []
for fname in flistC11:
load = np.loadtxt(fname)
fdata4.append(load[:,0]) #change to 0=strain or 1=%ES
fdata_arry4=np.array(fdata4)
print fdata_arry4
strain=np.mean(fdata_arry4[0,:])
strain_percent = strain/10
print strain_percent
#ES
fdata5 = []
for fname in flistC11:
load = np.loadtxt(fname)
fdata5.append(load[:,1]) #change to 0=strain or 1=%ES
fdata_arry5=np.array(fdata5)
print fdata_arry5
es_avg=np.mean(fdata_arry5[0,:])
es_std=np.std(fdata_arry5[0,:])
print es_avg
print es_std
np.savetxt('{0}.dat'.format(fname), np.c_[strain_percent,es_avg,es_std])

You can do something like:
flistC11 = glob.glob('ES*')
for fname in flistC11:
# ...something...
np.savetxt('{0}.dat'.format(fname), np.c_[strain_percent, es_avg, es_std])
note that using ES* will tell glob() to return the names of all files beggining with ES.
EDIT:
Based on your comments it seems you actually want something like this:
import glob
import numpy as np
flistC11 = glob.glob('ES-s*')
for fname in flistC11:
strains, stresses = np.loadtxt(fname, unpack=True)
strain = np.mean(strains)
strain_percent = strain/10
print fname, strain_percent
es_avg = np.mean(stresses)
es_std = np.std(stresses)
print fname, es_avg, es_std
np.savetxt('{0}.dat'.format(fname), np.c_[strain_percent, es_avg, es_std])

It's not entirely clear where your error is (where is line 15?), but lets assume it is in the load. You have
fdata4 = []
for fname in flistC11:
load = np.loadtxt(fname)
fdata4.append(load[:,0]) #change to 0=strain or 1=%ES
I'd suggest changing this to:
fdata4 = []
for fname in flistC11:
print fname # check that the names make sense
load = np.loadtxt(fname)
print load.shape # check that the shape is as expected
# maybe print more of 'load' here
# I assume you want to collect 'load' from all files, not just the last
fdata4.append(load[:,0]) #change to 0=strain or 1=%ES
print fdata4
In an Ipython shell, I had no problem producing:
In [90]: flistC11=['ES0','ES1','ES2']
In [91]: for fname in flistC11:
np.savetxt('{}.dat'.format(fname), np.arange(10))
....:
In [92]: glob.glob('ES*')
Out[92]: ['ES2.dat', 'ES0.dat', 'ES1.dat']

Related

Save file to pickel with nested loop indeces as filen names

Im converting my code from matlab to python. In matlab to save my variables i usually do this
for i =1:4
for j=1:3
save(['data_',int2str(i),'_' int2str(j)'.mat'],'var1', 'var2' )
end
end
so in the end i have files like: data_1_1, data_1_2 etc
How can i modify the code below to have a similar naming convention
import pickle
Tests = 5
data = {}
for i in range(Tests):
for j in range(4)
data['r'] = 5
data['m'] = 500
data['n'] = 500
data['X'] = np.random.rand(data['m'],data['n'])
data['Y'] = np.random.rand(data['m'],data['n'])
with open('data{}.pickle'.format(i), 'wb') as f:
pickle.dump(data, f)
I would like to save my pickled as say data_1_2 etc
Help! im new to python. thanks!

How to read in multiple documents with same code?

So I have a couple of documents, of which each has a x and y coordinate (among other stuff). I wrote some code which is able to filter out said x and y coordinates and store them into float variables.
Now Ideally I'd want to find a way to run the same code on all documents I have (number not fixed, but let's say 3 for now), extract x and y coordinates of each document and calculate an average of these 3 x-values and 3 y-values.
How would I approach this? Never done before.
I successfully created the code to extract the relevant data from 1 file.
Also note: In reality each file has more than just 1 set of x and y coordinates but this does not matter for the problem discussed at hand.
I'm just saying that so that the code does not confuse you.
with open('TestData.txt', 'r' ) as f:
full_array = f.readlines()
del full_array[1:31]
del full_array[len(full_array)-4:len(full_array)]
single_line = full_array[1].split(", ")
x_coord = float(single_line[0].replace("1 Location: ",""))
y_coord = float(single_line[1])
size = float(single_line[3].replace("Size: ",""))
#Remove unecessary stuff
category= single_line[6].replace(" Type: Class: 1D Descr: None","")
In the end I'd like to not have to write the same code for each file another time, especially since the amount of files may vary. Now I have 3 files which equals to 3 sets of coordinates. But on another day I might have 5 for example.
Use os.walk to find the files that you want. Then for each file do you calculation.
https://docs.python.org/2/library/os.html#os.walk
First of all create a method to read a file via it's file name and do the parsing in your way. Now iterate through the directory,I guess files are in the same directory.
Here is the basic code:
import os
def readFile(filename):
try:
with open(filename, 'r') as file:
data = file.read()
return data
except:
return ""
for filename in os.listdir('C:\\Users\\UserName\\Documents'):
#print(filename)
data=readFile( filename)
print(data)
#parse here
#do the calculation here

How can i properly use random.sample in my code?

My directory has hundreds of images and text files(.png and .txt). what's special about them is that each image has its own matching txt file, for example im1.png has img1.txt, news_im2.png has news_im2.png etc.. What i want is some way to give it a parameter or percentage, let's say 40 where it randomly copy 40% of the images along with their correspondent texts to a new file, and the most important word here is randomely as if i do the test again i shouldn't get the same results. Ideally i should be able to take 2 kind of parameters(reminder that the first would be the % of each sample) the second being number of samples for example maybe i want my data in 3 different samples randomly not only 2, in this case it should be able to take destination directories path equal to the number of samples i want and spread them accordingly, for example i shouldn't find img_1 in 2 different samples.
What i have done so far is simply set up my method to copy them :
import glob, os, shutil
source_dir ='all_the_content/'
dest_dir = 'percentage_only/'
files = glob.iglob(os.path.join(source_dir, "*.png"))
for file in files:
if os.path.isfile(file):
shutil.copy2(file, dest_dir)
and the start of my code to set the random switching:
import os, shutil,random
my_pic_dict = {}
source_dir ='/home/michel/ubuntu/EAST/data_0.8/'
for element in os.listdir(source_dir):
if element.endswith('.png'):
my_pic_dict[element] = element.replace('.png', '.txt')
print(my_pic_dict)
print (len(my_pic_dict))
imgs_list = my_pic_dict.keys()
print(imgs_list)
hwo can i finalize it as i couldn't make random.sample work.
try this:
import random
import numpy as np
n_elem = 1000
n_samples = 4
percentages = [50,10,10,30]
indices = list(range(n_elem))
random.shuffle(indices)
elem_per_samples = [int(p*n_elem/100) for p in percentages]
limits = np.cumsum([0]+elem_per_samples)
samples = [indices[limits[i]:limits[i+1]] for i in range(n_samples)]

How should I write multiple loops for a single operation in python

I'm new to learning python and I really need to get this job done, and I really dont know how to look for an answer in what it seems to be a big giant ocean of information.
I'm working with a PDB parser and I have the following code:
#Calling the module
from Bio.PDB.PDBParser import PDBParser
#Defining some variables
parser = PDBParser(PERMISSIVE=1)
structure_id= "3caq"
filename = "3caq.pdb"
#This is the actual task to be done
structure = parser.get_structure(structure_id, filename)
#What I'm interested in doing
compound = structure.header['compound']
print structure_id + " |", compound
with this result:
3caq | {'1': {'synonym': 'delta(4)-3-ketosteroid 5-beta-reductase, aldo-keto reductase family 1 member d1 ', 'chain': 'a, b', 'misc': '', 'molecule': '3-oxo-5-beta-steroid 4-dehydrogenase', 'ec_number': '1.3.1.3','ec': '1.3.1.3', 'engineered': 'yes'}}
The thing is that I'm not working with just a single file (defined under "filename") but I have hundreds of files from where I need to extract the "header", and only retain the "compound" variable of said header.
I know I have to write loops for that to be done and I tried the following:
#Defining lists
nicknames = { "3caq", "2zb7" }
structures = { "3caq.pdb", "2bz7.pdb" }
structure_id = []
for structure in structures:
structure_id.append(nickname)
filename = []
for structure in structures:
filename.append(structure)
then I feed the parser but I get an error.
Traceback (most recent call last):
File "/home/tomas/Escritorio/d.py", line 16, in <module>
header = parser.get_structure(structure_id, filename)
File "/usr/local/lib/python2.7/dist-packages/Bio/PDB/PDBParser.py"
, line 82, in get_structure
self._parse(handle.readlines())
AttributeError: 'list' object has no attribute 'readlines'
I'm pretty sure that the loop is not correctly written.
So, I'll be really thankful if I can get some help with how to correctly write that loop, either with a resource I can check, or with the right commands.
Best regards.
You're getting an error because you're passing a list (["3caq.pdb"]) when get_structure() expects a string ("3caq.pdb").
Here is how you can support multiple files:
from Bio.PDB.PDBParser import PDBParser
files = {"3caq.pdb", "2bz7.pdb"}
for filename in files:
# Defining some variables
parser = PDBParser(PERMISSIVE=1)
# Drop ".pdb" from filename to get structure_id
structure_id = filename.split('.')[0]
# This is the actual task to be done
structure = parser.get_structure(structure_id, filename)
# What I'm interested in doing
compound = structure.header['compound']
print("{} | {}".format(structure_id, compound))
To make your code even better, I would write it as a separate function:
from Bio.PDB.PDBParser import PDBParser
def get_compound_header(filename):
# Defining some variables
parser = PDBParser(PERMISSIVE=1)
# Drop ".pdb" from filename to get structure_id
structure_id = filename.split('.')[0]
# This is the actual task to be done
structure = parser.get_structure(structure_id, filename)
# What I'm interested in doing
compound = structure.header['compound']
return "{} | {}".format(structure_id, compound)
# Main function
if __name__ == "__main__":
files = {"3caq.pdb", "2bz7.pdb"}
for filename in files:
print(get_compound_header(filename))
Clear understanding would help you sort this out.
You need to iterate through many structures so good choice with the for loop. I would suggest you use a list instead of set to store the structures and filenames.
filenames = ["3caq", "2zb7" ]
structures = ["3caq.pdb", "2bz7.pdb"]
Now you can iterate by the length of the structures.
for each in range(len(structures)):
structure = parser.get_structure(structures[each], filenames[each])
compound = structure.header['compound']
print structure_id + " |", compound
Let me know if this works for you.

parsing sdf file, performance issue

I've wrtien a script which read different files and search molecular ID in big sdf databases (about 4.0 GB each).
the idea of this script is to copy every molecules from a list of id (about 287212 molecules) from my original databases to a new one in a way to have only one single copy of each molecule (in this case, the first copy encountered)
I've writen this script:
import re
import sys
import os
def sdf_grep (molname,files):
filin = open(files, 'r')
filine= filin.readlines()
for i in range(0,len(filine)):
if filine[i][0:-1] == molname and filine[i][0:-1] not in past_mol:
past_mol.append(filine[i][0:-1])
iterate = 1
while iterate == 1:
if filine[i] == "$$$$\n":
filout.write(filine[i])
iterate = 0
break
else:
filout.write(filine[i])
i = i+1
else:
continue
filin.close()
mol_dock = os.listdir("test")
listmol = []
past_mol = []
imp_listmol = open("consensus_sorted_surflex.txt", 'r')
filout = open('test_ini.sdf','wa')
for line in imp_listmol:
listmol.append(line.split('\t')[0])
print 'list ready... reading files'
imp_listmol.close()
for f in mol_dock:
print 'reading '+f
for molecule in listmol:
if molecule not in past_mol:
sdf_grep(molecule , 'test/'+f)
print len(past_mol)
filout.close()
it works perfectely, but it is very slow... too slow for what I need. Is there a way to rewrite this script in a way that can reduce the computation time?
thank you very much.
The main problem is that you have three nested loops: molecular documents, molecules and file parsing in the inner loop. That smells like trouble - I mean, quadratic complexity. You should move huge files parsing outside of inner loop and use set or dictionary for molecules.
Something like this:
For each sdf file
For each line, if it is molecule definition
Check in dictionary of unfound molecules
If present, process it and remove from dictionary of unfound molecules
This way, you will parse each sdf file exactly once, and with each found molecule, speed will further increase.
Let past_mol be a set, rather than a list. That will speed up
filine[i][0:-1] not in past_mol
since checking membership in a set is O(1), while checking membership in a list is O(n).
Try not to write to a file one line at a time. Instead, save up lines in a list, join them into a single string, and then write it out with one call to filout.write.
It is generally better not to allow functions to modify global variables. sdf_grep modifies the global variable past_mol.
By adding past_mol to the arguments of sdf_grep you make it explicit that sdf_grep depends on the existence of past_mol (otherwise, sdf_grep is not really a standalone function).
If you pass past_mol in as a third argument to sdf_grep, then Python will make a new local variable named past_mol which will point to the same object as the global variable past_mol. Since that object is a set and a set is a mutable object, past_mol.add(sline) will affect the global variable past_mol as well.
As an added bonus, Python looks up local variables faster than global variables:
def using_local():
x = set()
for i in range(10**6):
x
y = set
def using_global():
for i in range(10**6):
y
In [5]: %timeit using_local()
10 loops, best of 3: 33.1 ms per loop
In [6]: %timeit using_global()
10 loops, best of 3: 41 ms per loop
sdf_grep can be simplified greatly if you use a variable (let's call it found) which keeps track of whether or not we are inside one of the chunks of lines we want to keep. (By "chunk of lines" I mean one that begins with molname and ends with "$$$$"):
import re
import sys
import os
def sdf_grep(molname, files, past_mol):
chunk = []
found = False
with open(files, 'r') as filin:
for line in filin:
sline = line.rstrip()
if sline == molname and sline not in past_mol:
found = True
past_mol.add(sline)
elif sline == '$$$$':
chunk.append(line)
found = False
if found:
chunk.append(line)
return '\n'.join(chunk)
def main():
past_mol = set()
with open("consensus_sorted_surflex.txt", 'r') as imp_listmol:
listmol = [line.split('\t')[0] for line in imp_listmol]
print 'list ready... reading files'
with open('test_ini.sdf', 'wa') as filout:
for f in os.listdir("test"):
print 'reading ' + f
for molecule in listmol:
if molecule not in past_mol:
filout.write(sdf_grep(molecule, os.path.join('test/', f), past_mol))
print len(past_mol)
if __name__ == '__main__':
main()

Categories