I have the following file I'm trying to manipulate.
1 2 -3 5 10 8.2
5 8 5 4 0 6
4 3 2 3 -2 15
-3 4 0 2 4 2.33
2 1 1 1 2.5 0
0 2 6 0 8 5
The file just contains numbers.
I'm trying to write a program to subtract the rows from each other and print the results to a file. My program is below and, dtest.txt is the name of the input file. The name of the program is make_distance.py.
from math import *
posnfile = open("dtest.txt","r")
posn = posnfile.readlines()
posnfile.close()
for i in range (len(posn)-1):
for j in range (0,1):
if (j == 0):
Xp = float(posn[i].split()[0])
Yp = float(posn[i].split()[1])
Zp = float(posn[i].split()[2])
Xc = float(posn[i+1].split()[0])
Yc = float(posn[i+1].split()[1])
Zc = float(posn[i+1].split()[2])
else:
Xp = float(posn[i].split()[3*j+1])
Yp = float(posn[i].split()[3*j+2])
Zp = float(posn[i].split()[3*j+3])
Xc = float(posn[i+1].split()[3*j+1])
Yc = float(posn[i+1].split()[3*j+2])
Zc = float(posn[i+1].split()[3*j+3])
Px = fabs(Xc-Xp)
Py = fabs(Yc-Yp)
Pz = fabs(Zc-Zp)
print Px,Py,Pz
The program is calculating the values correctly but, when I try to call the program to write the output file,
mpipython make_distance.py > distance.dat
The output file (distance.dat) only contains 3 columns when it should contain 6. How do I tell the program to shift what columns to print to for each step j=0,1,....
For j = 0, the program should output to the first 3 columns, for j = 1 the program should output to the second 3 columns (3,4,5) and so on and so forth.
Finally the len function gives the number of rows in the input file but, what function gives the number of columns in the file?
Thanks.
Append a , to the end of your print statement and it will not print a newline, and then when you exit the for loop add an additional print to move to the next row:
for j in range (0,1):
...
print Px,Py,Pz,
print
Assuming all rows have the same number of columns, you can get the number of columns by using len(row.split()).
Also, you can definitely shorten your code quite a bit, I'm not sure what the purpose of j is, but the following should be equivalent to what you're doing now:
for j in range (0,1):
Xp, Yp, Zp = map(float, posn[i].split()[3*j:3*j+3])
Xc, Yc, Zc = map(float, posn[i+1].split()[3*j:3*j+3])
...
You don't need to:
use numpy
read the whole file in at once
know how many columns
use awkward comma at end of print statement
use list subscripting
use math.fabs()
explicitly close your file
Try this (untested):
with open("dtest.txt", "r") as posnfile:
previous = None
for line in posnfile:
current = [float(x) for x in line.split()]
if previous:
delta = [abs(c - p) for c, p in zip(current, previous)]
print ' '.join(str(d) for d in delta)
previous = current
just in case your dtest.txt grows larger and you don't want to redirect your output but rather write to distance.dat, especially, if you want to use numpy. Thank #John for pointing out my mistake in the old code ;-)
import numpy as np
pos = np.genfromtxt("dtest.txt")
dis = np.array([np.abs(pos[j+1] - pos[j]) for j in xrange(len(pos)-1)])
np.savetxt("distance.dat",dis)
Related
I need to use a certain program, to validate some of my results. I am relatively new in Python. The output is so different for each entry, see a snippit below:
SEQENCE ID TM SP PREDICTION
YOL154W_Q12512_Saccharomyces_cerevisiae 0 Y n8-15c20/21o
YDR481C_P11491_Saccharomyces_cerevisiae 1 0 i34-53o
YAL007C_P39704_Saccharomyces_cerevisiae 1 Y n5-20c25/26o181-207i
YAR028W_P39548_Saccharomyces_cerevisiae 2 0 i51-69o75-97i
YBL040C_P18414_Saccharomyces_cerevisiae 7 0 o6-26i38-56o62-80i101-119o125-143i155-174o186-206i
YBR106W_P38264_Saccharomyces_cerevisiae 1 0 o28-47i
YBR287W_P38355_Saccharomyces_cerevisiae 8 0 o12-32i44-63o69-90i258-275o295-315i327-351o363-385i397-421o
So, I need the last transmembrane region, in this case its always the last numbers between o and i or vise versa. if TM = 0, there is no transmembrane region, so I want the numbers if TM > 0
output I need:
34-53
181-207
75-97
186-206
28-47
397-421
preferably in seperate values, like:
first_number = 34
second_number = 53
Because I will be using a loop the values will be overwritten anyway. To summarize: I need the last region between the o and i or vise versa, with very variable strings (both in length and composition).
Trouble: If I just search (for example with regular expression) for the last region between o and i, I will sometimes pick the wrong region.
If the Phobius output is stored in a file, change 'Phobius_output' to the path, then the following code should give the expected result:
with open('Phobius_output') as file:
for line in file.readlines()[1:]:
if int(line.split()[1]) > 0:
prediction = line.split()[3]
i_idx, o_idx = prediction.rfind('i'), prediction.rfind('o')
last_region = prediction[i_idx + 1:o_idx] if i_idx < o_idx else prediction[o_idx + 1:i_idx]
first_number, second_number = map(int, last_region.split('-'))
print(last_region)
I've been trying to analyse data from a .dat file. The experiment in the file repeats (very many) times, such that each experiment has n data points for each of the r experiments. As an example: r = 4 experiments, with n = 3 data points in each experiment:
1 4.8
2 3.4
3 2.3
1 6.5
2 5.3
3 4.2
1 9.8
2 8.4
3 7.6
1 13.8
2 12.4
3 11.6
I want to read the file and plot 4 graphs - the first 3, second 3 and third 3 and fourth 3 rows. My code so far is this:
for line in myfile:
if not line.strip():#takes out empty rows
continue
else:
data.append(line)
for line in data:
x, y = line.split()
timestep.append(float(x))
value.append(float(y))
z = 0.0
j = 1
n = 3 #no. of data points in one experiment
r = 4 #no. of times experiment repeats
x = np.arange(1,n)
for k in range(1, r):
for i in (value):
j += 1
if n%j != 0: #trying to break the loop after the first experiment of n data points
z = i
y_"str(j)" = [] #I want to call this array y_j, i.e. y_1 for the first loop or y_2 for the second, etc, wild index in python?! :(
y_"str(j)".append(z)
else:
value = value[steps:] #trying to remove the first three points before starting to for loop again
plt.figure()
plt.plot(x, y_str(j),'r', label = "y_str(j)")
plt.title('y ' +str(j) )
plt.show()
I'll be analysing it more, but I'm just having difficulty in performing the same analysis (plotting, etc) every n times in the big array of data. It might not even be necessary to split my 2 column input data into separate x and y columns, but I was getting annoying int and float errors using data[i][2] in the for loop.
Thanks very much for any help!
Read in the data by experiment, using the empty line as separator:
data = []
exp = [[], []]
for line in myfile:
if line.strip():
for index, value in enumerate(line.split()):
exp[index].append(float(value))
else:
data.append(exp)
exp = [[], []]
and plot them all in one plot:
for number, exp in enumerate(data, 1):
plt.plot(*exp, label='experiment {}'.format(number))
plt.legend(loc='best')
plt.title('My Experiments')
Result:
try do something like this:
for line in myfile:
if line.strip():#takes out empty rows
x, y = line.split()
timestep.append(float(x))
value.append(float(y))
n = 3 #no. of data points in one experiment
r = 4 #no. of times experiment repeats
x = np.arange(1, n+1) # you don't need to use it. just plot(y)
colors = ['r', 'b', 'g', 'k']
for k in range(r):
y = value[k*n:(k+1)*n]
# plt.subplot(r+1, 1, k+1) #if you want place it on different plots
plt.plot(x, y, colors[k], label="%d experiment" % (k+1))
plt.legend(loc=1) # or wherever you want
plt.title('Experiments')
plt.show()
Hi I'm very new to python and trying to create a program that takes a random sample from a CSV file and makes a new file with some conditions. What I have done so far is probably highly over-complicated and not efficient (though it doesn't need to be).
I have 4 CSV files that contain 264 rows in total, where each full row is unique, though they all share common values in some columns.
csv1 = 72 rows, csv2 = 72 rows, csv3 = 60 rows, csv4 = 60 rows. I need to take a random sample of 160 rows which will make 4 blocks of 40, where in each block 10 must come from each csv file. The tricky part is that no more than 2 or 3 rows from the same CSV file can appear in order in the final file.
So far I have managed to take a random sample of 40 from each CSV (just using random.sample) and output them to 4 new CSV files. Then I split each csv into 4 new files each containing 10 rows so that I have each in a separate folder(1-4). So I now have 4 folders each containing 4 csv files. Now I need to combine these so that rows that came from the original CSV file don't repeat more than 2 or 3 times and the row order will be as random as possible. This is where I'm completely lost, I'm presuming that I should combine the 4 files in each folder (which I can do) and then re-sample or shuffle in a loop until the conditions are met, or something to that effect but I'm not sure how to proceed or am I going about this in the completely wrong way. Any help anyone can give me would be greatly appreciated and I can provide any further details that are necessary.
var_start = 1
total_condition_amount_start = 1
while (var_start < 5):
with open("condition"+`var_start`+".csv", "rb") as population1:
conditions1 = [line for line in population1]
random_selection1 = random.sample(conditions1, 40)
with open("./temp/40cond"+`var_start`+".csv", "wb") as temp_output:
temp_output.write("".join(random_selection1))
var_start = var_start + 1
while (total_condition_amount_start < total_condition_amount):
folder_no = 1
splitter.split(open("./temp/40cond"+`total_condition_amount_start`+".csv", 'rb'));
shutil.move("./temp/output_1.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_2.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_3.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
folder_no = folder_no + 1
shutil.move("./temp/output_4.csv", "./temp/block"+`folder_no`+"/output_"+`total_condition_amount_start`+".csv")
total_condition_amount_start = total_condition_amount_start + 1
You should probably try using the CSV built in lib: http://docs.python.org/3.3/library/csv.html
That way you can handle each file as a list of dictionaries, which will make your task a lot easier.
from random import randint, sample, choice
def create_random_list(length):
return [randint(0, 100) for i in range(length)]
# This should be your list of four initial csv files
# with the 264 rows in total, read with the csv lib
lists = [create_random_list(264) for i in range(4)]
# Take a randomized sample from the lists
lists = map(lambda x: sample(x, 40), lists)
# Add some variables to the
lists = map(lambda x: {'data': x, 'full_count': 0}, lists)
final = [[] for i in range(4)]
for l in final:
prev = None
count = 0
while len(l) < 40:
current = choice(lists)
if current['full_count'] == 10 or (current is prev and count == 3):
continue
# Take an item from the chosen list if it hasn't been used 3 times in a
# row or is already used 10 times. Append that item to the final list
total_left = 40 - len(l)
maxx = 0
for i in lists:
if i is not current and 10 - i['full_count'] > maxx:
maxx = 10 - i['full_count']
current_left = 10 - current['full_count']
max_left = maxx + maxx/3.0
if maxx > 3 and total_left <= max_left:
# Make sure that in te future it can still be split in to sets of
# max 3
continue
l.append(current['data'].pop())
count += 1
current['full_count'] += 1
if current is not prev:
count = 0
prev = current
for li in lists:
li['full_count'] = 0
So basically, this prog reads 5 numbers:
X, Y, startFrom, jump, until
with space separating each number. an example:
3 4 1 1 14
X = 3
Y = 4
1 = startFrom
jump = 1
until = 14
In order to do that, I used:
#get X, Y, startFrom, jump, until
parameters = raw_input()
parametersList = parameters.split()
X = int(parametersList[0])
Y = int(parametersList[1])
#start from startFrom
startFrom = int(parametersList[2])
#jumps of <jump>
jump = int(parametersList[3])
#until (and including) <until>
until = int(parametersList[4])
The program outputs a chain (or however you would like to call it) of, let's call it BOOZ and BANG, when BOOZ is X if exists in the number (i.e X is 2 and we are at 23, so it's a BOOZ) . in order to check that (I used: map(int, str(currentPos)) when my currentPos (our number) at first is basically startFrom, and as we progress (add jump every time), it gets closer and closer to until), or if X divides the currentPos (X%num == 0. i.e: X is 2 and we are at 34, it's also a BOOZ).
BANG is the same, but with Y. If currentPos is both BOOZ & BANG, the output is BOOZ-BANG.
startFrom, startFrom+ jump, startFrom+2*jump, startFrom+3*jump, ..., until
We know the numbers read are int type, but we need to make sure they are valid for the game.
X and Y must be between 1 and 9 included. otherwise, we print (fter all 5 numbers have been read): X and Y must be between 1 and 9 and exit the prog.
In addition, jump can't be 0. if it is, we print jump can't be 0 and exit the prog. Else, if we can't reachuntil using jump jumps (if startFrom+ n * jump == until when n is an int number) so we need to print can't jump from <startFrom> to <until> and exit the prog.
My algorithm got too messy there with alot of ifs and what not, so I'd like an assistance with that as well)
so for our first example (3 4 1 1 14) the output should be:
1,2,BOOZ,BANG,5,BOOZ,7,BANG,BOOZ,10,11,BOOZ-BANG,BOOZ,BANG
another example:
-4 -3 4 0 19
OUTPUT:
X and Y must be between 1 and 9
juump can't be 0
another:
5 3 670 7 691
OUTPUT:
BOOZ,677,BANG,691
another:
0 3 4 -5 24
OUTPUT:
X and Y must be between 1 and 9
can't jump from 4 to 24
another:
3 4 34 3 64
OUTPUT:
BOOZ-BANG,BOOZ,BANG,BOOZ-BANG,BANG,BANG,BANG,55,58,61,BANG
my prog is toooo messy ( I did a while loop with ALOT of ifs.. including if currentPos==until so in that cause it won't print the comma (,) for the last item printed etc.. but like I said, all of it is so messy, and the ifs conditions came out so long and messy that I just removed it all and decided to ask here for a nicer solution.
Thanks guys
I hope it was clear enough
My version has no if :)
parameters = raw_input()
sx, sy, sstartfrom, sjump, suntil = parameters.split()
x = "0123456789".index(sx)
y = "0123456789".index(sy)
startfrom = int(sstartfrom)
jump = int(sjump)
until = int(suntil)
for i in range(startfrom, until+jump, jump):
si = str(i)
booz = sx in si or i%x == 0
bang = sy in si or i%y == 0
print [[si, 'BANG'],['BOOZ','BOOZ-BANG']][booz][bang]
Easiest way to get the commas is to move the loop into a generator
def generator():
for i in range(startfrom, until+jump, jump):
si = str(i)
booz = sx in str(i) or i%x == 0
bang = sy in str(i) or i%y == 0
yield [[si, 'BANG'],['BOOZ','BOOZ-BANG']][booz][bang]
print ",".join(generator())
Sample output
$ echo 3 4 1 1 14 | python2 boozbang.py
1,2,BOOZ,BANG,5,BOOZ,7,BANG,BOOZ,10,11,BOOZ-BANG,BOOZ,BANG
$ echo 5 3 670 7 691 | python2 boozbang.py
BOOZ,677,BANG,691
$ echo 3 4 34 3 64 | python2 boozbang.py
BOOZ-BANG,BOOZ,BANG,BOOZ-BANG,BANG,BANG,BANG,55,58,61,BANG
def CheckCondition(number, xOrY):
return (xOrY in str(number)) or not (number % xOrY)
def SomeMethod(X, Y, start, jump, end):
for i in range(start, end, jump):
isPassX = CheckCondition(i, X)
isPassY = CheckCondition(i, Y)
if isPassX and isPassY:
print "BOOZ-BANG"
elif isPassX:
print "BOOZ"
elif isPassY:
print "BANG"
else:
print i
def YourMethod():
(X, Y, start, jump, end) = (3, 4, 1, 1, 14)
if (X not in range(1, 10) or Y not in range(1, 10)):
print "X and Y must be between 1 and 9"
if jump <= 0:
print "juump can't be less than 0"
SomeMethod(X, Y, start, jump, end)
I have been trying to smooth a plot which is noisy due to the sampling rate I'm using, and what it's counting. I've been using the help on here - mainly Plot smooth line with PyPlot (although I couldn't find the "spline" function and so am using UnivarinteSpline instead)
However, whatever I do I keep getting errors with either the pyplot error that "x and y are not of the same length" or, that the scipi.UnivariateSpline has a value for w that is incorrect. I am not sure quite how to fix this (not really a Python person!) I've attached the code although it's just the plotting bit at the end that is causing problems. Thanks
import os.path
import matplotlib.pyplot as plt
import scipy.interpolate as sci
import numpy as np
def main():
jcc = "0050"
dj = "005"
l = "060"
D = 20
hT = 4 * D
wT1 = 2 * D
wT2 = 5 * D
for jcm in ["025","030","035","040","045","050","055","060"]:
characteristic = "LeadersOnly/Jcm" + jcm + "/Jcc" + jcc + "/dJ" + dj + "/lambda" + l + "/Seed000"
fingertime1 = []
fingertime2 = []
stamp =[]
finger=[]
for x in range(0,2500,50):
if x<10000:
z=("00"+str(x))
if x<1000:
z=("000"+str(x))
if x<100:
z=("0000"+str(x))
if x<10:
z=("00000"+str(x))
stamp.append(x)
path = "LeadersOnly/Jcm" + jcm + "/Jcc" + jcc + "/dJ" + dj + "/lambda" + l + "/Seed000/profile_" + str(z) + ".txt"
if os.path.exists(path):
f = open(path, 'r')
pr1,pr2=np.genfromtxt(path, delimiter='\t', unpack=True)
p1=[]
p2=[]
h1=[]
h2=[]
a1=[]
a2=[]
finger1 = 0
finger2 = 0
for b in range(len(pr1)):
p1.append(pr1[b])
p2.append(pr2[b])
for elem in range(len(pr1)-80):
h1.append((p1[elem + (2*D)]-0.5*(p1[elem]+p1[elem + (4*D)])))
h2.append((p2[elem + (2*D)]-0.5*(p2[elem]+p2[elem + (4*D)])))
if h1[elem] >= hT:
a1.append(1)
else:
a1.append(0)
if h2[elem]>=hT:
a2.append(1)
else:
a2.append(0)
for elem in range(len(a1)-1):
if (a1[elem] - a1[elem + 1]) != 0:
finger1 = finger1 + 1
finger1 = finger1 / 2
for elem in range(len(a2)-1):
if (a2[elem] - a2[elem + 1]) != 0:
finger2 = finger2 + 1
finger2 = finger2 / 2
fingertime1.append(finger1)
fingertime2.append(finger2)
finger.append((finger1+finger2)/2)
namegraph = jcm
stampnew = np.linspace(stamp[0],stamp[-1],300)
fingernew = sci.UnivariateSpline(stamp, finger, stampnew)
plt.plot(stampnew,fingernew,label=namegraph)
plt.show()
main()
For information, the data input files are simply a list of integers (two lists seperated by tabs, as the code suggests).
Here is one of the error codes that I get:
0-th dimension must be fixed to 50 but got 300
error Traceback (most recent call last)
/group/data/Cara/JCMMOTFingers/fingercount_jcm_smooth.py in <module>()
116
117 if __name__ == '__main__':
--> 118 main()
119
120
/group/data/Cara/JCMMOTFingers/fingercount_jcm_smooth.py in main()
93 #print(len(stamp))
94 stampnew = np.linspace(stamp[0],stamp[-1],300)
---> 95 fingernew = sci.UnivariateSpline(stamp, finger, stampnew)
96 #print(len(stampnew))
97 #print(len(fingernew))
/usr/lib/python2.6/dist-packages/scipy/interpolate/fitpack2.pyc in __init__(self, x, y, w, bbox, k, s)
86 #_data == x,y,w,xb,xe,k,s,n,t,c,fp,fpint,nrdata,ier
87 data = dfitpack.fpcurf0(x,y,k,w=w,
---> 88 xb=bbox[0],xe=bbox[1],s=s)
89 if data[-1]==1:
90 # nest too small, setting to maximum bound
error: failed in converting 1st keyword `w' of dfitpack.fpcurf0 to C/Fortran array
Let's analyze your code a bit, starting from the for x in range(0, 2500, 50):
You define z as a string of 6 digits padded with 0s. You should really use somestring formatting like z = "{0:06d}".format(x) or z = "%06d" % x instead of these multiple tests of yours.
At the end of your loop, stamp will have (2500//50)=50 elements.
You check for the existence of your file path, then open it and read it, but you never close it. A more Pythonic way is to do:
try:
with open(path,"r") as f:
do...
except IOError:
do something else
With the with syntax, your file is automatically closed.
pr1 and pr2 are likely to be 1D arrays, right? You can really simplify the construction of your p1 and p2 lists as:
p1 = pr1.tolist()
p2 = pr2.tolist()
Your lists a1, a2 have the same size: you could combine your for elem in range(len(a..)-1) loops in a single one. You could also use the np.diff function.
at the end of the for x in range(...) loops, finger will have 50 elements minus the number of missing files. As you're not telling what to do in case of a missing file, your stamp and finger lists may not have the same number of elements, which will crash your scipy.UnivariateSpline. An easy fix would be to update your stamp list only if the path file is defined (that way, it always has the same number of elements as finger).
Your stampnew array has 300 elements, when your stamp and finger can only have at most 50. That's a second problem, the size of the weight array (stampnew) must be the same as the size of the inputs.
You're eventually trying to plot fingernew vs stamp. The problem is that fingernew is not an array, it's an instance of UnivariateSpline. You still need to calculate some actual points, for example with fingernew(stamp), then use that in your plot function.