I want to extract data from a txt file with python - python

I want to extract the initial residual values from the file attached:
http://www.filedropper.com/fixvel
But the initial residual must be referred to the "DICPCG" line and not to others (like "smoothsolver"). Then I want to store those values into a matrix that contains the values of inital residual at the same time step (on the same row) for every time-step.
Really thanks in advance

Please try this code:
import re
import sys
file = open("fixVel.txt")
textfile = file.readlines()
count = -1
matrix = []
for row in textfile:
if row.find("Time") == 0:
count = count + 1
matrix.append([])
if row.find("DICPCG") == 0:
index = row.find("Initial residual")
index1 = row[index:].find(",")
matrix[count].append(row[index+19:index+index1])
#print matrix
for i in matrix:
for j in i:
sys.stdout.write(j + " ")
print

Related

Calculate some averages in .txt python

I have a .txt-file called ecc.txt. It contains more than 8000 lines of numbers. I want to count the average of every 360 lines in that file.
Here is the code:
import math
f = open(r'ecc.txt').read()
data = []
for line in data:
sum = 0
for i in range (len(data)):
if i % 360 != 0:
sum = sum + ecc[i]
else:
average = sum / 360
print(average)
sum = 0
When I am running it, nothing happens. I didn't get any results. The code just running and end without any result.
Is there something wrong with this code?
Thank you.
avg_dict = {}
with open('ecc.txt') as f:
data = f.read().split(' ')
sum = 0
i = 0
for str_number in data:
sum += int(str_number)
i += 1
if i % 360 == 0:
avg_dict[i] = sum/360
sum = 0
I've assumed that your file text has an empty space as separator. Otherwise, you can change the sep value in the split method. If there is not separator change data as:
data = list(f.read())
You code would work with some changes:
import math
data=[]
with open(r'ecc.txt') as f:
for i in f:
data.append(int(i))
for line in data:
sum = 0
for i in range (len(data)):
if i%360 !=0:
sum = sum + ecc[i]
else:
average = sum/360
print(average)
sum=0
Be aware though, that this code doesn't include values for each 360th element (i guess it's not a problem for an average) and also you don't have average for last elements

Comparison of two elements in different array

My problem:
I am trying to compare two elements from two different arrays but the operator is not working.
Code Snippet in question:
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
#print()
#print(f"row length: {row_length}")
#print(f"count: {count}")
if count == row_length:
print(row[0])
exit(0)
What I have done: I tried to print the value of ss_record and row before it runs through the if statement but when it matches, count doesn't increase. I tried storing the value of row in a new array but it bugs out and only store the array length and first 2 value of row and repeats those values every next instance.
What I think the issue: I think the issue with my code is that row is being read from a CSV file and is not being converted into an integer as a result, it appears they are the same but one is an integer while the other is a string.
Entire Code:
import csv
import sys
import re
from cs50 import get_string
from sys import argv
def main():
line_count = 0
if len(argv) != 3:
print("missing command-line argument")
exit(1)
with open(sys.argv[1], 'r') as database:
sequence = open(sys.argv[2], 'r')
string = sequence.read()
reader = csv.reader(database, delimiter = ',')
for row in reader:
if line_count == 0:
row_length = len(row) - 1
ss_record = [row_length]
for i in range(row_length):
ss_record.append(ss_count(string, row[i + 1], len(row[i + 1])))
ss_record.pop(0)
line_count = 1
else:
count = 0
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
if count == row_length:
print(row[0])
exit(0)
#ss_count mean the # of times the substring appear in the string
def ss_count(string, substring, length):
count = 1
record = 0
pos_array = []
for m in re.finditer(substring, string):
pos_array.append(m.start())
for i in range(len(pos_array) - 1):
if pos_array[i + 1] - pos_array[i] == length:
count += 1
else:
if count > record:
record = count
count = 1
if count > record:
record = count
return record
main()
Values to use to reproduce issue:
sequence (this is a text file) = AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
substring (this is a csv file) =
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
Gist of the CSV file:
The numbers beside Alice means how many times a substring(STR/Short Tandem Repeat) appears in a row in the string(DNA sequence). In this string, AGATC appears 4 times in a row, AATG appears 1 time in a row, and TATC appears 5 times in a row. For this DNA sequence, it matches Bob and he outputted as the answer.
You were right, when you compare ss_record[i] == row[i + 1]: there is a type problem, the numbers of ss_record are integers while the numbers of the row are strings. You may acknowledge the issue by printing both ss_record and row:
print("ss_record: {}".format(ss_record)) -> ss_record: [4, 1, 5]
print("row: {}".format(row)) -> row: ['Alice', '2', '8', '3']
In order for the snippet to work you just need to change the comparison to
ss_record[i] == int(row[i + 1])
That said, I feel the code is quite complex for the task. The string class implements a count method that returns the number of non-overlapping occurrences of a given substring. Also, since the code it's working in an item basis and relies heavily in index manipulations the iteration logic is hard to follow (IMO). Here's my approach to the problem:
import csv
def match_user(dna_file, user_csv):
with open(dna_file, 'r') as r:
dna_seq = r.readline()
with open(user_csv, 'r') as r:
reader = csv.reader(r)
rows = list(reader)
target_substrings = rows[0][1:]
users = rows[1:]
num_matches = [dna_seq.count(target) for target in target_substrings]
for user in users:
user_matches = [int(x) for x in user[1:]]
if user_matches == num_matches:
return user[0]
return "Not found"
Happy Coding!

How to create a dataset using sequence file in python

I have a protein sequence file looks like this:
>102L:A MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL -------------------------------------------------------------------------------------------------------------------------------------------------------------------XX
The first one is the name of the sequence, the second one is the actual protein sequence, and the first one is the indicator that shows if there is any missing coordinates. In this case, notice that there is two "X" in the end. That means that the last two residue of the sequence witch are "NL" in this case are missing coordinates.
By coding in Python I would like to generate a table which should look like this:
name of the sequence
total number of missing coordinates (which is the number of X)
the range of these missing coordinates (which is the range of the position of those X)
4)the length of the sequence
5)the actual sequence
So the final results should looks like this:
>102L:A 2 163-164 164 MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAAKSELDKAIGRNTNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRAALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
And my code looks like this so far:
total_seq = []
with open('sample.txt') as lines:
for l in lines:
split_list = l.split()
# Assign the list number
header = split_list[0] # 1
seq = split_list[1] # 5
disorder = split_list[2]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq) # 4
for x in disorder:
counts = 0
if x == 'X':
counts = counts + 1
total_seq.append([header, seq, str(counts)]) # obviously I haven't finish coding 2 & 3
with open('new_sample.txt', 'a') as f:
for lol in total_seq:
f.write('\n'.join(lol))
I'm new in python, would anyone help please?
Here's your modified code. It now produces your desired output.
with open("sample.txt") as infile:
matrix = [line.split() for line in infile.readlines()]
header_list = [row[0] for row in matrix]
seq_list = [str(row[1]) for row in matrix]
disorder_list = [str(row[2]) for row in matrix]
f = open('new_sample.txt', 'a')
for i in range(len(header_list)):
header = header_list[i]
seq = seq_list[i]
disorder = disorder_list[i]
# count sequence length and total residue of missing coordinates
sequence_length = len(seq)
# get total number of missing coordinates
num_missing = disorder.count('X')
# get the range of these missing coordinates
first_X_pos = disorder.find('X')
last_X_pos = disorder.rfind('X')
range_missing = '-'.join([str(first_X_pos), str(last_X_pos)])
reformat_seq=" ".join([header, str(num_missing), range_missing, str(sequence_length), seq, '\n'])
f.write(reformat_seq)
f.close()
Some more tips:
Don't forget about python's string functions. They will solve a lot of your problems automatically. The documentation is very good.
If you searched for how to do just part 2 or just part 3 in your question, you would find the results elsewhere.

looping until the number of cells changed is neglible

This is probably a simple question, but it's driving me crazy! I have a python code that performs cellular automata on a land use grid. I've made a dictionary of cell id: land use code imported from a text file. I've also import of the adjacent neighbors of each cell from a text file. For each cell in the nested loop, I pick out the highest value, count the highest value of the neighboring cells. If this value is greater than the processing cell and occurred more than 4 times, then I update the dictionary for that cell id. The land use codes are ranked in priority. You will see < 6 in the code below...6 is water and wetlands which I do not want to be changed. The first time I run the code, 7509 cells changed land use based on adjacent neighbors land uses. I can comment out the reading the dictionary text file and run it again, then around 5,000 cells changed. Run it again, then even less and so on. What I would like to do is run this in a loop until only 0.0001 of the total cells change, after that break the loop.
I've tried several times using iterators like "for r in range(999)---something big; If End_Sim > count: break". But it breaks after the first one, because the count goes back to zero. I've tried putting the count = 0 inside the loop and it adds up...I want it to start back over every time so the number of cells gets less and less. I'm stump...hopefully this is trivial to somebody!
Here's my code (it's a clean slate...I've deleted my failed attempts to create the number of simulations loop):
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
#Calculate when to end of one sweep
Tot_Cells = len(FID)
End_Sim = int(0.0001*Tot_Cells)
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
#Performs cellular automata rules on land use grid codes
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
#print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
print count
Thanks for any help!
use a while loop:
cnt_total = 1234 # init appropriately
cnt_changed = cnt_total
p = 0.001
while (cnt_changed > cnt_total*p):
# your code here
# remember to update the cnt_changed variable
Try with the while break statements
initialization stuff
while(1):
...
if x < 0.0001:
break
...
http://docs.python.org/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops
I fixed the code so the simulations stop once the number of cells change is less than 0.0001 of the total cells. I had the while loop in the wrong place. Here's the code if anyone is interested in cellular automata.
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
#print FID
#Calculate when to end the simulations (neglible change in land use)
tot_cells = len(FID)
end_sim = tot_cells
p = 0.0001
#Performs cellular automata rules on land use grid codes
while (end_sim > tot_cells*p):
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
end_sim = count
print count

looping until the number of cells changed is very small

This is a repost because I'm getting weird results. I'm trying to run a simulation loop for cells that change in a cellular automata code that changes land use codes based on their adjacent neighbors. I import text files that create a cell id key = land use code value. I also import a text file with each cell's adjacent neighbors. The first time I run the code, 7509 cells changed land use based on adjacent neighbors land uses. I can comment out the reading the dictionary text file and run it again, then around 5,000 cells changed. Run it again, then even less and so on. What I would like to do is run this in a loop until only 0.0001 of the total cells change, after that break the loop.
I've tried a while loop, but it's not giving me the results I'm looking for. After the first run, the count is correct at 7509. After that the count is 28,476 over and over again. I don't understand why this is happening because the count should go back to zero. Can anyone tell me what I'm doing wrong? Here's the code:
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
#Calculate when to end of one sweep
tot_cells = len(FID)
end_sim = tot_cells
p = 0.0001
#Performs cellular automata rules on land use grid codes
while (end_sim > tot_cells*p):
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
#print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
end_sim = count
print end_sim
Thanks for any help....again! :(
I fixed the code so that the simulations stop after the number of cells changed is less than 0.0001 of total cells. I put the while loop in the wrong place. If anyone is interested, here's the revised code for land use cellular automata.
import sys, string, csv
#Creating a dictionary of FID: LU_Codes from external txt file
text_file = open("H:\SWAT\NC\FID_Whole_Copy.txt", "rb")
#Lines = text_file.readlines()
FID_GC_dict = dict()
reader = csv.reader(text_file, delimiter='\t')
for line in reader:
FID_GC_dict[line[0]] = int(line[1])
text_file.close()
#Importing neighbor list file for each FID value
Neighbors_file = open("H:\SWAT\NC\Pro_NL_Copy.txt","rb")
Entries = Neighbors_file.readlines()
Neighbors_file.close()
Neighbors_List = map(string.split, Entries)
#print Neighbors_List
#creates a list of the current FID
FID = [x[0] for x in Neighbors_List]
#print FID
#Calculate when to end the simulations (neglible change in land use)
tot_cells = len(FID)
end_sim = tot_cells
p = 0.0001
#Performs cellular automata rules on land use grid codes
while (end_sim > tot_cells*p):
gridList = []
for nlist in Neighbors_List:
row = []
for item in nlist:
row.append(FID_GC_dict[item])
gridList.append(row)
#print gridList
i = iter(FID)
count = 0
for glist in gridList:
Cur_FID = i.next()
Cur_GC = glist[0]
glist.sort()
lr_Value = glist[-1]
if lr_Value < 6:
tie_LR = glist.count(lr_Value)
if tie_LR >= 4 and lr_Value > Cur_GC:
FID_GC_dict[Cur_FID] = lr_Value
print "The updated gridcode for FID ", Cur_FID, "is ", FID_GC_dict[Cur_FID]
count += 1
end_sim = count
print count
I don't know the type of cellular automata that you are programming so mine it's just a guess but usually cellular automata works by updating a whole phase ignoring updated values until the phase is finished.
When I had unexpected results for simple cellular automata it was because I just forgot to apply the phase to a backup grid, but I was applying it directly to the grid I was working on.
What I mean is that you should have 2 grids, let's call them grid1 and grid2, and do something like
init grid1 with data
while number of generations < total generations needed
calculate grid2 as the next generation of grid1
grid1 = grid2 (you replace the real grid with the buffer)
Altering values of grid1 directly will lead to different results because you will mostly change neighbours of a cell that still has to be updated before having finished the current phase..

Categories