Path finding algorithm excercise and working with .txt files - python

This is a homework which was given to me and I have been struggling with writing the solution.
Write a program that finds the longest adjacent sequence of colors in a matrix(2D grid). Colors are represented by ‘R’, ‘G’, ‘B’ characters (respectively Red, Green and Blue).
You will be provided with 4 individual test cases, which must also be included in your solution.
An example of your solution root directory should look like this:
solutionRootDir
| - (my solution files and folders)
| - tests/
| - test_1
| - test_2
| - test_3
| - test_4
Individual test case input format:
First you should read two whitespace separated 32-bit integers from the provided test case
that represents the size (rows and cols) of the matrix.
Next you should read rows number of newline separated lines of 8-bit characters.
Your program should find and print the longest adjacent sequence (diagonals are not counted as adjacent fields),
and print to the standard output the number.
NOTE: in case of several sequences with the same length – simply print their equal length.
test_1
Provided input:
3 3
R R B
G G R
R B G
Expected Output:
2
test_2
Provided input:
4 4
R R R G
G B R G
R G G G
G G B B
Expected Output:
7
test_3
Provided input:
6 6
R R B B B B
B R B B G B
B G G B R B
B B R B G B
R B R B R B
R B B B G B
Expected Output:
22
test_4
Provided input:
1000 1000
1000 rows of 1000 R’s
Expected Output:
1000000
Your program entry point should accepted from one to four additional parameters.
Those parameters will indicate the names of the test cases that your program should run.
• Example 1: ./myprogram test_1 test_3
• Example 2: ./myprogram test_1 test_2 test_3 test_4
• you can assume that the input from the user will be correct (no validation is required)
import numpy as np
a = int(input("Enter rows: "))
b = int(input("Enter columns: "))
rgb = ["R", "G", "B"]
T = [[0 for col in range(b)] for row in range(a)]
for row in range(a):
for col in range(b):
T[row][col] = np.random.choice(rgb)
for r in T:
for c in r:
print(c, end=" ")
print()
def solution(t):
rows: int = len(t)
cols: int = len(t[0])
longest = np.empty((rows, cols))
longest_sean = 1
for i in range(rows - 1, -1, -1):
for j in range(cols - 1, -1, -1):
target = t[i][j]
current = 1
for ii in range(i, rows):
for jj in range(j, cols):
length = 1
if target == t[ii][jj]:
length += longest[ii][jj]
current = max(current, length)
longest[i][j] = current
longest_sean = max(current, longest_sean)
return longest_sean
print(solution(T))

in order to get the parameters from the console execution you have to use sys.argv so from sys import argv. than convert your text field to python lists like this
def load(file):
with open(file+".txt") as f:
data = f.readlines()
res = []
for row in data:
res.append([])
for element in row:
if element != "\n" and element != " ":
res[-1].append(element)
return res
witch will create a 2 dimentional list of containing "R", "B" and "G". than you can simply look for the longest area of one Value like using this Function:
def findLargest(data):
visited = []
area = []
length = 0
movement = [(1,0), (0,1), (-1,0),(0,-1)]
def recScan(x, y, scanArea):
visited.append((x,y))
scanArea.append((x,y))
for dx, dy in movement:
newX, newY = x+dx, y+dy
if newX >= 0 and newY >= 0 and newX < len(data) and newY < len(data[newX]):
if data[x][y] == data[newX][newY] and (not (newX, newY) in visited):
recScan(newX, newY, scanArea)
return scanArea
for x in range(len(data)):
for y in range(len(data[x])):
if (x, y) not in visited:
newArea = recScan(x, y, [])
if len(newArea) > length:
length = len(newArea)
area = newArea
return length, area
whereby recScan will check all adjacent fields that haven't bean visited jet. than just call the functions like this:
if __name__ == "__main__":
for file in argv[1:]:
data = load(file)
print(findLargest(data))
the argv[1:] is reqired because the first argument passed to python witch is the file you want to execute. my data structure is.
main.py
test_1.txt
test_2.txt
test_3.txt
test_4.txt
and test_1 threw test_4 look like this just with other values.
R R B B B B
B R B B G B
B G G B R B
B B R B G B
R B R B R B
R B B B G B

Related

Minimum Euclidean Distance

I have two dataframes (attached image). For each of the given row in Table-1 -
Part1 - I need to find the row in Table-2 which gives the minimum Euclidian distance. Output-1 is the expected answer.
Part2 - I need to find the row in Table-2 which gives the minimum Euclidian distance. Output-2 is the expected answer. Here the only difference is that a row from Table-2 cannot be selected two times.
I tried this code to get the distance but not sure on how to add other fields -
import numpy as np
from scipy.spatial import distance
s1 = np.array([(2,2), (3,0), (4,1)])
s2 = np.array([(1,3), (2,2),(3,0),(0,1)])
print(distance.cdist(s1,s2).min(axis=1))
Two dataframes and the expected output:
The code now gives the desired output, and there's a commented out print statement for extra output.
It's also flexible to different list lengths.
Credit also to: How can the Euclidean distance be calculated with NumPy?
Hope it helps:
from numpy import linalg as LA
list1 = [(2,2), (3,0), (4,1)]
list2 = [(1,3), (2,2),(3,0),(0,1)]
names = range(0, len(list1) + len(list2))
names = [chr(ord('`') + number + 1) for number in names]
i = -1
j = len(list1) #Start Table2 names
for tup1 in list1:
collector = {} #Let's collect values for each minimum check
j = len(list1)
i += 1
name1 = names[i]
for tup2 in list2:
name2 = names[j]
a = numpy.array(tup1)
b = numpy.array(tup2)
# print ("{} | {} -->".format(name1, name2), tup1, tup2, " ", numpy.around(LA.norm(a - b), 2))
j += 1
collector["{} | {}".format(name1, name2)] = numpy.around(LA.norm(a - b), 2)
if j == len(names):
min_key = min(collector, key=collector.get)
print (min_key, "-->" , collector[min_key])
Output:
a | e --> 0.0
b | f --> 0.0
c | f --> 1.41

Shouldn't this code generate a satisfiable value to all 9 variables of a magic square of 3*3?

It took 22 minutes on a Core 2 Duo with no result.
for a in range(10):
for b in range(10):
while b!=a:
for c in range(10):
while c!= a or b:
for d in range(10):
while d!=a or b or c:
for e in range(10):
while e!=a or b or c or d:
for f in range(10):
while f!=a or b or c or d or e:
for g in range(10):
while g!=a or b or c or d or e or f:
for h in range(10):
while h!= a or b or c or d or e or f or g:
for i in range(10):
while i!= a or b or c or d or e or f or g or h:
if (a+b+c==15 and a+d+g==15 and d+e+f==15 and g+h+i==15 and b+e+h==15 and c+f+i==15 and a+e+i==15 and c+e+g==15):
print(a,b,c,d,e,f,g,h,i)
This is ugly, but the error is that you cannot have a comparison like
while e!=a or b or c or d:
instead, you should write
while e!=a and e!=b and e!=c and e!=d:
Please learn how to use arrays/lists and re-think the problem.
#Selcuk has identified your problem; here is an improved approach:
# !!! Comments are good !!!
# Generate a magic square like
#
# a b c
# d e f
# g h i
#
# where each row, col, and diag adds to 15
values = list(range(10)) # problem domain
unused = set(values) # compare against a set
# instead of chained "a != b and a != c"
SUM = 15
for a in values:
unused.remove(a)
for b in values:
if b in unused:
unused.remove(b)
# value for c is forced by values of a and b
c = SUM - a - b
if c in unused:
unused.remove(c)
#
# etc
#
unused.add(c)
unused.add(b)
unused.add(a)
This will work, but will still be ugly.
In the long run you would be better off using a constraint solver like python-constraint:
from constraint import Problem, AllDifferentConstraint
values = list(range(10))
SUM = 15
prob = Problem()
# set up variables
prob.addVariables("abcdefghi", values)
# only use each value once
prob.addConstraint(AllDifferentConstraint())
# row sums
prob.addConstraint(lambda a,b,c: a+b+c==SUM, "abc")
prob.addConstraint(lambda d,e,f: d+e+f==SUM, "def")
prob.addConstraint(lambda g,h,i: g+h+i==SUM, "ghi")
# col sums
prob.addConstraint(lambda a,d,g: a+d+g==SUM, "adg")
prob.addConstraint(lambda b,e,h: b+e+h==SUM, "beh")
prob.addConstraint(lambda c,f,i: c+f+i==SUM, "cfi")
# diag sums
prob.addConstraint(lambda a,e,i: a+e+i==SUM, "aei")
prob.addConstraint(lambda c,e,g: c+e+g==SUM, "ceg")
for sol in prob.getSolutionIter():
print("{a} {b} {c}\n{d} {e} {f}\n{g} {h} {i}\n\n".format(**sol))
Note that this returns 8 solutions which are all rotated and mirrored versions of each other. If you only want unique solutions, you can add a constraint like
prob.addConstraint(lambda a,c,g: a<c<g, "acg")
which forces a unique ordering.
Also note that fixing the values of any three corners, or the center and any two non-opposed corners, forces all the remaining values. This leads to a simplified solution:
values = set(range(1, 10))
SUM = 15
for a in values:
for c in values:
if a < c: # for unique ordering
for g in values:
if c < g: # for unique ordering
b = SUM - a - c
d = SUM - a - g
e = SUM - c - g
f = SUM - d - e
h = SUM - b - e
i = SUM - a - e
if {a,b,c,d,e,f,g,h,i} == values:
print(
"{} {} {}\n{} {} {}\n{} {} {}\n\n"
.format(a,b,c,d,e,f,g,h,i)
)
On my machine this runs in 168 µs (approximately 1/6000th of a second).

How can I assign scores to a list of datapoints and then output values > 2 standard deviations from the mean in python?

I wrote a script that reads through a text file with rows of data. Example of a line of data:
10 1100 1101 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A/G G G G A/G G G A/G A/G G G A/G G A/G G G G G G . G A/G A/G G G A/G G G G A/G G A A/G A/G A/G G G A/G G G G A/G . G A A/G A . A A A G G G A G A/G A/G A/G A G A/G A A A A A A A/G A
My script calculates a frequency score for each row of data as a percentage based on the relative number of different letters. Currently my script outputs a subset of the row and the percentage score if the percentage score is >0.75.
However I would like to make the script do something more sophisticated but I do not know how.
1) For each row of data I would like the script to save the 1st, 2nd, 4th and 5th column of data in an array and also add the percentage score as an additional value.
2) Then once the script has read through all rows in the text file I would like it to output all the rows with percentage scores >2 standard deviations above the mean percentage score.
Below find my current script.
(one small but non-essential additional thing, currently I print each relevant row twice because if a percentage score that is >0.75 for one letter will also always be >0.75 for another letter. To get around this I just have to get the script to continue onto the next row of data once it has printed out once, but I always get confused if I should use break, continue or something else to get the script to move on to the next line without ending the entire script.)
inputfile = open('datafile.txt', 'r')
output = open('output.txt', 'w')
#windowstart = 0
for line in inputfile:
line = line.rstrip()
fields = line.split("\t")
chrom = fields[0]
pos = str(fields[1])
allele_one = str(fields[3])
allele_two = str(fields[4])
#which columns belong to which population
PopulationA = fields[3:26]
PopulationB = fields[26:36]
#sample size of each population
PopulationA_popsize = 46
PopulationB_popsize = 20
#Now count the total number of alleles in each population (Homozygous alleles counted twice, heterozygotes just once)
#count C allele
C_count_PopulationA = (2*PopulationA.count("C")) + PopulationA.count("C/T") + PopulationA.count("A/C") + PopulationA.count("C/G")
percentage_C_PopulationA = float(C_count_PopulationA)/46
#count A allele
A_count_PopulationA = (2*PopulationA.count("A")) + PopulationA.count("A/T") + PopulationA.count("A/C") + PopulationA.count("A/G")
percentage_A_PopulationA = float(A_count_PopulationA)/46
#count T allele
T_count_PopulationA = (2*PopulationA.count("T")) + PopulationA.count("C/T") + PopulationA.count("A/T") + PopulationA.count("G/T")
percentage_T_PopulationA = float(T_count_PopulationA)/46
#count G allele
G_count_PopulationA = (2*PopulationA.count("G")) + PopulationA.count("G/T") + PopulationA.count("A/G") + PopulationA.count("C/G")
percentage_G_PopulationA = float(G_count_PopulationA)/46
#count missing data
null_count_PopulationA = (2*PopulationA.count("."))
percentage_null_PopulationA = float(null_count_PopulationA)/46
#repeat for population B
C_count_PopulationB = (2*PopulationB.count("C")) + PopulationB.count("C/T") + PopulationB.count("A/C") + PopulationB.count("C/G")
percentage_C_PopulationB = float(C_count_PopulationB)/20
A_count_PopulationB = (2*PopulationB.count("A")) + PopulationB.count("A/T") + PopulationB.count("A/C") + PopulationB.count("A/G")
percentage_A_PopulationB = float(A_count_PopulationB)/20
T_count_PopulationB = (2*PopulationB.count("T")) + PopulationB.count("C/T") + PopulationB.count("A/T") + PopulationB.count("G/T")
percentage_T_PopulationB = float(T_count_PopulationB)/20
G_count_PopulationB = (2*PopulationB.count("G")) + PopulationB.count("G/T") + PopulationB.count("A/G") + PopulationB.count("C/G")
percentage_G_PopulationB = float(G_count_PopulationB)/20
null_count_PopulationB = (2*PopulationB.count("."))
percentage_null_PopulationB = float(null_count_PopulationB)/20
#If missing data less than 10% in both populations
if percentage_null_PopulationA < 0.1:
if percentage_null_PopulationB < 0.1:
#calculate frequency difference between populations for each allele
Frequency_diff_C_PopulationA_PopulationB = float(abs(percentage_C_PopulationA - percentage_C_PopulationB))
Frequency_diff_A_PopulationA_PopulationB = float(abs(percentage_A_PopulationA - percentage_A_PopulationB))
Frequency_diff_T_PopulationA_PopulationB = float(abs(percentage_T_PopulationA - percentage_T_PopulationB))
Frequency_diff_G_PopulationA_PopulationB = float(abs(percentage_G_PopulationA - percentage_G_PopulationB))
#if the frequency difference between alleles is greater than 0.75, print part of the row
if Frequency_diff_C_PopulationA_PopulationB >= 0.75:
print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)
if Frequency_diff_A_PopulationA_PopulationB >= 0.75:
print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)
if Frequency_diff_T_PopulationA_PopulationB >= 0.75:
print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)
if Frequency_diff_G_PopulationA_PopulationB >= 0.75:
print >> output, str(chrom) + "\t" + str(pos) + "\t" + str(allele_one) + "\t" + str(allele_two)
I am looking to find the allele frequency difference between the populations for each row. So for example if we imagine there are 10 individuals in population A (first 10 columns of nucleotides) and 10 individuals in population B (final 10 columns of nucleotides then in the example row of data below we see that population A has 10 G nucleotides. Population B has 3 G nucleotides and 7 A nucleotides. So the frequency difference between the 2 populations is 70%.
10 20 21 G G G G G G G G G G G G G A A A A A A A
As soon as you talk of means and standard deviations for lots of data, you should start using any of the numerical libraries. Consider using numpy, or even pandas (for readability) here. I'll be using them in this example, together with the Counter object from the collections module. Read up on both to see how they work, but I'll explain a bit throughout the code as well.
import numpy as np
from collections import Counter
nucleotid_bases = ('C', 'A', 'T', 'G', '.')
results = []
checksum = []
with open('datafile.txt') as f:
for line in f:
fields = line.split() # splits by consecutive whitespace, empty records will be purged
chrom, pos = [int(fields[x]) for x in (0,1)]
results.append([chrom,pos]) # start by building the current record
allele1, allele2 = [fields[i] for i in (3,4)]
checksum.append([allele1, allele2]) # you wanted to keep these, most likely for debugging purposes?
popA = fields[3:26] # population size: 2*23
popB = fields[26:36] # population size: 2*10
for population in (popA, popB):
summary = Counter(population) # traverses the line only once - much more efficient!
base_counts = [ sum(summary[k] for k in summary.keys() if base in k) for base in nucleotid_bases]
for index, base_name in enumerate(nucleotid_bases):
# Double the count when there is an exact match, e.g. "A/A" -> "A"
# An 'in' match can match an item anywhere in the string: 'A' in 'A/C' evaluates to True
base_counts[index] += summary[base_name]
results[-1].extend(base_counts) # append to the current record
results = np.array(results, dtype=np.float) # shape is now (x, 12) with x the amount of lines read
results[:, 2:7] /= 46
results[:, 7:] /= 20
At this point, the layout of the results is two columns filled with the chrom (results[:,0]) and pos (results[:,1]) labels from the text file,
then 5 columns of population A, where the first of those 5 contains the relative frequency of the 'C' base, next
column of the 'A' base and so on (see nucleotid_bases for the order). Then, the last 5 columns are similar, but they are for population B:
chrom, pos, freqC_in_A,..., freqG_in_A, freq_dot_in_A freqC_in_B, ..., freqG_in_B, freq_dot_in_B
If you want to ignore records (rows) in this table where either of the unknowns-frequencies (columns 6 and 11) are above a threshold, you would do:
threshold = .1 # arbitrary: 10%
to_consider = np.logical_and(results[:,6] < threshold, results[:,11] < threshold)
table = results[to_consider][:, [0,1,2,3,4,5,7,8,9,10]]
Now you can compute the table of frequency differences with:
freq_diffs = np.abs(table[:,2:6] - table[:,-4:]) # 4 columns, n rows
mean_freq_diff = freq_diffs.mean(axis=0) # holds 4 numbers, these are the means over all the rows
std_freq_diff = freq_diffs.std(axis=0) # similar: std over all the rows
condition = freq_diffs > (mean_freq_diff + 2*std_freq_diff)
Now you'll want to check if the condition was valid for any elements of the row, so e.g. if
the frequency difference for 'C' between popA and popB was .8 and the
(mean+2*std) was .7, then it will return True. But it will also return True
for the same row if this condition was fulfilled for any of the other
nucleotids. To check if the condition was True for any of the nucleotid frequency differences, do this:
specials = np.any(condition, axis=1)
print(table[specials, :2])

How to cycle through the index of an array?

line 14 is where my main problem is.i need to cycle through each item in the array and use it's index to determine whether or not it is a multiple of four so i can create proper spacing for binary numbers.
def decimalToBinary(hu):
bits = []
h = []
while hu > 0:
kla = hu%2
bits.append(kla)
hu = int(hu/2)
for i in reversed(bits):
h.append(i)
if len(h) <= 4:
print (''.join(map(str,h)))
else:
for j in range(len(h)):
h.index(1) = h.index(1)+1
if h.index % 4 != 0:
print (''.join(map(str,h)))
elif h.index % 4 == 0:
print (' '.join(map(str,h)))
decimalToBinary( 23 )
If what you're looking for is the index of the list from range(len(h)) in the for loop, then you can change that line to for idx,j in enumerate(range(len(h))): where idx is the index of the range.
This line h.index(1) = h.index(1)+1 is incorrect. Modified your function, so at least it executes and generates an output, but whether it is correct, i dont know. Anyway, hope it helps:
def decimalToBinary(hu):
bits = []
h = []
while hu > 0:
kla = hu%2
bits.append(kla)
hu = int(hu/2)
for i in reversed(bits):
h.append(i)
if len(h) <= 4:
print (''.join(map(str,h)))
else:
for j in range(len(h)):
h_index = h.index(1)+1 # use h_index variable instead of h.index(1)
if h_index % 4 != 0:
print (''.join(map(str,h)))
elif h_index % 4 == 0:
print (' '.join(map(str,h)))
decimalToBinary( 23 )
# get binary version to check your result against.
print(bin(23))
This results:
#outout from decimalToBinary
10111
10111
10111
10111
10111
#output from bin(23)
0b10111
You're trying to join the bits to string and separate them every 4 bits. You could modify your code with Marcin's correction (by replacing the syntax error line and do some other improvements), but I suggest doing it more "Pythonically".
Here's my version:
def decimalToBinary(hu):
bits = []
while hu > 0:
kla = hu%2
bits.append(kla)
hu = int(hu/2)
h = [''.join(map(str, bits[i:i+4])) for i in range(0,len(bits),4)]
bu = ' '.join(h)
print bu[::-1]
Explanation for the h assignment line:
range(0,len(bits),4): a list from 0 to length of bits with step = 4, eg. [0, 4, 8, ...]
[bits[i:i+4] for i in [0, 4, 8]: a list of lists whose element is every four elements from bits
eg. [ [1,0,1,0], [0,1,0,1] ...]
[''.join(map(str, bits[i:i+4])) for i in range(0,len(bits),4)]: convert the inner list to string
bu[::-1]: reverse the string
If you are learning Python, it's good to do your way. As #roippi pointed out,
for index, value in enumerate(h):
will give you access to both index and value of member of h in each loop.
To group 4 digits, I would do like this:
def decimalToBinary(num):
binary = str(bin(num))[2:][::-1]
index = 0
spaced = ''
while index + 4 < len(binary):
spaced += binary[index:index+4]+' '
index += 4
else:
spaced += binary[index:]
return spaced[::-1]
print decimalToBinary(23)
The result is:
1 0111

Printing in a loop

I have the following file I'm trying to manipulate.
1 2 -3 5 10 8.2
5 8 5 4 0 6
4 3 2 3 -2 15
-3 4 0 2 4 2.33
2 1 1 1 2.5 0
0 2 6 0 8 5
The file just contains numbers.
I'm trying to write a program to subtract the rows from each other and print the results to a file. My program is below and, dtest.txt is the name of the input file. The name of the program is make_distance.py.
from math import *
posnfile = open("dtest.txt","r")
posn = posnfile.readlines()
posnfile.close()
for i in range (len(posn)-1):
for j in range (0,1):
if (j == 0):
Xp = float(posn[i].split()[0])
Yp = float(posn[i].split()[1])
Zp = float(posn[i].split()[2])
Xc = float(posn[i+1].split()[0])
Yc = float(posn[i+1].split()[1])
Zc = float(posn[i+1].split()[2])
else:
Xp = float(posn[i].split()[3*j+1])
Yp = float(posn[i].split()[3*j+2])
Zp = float(posn[i].split()[3*j+3])
Xc = float(posn[i+1].split()[3*j+1])
Yc = float(posn[i+1].split()[3*j+2])
Zc = float(posn[i+1].split()[3*j+3])
Px = fabs(Xc-Xp)
Py = fabs(Yc-Yp)
Pz = fabs(Zc-Zp)
print Px,Py,Pz
The program is calculating the values correctly but, when I try to call the program to write the output file,
mpipython make_distance.py > distance.dat
The output file (distance.dat) only contains 3 columns when it should contain 6. How do I tell the program to shift what columns to print to for each step j=0,1,....
For j = 0, the program should output to the first 3 columns, for j = 1 the program should output to the second 3 columns (3,4,5) and so on and so forth.
Finally the len function gives the number of rows in the input file but, what function gives the number of columns in the file?
Thanks.
Append a , to the end of your print statement and it will not print a newline, and then when you exit the for loop add an additional print to move to the next row:
for j in range (0,1):
...
print Px,Py,Pz,
print
Assuming all rows have the same number of columns, you can get the number of columns by using len(row.split()).
Also, you can definitely shorten your code quite a bit, I'm not sure what the purpose of j is, but the following should be equivalent to what you're doing now:
for j in range (0,1):
Xp, Yp, Zp = map(float, posn[i].split()[3*j:3*j+3])
Xc, Yc, Zc = map(float, posn[i+1].split()[3*j:3*j+3])
...
You don't need to:
use numpy
read the whole file in at once
know how many columns
use awkward comma at end of print statement
use list subscripting
use math.fabs()
explicitly close your file
Try this (untested):
with open("dtest.txt", "r") as posnfile:
previous = None
for line in posnfile:
current = [float(x) for x in line.split()]
if previous:
delta = [abs(c - p) for c, p in zip(current, previous)]
print ' '.join(str(d) for d in delta)
previous = current
just in case your dtest.txt grows larger and you don't want to redirect your output but rather write to distance.dat, especially, if you want to use numpy. Thank #John for pointing out my mistake in the old code ;-)
import numpy as np
pos = np.genfromtxt("dtest.txt")
dis = np.array([np.abs(pos[j+1] - pos[j]) for j in xrange(len(pos)-1)])
np.savetxt("distance.dat",dis)

Categories