Comparison of two elements in different array

Comparison of two elements in different array - python

My problem:
I am trying to compare two elements from two different arrays but the operator is not working.
Code Snippet in question:
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
#print()
#print(f"row length: {row_length}")
#print(f"count: {count}")
if count == row_length:
print(row[0])
exit(0)
What I have done: I tried to print the value of ss_record and row before it runs through the if statement but when it matches, count doesn't increase. I tried storing the value of row in a new array but it bugs out and only store the array length and first 2 value of row and repeats those values every next instance.
What I think the issue: I think the issue with my code is that row is being read from a CSV file and is not being converted into an integer as a result, it appears they are the same but one is an integer while the other is a string.
Entire Code:
import csv
import sys
import re
from cs50 import get_string
from sys import argv
def main():
line_count = 0
if len(argv) != 3:
print("missing command-line argument")
exit(1)
with open(sys.argv[1], 'r') as database:
sequence = open(sys.argv[2], 'r')
string = sequence.read()
reader = csv.reader(database, delimiter = ',')
for row in reader:
if line_count == 0:
row_length = len(row) - 1
ss_record = [row_length]
for i in range(row_length):
ss_record.append(ss_count(string, row[i + 1], len(row[i + 1])))
ss_record.pop(0)
line_count = 1
else:
count = 0
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
if count == row_length:
print(row[0])
exit(0)
#ss_count mean the # of times the substring appear in the string
def ss_count(string, substring, length):
count = 1
record = 0
pos_array = []
for m in re.finditer(substring, string):
pos_array.append(m.start())
for i in range(len(pos_array) - 1):
if pos_array[i + 1] - pos_array[i] == length:
count += 1
else:
if count > record:
record = count
count = 1
if count > record:
record = count
return record
main()
Values to use to reproduce issue:
sequence (this is a text file) = AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
substring (this is a csv file) =
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
Gist of the CSV file:
The numbers beside Alice means how many times a substring(STR/Short Tandem Repeat) appears in a row in the string(DNA sequence). In this string, AGATC appears 4 times in a row, AATG appears 1 time in a row, and TATC appears 5 times in a row. For this DNA sequence, it matches Bob and he outputted as the answer.

You were right, when you compare ss_record[i] == row[i + 1]: there is a type problem, the numbers of ss_record are integers while the numbers of the row are strings. You may acknowledge the issue by printing both ss_record and row:
print("ss_record: {}".format(ss_record)) -> ss_record: [4, 1, 5]
print("row: {}".format(row)) -> row: ['Alice', '2', '8', '3']
In order for the snippet to work you just need to change the comparison to
ss_record[i] == int(row[i + 1])
That said, I feel the code is quite complex for the task. The string class implements a count method that returns the number of non-overlapping occurrences of a given substring. Also, since the code it's working in an item basis and relies heavily in index manipulations the iteration logic is hard to follow (IMO). Here's my approach to the problem:
import csv
def match_user(dna_file, user_csv):
with open(dna_file, 'r') as r:
dna_seq = r.readline()
with open(user_csv, 'r') as r:
reader = csv.reader(r)
rows = list(reader)
target_substrings = rows[0][1:]
users = rows[1:]
num_matches = [dna_seq.count(target) for target in target_substrings]
for user in users:
user_matches = [int(x) for x in user[1:]]
if user_matches == num_matches:
return user[0]
return "Not found"
Happy Coding!

Related

CSV Python Outputting: Outputting non-matching field once rather than once for every item in list

I've been trying to figure this out for about a year now and I'm really burnt out on it so please excuse me if this explanation is a bit rough.
I cannot include job data, but it would be accurate to imagine 2 csv files both with the first column populated with values (Serial numbers/phone numbers/names, doesn't matter - just values). Between both csv files, some values would match while other values would only be contained in one or the other (Timmy is in both files and is a match, Robert is only in file 1 and does not match any name in file 2).
I can successfully output a csv value ONCE that exists in the both csv files (I.e. both files contain "Value78", output file will contain "Value78" only once).
When I try to tack on an else statement to my if condition, to handle non-matching items, the program will output 1 entry for every item it does not match with (makes 100% sense, matches happen once but every other comparison result besides the match is a non-match).
I cannot envision a structure or method to hold the fields that don't match back so that they can be output once and not overrun my terminal or output file.
My goal is to output two csv files, matches and non-matches, with the non-matches having only one entry per value.
Anyways, onto the code:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
pass
The "else: pass" is where the logic of filtering out non-matches is escaping me. Outputting from this else statement will write the non-matching value (len(VList) - 1) times for an iteration that DOES produce 1 match, the entire len(VList) for an iteration with no match. I've tried using a counter and only outputting if the counter equals the len(VList), (incrementing in the else statement, writing output under the scope of the second for loop), but received the same output as if I tried outputting non-matches.

Below is one way you might go about deduplicating and then writing to a file:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
list_of_non_matches = []
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
OFile.write(MyList[x][0] + '\n')
else:
list_of_non_matches.append(MyList[x][0])
# Remove duplicates from the non matches
new_list = []
[new_list.append(x) for x in list_of_non_matches if x not in new_list]
# Write the new list to a file
for i in new_list:
NFile.write(i + '\n')

Does this work?
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,'r') as MFile,
(VENDORUNITS,'r') as VFile,
(MATCHES,'w') as OFile,
(NONMATCHES,mode,'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
MyVals = [x for x in MyList]
MyVals = [x[0] for x in MyVals]
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
vVals = [x for x in VList]
vVals = [x[0] for x in vVals]
for val in MyVals:
if val in vVals:
OFile.write(Val + '\n')
else:
NFile.write(Val + '\n')
#for x in range(len(MyList)):
# for y in range(len(VList)):
# if str(MyList[x][0]) == str(VList[y][0]):
# OFile.write(MyList[x][0] + '\n')
# else:
# pass

Sorry, I had some issues with my PC. I was able to solve my own question the night I posted. The solution I used is so simple I'm kicking myself for not figuring it out way sooner:
import csv
MYUNITS = 'MyUnits.csv'
VENDORUNITS = 'VendorUnits.csv'
MATCHES = 'Matches.csv'
NONMATCHES = 'NonMatches.csv'
with open(MYUNITS,mode='r') as MFile,
open(VENDORUNITS,mode='r') as VFile,
open(MATCHES,mode='w') as OFile,
open(NONMATCHES,mode'w') as NFile:
MyReader = csv.reader(MFile,delimiter=',',quotechar='"')
MyList = list(MyReader)
VendorReader = csv.reader(VFile,delimiter=',',quotechar='"')
VList = list(VendorReader)
for x in range(len(MyList)):
tmpStr = ''
for y in range(len(VList)):
if str(MyList[x][0]) == str(VList[y][0]):
tmpStr = '' #Sets to blank so comparison fails, works because break
OFile.write(MyList[x][0] + '\n')
break
else:
tmp = str(MyList[x][0])
if tmp != '':
NFile.write(tmp + '\n')

About CS50 Pset6 DNA, it overcounts STR for large.cvs

I'm working on pset6, DNA problem. This code is working for small.cvs but when I try the large one it overestimates the STR count. I guess the problem is when it tries to compare strings. But still don't know how to fix it. I checked that the counting is correct for the "TTTTTTCT" sequence but for the remaining STRs, the counting is in all cases larger than it should.
import sys
import csv
def main():
while (len(sys.argv) != 3):
print ("ERROR. Usage: python dna.py data.csv sequence.txt")
break
list_str = {}
#load the STRs to analyse
with open(sys.argv[1]) as csvfile:
readcsv = csv.reader (csvfile)
ncol = len(next(readcsv))
csvfile.seek(0)
header = list()
for line in readcsv:
a = sum(1 for line in readcsv)
for i in range(ncol):
list_str[line[i]] = 0
header.insert (i, line [i])
print (f"{header[i]}")
#open an work with the sequence file
sequence = open(sys.argv[2], 'r')
seq_r = sequence.read()
for k in list_str.keys():
#print (f"keu {k}")
p = 0
seq = len(seq_r)
while p < seq:
if seq_r[p:(p + len(k))] == k:
list_str[k] += 1
p += len(k)
else: p += 1
#print (f" sequenci encontrada{list_str[k]} y {k}")
print (f"nro de {k} {list_str[k]}")
with open(sys.argv[1]) as csvfile:
readcsv = csv.reader (csvfile)
next(csvfile)
find = False
for row in readcsv:
for j in range(1,ncol):
#print(f"header :{header[j]}")
if int(row [j]) == int(list_str[header[j]]):
print (f"row {row[j]} list {list_str[header[j]]}")
find = True
else:
find = False
break
if find == True: print (f"{row [0]}")
main()

The same thing happened to me, and then I saw the specifications of the pset.
We need to find the "longest run of consecutive repeats of the STR". Not the total count of STRs. It works for the small.csv as in my case too, so try to search for the longest consecutive occurrences of the specific STR.

Possible to do this more efficiently (turn compact file to sparse)

I have to read in a file line by line that has indices of where a vector has 1's
so for example:
1 3 9 10
means:
0,1,0,1,0,0,0,0,0,1,1
My goal is to write program that will take each line and print out the full vector with the 0's.
I am able to do this with my current program for a few lines:
#create a sparse vector
list_line_sparse = [0] * int(num_features)
#loop over all the lines
for item in lines:
#split the line on spaces
zz = item.split(' ')
#get all ints on a line
d = [int(x.strip()) for x in zz]
#loop over all ints and change index to 1 in sparse vector
for i in d:
list_line_sparse[i]=1
out_file += (', '.join(str(item) for item in list_line_sparse))
#change back to 0's
for i in d:
list_line_sparse[i]=0
out_file +='\n'
f = open('outfile', 'w')
f.write(out_file)
f.close()
The problem is that for a file with a lot of features and lines, my program is very very inefficient - it basically never finishes. Is there anything sticking out that I should change to make it more efficent? (I.e. the 2 for loops)

It would probably be more efficient to write each line of data to your output file as it is generated, rather than building up a huge string in memory.
numpy is a popular Python module that's good for doing bulk operations on numbers. If you start with:
import numpy as np
list_line_sparse = np.zeros(num_features, dtype=np.uint8)
Then, given d as the list of numbers on the current line, you can simply do:
list_line_sparse[d] = 1
to set ALL of those indexes in the array at the same time, no loop required. (At the Python level at least, obviously there's still a loop involved, but it's down in the C implementation of numpy).

It is slowing down because you are doing string concatenation. It is better to work with lists.
Also you could use csv to read your space separated lines in, and to then write each row with commas automatically added:
import csv
num_features = 20
with open('input.txt', 'r', newline='') as f_input, open('output.txt', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ')
csv_output = csv.writer(f_output)
for row in csv_input:
list_line_sparse = [0] * int(num_features)
for v in map(int, row):
list_line_sparse[v] = 1
csv_output.writerow(list_line_sparse)
So if input.txt contained the following:
1 3 9 10
1 3 9 11
2 7 3 5
Giving you an output.txt containing:
0,1,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0
0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0
0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0

Too much loops: first, the item.split(), then the for x in zz, then for i in d, then for item in list_line_sparse, and then for i in d again. Strings concatenations could be your most expensive part: the .join and the output +=. And all this for every line.
You could try a "character by character" parsing and writing. Something like this:
#features per line
count = int(num_features)
f = open('outfile.txt', 'w')
#loop over all lines
for item in lines:
#reset the feature
i = 0
#the characters buffer
index = ""
#parse character by character
for character in item:
#if a space or end of line is found,
#and the characters buffer (index) is not empty
if character in (" ", "\r", "\n"):
if index:
#parse the characters buffer
index = int(index)
#if is not the first feature
if i > 0:
#add the separator
f.write(", ")
#add 0's until index
while i < index:
f.write("0, ")
i += 1
#and write 1
f.write("1")
i += 1
#reset the characters buffer
index = ""
#if is not a space or end on line
else:
#add the character to the buffer
index += character
#if the last line didn't end with a carriage return,
#index could be waiting to be parsed
if index:
index = int(index)
if i > 0:
f.write(", ")
while i < index:
f.write("0, ")
i += 1
f.write("1")
i += 1
index = ""
#fill with 0's
while i < count:
if i == 0:
f.write("0")
else:
f.write(", 0")
i += 1
f.write("\n")
f.close()

Let's rework your code into a simpler package that takes better advantage of Python's features:
import sys
NUM_FEATURES = 12
with open(sys.argv[1]) as source, open(sys.argv[2], 'w') as sink:
for line in source:
list_line_sparse = [0] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = 1
print(*list_line_sparse, file=sink, sep=',')
I revisited this problem with your "more efficiently" in mind. Although the above is more memory efficient, it is a hair slower time-wise. I reconsidered your original and came up with a solution that is less memory efficient but about 2x faster than your code:
import sys
NUM_FEATURES = 12
data = ''
with open(sys.argv[1]) as source:
for line in source:
list_line_sparse = ["0"] * NUM_FEATURES
indicies = map(int, line.rstrip().split())
for index in indicies:
list_line_sparse[index] = "1"
data += ",".join(list_line_sparse) + '\n'
with open(sys.argv[2], 'w') as sink:
sink.write(data)
Like your original solution, it stores all the data in memory and writes it out at the end which is both a disadvantage (memory-wise) and an advantage (time-wise.)
input.txt
1 3 9 10
1 3 9 11
2 7 3 5
USAGE
% python3 test.py input.txt output.txt
output.txt
0,1,0,1,0,0,0,0,0,1,1,0
0,1,0,1,0,0,0,0,0,1,0,1
0,0,1,1,0,1,0,1,0,0,0,0

How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?

I have a program that reads a .csv file, checks for any mismatch in column length (by comparing it to the header-fields), which then returns everything it found out as a list (and then writes it into a file). What I want to do with this list, is to list out the results as follows:
row numbers where the same mismatch is found : the amount of columns in that row
e.g.
rows: n-m : y
where n and m are the numbers of rows which share the same amount of columns that mismatch to header.
I have looked into these topics, and while the information is useful, they do not answer the question:
Find and list duplicates in a list?
Identify duplicate values in a list in Python
This is where I am right now:
r = csv.reader(data, delimiter= '\t')
columns = []
for row in r:
# adds column length to a list
colm = len(row)
columns.append(colm)
b = len(columns)
for a in range(b):
# checks if the current member matches the header length of columns
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
file.write("row " + str(a + 1) + ": " + str(columns[a]) + " \n")
the file output looks like this:
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
row 7224: 0
row 7225: 1
row 7226: 1
when the desired end result is
rows 7220 - 7224 : 0
rows 7225 - 7226 : 1
So I what I essentially need, the way i see it, is an dictionary where key is the rows with duplicate value and value is the amount of columns in that said mismatch. What I essentially think I need (in a horrible written pseudocode, that doesn't make any sense now that I'm reading it years after writing this question), is here:
def pseudoList():
i = 1
ListOfLists = []
while (i < len(originalList)):
duplicateList = []
if originalList[i] == originalList[i-1]:
duplicateList.append(originalList[i])
i += 1
ListOfLists.append(duplicateList)
def PseudocreateDict(ListOfLists):
pseudoDict = {}
for x in ListOfLists:
a = ListOfLists[x][0] #this is the first node in the uniqueList created
i = len(ListOfLists) - 1
b = listOfLists[x][i] #this is the last node of the uniqueList created
pseudodict.update('key' : '{} - {}'.format(a,b))
This however, seems very convoluted way for doing what I want, so I was wondering if there's a) more efficient way b) an easier way to do this?

You can use a list comprehension to return a list of elements in the columns list that differ from adjacent elements, which will be the end-points of your ranges. Then enumerate these ranges and print/write out those that differ from the first (header) element. An extra element is added to the list of ranges to specify the end index of the list, to avoid out of range indexing.
columns = [2, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1];
ranges = [[i+1, v] for i,v in enumerate(columns[1:]) if columns[i] != columns[i+1]]
ranges.append([len(columns),0]) # special case for last element
for i,v in enumerate(ranges[:-1]):
if v[1] != columns[0]:
print "rows", v[0]+1, "-", ranges[i+1][0], ":", v[1]
output:
rows 2 - 5 : 1
rows 6 - 9 : 0
rows 10 - 11 : 1
rows 13 - 13 : 1

You can also try the following code -
b = len(columns)
check = 0
for a in range(b):
# checks if the current member matches the header length of columns
if check != 0 and columns[a] == check:
continue
elif check != 0 and columns[a] != check:
check = 0
if start != a:
file.write("row " + str(start) + " - " + str(a) + ": " + str(columns[a]) + " \n")
else:
file.write("row " + str(start) + ": " + str(columns[a]) + " \n")
if columns[a] != columns[0]:
# if it doesnt, write the row and the amount of columns in that row to a file
start = a+1
check = columns[a]

What you want to do is a map/reduce operation, but without the sorting that is normally done between the mapping and the reducing.
If you output
row 7220: 0
row 7221: 0
row 7222: 0
row 7223: 0
To stdout, you can pipe this data to another python program that generates the groups you want.
The second python program could look something like this:
import sys
import re
line = sys.stdin.readline()
last_rowid, last_diff = re.findall('(\d+)', line)
for line in sys.stdin:
rowid, diff = re.findall('(\d+)', line)
if diff != last_diff:
print "rows", last_rowid, rowid, last_diff
last_diff = diff
last_rowid = rowid
print "rows", last_rowid, rowid, last_diff
You would execute them like this in a unix environment to get the output into a file:
python yourprogram.py | python myprogram.py > youroutputfile.dat
If you cannot run this on a unix environment, you can still use the algorithm I wrote in your program with a few modifications.

get nth line of string in python

How can you get the nth line of a string in Python 3?
For example
getline("line1\nline2\nline3",3)
Is there any way to do this using stdlib/builtin functions?
I prefer a solution in Python 3, but Python 2 is also fine.

Try the following:
s = "line1\nline2\nline3"
print s.splitlines()[2]

a functional approach
>>> import StringIO
>>> from itertools import islice
>>> s = "line1\nline2\nline3"
>>> gen = StringIO.StringIO(s)
>>> print next(islice(gen, 2, 3))
line3

`my_string.strip().split("\n")[-1]`

Use a string buffer:
import io
def getLine(data, line_no):
buffer = io.StringIO(data)
for i in range(line_no - 1):
try:
next(buffer)
except StopIteration:
return '' #Reached EOF
try:
return next(buffer)
except StopIteration:
return '' #Reached EOF

A more efficient solution than splitting the string would be to iterate over its characters, finding the positions of the Nth and the (N - 1)th occurence of '\n' (taking into account the edge case at the start of the string). The Nth line is the substring between those positions.
Here's a messy piece of code to demonstrate it (line number is 1 indexed):
def getLine(data, line_no):
n = 0
lastPos = -1
for i in range(0, len(data) - 1):
if data[i] == "\n":
n = n + 1
if n == line_no:
return data[lastPos + 1:i]
else:
lastPos = i;
if(n == line_no - 1):
return data[lastPos + 1:]
return "" # end of string
This is also more efficient than the solution which builds up the string one character at a time.

From the comments it seems as if this string is very large.
If there is too much data to comfortably fit into memory one approach is to process the data from the file line-by-line with this:
N = ...
with open('data.txt') as inf:
for count, line in enumerate(inf, 1):
if count == N: #search for the N'th line
print line
Using enumerate() gives you the index and the value of object you are iterating over and you can specify a starting value, so I used 1 (instead of the default value of 0)
The advantage of using with is that it automatically closes the file for you when you are done or if you encounter an exception.

Since you brought up the point of memory efficiency, is this any better:
s = "line1\nline2\nline3"
# number of the line you want
line_number = 2
i = 0
line = ''
for c in s:
if i > line_number:
break
else:
if i == line_number-1 and c != '\n':
line += c
elif c == '\n':
i += 1

Wrote into two functions for readability
string = "foo\nbar\nbaz\nfubar\nsnafu\n"
def iterlines(string):
word = ""
for letter in string:
if letter == '\n':
yield word
word = ""
continue
word += letter
def getline(string, line_number):
for index, word in enumerate(iterlines(string),1):
if index == line_number:
#print(word)
return word
print(getline(string, 4))

My solution (effecient and compact):
def getLine(data, line_no):
index = -1
for _ in range(line_no):index = data.index('\n',index+1)
return data[index+1:data.index('\n',index+1)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Comparison of two elements in different array - python

Related

CSV Python Outputting: Outputting non-matching field once rather than once for every item in list

About CS50 Pset6 DNA, it overcounts STR for large.cvs

Possible to do this more efficiently (turn compact file to sparse)

How to find duplicates in a python list that are adjacent to each other and list them with respect to their indices?

get nth line of string in python

Categories

Resources