CS50 DNA: Selecting the STRs with next()

CS50 DNA: Selecting the STRs with next() - python

I'm working on Problem Set 6: DNA. My approach is to save the different types of STR sequences as "all_sequences", then find the maximum number of repeats for each sequence in "all_sequences".
My question is: Why does next() ensure I will only select the first row of the csv? I understand [1:] is to remove the name column, but how does next() ensure I only select the first row?
f = sys.argv[1] # name of CSV file
t = sys.argv[2] # name of text file with dna sequence
# Open the CSV file and read its contents into memory.
with open(f, "r") as database:
index = csv.reader(database)
all_sequences = next(index)[1:]
# Open the DNA sequence and read its contents into memory.
with open(t, "r") as dnaseq:
s = dnaseq.read()
actual = [maxrepeats(s, seq) for seq in all_sequences]
print_match(index, actual)

In your example, index is a csv.reader object which is an iterator. next(index)yields the next element of the iterator (apparently a list). The list is then sliced to omit the first value.
It is strange to see that next is used only once, because this is simply yields the first row of the index iterator. It starts making sense when next is called more often.

Related

python keeping track of set of values on disk, is there a better approach? (no pandas)

I am parsing 1000s of of logging documents and I have the need to keep track of the collection of single user_ids that appear 1000 of times in those files.
What I thought of is keeping a text file containing that list of user_ids.
The text is read into a list, the new list is merged, the set is extracted and the list is again saved as a file:
def add_value_to_csv_text_set_file(file,
values, # list of strings
raise_error=True,
verbose=False,
DEBUG=False):
# check if file exists, otherwise creates empty file
file = Path(file)
if file.is_file() == False:
file.write_text("")
# read the contents
with open(file, 'r') as f:
registered_values = f.read().split(',')
registered_values = [value for value in registered_values if value != '']
if DEBUG: print('now: ', registered_values)
set_of_values = set([str(value) for value in values])
new_values = [value for value in set_of_values if value not in registered_values]
if DEBUG: print("to_add", rows_to_write)
new_text = ','.join(sorted(registered_values+new_values))
with open(file, 'w') as f:
f.write(new_text)
Somehow this does not seem very efficient. For once I read the whole text into memory, secondly I use the set(list) func that I think is not very fast, and third I convert lists back and forth into text, fourth I check every single time if the file exists and also evaluate is there are empty elements (because an empty element is created the first time at the beginning, i.e. in file.write_text("")).
Would someone point to a better and more pythonic solution?

Reading an nth line of a textfile in python determined from a list

I have a function gen_rand_index that generates a random group of numbers in list format, such as [3,1] or [3,2,1]
I also have a textfile that that reads something like this:
red $1
green $5
blue $6
How do I write a function so that once python generates this list of numbers, it automatically reads that # line in the text file? So if it generated [2,1], instead of printing [2,1] I would get "green $5, red $1" aka the second line in the text file and the first line in the text file?
I know that you can do print(line[2]) and commands like that, but this won't work in my case because each time I am getting a different random number of a line that I want to read, it is not a set line I want to read each time.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
I have this so far, but I am getting this
error: invalid literal for int() with base 10: '[4, 1]'
I also have gotten
TypeError: string indices must be integers
butI have tried replacing str with int and many things like that but I'm thinking the way I'm just approaching this is wrong. Can anyone help me? (I have only been coding for a couple days now so I apologize in advance if this question is really basic)

Okay, let us first get some stuff out of the way
Whenever you access something from a list the thing you put inside the box brackets [] should be an integer, eg: [5]. This tells Python that you want the 5th element. It cannot ["5"] because 5 in this case would be treated as a string
Therefore the line row = str(result[gen_rand_index]) should actually just be row = ... without the call to str. This is why you got the TypeError about list indices
Secondly, as per your description gen_rand_index would return a list of numbers.
So going by that, why don;t you try this
indices_to_pull = gen_rand_index()
file_handle = open("Foodinventory.txt", 'r')
file_contents = file_handle.readlines() # If the file is small and simle this would work fine
answer = []
for index in indices_to_pull:
answer.append(file_contents[index-1])
Explanation
We get the indices of the file lines from gen_rand_index
we read the entire file into memory using readlines()
Then we get the lines we want, Rememebr to subtract 1 as the list is indexed from 0

The error you are getting is because you're trying to index a string variable (line) with a string index (row). Presumably row will contain something like '[2,3,1]'.
However, even if row was a numerical index, you're not indexing what you think you're indexing. The variable line is a string, and it contains (on any given iteration) one line of the file. Indexing this variable will give you a single character. For example, if line contains green $5, then line[2] will yield 'e'.
It looks like your intent is to index into a list of strings, which represent all the lines of the file.
If your file is not overly large, you can read the entire file into a list of lines, and then just index that array:
with open('file.txt') as fp:
lines = fp.readlines()
print(lines[2]).
In this case, lines[2] will yield the string 'blue $6\n'.
To discard the trailing newline, use lines[2].strip() instead.

I'll go line by line and raise some issues.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
Are you sure it is gen_rand_index and not gen_rand_index()? If gen_rand_index is a function, you should call the function. In the code you have, you are not calling the function, instead you are using the function directly as an index.
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
The correct python idiom for opening a file and reading line by line is
with open("Foodinventory.txt.", "r") as f:
for line in f:
...
This way you do not have to close the file; the with clause does this for you automatically.
Now, what you want to do is to print the lines of the file that correspond to the elements in your variable row. So what you need is an if statement that checks if the line number you just read from the file corresponds to the line number in your array row.
with open("Foodinventory.txt", "r") as f:
for i, line in enumerate(f):
if i == row[i]:
print(line)
But this is wrong: it would work only if your list's elements are ordered. That is not the case in your question. So let's think a little bit. You could iterate over your file multiple times, and each time you iterate over it, print out one line. But this will be inefficient: it will take time O(nm) where n==len(row) and m == number of lines in your file.
A better solution is to read all the lines of the file and save them to an array, then print the corresponding indices from this array:
arr = []
with open("Foodinventory.txt", "r") as f:
arr = list(f)
for i in row:
print(arr[i - 1]) # arrays are zero-indiced

Truncate a column of a csv file?

I'm new to Python and I have the following csv file (let's call it out.csv):
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27.363000+00:00,0.9987,1.0113
2017-01-15,13:03:46.660000+00:00,0.9987,1.0113
2017-01-15,21:25:07.320000+00:00,0.9987,1.0113
2017-01-15,21:26:46.164000+00:00,0.9987,1.0113
2017-01-16,12:40:11.593000+00:00,,1.0154
2017-01-16,12:40:11.593000+00:00,1.0004,
2017-01-16,12:43:34.696000+00:00,,1.0095
and I want to truncate the second column so the csv looks like:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
This is what I have so far..
with open('out.csv','r+b') as nL, open('outy_3.csv','w+b') as nL3:
new_csv = []
reader = csv.reader(nL)
for row in reader:
time = row[1].split('.')
new_row = []
new_row.append(row[0])
new_row.append(time[0])
new_row.append(row[2])
new_row.append(row[3])
print new_row
nL3.writelines(new_row)
I can't seem to get a new line in after writing each line to the new csv file.
This definitely doesnt look or feel pythonic
Thanks

The missing newlines issue is because the file.writelines() method doesn't automatically add line separators to the elements of the argument it's passed, which it expects to be an sequence of strings. If these elements represent separate lines, then it's your responsibility to ensure each one ends in a newline.
However, your code is tries to use it to only output a single line of output. To fix that you should use file.write() instead because it expects its argument to be a single string—and if you want that string to be a separate line in the file, it must end with a newline or have one manually added to it.
Below is code that does what you want. It works by changing one of the elements of the list of strings that the csv.reader returns in-place, and then writes the modified list to the output file as single string by join()ing them all back together, and then manually adds a newline the end of the result (stored in new_row).
import csv
with open('out.csv','rb') as nL, open('outy_3.csv','wt') as nL3:
for row in csv.reader(nL):
time_col = row[1]
try:
period_location = time_col.index('.')
row[1] = time_col[:period_location] # only keep characters in front of period
except ValueError: # no period character found
pass # leave row unchanged
new_row = ','.join(row)
print(new_row)
nL3.write(new_row + '\n')
Printed (and file) output:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095

Sort by ignoring the first column and whitespace in csv file in Python

I have a csv file which I want to sort by taking each row at a time. While sorting the row, I want to ignore the whitespace (or empty cell). Also, I want to ignore the first row and first column while sorting.
This is how my code looks like:
import csv, sys, operator
fname = "Source.csv"
new_fname = "Dest.csv"
data = csv.reader(open(fname,"rb"),delimiter=',')
num = 1
sortedlist = []
ind=0
for row in data:
if num==1:
sortedlist.append(row)
with open(new_fname,"wb") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
elif num > 1:
sortedlist.append(sorted(row))
with open(new_fname,"ab") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
num+=1
I was able to ignore the first row. But, I am not sure how to ignore the whitespace and the first column while sorting. Any suggestions are welcome.

I simplified your code significantly and here's what I got (although I didn't understand the part about empty columns, they are values as well... Did you mean that you wanted to keep empty columns in the same place instead of putting them at start?)
import csv
if __name__ == '__main__':
reader = csv.reader(open("Source.csv","r"),delimiter=',')
out_file = open("Dest.csv","w")
writer = csv.writer(out_file,delimiter=",")
writer.writerow(reader.next())
for row in reader:
writer.writerow([row[0]] + sorted(row[1:]))
out_file.close()
Always write executable code in if __name__ == '__main__':, this is done so that your code is not executed if your script was not run directly, but rather imported by another script.
We record the out_file variable to be able out_file.close() it cleanly later, code will work without it, but it's a clean way to write files.
Do not use "wb", "rb", "ab" for text files, the "b" part stands for "binary" and should be used for structured files.
reader.next() gets the first line of the csv file (or crashes if file is empty)
for row in reader: already runs starting from second line (because we ran reader.next() earlier), so we don't need any line number conditionals anymore.
row[0] gets the first element of the list, row[1:] gets all elements of the list, except the first one. For example, row[3:] would ignore first 3 elements and return the rest of the list. In this case, we only sort the row without its first element by doing sorted(row[1:])
EDIT: If you really want to remove empty columns from your csv, replace sorted(row[1:]) with sorted(filter(lambda x: x.strip()!='', row[1:])). This will remove empty columns from the list before sorting, but keep in mind that empty values in csv are still values.
EDIT2: As correctly pointed out by #user3468054 values will be sorted as strings, if you want them to be sorted as numbers, add a named parameter key=int to the sorted function, or key=float if your values are float.

I cannot get split to work, what am I doing wrong?

Here is the code for the program that I have done so far. I am trying to calculate the efficiency of NBA players for a class project. When I run the program on a comma-delimited file that contains all the stats, instead of splitting on each comma it is creating a list entry of the entire line of the stat file. I get an index out of range error or it treats each character as a index point instead of the separate fields. I am new to this but it seems it should be creating a list for each line in the file that is separated by elements of that list, so I get a list of lists. I hope I have made myself understood.
Here is the code:
def get_data_list (file_name):
data_file = open(file_name, "r")
data_list = []
for line_str in data_file:
# strip end-of-line, split on commas, and append items to list
line_str.strip()
line_str.split(',')
print(line_str)
data_list.append(line_str)
print(data_list)
file_name1 = input("File name: ")
result_list = get_data_list (file_name1)
print(result_list)
I do not see how to post the data file for you to look at and try it with, but any file of numbers that are comma-delimited should work.
If there is a way to post the data file or email to you for you to help me with it I would be happy to do so.
Boliver

Strings are immutable objects, this means you can't change them in place. That means, any operation on a string returns a new one. Now look at your code:
line_str.strip() # returns a string
line_str.split(',') # returns a list of strings
data_list.append(line_str) # appends original 'line_str' (i.e. the entire line)
You could solve this by:
stripped = line_str.strip()
data = stripped.split(',')
data_list.append(data)
Or concatenating the string operations:
data = line_str.strip().split(',')
data_list.append(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

CS50 DNA: Selecting the STRs with next() - python

Related

python keeping track of set of values on disk, is there a better approach? (no pandas)

Reading an nth line of a textfile in python determined from a list

Truncate a column of a csv file?

Sort by ignoring the first column and whitespace in csv file in Python

I cannot get split to work, what am I doing wrong?

Categories

Resources