Python CSV search [duplicate] - python

This question already has answers here:
Reading from CSVs in Python repeatedly?
(2 answers)
Closed 6 years ago.
I am fairly new to python and have run into an issue which I believe to be strange. I am searching through the first column of a csvfile.
I am using the csv module and have some code with nested for loops. My intentions were for the loop in the middle to restart from the first row in the csvfile every time it finds a match. But instead it always starts from the last row being searched in the csv.
My code below and the results will make my problem more apparent.
number = [1,3,5,6,7,8,1234,324,5,2,35]
import csv
with open('...../Documents/pycharm/testcsv.csv', 'rb') as csvimport:
csvfile = csv.reader(csvimport)
for num in number:
print 'looking for ' + str(num)
is_in_file = False
print 'set to false'
for row in csvfile:
print 'looking at value ' + row[0]
if row[0] == str(num):
is_in_file = True
print 'match, set to true'
break
print 'test1'
if is_in_file == False:
print str(num) + ' not found in file!'
Here is what gets printed in the IDE:
looking for 1
set to false
looking at value a
looking at value 1
match, set to true
test1
looking for 3
set to false
NOTE: here I wish to look at the first line of the csvfile (value a). Instead it looks at the third line of the csvfile (value '').
looking at value
looking at value 1234
looking at value 7
looking at value 1
looking at value 3
match, set to true
test1
From here on out it skips my inner for loop all together as it has gone through the last row:
looking for 5
set to false
looking at value 5
match, set to true
test1
looking for 6
set to false
looking at value 6
match, set to true
test1
looking for 7
set to false
looking at value 77
looking at value 23
looking at value 87
test1
7 not found in file!
looking for 8
set to false
test1
8 not found in file!
looking for 1234
set to false
test1
1234 not found in file!
looking for 324
set to false
test1
324 not found in file!
looking for 5
set to false
test1
5 not found in file!
looking for 2
set to false
test1
2 not found in file!
looking for 35
set to false
test1
35 not found in file!
Here is the csvfile
a,c,b,d,e
1,3,4,5,6
,7,7,,87
1234,1,98,7,76
7,8,90,0,8
1,3,98,0,0
3,cat,food,20,39
5,%,3,6,90
6,2,2,2,3
77,3,4,3,5
23,3,4,3,6
87,5,5,5,

csvfile is a generator: first time it will work, but second time it will return immediately.
Consider doing this:
csvfile = list(csv.reader(csvimport))
Then you can scan csvfile as much as you want.
BUT, this code is not really performant as you perform a linear scan on your file. Consider doing a dictionary instead. Here's how
d = dict()
for r in csvfile:
d[r[0]] = r[1:]
Then, replace your inner loop by:
if num in d:
is_in_file = True
print("%d is in file" % num)

Related

Regex in Python returns nothing (search parameters keywords for search for when using regex)

I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.
Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:
import re
import requests
import pdfplumber
import pandas as pd
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)
with pdfplumber.open(ap) as pdf:
page = pdf.pages[1]
text = page.extract_text()
#print(text)
new_vend_re = re.compile(r'\d{1} [A-D]')
for line in text.split('\n'):
if new_vend_re.match(line):
print(line)
When I run the code, I do not get anything in return. Printing the text though will print the whole page.
Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf
You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.
'9700/12 Cambridge International AS/A Level – Mark Scheme March 2019\nPUBLISHED \n \nQuestion Answer Marks \n1 A 1\n2 C 1\n3 C 1\n4 A 1\n5 A 1\n6 C 1\n7 A 1\n8 D 1\n9 A 1\n10 C 1\n11 B 1\n12 D 1\n13 B 1\n...
Change your regex to the following to match not only one, but one or more spaces:
new_vend_re = re.compile(r'\d{1}\s+[A-D]')
See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:
1 A 1
2 C 1
3 C 1
4 A 1
5 A 1
6 C 1
7 A 1
8 D 1
9 A 1
(You can also see here, that there are always two spaces instead of one).
//Edit: Fixed typo in regex

How do I include the last instance of the condition met for the if statement?

I have a document that I am reading and writing to. Information is listed with a timestamp at the beginning of every entry. I am trying to populate the dateNTimeArr array with datetime objects for every entry but I notice that the last entry is not appending to the array, no matter how many entries I have. I'm not sure how to solve this.
this an example of the text file I am reading and writing to.
In this example, the array will populate with the datetime objects I create from the first and second entry but it doesn't append the last one.
2021-08-10 16:26:12
123
123
123
123
123
2021-08-10 16:26:28
123
123
123
123
123
2021-08-10 16:27:15
123
123
123
123
123
I tried removing the '\n' from the start of the while loop which seemed to work but the next time I ran the code, it messed with the format and kind of broke. Sorry in advance for the lack of structure to my code.
f = open("filename", "r")
dateNTimeArr = []
for line in f:
if "2021" in line:
datentime = line.split(" ")
datePart = datentime[0]
timePart = datentime[1]
hours, mins, secs = timePart.split(":")
year, month, day = datePart.split("-")
date1 = date(int(year), int(month), int(day))
time1 = time(int(hours), int(mins), int(secs))
datetime1 = datetime.combine(date1, time1)
dateNTimeArr.append(datetime1)
f.close()
f = open("filename", "a+")
submitBool = FALSE
while submitBool == FALSE:
f.write('\n')
f.write(now)
f.write('\n')
f.write(aQuantity.get())
f.write('\n')
f.write(bQuantity.get())
f.write('\n')
f.write(cQuantity.get())
f.write('\n')
f.write(dQuantity.get())
f.write('\n')
f.write(eQuantity.get())
submitBool = TRUE
f.close()
As Bramar mention, the list has 3 elements, you never printed it to check that.
Also, you have a few mistakes, in the while loop. The bool values are True and False not TRUE or FALSE.
And the loop will only run once because you changed it to True at the end of it.

Find next missing number in list

I am trying to make a very simple login script to learn about accessing files and lists but I'm a bit stuck.
newaccno = str(1)
with open("C:\\Python\\Test\\userpasstest.txt","r+") as loginfile:
for line in loginfile.readlines():
line = line.strip()
logininfo = line.split(" ")
print(newaccno in logininfo[0])
while newaccno in logininfo[0]: #issue is here, also tried ==
newaccno += 1
print(newaccno)
loginfile.write(newaccno)
My logic is that it will search logininfo[0] for newaccno and if it is true, increase newaccno by 1 and search again until it is false then write to file (so if the file has 1, 2 and 3 already then newaccno will end up as 4).
Edit: This is how the txt file looks, the first number represents newaccno before it gets split.
1 abc qwe
2 123 456
(adapted from comment)
Your while loop needs to be inside your for loop for it to work. If it is outside logininfo[0] will always be the last line's first character

Error in string index out of range: maf file to fasta file

Context: I am trying to convert a maf file (multiple alignment file) to individual fasta file. I keep running into the same error that I'm not sure how to fix and this may be because I'm relatively new to python. My code is below:
open_file = open('C:/Users/Danielle/Desktop/Maf_file.maf','r')
for record in open_file:
print(record[2:7])
if (record[2] == 'Z'):
new_file = open(record[2:7]+".fasta",'w')
header = ">"+record[2:7]+"\n"
sequence = record[46:len(record)]
new_file.write(header)
new_file.write(sequence)
new_file.close()
else:
print("Not correct isolate.")
open_file.close()
The error I get is:
IndexError
Traceback (most recent call last)
2 for record in open_file:
3 print(record[2:7])
----> 4 if (record[2] == 'Z'):
5 new_file = open(record[2:7]+".fasta",'w')
6 header = ">"+record[2:7]+"\n"
IndexError: string index out of range
If I remove the if, else statement it works as I expect but I would like to filter for specific species that start with the character Z.
If anyone could help explain why I can't select for only strings that start with the character Z this way, that would be great! Thanks in advance.
Its giving an error when the length of record is less than 2.
To fix this you can change your if statement to:
if (len(record) > 2 and record[2] == 'Z'):
Ideally you should also handle such cases before separately.
Here's an alternative answer.
The problem you're having is because the record you're reading might not have at least 3 chars. For that reason you have to check the length of the string before checking index 2. As you might have noticed, the line 3 doesn't crash.
Slice operator in short will slice from index 2 to 7 returning anything it finds.
So if you have let look at this:
"abcd"[1:] == "bcd"
"abcd"[1:3] == "bc"
"abcd"[1:10] == "bcd"
"abcd"[4:] == ""
As you can see, it will return anything from index 1 to index 2 excluded. When it doesn't find anything the slice operation stop and return.
So one different way to get the same result as checking for the length would be to do this:
if (record[2:3] == 'Z'):
This way, you're slicing the char at index 2, if the index exists it will return a char otherwise an empty string. That said, I can't say if the slice operation will be faster than checking the length of the string and getting the index then. In some ways, the slice operation does that internally most probably.
A better answer
This way, we can fix the problem you had and also make it a bit more efficient. You're slicing multiple time the record for [2:7] you should store that in a variable. If the index 2 isn't present in the resulting filename, we can assume the filename is empty. If the filename is empty it will be falsy, if not we can check index 0 because it's certainly will be there and index 0 of filename is index 2 of record.
Second problem is to use the format string instead of + operator. It will convert anything you pass to the format string to a string as the format passed is %s. If you'd pass a False or None value, the program will crash as arithmetic operation are only allowed for str + str.
open_file = open('C:/Users/Danielle/Desktop/Maf_file.maf','r')
for record in open_file:
filename = record[2:7]
print(filename)
if (filename and filename[0] == 'Z'):
with open("%s.fasta" % filename,'w') as newfile:
header = ">%s\n" % filename
sequence = record[46:len(record)]
new_file.write(header)
new_file.write(sequence)
else:
print("Not correct isolate.")
open_file.close()
A bit of reformat and we'd end up with this:
def write_fasta(record):
filename = record[2:7]
print(filename)
if (filename and filename[0] == 'Z'):
with open("%s.fasta" % filename,'w') as new_file:
header = ">%s\n" % filename
sequence = record[46:len(record)]
new_file.write(header)
new_file.write(sequence)
else:
print("Not correct isolate.")
maf_filename = 'C:/Users/Danielle/Desktop/Maf_file.maf'
with open(maf_filename, 'r') as maf:
for record in maf:
write_fasta(record)
Use the context manager whenever possible as they'll handle the file closing themselves. So no need to explicitly call the close() method.

Python exception "too many vaues to unpack" thrown when assigning a string of numbers to a dictionary

I have a function that reads a file which contains a name followed by a space, then multiple numbers, each seperated by a space. I want to parse the name into one string, and all the numbers into another, then put them in a dictionary (with the name as the key). I have written the following code:
def read_users (user_file):
try:
file_in = open(user_file)
except:
return None
user_scores = {}
for line in file_in:
temp_lst = line.strip().split(' ', 1)
user_scores = [temp_lst[0]] = temp_lst[1]
return user_scores
This seems to do everything I need, but when it puts it into a dictionary it throws the exception "Too many values to unpack". I'm confused as to why this is thrown because I think I should be passing the dictionary a string with the name as the key, and a string with a bunch of numbers as the value.
If it's important the lines in the input file are formatted as follows:
Ben 1 0 2 3 4 -2 5 5 6 6 1
I have tried printing the list before I pass it to the dictionary and it appears as follows:
['Ben', '1 0 2 3 4 -1 5 5 6 6 1']
Anyone have any ideas? Thanks!
#I think the way you construct the dictionary is not quite right. Try below code to see if it works.
user_scores[temp_lst[0]] = temp_lst[1]

Categories