PYTHON how to search a text file for a number - python

There's a text file that I'm reading line by line. It looks something like this:
3
3
67
46
67
3
46
Each time the program encounters a new number, it writes it to a text file. The way I'm thinking of doing this is writing the first number to the file, then looking at the second number and checking if it's already in the output file. If it isn't, it writes THAT number to the file. If it is, it skips that line to avoid repetitions and goes on to the next line. How do I do this?

Rather than searching your output file, keep a set of the numbers you've written, and only write numbers that are not in the set.

Instead of checking output file for the number if it was already written it is better to keep this information in a variable (a set or list). It will save you on disk reads.
To search a file for numbers you need to loop through each line of that file, you can do that with for line in open('input'): loop, where input is the name of your file. On each iteration line would contain one line of input file ended with end of line character '\n'.
In each iteration you should try to convert the value on that line to a number, int() function may be used. You may want to protect yourself against empty lines or non-number values with try statement.
In each iteration having the number you should check if the value you found wasn't already written to the output file by checking a set of already written numbers. If value is not in the set yet, add it and write to the output file.
#!/usr/bin/env python
numbers = set() # create a set for storing numbers that were already written
out = open('output', 'w') # open 'output' file for writing
for line in open('input'): # loop through each line of 'input' file
try:
i = int(line) # try to convert line to integer
except ValueError: # if conversion to integer fails display a warning
print "Warning: cannot convert to number string '%s'" % line.strip()
continue # skip to next line on error
if i not in numbers: # check if the number wasn't already added to the set
out.write('%d\n' % i) # write the number to the 'output' file followed by EOL
numbers.add(i) # add number to the set to mark it as already added
This example assumes that your input file contains single numbers on each line. In case of empty on incorrect line a warning will be displayed to stdout.
You could also use list in the above example, but it may be less efficient.
Instead of numbers = set() use numbers = [] and instead of numbers.add(i): numbers.append(i). The if condition stays the same.

Don't do that. Use a set() to keep track of all the numbers you have seen. It will only have one of each.
numbers = set()
for line in open("numberfile"):
numbers.add(int(line.strip()))
open("outputfile", "w").write("\n".join(str(n) for n in numbers))
Note this reads them all, then writes them all out at once. This will put them in a different order than in the original file (assuming they're integers, they will come out in ascending numeric order). If you don't want that, you can also write them as you read them, but only if they are not already in the set:
numbers = set()
with open("outfile", "w") as outfile:
for line in open("numberfile"):
number = int(line.strip())
if number not in numbers:
outfile.write(str(number) + "\n")
numbers.add(number)

Are you working with exceptionally large files? You probably don't want to try to "search" the file you're writing to for a value you just wrote. You (probably) want something more like this:
encountered = set([])
with open('file1') as fhi, open('file2', 'w') as fho:
for line in fhi:
if line not in encountered:
encountered.add(line)
fho.write(line)

If you want to scan through a file to see if it contains a number on any line, you could do something like this:
def file_contains(f, n):
with f:
for line in f:
if int(line.strip()) == n:
return True
return False
However as Ned points out in his answer, this isn't a very efficient solution; if you have to search through the file again for each line, the running time of your program will increase proportional to the square of the number of numbers.
It the number of values is not incredibly large, it would be more efficient to use a set (documentation). Sets are designed to very efficiently keep track of unordered values. For example:
with open("input_file.txt", "rt") as in_file:
with open("output_file.txt", "wt") as out_file:
encountered_numbers = set()
for line in in_file:
n = int(line.strip())
if n not in encountered_numbers:
encountered_numbers.add(n)
out_file.write(line)

Related

How to find whether a integer is between first two columns of a file without using any for loop

I've a file which have integers in first two columns.
File Name : file.txt
col_a,col_b
1001021,1010045
2001021,2010045
3001021,3010045
4001021,4010045 and so on
Now using python, i get a variable var_a = 2002000.
Now how to find the range within which this var_a lies in "file.txt".
Expected Output : 2001021,2010045
I have tried with below,
With open("file.txt","r") as a:
a_line = a.readlines()
for line in a_line:
line_sp = line.split(',')
if var_a < line_sp[0] and var_a > line_sp[1]:
print ('%r, %r', %(line_sp[0], line_sp[1])
Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.
Since the file have more than million of record this make it time
consuming. Is there any better way to do the same without a for loop.
Unfortunately you have to iterate over all records in file and the only way you can archive that is some kind of for loop. So complexity of this task will always be at least O(n).
It is better to read your file linewise (not all into memory) and store its content inside ranges to look them up for multiple numbers. Ranges store quite efficiently and you only have to read in your file once to check more then 1 number.
Since python 3.7 dictionarys are insert ordered, if your file is sorted you will only iterate your dictionary until the first time a number is in the range, for numbers not all all in range you iterate the whole dictionary.
Create file:
fn = "n.txt"
with open(fn, "w") as f:
f.write("""1001021,1010045
2001021,2010045
3001021,3010045
garbage
4001021,4010045""")
Process file:
fn = "n.txt"
# read in
data = {}
with open(fn) as f:
for nr,line in enumerate(f):
line = line.strip()
if line:
try:
start,stop = map(int, line.split(","))
data[nr] = range(start,stop+1)
except ValueError as e:
pass # print(f"Bad data ({e}) in line {nr}")
look_for_nums = [800, 1001021, 3001039, 4010043, 9999999]
for look_for in look_for_nums:
items_checked = 0
for nr,rng in data.items():
items_checked += 1
if look_for in rng:
print(f"Found {look_for} it in line {nr} in range: {rng.start},{rng.stop-1}", end=" ")
break
else:
print(f"{look_for} not found")
print(f"after {items_checked } checks")
Output:
800 not found after 4 checks
Found 1001021 it in line 0 in range: 1001021,1010045 after 1 checks
Found 3001039 it in line 2 in range: 3001021,3010045 after 3 checks
Found 4010043 it in line 5 in range: 4001021,4010045 after 4 checks
9999999 not found after 4 checks
There are better ways to store such a ranges-file, f.e. in a tree like datastructure - research into k-d-trees to get even faster results if you need them. They partition the ranges in a smarter way, so you do not need to use a linear search to find the right bucket.
This answer to Data Structure to store Integer Range , Query the ranges and modify the ranges provides more things to research.
Assuming each line in the file has the correct format, you can do something like following.
var_a = 2002000
with open("file.txt") as file:
for l in file:
a,b = map(int, l.split(',', 1)) # each line must have only two comma separated numbers
if a < var_a < b:
print(l) # use the line as you want
break # if you need only the first occurrence, break the loop now
Note that you'll have to do additional verifications/workarounds if the file format is not guaranteed.
Obviously you have to iterate through all the lines (in the worse case). But we don't load all the lines into memory at once. So as soon as the answer is found, the rest of the file is ignored without reading (assuming you are looking only for the first match).

Reading an nth line of a textfile in python determined from a list

I have a function gen_rand_index that generates a random group of numbers in list format, such as [3,1] or [3,2,1]
I also have a textfile that that reads something like this:
red $1
green $5
blue $6
How do I write a function so that once python generates this list of numbers, it automatically reads that # line in the text file? So if it generated [2,1], instead of printing [2,1] I would get "green $5, red $1" aka the second line in the text file and the first line in the text file?
I know that you can do print(line[2]) and commands like that, but this won't work in my case because each time I am getting a different random number of a line that I want to read, it is not a set line I want to read each time.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
I have this so far, but I am getting this
error: invalid literal for int() with base 10: '[4, 1]'
I also have gotten
TypeError: string indices must be integers
butI have tried replacing str with int and many things like that but I'm thinking the way I'm just approaching this is wrong. Can anyone help me? (I have only been coding for a couple days now so I apologize in advance if this question is really basic)
Okay, let us first get some stuff out of the way
Whenever you access something from a list the thing you put inside the box brackets [] should be an integer, eg: [5]. This tells Python that you want the 5th element. It cannot ["5"] because 5 in this case would be treated as a string
Therefore the line row = str(result[gen_rand_index]) should actually just be row = ... without the call to str. This is why you got the TypeError about list indices
Secondly, as per your description gen_rand_index would return a list of numbers.
So going by that, why don;t you try this
indices_to_pull = gen_rand_index()
file_handle = open("Foodinventory.txt", 'r')
file_contents = file_handle.readlines() # If the file is small and simle this would work fine
answer = []
for index in indices_to_pull:
answer.append(file_contents[index-1])
Explanation
We get the indices of the file lines from gen_rand_index
we read the entire file into memory using readlines()
Then we get the lines we want, Rememebr to subtract 1 as the list is indexed from 0
The error you are getting is because you're trying to index a string variable (line) with a string index (row). Presumably row will contain something like '[2,3,1]'.
However, even if row was a numerical index, you're not indexing what you think you're indexing. The variable line is a string, and it contains (on any given iteration) one line of the file. Indexing this variable will give you a single character. For example, if line contains green $5, then line[2] will yield 'e'.
It looks like your intent is to index into a list of strings, which represent all the lines of the file.
If your file is not overly large, you can read the entire file into a list of lines, and then just index that array:
with open('file.txt') as fp:
lines = fp.readlines()
print(lines[2]).
In this case, lines[2] will yield the string 'blue $6\n'.
To discard the trailing newline, use lines[2].strip() instead.
I'll go line by line and raise some issues.
row = str(result[gen_rand_index]) #result[gen_rand_index] gives me the random list of numbers
Are you sure it is gen_rand_index and not gen_rand_index()? If gen_rand_index is a function, you should call the function. In the code you have, you are not calling the function, instead you are using the function directly as an index.
file = open("Foodinventory.txt", 'r')
for line in file:
print(line[row])
file.close()
The correct python idiom for opening a file and reading line by line is
with open("Foodinventory.txt.", "r") as f:
for line in f:
...
This way you do not have to close the file; the with clause does this for you automatically.
Now, what you want to do is to print the lines of the file that correspond to the elements in your variable row. So what you need is an if statement that checks if the line number you just read from the file corresponds to the line number in your array row.
with open("Foodinventory.txt", "r") as f:
for i, line in enumerate(f):
if i == row[i]:
print(line)
But this is wrong: it would work only if your list's elements are ordered. That is not the case in your question. So let's think a little bit. You could iterate over your file multiple times, and each time you iterate over it, print out one line. But this will be inefficient: it will take time O(nm) where n==len(row) and m == number of lines in your file.
A better solution is to read all the lines of the file and save them to an array, then print the corresponding indices from this array:
arr = []
with open("Foodinventory.txt", "r") as f:
arr = list(f)
for i in row:
print(arr[i - 1]) # arrays are zero-indiced

Issues reading two text files and calculating values

I have two text files with numbers that I want to do some very easy calculations on (for now). I though I would go with Python. I have two file readers for the two text files:
with open('one.txt', 'r') as one:
one_txt = one.readline()
print(one_txt)
with open('two.txt', 'r') as two:
two_txt = two.readline()
print(two_txt)
Now to the fun (and for me hard) part. I would like to loop trough all the numbers in the second text file and then subtract it with the second number in the first text file.
I have done this (extended the coded above):
with open('two.txt') as two_txt:
for line in two_txt:
print line;
I don't know how to proceed now, because I think that the second text file would need to be converted to string in order do make some parsing so I get the numbers I want. The text file (two.txt) looks like this:
Start,End
2432009028,2432009184,
2432065385,2432066027,
2432115011,2432115211,
2432165329,2432165433,
2432216134,2432216289,
2432266528,2432266667,
I want to loop trough this, ignore the Start,End (first line) and then once it loops only pick the first values before each comma, the result would be:
2432009028
2432065385
2432115011
2432165329
2432216134
2432266528
Which I would then subtract with the second value in one.txt (contains numbers only and no Strings what so ever) and print the result.
There are many ways to do string operations and I feel lost, for instance I don't know if the methods to read everything to memory are good or not.
Any examples on how to solve this problem would be very appreciated (I am open to different solutions)!
Edit: Forgot to point out, one.txt has values without any comma, like this:
102582
205335
350365
133565
Something like this
with open('one.txt', 'r') as one, open('two.txt', 'r') as two:
next(two) # skip first line in two.txt
for line_one, line_two in zip(one, two):
one_a = int(split(line_one, ",")[0])
two_b = int(split(line_two, " ")[1])
print(one_a - two_b)
Try this:
onearray = []
file = open("one.txt", "r")
for line in file:
onearray.append(int(line.replace("\n", "")))
file.close()
twoarray = []
file = open("two.txt", "r")
for line in file:
if line != "Start,End\n":
twoarray.append(int(line.split(",")[0]))
file.close()
for i in range(0, len(onearray)):
print(twoarray[i] - onearray[i])
It should do the job!

Error with .readlines()[n]

I'm a beginner with Python.
I tried to solve the problem: "If we have a file containing <1000 lines, how to print only the odd-numbered lines? ". That's my code:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
n=1
num_lines=sum(1 for line in f)
while n<num_lines:
if n/2!=0:
a=f.readlines()[n]
print(a)
break
n=n+2
where n is a counter and num_lines calculates how many lines the file contains.
But when I try to execute the code, it says:
"a=f.readlines()[n]
IndexError: list index out of range"
Why it doesn't recognize n as a counter?
You have the call to readlines into a loop, but this is not its intended use,
because readlines ingests the whole of the file at once, returning you a LIST
of newline terminated strings.
You may want to save such a list and operate on it
list_of_lines = open(filename).readlines() # no need for closing, python will do it for you
odd = 1
for line in list_of_lines:
if odd : print(line, end='')
odd = 1-odd
Two remarks:
odd is alternating between 1 (hence true when argument of an if) or 0 (hence false when argument of an if),
the optional argument end='' to the print function is required because each line in list_of_lines is terminated by a new line character, if you omit the optional argument the print function will output a SECOND new line character at the end of each line.
Coming back to your code, you can fix its behavior using a
f.seek(0)
before the loop to rewind the file to its beginning position and using the
f.readline() (look, it's NOT readline**S**) method inside the loop,
but rest assured that proceding like this is. let's say, a bit unconventional...
Eventually, it is possible to do everything you want with a one-liner
print(''.join(open(filename).readlines()[::2]))
that uses the slice notation for lists and the string method .join()
Well, I'd personally do it like this:
def print_odd_lines(some_file):
with open(some_file) as my_file:
for index, each_line in enumerate(my_file): # keep track of the index of each line
if index % 2 == 1: # check if index is odd
print(each_line) # if it does, print it
if __name__ == '__main__':
print_odd_lines('C:\Users\Savina\Desktop\rosalind_ini5.txt')
Be aware that this will leave a blank line instead of the even number. I'm sure you figure how to get rid of it.
This code will do exactly as you asked:
with open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')as f:
for i, line in enumerate(f.readlines()): # Iterate over each line and add an index (i) to it.
if i % 2 == 0: # i starts at 0 in python, so if i is even, the line is odd
print(line)
To explain what happens in your code:
A file can only be read through once. After that is has to be closed and reopened again.
You first iterate over the entire file in num_lines=sum(1 for line in f). Now the object f is empty.
If n is odd however, you call f.readlines(). This will go through all the lines again, but none are left in f. So every time n is odd, you go through the entire file. It is faster to go through it once (as in the solutions offered to your question).
As a fix, you need to type
f.close()
f = open(r'C:\Users\Savina\Desktop\rosalind_ini5.txt')
everytime after you read through the file, in order to get back to the start.
As a side note, you should look up modolus % for finding odd numbers.

Adding non-duplicate strings from one txt to another in Python3.3

I have 2 text files (new.txt and master.txt). Each has different data stored as such:
Cory 12 12:40:12.016221
Suzy 64 12:40:33.404614
Trent 145 12:40:56.640052
(catagorised by the first set of numbers appearing on each line)
I have to scan each line of new.txt for the name (e.g. Suzy), check if there is a duplicate in master.txt and if there isn't, then I add that line to master.txt catagorized by that line's number (e.g. 64 in Suzy 64 12:40:33.404614).
I have written the following script, but it falls into a loop of checking the 1st line of new.txt (I know why, I just don't know how to work around not closing fileinput.input(new.txt) so that I can then open fileinput.input(master.txt) further down the loop). I feel like I've highly over complicated things for myself and any help is appreciated.
import fileinput
import re
end_of_file = False
while end_of_file == False:
for line in fileinput.input('new.txt', inplace=1):
end_of_file = fileinput.isstdin() #ends while loop if on last line of new.txt
user_f_line_list = line.split()
master_f = open('master.txt', 'r')
master_f_read = master_f.read()
master_f.close()
fileinput.close()
if not re.findall(user_f_line_list[0], master_f_read):
for line in fileinput.input('master.txt', inplace=1):
master_line_list = line.split()
if int(user_f_line_list[1]) <= int(master_line_list[1]):
written = False
while written == False:
written = True
print(' '.join(user_f_line_list))
print(line, end='')
fileinput.close()
And for reference, master.txt starts with startline 0 and ends with endline 1000000000000000 so that it is impossible for the categorizing to be out of range.
Some suggestions:
Open master.txt into a list with readlines().
Use an OrderedDict from the collections module - it is the same as a regular dict but preserves the order. Make each key the unique element - a tuple in this case (e.g. ("Cory", 12)). Make the value whatever comes after.
Now you can very rapidly check to see if the entry is present by if key in my_dict:.
If it isn't, you can insert it. If you need to insert in order, it'll take a bit more work, but not too much. I would insert in the end, convert to a list when all is done, and apply a sort function to the list with a custom function to specify how to sort.
Output it back to the file.
I won't say it's necessarily shorter than your solution, but it is a lot cleaner.

Categories