Simplify python code for txt searching - python

I am a beginner at python and I need to check the presence of a given set of string in a huge txt file. I've written this code so far and it runs with no problems on a light subsample of my database. The problem is that it takes more than 10 hours when searching through the whole database and I'm looking for a way to speed up the process.
The code so far reads a list of strings from a txt I've put together (list.txt) and search for every item in every line of the database (hugedataset.txt). My final output should be a list of items which are present in the database (or, alternatively, a list of items which are NOT present). I bet there is a more efficient way to do things though...
Thank you for your support!
import re
fobj_in = open('hugedataset.txt')
present=[]
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for l in fobj_in:
for title in list1:
if title in l:
print title
present.append(title)
set=set(presenti)
print set

Since you don't need any per-line information, you can search the whole thing in one go for each string:
data = open('hugedataset.txt').read() # Assuming it fits in memory
present=[] # As #svk points out, you could make this a set
with open('list.txt', 'r') as f:
list1 = [line.strip() for line in f]
print list1
for title in list1:
if title in data:
print title
present.append(title)
set=set(present)
print set

You could use a regexp to check for all substring with a single pass. Look for example at this answer: Check to ensure a string does not contain multiple values

Related

How do I compare the strings in a txt.file to my dictionary values (the values are integers) and calculate the sum

The title of my question might be a bit confusing - maybe that's because I am a little confused myself.
I have a task to create a small billing system for a restaurant (It is not a real restaurant).
I have 3 text files. The first one being the menu.txt. In the file, each line consists of the name of a dish and its price, seperated by a comma. Then I have 2 order files order1.txt and order2.txt Each line contains an item from the menu that has been ordered.
My own suggestion to this task was to put the menu file into a list and then make it a dictionary.
This is my solution:
def compute_total(mfile, ofile):
dictionary = {}
newlist = []
with open('menu.txt') as f:
for line in f:
line = line.replace(',', '')
newlist.append(line.split())
for strings in newlist:
dictionary[strings[0]] = strings[1]
I feel like I am on the right way with this but I don't really know how make code from here.
Because I know that I want to somehow see if for example: order1 is in the dictionary (menu) and then calculate the values (prices) of the dishes.
I was thinking maybe something like this below, but it does not work and I am stuck.
for k, v in dictionary.items():
if int(k) in dictionary.items():
Dvalues.append(v)
I hope you can give me some advice to get going. I am a novice so I really hope you would take some time to help me with small problem (for you) like this.
Best regards,
SEE
Python's csv package is great for dealing with comma separated files. Instead of having to parse each line manually, the csv module will be able to convert each line to a list of values that were previously separated by commas. Then, you can just use parallel assignment to get the items and prices for each line, and store them in the dictionary:
with open('menu.txt') as f:
for line in f:
item, price = csv.reader(line, quotechar="'", delimiter=",")
dictionary[item] = price
Now that you've stored all the menu items in the dictionary, all you have to do is read each item line from the other text files, pass that item to your dictionary, get back the price, and add it to the sum:
order1_total = 0
with open(mfile) as f:
for line in f:
item = line
order1_total += dictionary[item]
Don't forget to add the line import csv to the top of your program. Hope it helps! Good luck!

How to import a special format as a dictionary in python?

I have the text files as below format in single line,
username:password;username1:password1;username2:password2;
etc.
What I have tried so far is
with open('list.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
but I get an error saying that the length is 1 and 2 is required which indicates the file is not being as key:value.
Is there any way to fix this or should I just reformat the file in another way?
thanks for your answers.
What i got so far is:
with open('tester.txt') as f:
password_list = dict(x.strip(":").split(";", 1) for x in f)
for user, password in password_list.items():
print(user + " - " + password)
the results comes out as username:password - username1:password1
what i need is to split username:password where key = user and value = password
Since variable f in this case is a file object and not a list, the first thing to do would be to get the lines from it. You could use the https://docs.python.org/2/library/stdtypes.html?highlight=readline#file.readlines* method for this.
Furthermore, I think I would use strip with the semicolon (";") parameter. This will provide you with a list of strings of "username:password", provided your entire file looks like this.
I think you will figure out what to do after that.
EDIT
* I auto assumed you use Python 2.7 for some reason. In version 3.X you might want to look at the "distutils.text_file" (https://docs.python.org/3.7/distutils/apiref.html?highlight=readlines#distutils.text_file.TextFile.readlines) class.
Load the text of the file in Python with open() and read() as a string
Apply split(;) to that string to create a list like [username:password, username1:password1, username2:password2]
Do a dict comprehension where you apply split(":") to each item of the above list to split those pairs.
with open('list.txt', 'rt') as f:
raw_data = f.readlines()[0]
list_data = raw_data.split(';')
user_dict = { x.split(':')[0]:x.split(':')[1] for x in list_data }
print(user_dict)
Dictionary comprehension is useful here.
One liner to pull all the info out of the text file. As requested. Hope your tutor is impressed. Ask him How it works and see what he says. Maybe update your question to include his response.
If you want me to explain, feel free to comment and I shall go into more detail.
The error you're probably getting:
ValueError: dictionary update sequence element #3 has length 1; 2 is required
is because the text line ends with a semicolon. Splitting it on semicolons then results in a list that contains some pairs, and an empty string:
>>> "username:password;username1:password1;username2:password2;".split(";")
['username:password', 'username1:password1', 'username2:password2', '']
Splitting the empty string on colons then results in a single empty string, rather than two strings.
To fix this, filter out the empty string. One example of doing this would be
[element for element in x.split(";") if element != ""]
In general, I recommend you do the work one step at a time and assign to intermediary variables.
Here's a simple (but long) answer. You need to get the line from the file, and then split it and the items resulting from the split:
results = {}
with open('file.txt') as file:
for line in file:
#Only one line, but that's fine
entries = line.split(';')
for entry in entries:
if entry != '':
#The last item in entries will be blank, due to how split works in this example
user, password = entry.split(':')
results[user] = password
Try this.
f = open('test.txt').read()
data = f.split(";")
d = {}
for i in data:
if i:
value = i.split(":")
d.update({value[0]:value[1]})
print d

Issues reading two text files and calculating values

I have two text files with numbers that I want to do some very easy calculations on (for now). I though I would go with Python. I have two file readers for the two text files:
with open('one.txt', 'r') as one:
one_txt = one.readline()
print(one_txt)
with open('two.txt', 'r') as two:
two_txt = two.readline()
print(two_txt)
Now to the fun (and for me hard) part. I would like to loop trough all the numbers in the second text file and then subtract it with the second number in the first text file.
I have done this (extended the coded above):
with open('two.txt') as two_txt:
for line in two_txt:
print line;
I don't know how to proceed now, because I think that the second text file would need to be converted to string in order do make some parsing so I get the numbers I want. The text file (two.txt) looks like this:
Start,End
2432009028,2432009184,
2432065385,2432066027,
2432115011,2432115211,
2432165329,2432165433,
2432216134,2432216289,
2432266528,2432266667,
I want to loop trough this, ignore the Start,End (first line) and then once it loops only pick the first values before each comma, the result would be:
2432009028
2432065385
2432115011
2432165329
2432216134
2432266528
Which I would then subtract with the second value in one.txt (contains numbers only and no Strings what so ever) and print the result.
There are many ways to do string operations and I feel lost, for instance I don't know if the methods to read everything to memory are good or not.
Any examples on how to solve this problem would be very appreciated (I am open to different solutions)!
Edit: Forgot to point out, one.txt has values without any comma, like this:
102582
205335
350365
133565
Something like this
with open('one.txt', 'r') as one, open('two.txt', 'r') as two:
next(two) # skip first line in two.txt
for line_one, line_two in zip(one, two):
one_a = int(split(line_one, ",")[0])
two_b = int(split(line_two, " ")[1])
print(one_a - two_b)
Try this:
onearray = []
file = open("one.txt", "r")
for line in file:
onearray.append(int(line.replace("\n", "")))
file.close()
twoarray = []
file = open("two.txt", "r")
for line in file:
if line != "Start,End\n":
twoarray.append(int(line.split(",")[0]))
file.close()
for i in range(0, len(onearray)):
print(twoarray[i] - onearray[i])
It should do the job!

How to get text between 2 lines with PYthon

So I have a text file that is structured like this:
Product ID List:
ABB:
578SH8
EFC025
EFC967
CNC:
HDJ834
HSLA87
...
...
This file continues on with many companies' names and Id's below them. I need to then get the ID's of the chosen company and append them to a list, where they will be used to search a website. Here is the current line I have to get the data:
PID = open('PID.txt').read().split()
This works great if there are only Product ID's of only 1 company in there and no text. This does not work for what I plan on doing however... How can I have the reader read from (an example) after where it says ABB: to before the next company? I was thinking maybe add some kind of thing in the file like ABB END to know where to cut to, but I still don't know how to cut out between lines in the first place... If you could let me know, that would be great!
Two consecutive newlines act as a delimeter, so just split there an construct a dictionary of the data:
data = {i.split()[0]: i.split()[1:] for i in open('PID.txt').read().split('\n\n')}
Since the file is structured like that you could follow these steps:
Split based on the two newline characters \n\n into a list
Split each list on a single newline character \n
Drop the first element for a list containing the IDs for each company
Use the first element (mentioned above) as needed for the company name (make sure to remove the colon)
Also, take a look at regular expressions for parsing data like this.
with open('file.txt', 'r') as f: # open the file
next(f) # skip the first line
results = {} # initialize a dictionary
for line in f: # iterate through the remainder of the file
if ':' in line: # if the line contains a :
current = line.strip() # strip the whitespace
results[current] = [] # and add it as a dictionary entry
elif line.strip(): # otherwise, and if content remains after stripping whitespace,
results[current].append(line.strip()) # append this line to the relevant list
This should at least get you started, you will likely have better luck using dictionaries than lists, at least for the first part of your logic. By what method will you pass the codes along?
a = {}
f1 = open("C:\sotest.txt", 'r')
current_key = ''
for row in f1:
strrow = row.strip('\n')
if strrow == "":
pass
elif ":" in strrow:
current_key = strrow.strip(':')
a[current_key] = []
else:
a[current_key].append(strrow)
for key in a:
print key
for item in a[key]:
print item

Trouble sorting a list with python

I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).

Categories