Searching a string from a file - python - python

I have a human dictionary file that looks like this in eng.dic (image that there is close to a billion words in that list). And I have to run different word queries quite often.
apple
pear
foo
bar
foo bar
dictionary
sentence
I have a string let's say "foo-bar", is there a better (more efficient way) of searching through that file to see whether it exist, if it return exist, if it doesnt exist, append the dictionary file
dic_file = open('en_dic', 'ra', 'utf8')
query = "foo-bar"
wordlist = list(dic_file.readlines().replace(" ","-"))
en_dic = map(str.strip, wordlist)
if query in en_dic:
return 1
else:
print>>dic_file, query
Is there any in-built search functions in python? or any libraries that i can import to run such searches without much overheads?

As I already mentioned, going through the whole file when its size is significant, is not a good idea. Instead you should use established solutions and:
index the words in the document,
store the results of indexing in appropriate form (I suggest database),
check if the word exists in the file (by checking the database),
if it does not exist, add it to file and database,
Storing data in database is really a lot more efficient than trying to reinvent the wheel. If you will use SQLite, the database will be also a file, so the setup procedure is minimal.
So again, I am proposing storing words in SQLite database and querying when you want to check if the word exists in the file, then updating it if you are adding it.
To read more on the solution see answers to this question:
The most efficient way to index words in a document

Most efficient way depends on most frequent operation that you will perform with this dictionary.
If you need to read file each time, you can use while loop reading file line-by-line until result is your word on end of the file. This is necessary if you have several concurrent workers that can update file at the same time.
If you don't need to read file each time (eg, you have only one process that work with dictionary), you can definitely write more efficient implementation: 1) read all lines into set (instead of list), 2) for each "new" word perform both actions - update set with add operation and write word to file.

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:
with open('largeFile', 'r') as inF:
for line in inF:
if 'myString' in line:
# do_something

Related

Python: Removing duplicates from a huge csv file (memory issues)

I have a csv file that is very big, containing a load of different people. Some of these people come up twice. Something like this:
Name,Colour,Date
John,Red,2017
Dave,Blue,2017
Tom,Blue,2017
Amy,Green,2017
John,Red,2016
Dave,Green,2016
Tom,Blue,2016
John,Green,2015
Dave,Green,2015
Tom,Blue,2015
Rebecca,Blue,2015
I want a csv file that contains only the most recent colour for each person. For example, for John, Dave, Tom and Amy I am only interested in the row for 2017. For Rebecca I will need the value from 2015.
The csv file is huge, containing over 10 million records (all people have a unique ID so repeated names don't matter). I've tried something along the lines of the following:
Open csv file
Read line 1.
If person is not in "seen" list, add to csv file 2
Add person to "Seen" list.
Read line 2...
The problem is the "seen" list gets massive and I run out of memory. The other issue is sometimes the dates are not in order so an old entry gets into the "seen" list and then the new entry won't overwrite it. This would be easy to solve if I could sort the data by descending date, but I'm struggling to sort it with the size of the file.
Any suggestions?
If the whole csv file can be stored in a list like:
csv_as_list = [
(unique_id, color, year),
…
]
then you can sort this list by:
import operator
# first sort by year descending
csv_as_list.sort(key=operator.itemgetter(2), reverse=True)
# then, since the Python sort is stable, by unique_id
csv_as_list.sort(key=operator.itemgetter(0))
and then you can:
from __future__ import print_function
import operator, itertools
for unique_id, group in itertools.groupby(csv_as_list, operator.itemgetter(0)):
latest_color = next(group)[1]
print(unique_id, latest_color)
(I just used print here, but you get the gist.)
If the csv file cannot be loaded in-memory as a list, you'll have to go through an intermediate step that uses disk (e.g. SQLite).
Open your csv file to read.
Read line by line, append user to final_list if his ID is not already found in there. If it is found, check the year of your current_data, with your final_list data. If the current data has a more recent entry, just change the date of your user in final_list, along with the color associated with it.
Only then, when your final_list is done, will you write a new csv file.
If you want this task to be faster, you want to...
Optimize your loops.
Use standard python functions and/or libraries coded in C.
If this is still not optimized enough... learn C. Reading a csv file in C, parsing it with a separator, and iterating through an array is not hard, even in C.
I see two obvious ways to solve this that don't involve keeping huge amounts of data in memory:
Use a database instead of CSV files
Reorganise your CSV files to facilitate sorting.
Using a database is fairly straightforward. I expect you could could even use the SQLite that comes with Python. This would be my preferred option, I think. To get the best performance, create an index of (person, date).
The second involves letting the first column of your CSV file be the person ID and the second column be the date. Then you could sort the CSV file from the commandline, i.e. sort myfile.csv. This will group all entries for a particular person together, and provided your date is in a proper format (e.g. YYYY-MM-DD), the entry of interest will be the last one. The Unix sort command is not known for its speed, but it's very robust.

Finding and deleting duplicate lines in a file(the fastest and most efficient way)

As the title says, I want to find and delete duplicate lines in a file. That is pretty easy to do...the catch is that i want to know what is the fastest and most efficient way to do that (let's say that you have gigabytes worth of files and you want to do this as efficient and as fast as you can)
If you know some method...as complicated as it is that can do that I would like to know. I heard some stuff like loop unrolling and started to second guess that the most simple things are the fastest so I am curious.
The best solution is the keep a set of the lines seen so far, and return only the ones not in it. This approach is used in python's collections implementation
def unique_lines(filename):
lines = open(filename).readlines()
seen = set()
for line in lines:
if line not in seen:
yield line
seen.add(line)
and then
for unique_line in unique_lines(filename)
# do stuff
Of course, if you don't care about the order, you can convert the whole text to a set directly, like
set(open(filename).readlines())
Use python hashlib to hash every line in file to a unique hash... And check if a line is duplicate, lookup into the hashes in a set
Lines can be kept directly in a set, however, hashing will reduce space required.
https://docs.python.org/3/library/hashlib.html

Size issues with Python shelve module

I want to store a few dictionaries using the shelve module, however, I am running into a problem with the size. I use Python 3.5.2 and the latest shelve module.
I have a list of words and I want to create a map from the bigrams (character level) to the words. The structure will look something like this:
'aa': 'aardvark', 'and', ...
'ab': 'absolute', 'dab', ...
...
I read in a large file consisting of approximately 1.3 million words. So the dictionary gets pretty large. This is the code:
self.bicharacters // part of class
def _create_bicharacters(self):
'''
Creates a bicharacter index for calculating Jaccard coefficient.
'''
with open('wordlist.txt', encoding='ISO-8859-1') as f:
for line in f:
word = line.split('\t')[2]
for i in range(len(word) - 1):
bicharacter = (word[i] + word[i+1])
if bicharacter in self.bicharacters:
get = self.bicharacters[bicharacter]
get.append(word)
self.bicharacters[bicharacter] = get
else:
self.bicharacters[bicharacter] = [word]
When I ran this code using a regular Python dictionary, I did not run into issues, but I can't spare those kinds of memory resources due to the rest of the program also having quite a large memory footprint.
So I tried using the shelve module. However, when I run the code above using shelve the program stops after a while due to no more memory on disk, the shelve db that was created was around 120gb, and it had still not read even half the 1.3M word list from the file. What am I doing wrong here?
The problem here is not so much the number of keys, but that each key references a list of words.
While in memory as one (huge) dictionary, this isn't that big a problem as the words are simply shared between the lists; each list is simply a sequence of references to other objects and here many of those objects are the same, as only one string per word needs to be referenced.
In shelve, however, each value is pickled and stored separately, meaning that a concrete copy of the words in a list will have to be stored for each value. Since your setup ends up adding a given word to a large number of lists, this multiplies your data needs rather drastically.
I'd switch to using a SQL database here. Python comes with bundled with sqlite3. If you create one table for individual words, and second table for each possible bigram, and a third that simply links between the two (a many-to-many mapping, linking bigram row id to word row id), this can be done very efficiently. You can then do very efficient lookups as SQLite is quite adept managing memory and indices for you.

Choosing random number from list, and then removing it?

Let's say that I have a separate text file that contains a series of numbers:
1
2
3
And so on. Is it possible for a Python program to randomly choose one of the numbers in that text file, and then remove that number from the text file? I know it is possible to do the first, but the I am struggling with the second part.
If it helps, the list is about 180000 numbers long. I am very new at this. The idea is to randomly assign a player a number, and then remove that number from the list so another player can't get it.
Do you actually have 180,000 players? If not, what about solving the problem the other way round:
Create a file listing the IDs already used
For each new user:
Create a fairly large random ID (like the ones in your current file)
Run through the 'used' IDs in your file and check your new ID doesn't collide with an existing one - if it does, generate new ones until there is no collision
Append the new ID to your file
This will be much faster than reading, checking and writing a large file each time. If your IDs are large, you won't get many collisions.
You could also optimise the process, for example using a two-part ID consisting of today's date and a random number. You would then keep a file for each day, and only need to check for collisions with the IDs issued today.
The suggestion I would say is that, you read the entire text file, make whatever changes you want to do to it, and then rewrite over the original contents of the file, which is the best way as far as i know
If the file is small, read the whole thing into a list, delete a value from the list, then write the new list to a temp file. Finally, rename the temp file to the original filename.
If the file is large, read the file one line at a time, writing the values (except one) to a temp file. Then rename the temp file to the original filename.
Like dstromberg said, if the file is small, check out the documentation on file IO and this answer's strategy for writing lists to a file. Note that writelines() "does not add line separators."

Searching for duplicate records within a text file where the duplicate is determined by only two fields

First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
# Compress Full Part to a Compressed Part
# and Check for Duplicates on Supplier Code
# and Compressed Part combination
import sys
import re
import time
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try:
file1 = open("C:\Accounting\May Accounting\May.txt", "r")
except IOError:
print >> sys.stderr, "Cannot Open Read File"
sys.exit(1)
try:
file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a")
except IOError:
print >> sys.stderr, "Cannot Open Write File"
sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR"
file2.write(hdrList+chr(10))
lines_seen=set()
affirm="YES"
records = file1.readlines()
for record in records:
fields = record.split(chr(124))
if fields[0]=="CIGSupplier":
continue #If incoming file has a header line, skip it
file2.write(fields[0]+"|"), #Supplier Code
file2.write(fields[1]+"|"), #Full_Part
file2.write(fields[2]+"|"), #Part Status
file2.write(fields[3]+"|"), #Alias Flag
file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag
file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part
dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1])
if dupechk not in lines_seen:
file2.write(chr(10))
lines_seen.add(dupechk)
else:
file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close()
file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.
Maybe you could consider building a dictionary mapping (supplier_number, compressed_part_number) tuples to data structures (nested lists perhaps, or instances of a custom class for improved readability & maintainability) holding information on line numbers for the lines the records matching the key tuple appear in your file plus possibly the complete records themselves.
This would end up putting all the data from the file into a large in-memory dictionary, which might or might not be a problem depending on your requirements; if you skip the actual records and only hold line numbers, the dictionary will be much smaller.
You can then iterate over the entries in the dictionary spitting out the duplicates to a file as you go.
I think you should sort the entries in the input file first. Maybe it will consume too much memory, but you should first try to read all input in memory, sort this based upon the value of dupechk and then you can iterate over all entries and easily see if there are two or more identical records. Because identical records are grouped, it is easy to output just those records.
This might be more efficient/feasible for the large files you are dealing with:
Sort the file based on the supplier code and compressed part number - dump it to a temporary file. I don't think it is worth actually tacking on the compressed part number, just compute it from the full part number when needed. However, that is pure conjecture and definitely deserves some quick benchmarking.
Iterate through the temporary file (might want to take advantage of 'with'). Check if current line's supplier code and compressed part number is identical to previous one - if it is, you have identified a duplicate. Handle as you see fit. Since the file is sorted you reduce the memory requirement of needing to store all the lines in memory to a set of consecutive identical lines.
You are already reading the whole file into memory. You don't need to sort. Instead of a set, have a dict mapping (supplier, compressed_pn) to line_number_last_seen - 1. That way, when you discover a duplicate, you can output the two duplicate records immediately. This method requires only one pass over the file. You don't need to write a temporary file.
If you often have 3 or more records with the same key, you may wish to use an approach that maps the key to a list of line indices. At the end of reading the file, you iterate over the dictionary looking for lists with more than 1 entry.
Couple of comments:
Using file.readlines on a large file is wasteful - it's reading the entire file into memory. You should, instead, take advantage that a file is iterable, reading a single line at a time by default.
Your file format is basically a CSV, with a pipe instead of a comma as a separator. So, use the CSV module. The CSV is written in C and escapes most of the interpreted overhead. It also provides a nice iterable interface which also does not require reading the whole file into memory, either.
You should additionally use a DictReader from the csv module. If the header is in the file, great, the class will parse it and use as the keys further on. If not, specify the header in the code. Either way, fields[0] is uninformative and error prone. fields["CIGSUPPLIER"] is much more self-documenting.
Just as with reading, use the csv module for writing. Again, you can specify the delimiter.
Don't use file2.write(char(10)). Use file2.write('\n'), and open your file appropriately. Alternatively, if you're using the csv.writer class, these become unnecessary.
Otherwise, your logic and flow looks alright. I'd overall advise against using the chr(*) calls, unless that character is truly unprintable. newlines and pipes are printable (or have supported escapes), and should be used as such.

Categories