Selective concatenation of two huge files - python

I have two really huge flat text files (> 10 GB each). The files consist with many lines - each line is a string (about 80 bytes) the separatorn and then another bigger string.
The first string like a unique key for the first file but can be repeated in second file.
So, I need get a result files - and it should contain key (may be duplicated like in second file) the separator the second string from first file and then second string from second file.
I'm thinking to use dict to store info from 1-st file: key = someHash(str1), value = position in file and the iterate via second file and print result to third file
But I'do not know which hash should be used and if should be used at all
And how resolve possible collision?
And finally how build effective (memory + time) solution for this problem

The hashes provided with python are designed to be cryptographically strong, which means, in simple terms, that they're processor intensive. See this question for other options if you do decide to go with the script solution.

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.
If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Quickest and most efficient way to search large sorted text file

I have a large static text/csv file, which contains approx 100k rows (2MB). It's essentially a dictionary, and I need to perform regular lookups on this data in Python.
The format of the file is:
key value1 value2
alpha x1 x2
alpha beta y1 y2
gamma z1 z2
...
The keys can be multi-word strings.
The list is sorted in alphabetical order by the key
The values are strings
This is part of a web application where every user will be looking up 100-300 keys at a time, and will expect to get both value 1 and value 2 for each of those keys. There will be up to 100 users on the application each looking up those 100-300 keys over the same data.
I just need to return the first exact match. For example, if the user searched for the keys [alpha, gamma], I just need to return [('x1','x2'), ('z1','z2')], which represents the first exact match of 'alpha' and 'gamma'.
I've been reading about the options I have, and I'd really love your input on which of the following approaches is best for my use case.
Read the file once into an ordered set, and perform the 200 or so lookups. However, for every user using the application (~100), the file will be loaded into memory.
Read the file once into a list, and use binary search (e.g. bisect). Similar problem as 1.) the file will be loaded into memory for every user who needs to do a search.
Don't read the entire file into memory, and just read the file one line at a time. I can split the .csv into 26 files by each letter (a.csv, b.csv, ...) to speed this up a bit.
Whoosh is a search library that caught my eye since it created an index once. However, I'm not sure if it's applicable for my use case at all as it looks like a full text search and I can't limit to just looking up the first column. If this specific library is not an option, is there any other way I can create a reusable index in Python to support these kinds of lookups?
I'm really open to ideas and I'm in no way restricted to the four options above!
Thank you :)
How about something similar to approach #2. You could still read the file into memory but instead of storing it into a list and using binary search for searching up keys, you could store the file into a hash map.
The benefit of doing this is to take advantage of a hash map's average lookup time of O(1) with a worst case of O(n). The time complexity benefit and justification can be found here and here. Since you're only looking up keys, having constant lookup time would be a great way to search through the file. This method would also be faster than binary search's average O(log n) search time.
You could store your file as
table = {
key1: (value1, value2),
key2: (value1, value2),
key2: (value1, value2)
}
Note this method is only viable if your keys are all distinct with no duplicate keys.

Choosing random number from list, and then removing it?

Let's say that I have a separate text file that contains a series of numbers:
1
2
3
And so on. Is it possible for a Python program to randomly choose one of the numbers in that text file, and then remove that number from the text file? I know it is possible to do the first, but the I am struggling with the second part.
If it helps, the list is about 180000 numbers long. I am very new at this. The idea is to randomly assign a player a number, and then remove that number from the list so another player can't get it.
Do you actually have 180,000 players? If not, what about solving the problem the other way round:
Create a file listing the IDs already used
For each new user:
Create a fairly large random ID (like the ones in your current file)
Run through the 'used' IDs in your file and check your new ID doesn't collide with an existing one - if it does, generate new ones until there is no collision
Append the new ID to your file
This will be much faster than reading, checking and writing a large file each time. If your IDs are large, you won't get many collisions.
You could also optimise the process, for example using a two-part ID consisting of today's date and a random number. You would then keep a file for each day, and only need to check for collisions with the IDs issued today.
The suggestion I would say is that, you read the entire text file, make whatever changes you want to do to it, and then rewrite over the original contents of the file, which is the best way as far as i know
If the file is small, read the whole thing into a list, delete a value from the list, then write the new list to a temp file. Finally, rename the temp file to the original filename.
If the file is large, read the file one line at a time, writing the values (except one) to a temp file. Then rename the temp file to the original filename.
Like dstromberg said, if the file is small, check out the documentation on file IO and this answer's strategy for writing lists to a file. Note that writelines() "does not add line separators."

Taking a specific range of data in a CSV file (Python)

Basically, what I want to do here is read in a specific range of data (Say, 10,000 values) and see if it contains a match that I'm looking for. If it doesn't contain that match, then it throws out those values and takes the next 10,000.
For example, if I have the MD5 hash "fac2a47adace059aff113283a03f6760" (The value of which is stack), I will load 10,000 values from a CSV file and check to see if the MD5 hash in that line matches up with my given hash. If it does, then I print out the value after the comma on that line, and if it doesn't then throw those 10,000 values out of memory and take the 10,000 after that until I get a value.
Apologies of this is a bit unclear... I can't think of a crystal-clear way of explaining it. My current method of doing things is dumping a dictionary containing all the combinations of characters (up to 5) to a text file via JSON and loading that back into memory to be searched, which doesn't work with 5 characters (Throws a MemoryError).
Thanks in advance for any help, and let me know if you need clarification!
Assuming that the matching line looks like 'fac2a47adace059aff113283a03f6760,stack', you basically want to do this:
for row in csv.reader(csvfile):
if row[0] == "fac2a47adace059aff113283a03f6760":
print row[1]
break
If your hash isn't in the first column or your pre-hash value isn't in the second, adjust the [0] and [1] to the right indexes.

Searching for duplicate records within a text file where the duplicate is determined by only two fields

First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
# Compress Full Part to a Compressed Part
# and Check for Duplicates on Supplier Code
# and Compressed Part combination
import sys
import re
import time
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try:
file1 = open("C:\Accounting\May Accounting\May.txt", "r")
except IOError:
print >> sys.stderr, "Cannot Open Read File"
sys.exit(1)
try:
file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a")
except IOError:
print >> sys.stderr, "Cannot Open Write File"
sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR"
file2.write(hdrList+chr(10))
lines_seen=set()
affirm="YES"
records = file1.readlines()
for record in records:
fields = record.split(chr(124))
if fields[0]=="CIGSupplier":
continue #If incoming file has a header line, skip it
file2.write(fields[0]+"|"), #Supplier Code
file2.write(fields[1]+"|"), #Full_Part
file2.write(fields[2]+"|"), #Part Status
file2.write(fields[3]+"|"), #Alias Flag
file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag
file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part
dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1])
if dupechk not in lines_seen:
file2.write(chr(10))
lines_seen.add(dupechk)
else:
file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close()
file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.
Maybe you could consider building a dictionary mapping (supplier_number, compressed_part_number) tuples to data structures (nested lists perhaps, or instances of a custom class for improved readability & maintainability) holding information on line numbers for the lines the records matching the key tuple appear in your file plus possibly the complete records themselves.
This would end up putting all the data from the file into a large in-memory dictionary, which might or might not be a problem depending on your requirements; if you skip the actual records and only hold line numbers, the dictionary will be much smaller.
You can then iterate over the entries in the dictionary spitting out the duplicates to a file as you go.
I think you should sort the entries in the input file first. Maybe it will consume too much memory, but you should first try to read all input in memory, sort this based upon the value of dupechk and then you can iterate over all entries and easily see if there are two or more identical records. Because identical records are grouped, it is easy to output just those records.
This might be more efficient/feasible for the large files you are dealing with:
Sort the file based on the supplier code and compressed part number - dump it to a temporary file. I don't think it is worth actually tacking on the compressed part number, just compute it from the full part number when needed. However, that is pure conjecture and definitely deserves some quick benchmarking.
Iterate through the temporary file (might want to take advantage of 'with'). Check if current line's supplier code and compressed part number is identical to previous one - if it is, you have identified a duplicate. Handle as you see fit. Since the file is sorted you reduce the memory requirement of needing to store all the lines in memory to a set of consecutive identical lines.
You are already reading the whole file into memory. You don't need to sort. Instead of a set, have a dict mapping (supplier, compressed_pn) to line_number_last_seen - 1. That way, when you discover a duplicate, you can output the two duplicate records immediately. This method requires only one pass over the file. You don't need to write a temporary file.
If you often have 3 or more records with the same key, you may wish to use an approach that maps the key to a list of line indices. At the end of reading the file, you iterate over the dictionary looking for lists with more than 1 entry.
Couple of comments:
Using file.readlines on a large file is wasteful - it's reading the entire file into memory. You should, instead, take advantage that a file is iterable, reading a single line at a time by default.
Your file format is basically a CSV, with a pipe instead of a comma as a separator. So, use the CSV module. The CSV is written in C and escapes most of the interpreted overhead. It also provides a nice iterable interface which also does not require reading the whole file into memory, either.
You should additionally use a DictReader from the csv module. If the header is in the file, great, the class will parse it and use as the keys further on. If not, specify the header in the code. Either way, fields[0] is uninformative and error prone. fields["CIGSUPPLIER"] is much more self-documenting.
Just as with reading, use the csv module for writing. Again, you can specify the delimiter.
Don't use file2.write(char(10)). Use file2.write('\n'), and open your file appropriately. Alternatively, if you're using the csv.writer class, these become unnecessary.
Otherwise, your logic and flow looks alright. I'd overall advise against using the chr(*) calls, unless that character is truly unprintable. newlines and pipes are printable (or have supported escapes), and should be used as such.

Categories