Python fast string parsing, manipulation - python

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?

Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.

Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

Related

Python - scalability with respect to run time and memory usage is important

I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.
I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.
When I compared the running time of 2 scripts, I found the following:
script 1 - use generator - take more time - 0.0155925750732s
def each_sentence(text):
match = re.match(r'[0-9]+', text)
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)
script 2 - use split and without generator - take less time - 0.00619888305664
with open("./file_testing.csv") as csvfile:
for line in csvfile:
array = line.split(',')
num = int(array[0])
if sympy.isprime(num) == False:
print line.strip()
To meet the requirement, do I need to use python generator? or any suggestions or recommendations?
To meet the requirement, do I need to use python generator?
No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.
Any suggestions or recommendations?
You need to learn about three things: complexity, parallelization and caching.
Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?
Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.
Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.
The main loop for line in csvfile: already scales very well unless the csv file contains extremely long lines.
Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0]) will raise a value error.
The isprime function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.
Split your analysis into two discrete regular expression results: A small result with 10 values, and a large result with 10,000,000 values. This question is about the average len() of match, as much as it is about the len() of csvfile.
With a small re result - 10 bytes
1st code block will have slower run time, and relatively low memory usage.
2nd code block will have faster run time, and also relatively low memory usage.
With a large re result - 10,000,000 bytes
1st code block will have slower run time, and very little memory usage.
2nd code block will have faster run time, and very large memory usage.
Bottom line:
If you are supposed to build a function considering run time and memory, then the yield function is definitely the best way to go when the problem requires a scalable solution to different result sizes.
Another question on scalability: What if the re result equals None? I would slightly modify the code to below:
def each_sentence(text):
match = re.match(r'[0-9]+', text)
if match != None:
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)

Python: improve performance in log writing to file

I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added
You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.

Is it faster to do a bulk file write or write to it in smaller parts?

I have a python script that reads a flat file and writes the records to a JSON file. Would it be faster to do a write all at once:
dict_array = []
for record in records:
dict_array.append(record)
# writes one big array to file
out_file.write(json.dumps(dict_array))
Or write to the file as the iterator yields each record?
for record in records:
out_file.write(json.dumps(record) + '\n')
The amount of records in records is around 81,000.
Also, the format of JSON can be one big array of objects (case 1) or line-separated objects (case 2).
Your two solutions aren't doing the same thing. The first one writes a valid JSON object. The second writes a probably-valid (but you have to be careful) JSONlines (and probably also NDJSON/LDJSON and NDJ) file. So, the way you process the data later is going to be very different. And that's the most important thing here—do you want a JSON file, or a JSONlines file?
But since you asked about performance: It depends.
Python files are buffered by default, so doing a whole bunch of small writes is only a tiny bit slower than doing one big write. But it is a tiny bit slower, not zero.
On the other hand, building a huge list in memory means allocation, and copies, that are otherwise unneeded. This is almost certainly going to be more significant, unless your values are really tiny and your list is also really short.
Without seeing your data, I'd give about 10:1 odds that the iterative solution will turn out faster, but why rely on that barely-educated guess? If it matters, measure with your actual data with timeit.
On the third hand, 81,000 typical JSON records is basically nothing, so unless you're doing this zillions of times, it's probably not even worth measuring. If you spend an hour figuring out how to measure it, running the tests, and interpreting the results (not to mention the time you spent on SO) to save 23 milliseconds per day for about a week and then nothing ever again… well, to a programmer, that's always attractive, but still it's not always a good idea.
import json
dict_array = []
records = range(10**5)
start = time.time()
for record in records:
dict_array.append(record)
out_file.write(json.dumps(dict_array))
end = time.time()
print(end-start)
#0.07105851173400879
start = time.time()
for record in records:
out_file.write(json.dumps(record) + '\n')
end = time.time()
print(end-start)
#1.1138122081756592
start = time.time()
out_file.write(json.dumps([record for record in records]))
end = time.time()
print(end-start)
#0.051038265228271484
I don't know what records is, but based on these tests, list comprehension is fastest, followed by constructing a list and writing it all at once, followed by writing one record at a time. Depending on what records is, just doing out_file.write(json.dumps(records))) may be even faster.

Searching for duplicate records within a text file where the duplicate is determined by only two fields

First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
# Compress Full Part to a Compressed Part
# and Check for Duplicates on Supplier Code
# and Compressed Part combination
import sys
import re
import time
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try:
file1 = open("C:\Accounting\May Accounting\May.txt", "r")
except IOError:
print >> sys.stderr, "Cannot Open Read File"
sys.exit(1)
try:
file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a")
except IOError:
print >> sys.stderr, "Cannot Open Write File"
sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR"
file2.write(hdrList+chr(10))
lines_seen=set()
affirm="YES"
records = file1.readlines()
for record in records:
fields = record.split(chr(124))
if fields[0]=="CIGSupplier":
continue #If incoming file has a header line, skip it
file2.write(fields[0]+"|"), #Supplier Code
file2.write(fields[1]+"|"), #Full_Part
file2.write(fields[2]+"|"), #Part Status
file2.write(fields[3]+"|"), #Alias Flag
file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag
file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part
dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1])
if dupechk not in lines_seen:
file2.write(chr(10))
lines_seen.add(dupechk)
else:
file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close()
file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.
Maybe you could consider building a dictionary mapping (supplier_number, compressed_part_number) tuples to data structures (nested lists perhaps, or instances of a custom class for improved readability & maintainability) holding information on line numbers for the lines the records matching the key tuple appear in your file plus possibly the complete records themselves.
This would end up putting all the data from the file into a large in-memory dictionary, which might or might not be a problem depending on your requirements; if you skip the actual records and only hold line numbers, the dictionary will be much smaller.
You can then iterate over the entries in the dictionary spitting out the duplicates to a file as you go.
I think you should sort the entries in the input file first. Maybe it will consume too much memory, but you should first try to read all input in memory, sort this based upon the value of dupechk and then you can iterate over all entries and easily see if there are two or more identical records. Because identical records are grouped, it is easy to output just those records.
This might be more efficient/feasible for the large files you are dealing with:
Sort the file based on the supplier code and compressed part number - dump it to a temporary file. I don't think it is worth actually tacking on the compressed part number, just compute it from the full part number when needed. However, that is pure conjecture and definitely deserves some quick benchmarking.
Iterate through the temporary file (might want to take advantage of 'with'). Check if current line's supplier code and compressed part number is identical to previous one - if it is, you have identified a duplicate. Handle as you see fit. Since the file is sorted you reduce the memory requirement of needing to store all the lines in memory to a set of consecutive identical lines.
You are already reading the whole file into memory. You don't need to sort. Instead of a set, have a dict mapping (supplier, compressed_pn) to line_number_last_seen - 1. That way, when you discover a duplicate, you can output the two duplicate records immediately. This method requires only one pass over the file. You don't need to write a temporary file.
If you often have 3 or more records with the same key, you may wish to use an approach that maps the key to a list of line indices. At the end of reading the file, you iterate over the dictionary looking for lists with more than 1 entry.
Couple of comments:
Using file.readlines on a large file is wasteful - it's reading the entire file into memory. You should, instead, take advantage that a file is iterable, reading a single line at a time by default.
Your file format is basically a CSV, with a pipe instead of a comma as a separator. So, use the CSV module. The CSV is written in C and escapes most of the interpreted overhead. It also provides a nice iterable interface which also does not require reading the whole file into memory, either.
You should additionally use a DictReader from the csv module. If the header is in the file, great, the class will parse it and use as the keys further on. If not, specify the header in the code. Either way, fields[0] is uninformative and error prone. fields["CIGSUPPLIER"] is much more self-documenting.
Just as with reading, use the csv module for writing. Again, you can specify the delimiter.
Don't use file2.write(char(10)). Use file2.write('\n'), and open your file appropriately. Alternatively, if you're using the csv.writer class, these become unnecessary.
Otherwise, your logic and flow looks alright. I'd overall advise against using the chr(*) calls, unless that character is truly unprintable. newlines and pipes are printable (or have supported escapes), and should be used as such.

How to write a memory efficient Python program?

It's said that Python automatically manages memory. I'm confused because I have a Python program consistently uses more than 2GB of memory.
It's a simple multi-thread binary data downloader and unpacker.
def GetData(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = response.read() // data size is about 15MB
response.close()
count = struct.unpack("!I", data[:4])
for i in range(0, count):
UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
class MyThread(threading.Thread):
def __init__(self, total, daterange, tickers):
threading.Thread.__init__(self)
def stop(self):
self._Thread__stop()
def run(self):
GET URL FOR EACH REQUEST
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
f = open(filename, 'w')
f.write(os.linesep.join(data))
f.close()
There are 15 threads running. Each request gets 15MB of data and unpack it and saved to local text file. How could this program consume more than 2GB of memory? Do I need to do any memory recycling jobs in this case? How can I see how much memory each objects or functions use?
I would appreciate all your advices or tips on how to keep a python program running in a memory efficient mode.
Edit: Here is the output of "cat /proc/meminfo"
MemTotal: 7975216 kB
MemFree: 732368 kB
Buffers: 38032 kB
Cached: 4365664 kB
SwapCached: 14016 kB
Active: 2182264 kB
Inactive: 4836612 kB
Like others have said, you need at least the following two changes:
Do not create a huge list of integers with range
# use xrange
for i in xrange(0, count):
# UNPACK FIXED LENGTH OF BINARY DATA HERE
yield (field1, field2, field3)
do not create a huge string as the full file body to be written at once
# use writelines
f = open(filename, 'w')
f.writelines((datum + os.linesep) for datum in data)
f.close()
Even better, you could write the file as:
items = GetData(url)
f = open(filename, 'w')
for item in items:
f.write(';'.join(item) + os.linesep)
f.close()
The major culprit here is as mentioned above the range() call. It will create a list with 15 million members, and that will eat up 200 MB of your memory, and with 15 processes, that's 3GB.
But also don't read in the whole 15MB file into data(), read bit by bit from the response. Sticking those 15MB into a variable will use up 15MB of memory more than reading bit by bit from the response.
You might want to consider simply just extracting data until you run out if indata, and comparing the count of data you extracted with what the first bytes said it should be. Then you need neither range() nor xrange(). Seems more pythonic to me. :)
Consider using xrange() instead of range(), I believe that xrange is a generator whereas range() expands the whole list.
I'd say either don't read the whole file into memory, or don't keep the whole unpacked structure in memory.
Currently you keep both in memory, at the same time, this is going to be quite big. So you've got at least two copies of your data in memory, plus some metadata.
Also the final line
f.write(os.linesep.join(data))
May actually mean you've temporarily got a third copy in memory (a big string with the entire output file).
So I'd say you're doing it in quite an inefficient way, keeping the entire input file, entire output file and a fair amount of intermediate data in memory at once.
Using the generator to parse it is quite a nice idea. Consider writing each record out after you've generated it (it can then be discarded and the memory reused), or if that causes too many write requests, batch them into, say, 100 rows at once.
Likewise, reading the response could be done in chunks. As they're fixed records this should be reasonably easy.
The last line should surely be f.close()? Those trailing parens are kinda important.
You can make this program more memory efficient by not reading all 15MB from the TCP connection, but instead processing each line as it is read. This will make the remote servers wait for you, of course, but that's okay.
Python is just not very memory efficient. It wasn't built for that.
You could do more of your work in compiled C code if you convert this to a list comprehension:
data = []
items = GetData(url)
for item in items:
data.append(';'.join(item))
to:
data = [';'.join(items) for items in GetData(url)]
This is actually slightly different from your original code. In your version, GetData returns a 3-tuple, which comes back in items. You then iterate over this triplet, and append ';'.join(item) for each item in it. This means that you get 3 entries added to data for every triplet read from GetData, each one ';'.join'ed. If the items are just strings, then ';'.join will give you back a string with every other character a ';' - that is ';'.join("ABC") will give back "A;B;C". I think what you actually wanted was to have each triplet saved back to the data list as the 3 values of the triplet, separated by semicolons. That is what my version generates.
This may also help somewhat with your original memory problem, as you are no longer creating as many Python values. Remember that a variable in Python has much more overhead than one in a language like C. Since each value is itself an object, and add the overhead of each name reference to that object, you can easily expand the theoretical storage requirement several-fold. In your case, reading 15Mb X 15 = 225Mb + the overhead of each item of each triple being stored as a string entry in your data list could quickly grow to your 2Gb observed size. At minimum, my version of your data list will have only 1/3 the entries in it, plus the separate item references are skipped, plus the iteration is done in compiled code.
There are 2 obvious places where you keep large data objects in memory (data variable in GetData() and data in MyThread.run() - these two will take about 500Mb) and probably there are other places in the skipped code. There are both easy to make memory efficient. Use response.read(4) instead of reading whole response at once and do it the same way in code behind UNPACK FIXED LENGTH OF BINARY DATA HERE. Change data.append(...) in MyThread.run() to
if not first:
f.write(os.linesep)
f.write(';'.join(item))
These changes will save you a lot of memory.
Make sure you delete the threads after they are stopped. (using del)

Categories