Python: improve performance in log writing to file - python

I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added

You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.

Related

Python - scalability with respect to run time and memory usage is important

I have python scripts to filter a massive data in csv file. The requirement asks for considering scalability with respect to run time and memory usage.
I wrote 2 scripts, both of them are working fine of filtering data. Regarding to considering scalability, I decided to use python generator, because it uses iterator and don't save much data in memory.
When I compared the running time of 2 scripts, I found the following:
script 1 - use generator - take more time - 0.0155925750732s
def each_sentence(text):
match = re.match(r'[0-9]+', text)
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)
script 2 - use split and without generator - take less time - 0.00619888305664
with open("./file_testing.csv") as csvfile:
for line in csvfile:
array = line.split(',')
num = int(array[0])
if sympy.isprime(num) == False:
print line.strip()
To meet the requirement, do I need to use python generator? or any suggestions or recommendations?
To meet the requirement, do I need to use python generator?
No, you don't. Script 1 doesn't make sense. The generator is always executed once and return one result in the first iteration.
Any suggestions or recommendations?
You need to learn about three things: complexity, parallelization and caching.
Complexity basically means "if I double the size of input data (csv file), do I need twice the time? Or four times? Or what"?
Parallelization means attacking a problem in a way that makes it easy to add more resources for solving it.
Caching is important. Things get much faster if you don't have to re-create everything all the time, but you can re-use stuff you have already generated.
The main loop for line in csvfile: already scales very well unless the csv file contains extremely long lines.
Script 2 contains a bug: If the first cell in a line is not integer, then int(array[0]) will raise a value error.
The isprime function is probably the "hotspot" in your code, so you can try to parallelize it with multiple threads or sub-processes.
Split your analysis into two discrete regular expression results: A small result with 10 values, and a large result with 10,000,000 values. This question is about the average len() of match, as much as it is about the len() of csvfile.
With a small re result - 10 bytes
1st code block will have slower run time, and relatively low memory usage.
2nd code block will have faster run time, and also relatively low memory usage.
With a large re result - 10,000,000 bytes
1st code block will have slower run time, and very little memory usage.
2nd code block will have faster run time, and very large memory usage.
Bottom line:
If you are supposed to build a function considering run time and memory, then the yield function is definitely the best way to go when the problem requires a scalable solution to different result sizes.
Another question on scalability: What if the re result equals None? I would slightly modify the code to below:
def each_sentence(text):
match = re.match(r'[0-9]+', text)
if match != None:
num = int(text[match.start():match.end()])
if sympy.isprime(num) == False:
yield text.strip()
with open("./file_testing.csv") as csvfile:
for line in csvfile:
for text in each_sentence(line):
print(text)

Is it faster to do a bulk file write or write to it in smaller parts?

I have a python script that reads a flat file and writes the records to a JSON file. Would it be faster to do a write all at once:
dict_array = []
for record in records:
dict_array.append(record)
# writes one big array to file
out_file.write(json.dumps(dict_array))
Or write to the file as the iterator yields each record?
for record in records:
out_file.write(json.dumps(record) + '\n')
The amount of records in records is around 81,000.
Also, the format of JSON can be one big array of objects (case 1) or line-separated objects (case 2).
Your two solutions aren't doing the same thing. The first one writes a valid JSON object. The second writes a probably-valid (but you have to be careful) JSONlines (and probably also NDJSON/LDJSON and NDJ) file. So, the way you process the data later is going to be very different. And that's the most important thing here—do you want a JSON file, or a JSONlines file?
But since you asked about performance: It depends.
Python files are buffered by default, so doing a whole bunch of small writes is only a tiny bit slower than doing one big write. But it is a tiny bit slower, not zero.
On the other hand, building a huge list in memory means allocation, and copies, that are otherwise unneeded. This is almost certainly going to be more significant, unless your values are really tiny and your list is also really short.
Without seeing your data, I'd give about 10:1 odds that the iterative solution will turn out faster, but why rely on that barely-educated guess? If it matters, measure with your actual data with timeit.
On the third hand, 81,000 typical JSON records is basically nothing, so unless you're doing this zillions of times, it's probably not even worth measuring. If you spend an hour figuring out how to measure it, running the tests, and interpreting the results (not to mention the time you spent on SO) to save 23 milliseconds per day for about a week and then nothing ever again… well, to a programmer, that's always attractive, but still it's not always a good idea.
import json
dict_array = []
records = range(10**5)
start = time.time()
for record in records:
dict_array.append(record)
out_file.write(json.dumps(dict_array))
end = time.time()
print(end-start)
#0.07105851173400879
start = time.time()
for record in records:
out_file.write(json.dumps(record) + '\n')
end = time.time()
print(end-start)
#1.1138122081756592
start = time.time()
out_file.write(json.dumps([record for record in records]))
end = time.time()
print(end-start)
#0.051038265228271484
I don't know what records is, but based on these tests, list comprehension is fastest, followed by constructing a list and writing it all at once, followed by writing one record at a time. Depending on what records is, just doing out_file.write(json.dumps(records))) may be even faster.

String manipulation appears to be inefficient

I think my code is too inefficient. I'm guessing it has something to do with using strings, though I'm unsure. Here is the code:
genome = FASTAdata[1]
genomeLength = len(genome);
# Hash table holding all the k-mers we will come across
kmers = dict()
# We go through all the possible k-mers by index
for outer in range (0, genomeLength-1):
for inner in range (outer+2, outer+22):
substring = genome[outer:inner]
if substring in kmers: # if we already have this substring on record, increase its value (count of num of appearances) by 1
kmers[substring] += 1
else:
kmers[substring] = 1 # otherwise record that it's here once
This is to search through all substrings of length at most 20. Now this code seems to take pretty forever and never terminate, so something has to be wrong here. Is using [:] on strings causing the huge overhead? And if so, what can I replace it with?
And for clarity the file in question is nearly 200mb, so pretty big.
I would recommend using a dynamic programming algorithm. The problem is that for all inner strings that are not found, you are re-searching those again with extra characters appended onto them, so of course those will also not be found. I do not have a specific algorithm in mind, but this is certainly a case for dynamic programming where you remember what you have already searched for. As a really crummy example, remember all substrings of length 1,2,3,... that are not found, and never extend those bases in the next iteration where the strings are only longer.
You should use memoryview to avoid creating sub-strings as [:] will then return a "view" instead of a copy, BUT you must use Python 3.3 or higher (before that, they are not hashable).
Also, a Counter will simplify your code.
from collections import Counter
genome = memoryview("abcdefghijkrhirejtvejtijvioecjtiovjitrejabababcd".encode('ascii'))
genomeLength = len(genome)
minlen, maxlen = 2, 22
def fragments():
for start in range (0, genomeLength-minlen):
for finish in range (start+minlen, start+maxlen):
if finish <= genomeLength:
yield genome[start:finish]
count = Counter(fragments())
for (mv, n) in count.most_common(3):
print(n, mv.tobytes())
produces:
4 b'ab'
3 b'jt'
3 b'ej'
A 1,000,000 byte random array takes 45s on my laptop, but 2,000,000 causes swapping (over 8GB memory use). However, since your fragment size is small, you can easily break the problem up into million-long sub-sequences and then combine results at the end (just be careful about overlaps). That would give a total running time for a 200MB array of ~3 hours, with luck.
PS To be clear, by "combine results at the end" I assume that you only need to save the most popular for each 1M sub-sequence, by, for example, writing them to a file. You cannot keep the counter in memory - that is what is using the 8GB. That's fine if you have fragments that occur many thousands of times, but obviously won't work for smaller numbers (you might see a fragment just once in each of the 200 1M sub-sequences, and so never save it, for example). In other words, results will be lower bounds that are incomplete, particularly at lower frequencies (values are complete only if a fragment is found and recorded in every sub-sequence). If you need an exact result, this is not suitable.

summing entries in a variable based on another variable(make unique) in python lists

i have a question as to how i can perform this task in python:-
i have an array of entries like:
[IPAddress, connections, policystatus, activity flag, longitude, latitude] (all as strings)
ex.
['172.1.21.26','54','1','2','31.15424','12.54464']
['172.1.21.27','12','2','4','31.15424','12.54464']
['172.1.27.34','40','1','1','-40.15474','-54.21454']
['172.1.2.45','32','1','1','-40.15474','-54.21454']
...
till about 110000 entries with about 4000 different combinations of longitude-latitude
i want to count the average connections, average policy status,average of activity flag for each location
something like this:
[longitude,latitude,avgConn,avgPoli,avgActi]
['31.15424','12.54464','33','2','3']
['-40.15474','-54.21454','31','1','1']
...
so on
and i have about 195 files with ~110,000 entries each (sort of a big data problem)
my files are in .csv but im using it as .txt to easily work with it in python(not sure if this is the best idea)
im still new to python so im not really sure whats the best approach to use but i sincerely appreciate any help or guidance for this problem
thanks in advance!
No, if you have the files as .csv, threating them as text does not make sense, since python ships with the excellent csv module.
You could read the csv rows into a dict to group them, but I'd suggest writing the data in a proper database, and use SQL's AVG() and GROUP BY. Python ships with bindings for most databaases. If you have none installed, consider using the sqlite module.
I'll only give you the algorithm, you would learn more by writing the actual code yourself.
Use a Dictionary, with the key as a pair of the form (longitude, latitude) and value as a list of the for [ConnectionSum,policystatusSum,ActivityFlagSum]
loop over the entries once (do count the total number of entries, N)
a. for each entry, if the location exists - add the conn, policystat and Activity value to the existing sum.
b. if the entry does not exist, then assign [0,0,0] as the value
Do 1 and 2 for all files.
After all the entries have been scanned. Loop over the dictionary and divide each element of the list [ConnectionSum,policystatusSum,ActivityFlagSum] by N to get the average values of each.
As long as your locations are restricted to being in the same files (or even close to each other in a file), all you need to do is the stream-processing paradigm. For example if you know that duplicate locations only appear in a file, read each file, calculate the averages, then close the file. As long as you let the old data float out of scope, the garbage collector will get rid of it for you. Basically do this:
def processFile(pathToFile):
...
totalResults = ...
for path in filePaths:
partialResults = processFile(path)
totalResults = combine...partialResults...with...totalResults
An even more elegant solution would be to use the O(1) method of calculating averages "on-line". If for example you are averaging 5,6,7, you would do 5/1=5.0, (5.0*1+6)/2=5.5, (5.5*2+7)/3=6. At each step, you only keep track of the current average and the number of elements. This solution will yield the minimal amount of memory used (no more than the size of your final result!), and doesn't care about which order you visit elements in. It would go something like this. See http://docs.python.org/library/csv.html for what functions you'll need in the CSV module.
import csv
def allTheRecords():
for path in filePaths:
for row in csv.somehow_get_rows(path):
yield SomeStructure(row)
averages = {} # dict: keys are tuples (lat,long), values are an arbitrary
# datastructure, e.g. dict, representing {avgConn,avgPoli,avgActi,num}
for record in allTheRecords():
position = (record.lat, record.long)
currentAverage = averages.get(position, default={'avgConn':0, 'avgPoli':0, 'avgActi':0, num:0})
newAverage = {apply the math I mentioned above}
averages[position] = newAverage
(Do note that the notion of an "average at a location" is not well-defined. Well, it is well-defined, but not very useful: If you knew the exactly location of every IP event to infinite precision, the average of everything would be itself. The only reason you can compress your dataset is because your latitude and longitude have finite precision. If you run into this issue if you acquire more precise data, you can choose to round to the appropriate precision. It may be reasonable to round to within 10 meters or something; see latitude and longitude. This requires just a little bit of math/geometry.)

Python fast string parsing, manipulation

I am using python to parse the incoming comma separated string. I want to do some calculation afterwards on the data.
The length of the string is: 800 characters with 120 comma separated fields.
There such 1.2 million strings to process.
for v in item.values():
l.extend(get_fields(v.split(',')))
#process l
get_fields uses operator.itemgetter() to extract around 20 fields out of 120.
This entire operation takes about 4-5 minutes excluding the time to bring in the data.
In the later part of the program I insert these lines into sqlite memory table for further use.
But overall 4-5 minutes time for just parsing and getting a list is not good for my project.
I run this processing in around 6-8 threads.
Does switching to C/C++ might help?
Are you loading a dict with your file records? Probably better to process the data directly:
datafile = file("file_with_1point2million_records.dat")
# uncomment next to skip over a header record
# file.next()
l = sum(get_fields(v.split(',')) for v in file, [])
This avoids creating any overall data structures, and only accumulated the desired values as returned by get_fields.
Your program might be slowing down trying to allocate enough memory for 1.2M strings. In other words, the speed problem might not be due to the string parsing/manipulation, but rather in the l.extend. To test this hypothsis, you could put a print statement in the loop:
for v in item.values():
print('got here')
l.extend(get_fields(v.split(',')))
If the print statements get slower and slower, you can probably conclude l.extend is the culprit. In this case, you may see significant speed improvement if you can move the processing of each line into the loop.
PS: You probably should be using the csv module to take care of the parsing for you in a more high-level manner, but I don't think that will affect the speed very much.

Categories