So I have created a script to upload a list of 1 million points in batches of 4000 points every second so it should take 250 batches to upload the data.
I have placed 2 timing functions in the script:
One around each write_api.write() to calculate the time it takes for each batch
One around the outer loop to calculate the the time it takes to upload the whole 1 million points
However each individual timing function says it takes on average 1 second to upload the batch, so in my opinion it should take 250 seconds to upload, however the actual writing time from the outer loop to calculate the total time says 400 seconds which is almost double the time taken even if I sum up the individual 250 write times.
import random
import copy
import datetime
from influxdb_client import InfluxDBClient, Point, WriteOptions
from influxdb_client.client.exceptions import InfluxDBError
import time
class BatchingCallback(object):
def success(self, conf: (str, str, str), data: str):
print(f"Written batch: {conf}, data: {data}")
def error(self, conf: (str, str, str), data: str, exception: InfluxDBError):
print(f"Cannot write batch: {conf}, data: {data} due: {exception}")
def retry(self, conf: (str, str, str), data: str, exception: InfluxDBError):
print(f"Retryable error occurs for batch: {conf}, data: {data} retry: {exception}")
def write_data(url, token, bucket, org):
callback = BatchingCallback()
data = generate_list_dictionary() # Generating the data which contains a list of 1 mil points with 100 fields each
total_start = time.perf_counter()
point_per_batch = 4000
with InfluxDBClient(url=url, token=token, org=org) as client:
with client.write_api(write_options=WriteOptions(batch_size=points_per_batch),
success_callback=callback.success,
error_callback=callback.error,
retry_callback=callback.retry) as write_api:
time_start = datetime.datetime.now()
for data_point in range(0, len(data), points_per_batch):
upper_index = data_point + points_per_batch #Allows us to slice the data in batches of 4000
seconds = datetime.delta(seconds=int(upper_index/points_per_batch)) # Allows us to send the data every second
while(True):
if(datetime.datetime.now() >= time_start + seconds):
# Writing the data in batches and performing individual timings
start = time.perf_counter()
write_api.write(bucket=bucket, org=org, record=data[data_points:upper_index])
end = time.perf_counter()
print(f"Individual time is: {end-start}")
break
total_end = time.perf_counter
print(f"Total time is: {total_end-total_start}")
if __name__ == '__main__':
url = ""
token = ""
bucket = ""
org = ""
write_data(url, token, bucket, org)
As you can see the Total time gets printed as 450 seconds on average, while the individual time is 1 second on average however by this logic it should take 250 seconds to upload but it's almost double.
So my question is, is this individual time that Python is calculating wrong? If so how do I calculate the time taken to upload every single write() I do.
The client version I'm using is 1.31.0 and my Python version is 3.7.7 and I'm on Windows
Related
I have a script that reads some data from a binary stream of "packets" containing "parameters". The parameters read are stored in a dictionary for each packet, which is appended to an array representing the packet stream.
At the end this array of dict is written to an output CSV file.
Among the read data is a CUC7 datetime, stored as coarse/fine integer parts of a GPS time, which I also want to convert to a UTC ISO string.
from astropy.time import Time
def cuc2gps_time(coarse, fine):
return Time(coarse + fine / (2**24), format='gps')
def gps2utc_time(gps):
return Time(gps, format='isot', scale='utc')
The issue is that I realized that these two time conversions make up 90% of the total run time of my script, most of the work of my script being done in the 10% remaining (read binary file, decode 15 other parameters, write to CSV).
I somehow improved the situation by making the conversions in batches on Numpy arrays, instead of on each packet. This reduces the total runtime by about half.
import numpy as np
while end_not_reached:
# Read 1 packet
# (...)
nb_packets += 1
end_not_reached = ... # boolean
# Process in batches for better performance
if not nb_packets%1000 or not end_not_reached:
# Convert CUC7 time to GPS and UTC times
all_coarse = np.array([packet['lobt_coarse'] for packet in packets])
all_fine = np.array([packet['lobt_fine'] for packet in packets])
all_gps = cuc2gps_time(all_coarse, all_fine)
all_utc = gps2utc_time(all_gps)
# Add times to each packet
for packet, gps_time, utc_time in zip(packets, all_gps, all_utc):
packet.update({'gps_time': gps_time, 'utc_time': utc_time})
But my script is still absurdly slow. Reading 60000 packets from a 1.2GB file and writing it as a CSV takes 12s, against only 2.5s if I remove the time conversion.
So:
Is it expected that Astropy time conversions are so slow? Am I using it wrong? Is there a better library?
Is there a way to improve my current implementation? I suspect that the remaining "for" loop in there is very costly, but could not find a good way to replace it.
I think the problem is that you're doing multiple loops over your sequence of packets. In Python I would recommend having arrays representing each parameter, instead of having a list of objects, each with a bunch of scalar parameters.
If you can read all the packets in at once, I would recommend something like:
num_bytes = ...
num_bytes_per_packet = ...
num_packets = num_bytes / num_bytes_per_packet
param1 = np.empty(num_packets)
param2 = np.empty(num_packets)
...
time_coarse = np.empty(num_packets)
time_fine = np.empty(num_packets)
...
param_N = np.empty(num_packets)
for i in range(num_packets):
param_1[i], param_2[i], ..., time_coarse[i], time_fine[i], ... param_N[i] = decode_packet(...)
time_gps = Time(time_coarse + time_fine / (2**24), format='gps')
time_utc = time_gps.utc.isot
I have a CSV file having online Teams meeting data. It has two columns, one is "names" and the other with their duration of attendance during the meeting. I want to convert this information into a graph to quickly identify who remained for a long time in the meeting and who for less time. The duration column data is in this format, 3h 54m. This means it has the characters h and m in the column. See the picture below too.
Now how can I convert this data into decimal values like 3.54 or 234?
3.54 will mean hours and 234 will mean minutes in total. I am happy with any solution like either hour i.e. 3.54 or minutes 234.
Numpy or Pandas are both welcome.
The conversion of the timestring can be achieved with the datetime module:
from datetime import datetime
time_string = '3h 54m'
datetime_obj = datetime.strptime(time_string, '%Hh %Mm')
total_mins = (datetime_obj.hour * 60) + datetime_obj.minute
time_in_hours = total_mins / 60
# Outputs: `3.9 234`
print(time_in_hours, total_mins)
Here '%H' means the number of hours and '%M' is the number of minutes (both zero-padded). If this is encapsulated in a function, it can applied on the data from your spreadsheet that has been read-in via numpy or pandas.
REFERENCE:
https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime
3h 54m is not 3.54. It's 3 + 54/60 = 3.9. Also since you have some items with seconds, it may be best to do all the conversions to seconds so you don't lose significant digits due to rounding if you need to add any of the items. For example, if you have 37 minutes, that's 0.61666667 hours. So using seconds, you get more precise results if you have to combine things.
The function below can handle h m and s and any combo of them. I provide examples at the bottom. I'm sure there are hundreds of ways to do this conversion. There are tons of examples on StackOverflow of reading and using CSV files so I didn't provide info. I like playing with the math stuff. :)
def hms_to_seconds(time_string):
timeparts = []
# split string. Assumes blank space(s) between each time componet
timeparts = time_string.split()
h = 0
m = 0
s = 0
# loop through the componets and get values
for part in timeparts:
if (part[-1] == "h"):
h = int( part.partition("h")[0] )
if (part[-1] == "m"):
m = int( part.partition("m")[0] )
if (part[-1] == "s"):
s = int( part.partition("s")[0] )
return (h*3600 + m*60 + s)
print(hms_to_seconds("3h 45m")) # 13500 sec
print(hms_to_seconds("1m 1s")) # 61 sec
print(hms_to_seconds("1h 1s")) # 3601 sec
print(hms_to_seconds("2h")) # 7200 sec
print(hms_to_seconds("10m")) # 600 sec
print(hms_to_seconds("33s")) # 33 sec
1. Read .csv
Reading the csv file into memory can be done using pd.read_csv()
2. Parse time string
Parser
Ideally you want to write a parser that can parse the time string for you. This can be done using ANTLR or by writing one yourself. See e.g. this blog post. This would be the most interesting solution and also the most challenging one. There might be a parser already out there which can handle this as the format is quite common, so you might wanna search for one.
RegEx
A quick and dirty solution however can be implemented using RegEx. The idea is to use this RegEx [0-9]*(?=h) for hours, [0-9]*(?=m) for minutes, [0-9]*(?=s) for seconds to extract the actual values before those identifiers and then calculate the total duration in seconds.
Please note: this is not an ideal solution and only works for "h", "m", "s" and there are edge cases that are not handled in this implementation but it does work in principle.
import re
duration_calcs = {
"h": 60 * 60, # convert hours to seconds => hours * 60 * 60
"m": 60, # convert minutes to seconds => minutes * 60
"s": 1, # convert seconds to seconds => seconds * 1
}
tests_cases = [
"3h 54m",
"4h 9m",
"12s",
"41m 1s"
]
tests_results = [
int(14040), # 3 * 60 * 60 + 54 * 60
14940, # 4 * 60 * 60 + 9 * 60
12, # 12 * 1
2461 # 41 * 60 + 1 * 1
]
def get_duration_component(identifier, text):
"""
Gets one component of the duration from the string and returns the duration in seconds.
:param identifier: either "h", "m" or "s"
:type identifier: str
:param text: text to extract the information from
:type text: str
:return duration in seconds
"""
# RegEx using positive lookahead to either "h", "m" or "s" to extract number before that identifier
regex = re.compile(f"[0-9]*(?={identifier})")
match = regex.search(text)
if not match:
return 0
return int(match.group()) * duration_calcs[identifier]
def get_duration(text):
"""
Get duration from text which contains duration like "4h 43m 12s".
Only "h", "m" and "s" supported.
:param text: text which contains a duration
:type text: str
:return: duration in seconds
"""
idents = ["h", "m", "s"]
total_duration = 0
for ident in idents:
total_duration += get_duration_component(ident, text)
return total_duration
def run_tests(tests, test_results):
"""
Run tests to verify that this quick and dirty implementation works at least for the given test cases.
:param tests: test cases to run
:type tests: list[str]
:param test_results: expected test results
:type test_results: list[int]
:return:
"""
for i, (test_inp, expected_output) in enumerate(zip(tests, test_results)):
result = get_duration(test_inp)
print(f"Test {i}: Input={test_inp}\tExpected Output={expected_output}\tResult: {result}", end="")
if result == expected_output:
print(f"\t[SUCCESS]")
else:
print(f"\t[FAILURE]")
run_tests(tests_cases, tests_results)
Expected result
Test 0: Input=3h 54m Expected Output=14040 Result: 14040 [SUCCESS]
Test 1: Input=4h 9m Expected Output=14940 Result: 14940 [SUCCESS]
Test 2: Input=12s Expected Output=12 Result: 12 [SUCCESS]
Test 3: Input=41m 1s Expected Output=2461 Result: 2461 [SUCCESS]
split()
Another (even simpler) solution could be to split() at space and use some if statements to determine whether a part is "h", "m" or "s" and then parse the preceding string to int and convert is as shown above. The idea is similar, so I did not write a program for this.
I posted a similar question a few days ago but without any code, now I created a test code in hopes of getting some help.
Code is at the bottom.
I got some dataset where I have a bunch of large files (~100) and I want to extract specific lines from those files very efficiently (both in memory and in speed).
My code gets a list of relevant files, the code opens each file with [line 1], then maps the file to memory with [line 2], also, for each file I receives a list of indices and going over the indices I retrieve the relevant information (10 bytes for this example) like so: [line 3-4], finally I close the handles with [line 5-6].
binaryFile = open(path, "r+b")
binaryFile_mm = mmap.mmap(binaryFile.fileno(), 0)
for INDEX in INDEXES:
information = binaryFile_mm[(INDEX):(INDEX)+10].decode("utf-8")
binaryFile_mm.close()
binaryFile.close()
This codes runs in parallel, with thousands of indices for each file, and continuously do that several times a second for hours.
Now to the problem - The code runs well when I limit the indices to be small (meaning - when I ask the code to get information from the beginning of the file). But! when I increase the range of the indices, everything slows down to (almost) a halt AND the buff/cache memory gets full (I'm not sure if the memory issue is related to the slowdown).
So my question is why does it matter if I retrieve information from the beginning or the end of the file and how do I overcome this in order to get instant access to information from the end of the file without slowing down and increasing buff/cache memory use.
PS - some numbers and sizes: so I got ~100 files each about 1GB in size, when I limit the indices to be from the 0%-10% of the file it runs fine, but when I allow the index to be anywhere in the file it stops working.
Code - tested on linux and windows with python 3.5, requires 10 GB of storage (creates 3 files with random strings inside 3GB each)
import os, errno, sys
import random, time
import mmap
def create_binary_test_file():
print("Creating files with 3,000,000,000 characters, takes a few seconds...")
test_binary_file1 = open("test_binary_file1.testbin", "wb")
test_binary_file2 = open("test_binary_file2.testbin", "wb")
test_binary_file3 = open("test_binary_file3.testbin", "wb")
for i in range(1000):
if i % 100 == 0 :
print("progress - ", i/10, " % ")
# efficiently create random strings and write to files
tbl = bytes.maketrans(bytearray(range(256)),
bytearray([ord(b'a') + b % 26 for b in range(256)]))
random_string = (os.urandom(3000000).translate(tbl))
test_binary_file1.write(str(random_string).encode('utf-8'))
test_binary_file2.write(str(random_string).encode('utf-8'))
test_binary_file3.write(str(random_string).encode('utf-8'))
test_binary_file1.close()
test_binary_file2.close()
test_binary_file3.close()
print("Created binary file for testing.The file contains 3,000,000,000 characters")
# Opening binary test file
try:
binary_file = open("test_binary_file1.testbin", "r+b")
except OSError as e: # this would be "except OSError, e:" before Python 2.6
if e.errno == errno.ENOENT: # errno.ENOENT = no such file or directory
create_binary_test_file()
binary_file = open("test_binary_file1.testbin", "r+b")
## example of use - perform 100 times, in each itteration: open one of the binary files and retrieve 5,000 sample strings
## (if code runs fast and without a slowdown - increase the k or other numbers and it should reproduce the problem)
## Example 1 - getting information from start of file
print("Getting information from start of file")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,100000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
## Example 2 - getting information from all of the file
print("Getting information from all of the file")
binary_file = open("test_binary_file1.testbin", "r+b")
etime = []
for i in range(100):
start = time.time()
binary_file_mm = mmap.mmap(binary_file.fileno(), 0)
sample_index_list = random.sample(range(1,3000000000-1000), k=50000)
sampled_data = [[binary_file_mm[v:v+1000].decode("utf-8")] for v in sample_index_list]
binary_file_mm.close()
binary_file.close()
file_number = random.randint(1, 3)
binary_file = open("test_binary_file" + str(file_number) + ".testbin", "r+b")
etime.append((time.time() - start))
if i % 10 == 9 :
print("Iter ", i, " \tAverage time - ", '%.5f' % (sum(etime[-9:]) / len(etime[-9:])))
binary_file.close()
My results: (The average time of getting information from all across the file is almost 4 times slower than getting information from the beginning, with ~100 files and parallel computing this difference gets much bigger)
Getting information from start of file
Iter 9 Average time - 0.14790
Iter 19 Average time - 0.14590
Iter 29 Average time - 0.14456
Iter 39 Average time - 0.14279
Iter 49 Average time - 0.14256
Iter 59 Average time - 0.14312
Iter 69 Average time - 0.14145
Iter 79 Average time - 0.13867
Iter 89 Average time - 0.14079
Iter 99 Average time - 0.13979
Getting information from all of the file
Iter 9 Average time - 0.46114
Iter 19 Average time - 0.47547
Iter 29 Average time - 0.47936
Iter 39 Average time - 0.47469
Iter 49 Average time - 0.47158
Iter 59 Average time - 0.47114
Iter 69 Average time - 0.47247
Iter 79 Average time - 0.47881
Iter 89 Average time - 0.47792
Iter 99 Average time - 0.47681
The basic reason why you have this time difference is that you have to seek to where you need in the file. The further from position 0 you are, the longer it's going to take.
What might help is since you know the starting index you need, seek on the file descriptor to that point and then do the mmap. Or really, why bother with mmap in the first place - just read the number of bytes that you need from the seeked-to position, and put that into your result variable.
To determine if you're getting adequate performance, check the memory available for the buffer/page cache (free in Linux), I/O stats - the number of reads, their size and duration (iostat; compare with the specs of your hardware), and the CPU utilization of your process.
[edit] Assuming that you read from a locally attached SSD (without having the data you need in the cache):
When reading in a single thread, you should expect your batch of 50,000 reads to take more than 7 seconds (50000*0.000150). Probably longer because the 50k accesses of a mmap-ed file will trigger more or larger reads, as your accesses are not page-aligned - as I suggested in another Q&A I'd use simple seek/read instead (and open the file with buffering=0 to avoid unnecessary reads for Python buffered I/O).
With more threads/processes reading simultaneously, you can saturate your SSD throughput (how much 4KB reads/s it can do - it can be anywhere from 5,000 to 1,000,000), then the individual reads will become even slower.
[/edit]
The first example only accesses 3*100KB of the files' data, so as you have much more than that available for the cache, all of the 300KB quickly end up in the cache, so you'll see no I/O, and your python process will be CPU-bound.
I'm 99.99% sure that if you test reading from the last 100KB of each file, it will perform as well as the first example - it's not about the location of the data, but about the size of the data accessed.
The second example accesses random portions from 9GB, so you can hope to see similar performance only if you have enough free RAM to cache all of the 9GB, and only after you preload the files into the cache, so that the testcase runs with zero I/O.
In realistic scenarios, the files will not be fully in the cache - so you'll see many I/O requests and much lower CPU utilization for python. As I/O is much slower than cached access, you should expect this example to run slower.
I'm using Raspberry Pi 3 and ADS1115, my project requires me to get evenly spaced samples so as to plot and analyse. The other posts were about achieving 10k and 50k sps but I only require 500SPS and that isn't working either. Is there a way to run my code for 120 seconds with 500 sps and get 60,000 samples in the end from both A0 and A1 channel at the same time? I have attached the code for reference. Thanks in advance
from Adafruit_ADS1x15 import ADS1x15
import time
import numpy as np
pga = 2/3 # Set full-scale range of programmable gain
# amplifier (page 13 of data sheet), change
#depending on the input voltage range
ADS1115 = 0x01 # Specify that the device being used is the
# ADS1115, for the ADS1015 used 0x00
adc = Adafruit_ADS1x15.ADS1015() # Create instance of the class ADS1x15
# Function to print sampled values to the terminal
def logdata():
print "sps value should be one of: 8, 16, 32, 64, 128, 250, 475, 860,
otherwise the value will default to 250"
frequency = input("Input sampling frequency (Hz): ") # Get
#sampling frequency from user
sps = input("Input sps (Hz) : ") # Get
# ads1115 sps value from the user
time1 = input("Input sample time (seconds): ") # Get how
#long to sample for from the user
period = 1.0 / frequency # Calculate sampling period
datapoints = int(time1*frequency) # Datapoints is the total number
#of samples to take, which must be an integer
startTime=time.time() # Time of first sample
t1=startTime # T1 is last sample time
t2=t1 # T2 is current time
for x in range (0,datapoints) : # Loop in which data is sampled
while (t2-t1 < period) : # Check if t2-t1 is less then
#sample period, if it is then update t2
t2=time.time() # and check again
t1+=period # Update last sample time by the
# sampling period
print adc.read_adc(0, pga, data_rate=sps), "mV ", ("%.2f" %
(t2-startTime)) , "s" # Print sampled value and time to the terminal
# Call to logdata function
logdata()
1) Are you using the ADS1115 ???
adc = Adafruit_ADS1x15.ADS1015() # should be adc = Adafruit_ADS1x15.ADS1115()
2) you can't read two or more single-ended channels at the same time. In differential mode one can compare two channels to yield one value.
To read a value from channel 1, besides channel 0 you would have to add another call in your loop:
print adc.read_adc(0, pga, data_rate=sps) ..... # original call for channel 0
print adc.read_adc(1, pga, data_rate=sps) ..... # new call for channel 1
3) before a value can be read, the ADS must be configured with several parameters, like channel, rate, gain etc. After some time, which is needed for doing an analog/digital conversion, values can be read, once, or over and over again, when in continuous mode.
In the original ADAFruit library this time is calculated from the data rate (insecure) and in the recent port to CircuitPython this time will most often be around 0.01 s, because the conversion will most probably not have finished directly after configuration (check the _read method).
4) reading at 500SPS is about the fastest the ADS1115 can read. The reference states that conversion at 860SPS takes 1.2 ms. Adding time for configuration and reading you will not be able to read two or more values continuosly every 0.002s, even if you were receiving notifications for conversion, like I described on my homepage, instead of waiting for a fixed period of time.
5) I think the closest you can get with Python is to run 2 daisy-chained ADS1115 in continuous mode with GPIO notifications, but I have no experience with that.
Samples records in the data file (SAM file):
M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0
M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0
......
The data files are line-by-line based
The size of the data files are varies from 1G - 5G.
I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.
the basic workflow is like this:
# global variable, counters will be updated in search function according to the value passed.
counter_a = 0
counter_b = 0
counter_c = 0
open textfile:
for line in textfile:
value = line.split()[3]
search_function(value) # this function takes abit long time to process
def search_function (value):
some conditions checking:
update the counter_a or counter_b or counter_c
With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.
I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.
I have access to a server machine with: 24 processors and around 40G RAM.
Anyone could help with this? Thanks very much.
The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)
If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:
import time
import multiprocessing
def slowfunc(value):
time.sleep(0.01)
return value**2 + 0.3*value + 1
counter_a = counter_b = counter_c = 0
def add_to_counter(res):
global counter_a, counter_b, counter_c
counter_a += res
counter_b -= (res - 10)**2
counter_c += (int(res) % 2)
pool = multiprocessing.Pool(50)
results = []
for value in range(100000):
r = pool.apply_async(slowfunc, [value])
results.append(r)
# don't let the queue grow too long
if len(results) == 1000:
results[0].wait()
while results and results[0].ready():
r = results.pop(0)
add_to_counter(r.get())
for r in results:
r.wait()
add_to_counter(r.get())
print counter_a, counter_b, counter_c
That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.
Read one file at a time, use all CPUs to run search_function():
#!/usr/bin/env python
from multiprocessing import Array, Pool
def init(counters_): # called for each child process
global counters
counters = counters_
def search_function (value): # assume it is CPU-intensive task
some conditions checking:
update the counter_a or counter_b or counter_c
counter[0] += 1 # counter 'a'
counter[1] += 1 # counter 'b'
return value, result, error
if __name__ == '__main__':
counters = Array('i', [0]*3)
pool = Pool(initializer=init, initargs=[counters])
values = (line.split()[3] for line in textfile)
for value, result, error in pool.imap_unordered(search_function, values,
chunksize=1000):
if error is not None:
print('value: {value}, error: {error}'.format(**vars()))
pool.close()
pool.join()
print(list(counters))
Make sure (for example, by writing wrappers) that exceptions do not escape next(values), search_function().
This solution works on a set of files.
For each file, it divides it into a specified number of line-aligned chunks, solves each chunk in parallel, then combines the results.
It streams each chunk from disk; this is somewhat slower, but does not consume nearly so much memory. We depend on disk cache and buffered reads to prevent head thrashing.
Usage is like
python script.py -n 16 sam1.txt sam2.txt sam3.txt
and script.py is
import argparse
from io import SEEK_END
import multiprocessing as mp
#
# Worker process
#
def summarize(fname, start, stop):
"""
Process file[start:stop]
start and stop both point to first char of a line (or EOF)
"""
a = 0
b = 0
c = 0
with open(fname, newline='') as inf:
# jump to start position
pos = start
inf.seek(pos)
for line in inf:
value = int(line.split(4)[3])
# *** START EDIT HERE ***
#
# update a, b, c based on value
#
# *** END EDIT HERE ***
pos += len(line)
if pos >= stop:
break
return a, b, c
def main(num_workers, sam_files):
print("{} workers".format(num_workers))
pool = mp.Pool(processes=num_workers)
# for each input file
for fname in sam_files:
print("Dividing {}".format(fname))
# decide how to divide up the file
with open(fname) as inf:
# get file length
inf.seek(0, SEEK_END)
f_len = inf.tell()
# find break-points
starts = [0]
for n in range(1, num_workers):
# jump to approximate break-point
inf.seek(n * f_len // num_workers)
# find start of next full line
inf.readline()
# store offset
starts.append(inf.tell())
# do it!
stops = starts[1:] + [f_len]
start_stops = zip(starts, stops)
print("Solving {}".format(fname))
results = [pool.apply(summarize, args=(fname, start, stop)) for start,stop in start_stops]
# collect results
results = [sum(col) for col in zip(*results)]
print(results)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Parallel text processor')
parser.add_argument('--num_workers', '-n', default=8, type=int)
parser.add_argument('sam_files', nargs='+')
args = parser.parse_args()
main(args.num_workers, args.sam_files)
main(args.num_workers, args.sam_files)
What you don't want to do is hand files to invidual CPUs. If that's the case, the file open/reads will likely cause the heads to bounce randomly all over the disk, because the files are likely to be all over the disk.
Instead, break each file into chunks and process the chunks.
Open the file with one CPU. Read in the whole thing into an array Text. You want to do this is one massive read to prevent the heads from thrashing around the disk, under the assumption that your file(s) are placed on the disk in relatively large sequential chunks.
Divide its size in bytes by N, giving a (global) value K, the approximate number of bytes each CPU should process. Fork N threads, and hand each thread i its index i, and a copied handle for each file.
Each thread i starts a thread-local scan pointer p into Text as offset i*K. It scans the text, incrementing p and ignores the text until a newline is found. At this point, it starts processing lines (increment p as it scans the lines). Tt stops after processing a line, when its index into the Text file is greater than (i+1)*K.
If the amount of work per line is about equal, your N cores will all finish about the same time.
(If you have more than one file, you can then start the next one).
If you know that the file sizes are smaller than memory, you might arrange the file reads to be pipelined, e.g., while the current file is being processed, a file-read thread is reading the next file.