Time-series data analysis using scientific python: continuous analysis over multiple files - python

The Problem
I'm doing time-series analysis. Measured data comes from the sampling the voltage output of a sensor at 50 kHz and then dumping that data to disk as separate files in hour chunks. Data is saved to an HDF5 file using pytables as a CArray. This format was chosen to maintain interoperability with MATLAB.
The full data set is now multiple TB, far too large to load into memory.
Some of my analysis requires me to iterative over the full data set. For analysis that requires me to grab chunks of data, I can see a path forward through creating a generator method. I'm a bit uncertain of how to proceed with analysis that requires a continuous time series.
Example
For example, let's say I'm looking to find and categorize transients using some moving window process (e.g. wavelet analysis) or apply a FIR filter. How do I handle the boundaries, either at the end or beginning of a file or at chunk boundaries? I would like the data to appear as one continuous data set.
Request
I would love to:
Keep the memory footprint low by loading data as necessary.
Keep a map of the entire data set in memory so that I can address the data set as I would a regular pandas Series object, e.g. data[time1:time2].
I'm using scientific python (Enthought distribution) with all the regular stuff: numpy, scipy, pandas, matplotlib, etc. I only recently started incorporating pandas into my work flow and I'm still unfamiliar with all of its capabilities.
I've looked over related stackexchange threads and didn't see anything that exactly addressed my issue.
EDIT: FINAL SOLUTION.
Based upon the helpful hints I built a iterator that steps over files and returns chunks of arbitrary size---a moving window that hopefully handles file boundaries with grace. I've added the option of padding the front and back of each of the windows with data (overlapping windows). I can then apply a succession of filters to the overlapping windows and then remove the overlaps at the end. This, I hope, gives me continuity.
I haven't yet implemented __getitem__ but it's on my list of things to do.
Here's the final code. A few details are omitted for brevity.
class FolderContainer(readdata.DataContainer):
def __init__(self,startdir):
readdata.DataContainer.__init__(self,startdir)
self.filelist = None
self.fs = None
self.nsamples_hour = None
# Build the file list
self._build_filelist(startdir)
def _build_filelist(self,startdir):
"""
Populate the filelist dictionary with active files and their associated
file date (YYYY,MM,DD) and hour.
Each entry in 'filelist' has the form (abs. path : datetime) where the
datetime object contains the complete date and hour information.
"""
print('Building file list....',end='')
# Use the full file path instead of a relative path so that we don't
# run into problems if we change the current working directory.
filelist = { os.path.abspath(f):self._datetime_from_fname(f)
for f in os.listdir(startdir)
if fnmatch.fnmatch(f,'NODE*.h5')}
# If we haven't found any files, raise an error
if not filelist:
msg = "Input directory does not contain Illionix h5 files."
raise IOError(msg)
# Filelist is a ordered dictionary. Sort before saving.
self.filelist = OrderedDict(sorted(filelist.items(),
key=lambda t: t[0]))
print('done')
def _datetime_from_fname(self,fname):
"""
Return the year, month, day, and hour from a filename as a datetime
object
"""
# Filename has the prototype: NODE##-YY-MM-DD-HH.h5. Split this up and
# take only the date parts. Convert the year form YY to YYYY.
(year,month,day,hour) = [int(d) for d in re.split('-|\.',fname)[1:-1]]
year+=2000
return datetime.datetime(year,month,day,hour)
def chunk(self,tstart,dt,**kwargs):
"""
Generator expression from returning consecutive chunks of data with
overlaps from the entire set of Illionix data files.
Parameters
----------
Arguments:
tstart: UTC start time [provided as a datetime or date string]
dt: Chunk size [integer number of samples]
Keyword arguments:
tend: UTC end time [provided as a datetime or date string].
frontpad: Padding in front of sample [integer number of samples].
backpad: Padding in back of sample [integer number of samples]
Yields:
chunk: generator expression
"""
# PARSE INPUT ARGUMENTS
# Ensure 'tstart' is a datetime object.
tstart = self._to_datetime(tstart)
# Find the offset, in samples, of the starting position of the window
# in the first data file
tstart_samples = self._to_samples(tstart)
# Convert dt to samples. Because dt is a timedelta object, we can't use
# '_to_samples' for conversion.
if isinstance(dt,int):
dt_samples = dt
elif isinstance(dt,datetime.timedelta):
dt_samples = np.int64((dt.day*24*3600 + dt.seconds +
dt.microseconds*1000) * self.fs)
else:
# FIXME: Pandas 0.13 includes a 'to_timedelta' function. Change
# below when EPD pushes the update.
t = self._parse_date_str(dt)
dt_samples = np.int64((t.minute*60 + t.second) * self.fs)
# Read keyword arguments. 'tend' defaults to the end of the last file
# if a time is not provided.
default_tend = self.filelist.values()[-1] + datetime.timedelta(hours=1)
tend = self._to_datetime(kwargs.get('tend',default_tend))
tend_samples = self._to_samples(tend)
frontpad = kwargs.get('frontpad',0)
backpad = kwargs.get('backpad',0)
# CREATE FILE LIST
# Build the the list of data files we will iterative over based upon
# the start and stop times.
print('Pruning file list...',end='')
tstart_floor = datetime.datetime(tstart.year,tstart.month,tstart.day,
tstart.hour)
filelist_pruned = OrderedDict([(k,v) for k,v in self.filelist.items()
if v >= tstart_floor and v <= tend])
print('done.')
# Check to ensure that we're not missing files by enforcing that there
# is exactly an hour offset between all files.
if not all([dt == datetime.timedelta(hours=1)
for dt in np.diff(np.array(filelist_pruned.values()))]):
raise readdata.DataIntegrityError("Hour gap(s) detected in data")
# MOVING WINDOW GENERATOR ALGORITHM
# Keep two files open, the current file and the next in line (que file)
fname_generator = self._file_iterator(filelist_pruned)
fname_current = fname_generator.next()
fname_next = fname_generator.next()
# Iterate over all the files. 'lastfile' indicates when we're
# processing the last file in the que.
lastfile = False
i = tstart_samples
while True:
with tables.openFile(fname_current) as fcurrent, \
tables.openFile(fname_next) as fnext:
# Point to the data
data_current = fcurrent.getNode('/data/voltage/raw')
data_next = fnext.getNode('/data/voltage/raw')
# Process all data windows associated with the current pair of
# files. Avoid unnecessary file access operations as we moving
# the sliding window.
while True:
# Conditionals that depend on if our slice is:
# (1) completely into the next hour
# (2) partially spills into the next hour
# (3) completely in the current hour.
if i - backpad >= self.nsamples_hour:
# If we're already on our last file in the processing
# que, we can't continue to the next. Exit. Generator
# is finished.
if lastfile:
raise GeneratorExit
# Advance the active and que file names.
fname_current = fname_next
try:
fname_next = fname_generator.next()
except GeneratorExit:
# We've reached the end of our file processing que.
# Indicate this is the last file so that if we try
# to pull data across the next file boundary, we'll
# exit.
lastfile = True
# Our data slice has completely moved into the next
# hour.
i-=self.nsamples_hour
# Return the data
yield data_next[i-backpad:i+dt_samples+frontpad]
# Move window by amount dt
i+=dt_samples
# We've completely moved on the the next pair of files.
# Move to the outer scope to grab the next set of
# files.
break
elif i + dt_samples + frontpad >= self.nsamples_hour:
if lastfile:
raise GeneratorExit
# Slice spills over into the next hour
yield np.r_[data_current[i-backpad:],
data_next[:i+dt_samples+frontpad-self.nsamples_hour]]
i+=dt_samples
else:
if lastfile:
# Exit once our slice crosses the boundary of the
# last file.
if i + dt_samples + frontpad > tend_samples:
raise GeneratorExit
# Slice is completely within the current hour
yield data_current[i-backpad:i+dt_samples+frontpad]
i+=dt_samples
def _to_samples(self,input_time):
"""Convert input time, if not in samples, to samples"""
if isinstance(input_time,int):
# Input time is already in samples
return input_time
elif isinstance(input_time,datetime.datetime):
# Input time is a datetime object
return self.fs * (input_time.minute * 60 + input_time.second)
else:
raise ValueError("Invalid input 'tstart' parameter")
def _to_datetime(self,input_time):
"""Return the passed time as a datetime object"""
if isinstance(input_time,datetime.datetime):
converted_time = input_time
elif isinstance(input_time,str):
converted_time = self._parse_date_str(input_time)
else:
raise TypeError("A datetime object or string date/time were "
"expected")
return converted_time
def _file_iterator(self,filelist):
"""Generator for iterating over file names."""
for fname in filelist:
yield fname

#Sean here's my 2c
Take a look at this issue here which I created a while back. This is essentially what you are trying to do. This is a bit non-trivial.
Without knowing more details, I would offer a couple of suggestions:
HDFStore CAN read in a standard CArray type of format, see here
You can easily create a 'Series' like object that has nice properties of a) knowing where each file is and its extents, and uses __getitem__ to 'select' those files, e.g. s[time1:time2]. From a top-level view this might be a very nice abstraction, and you can then dispatch operations.
e.g.
class OutOfCoreSeries(object):
def __init__(self, dir):
.... load a list of the files in the dir where you have them ...
def __getitem__(self, key):
.... map the selection key (say its a slice, which 'time1:time2' resolves) ...
.... to the files that make it up .... , then return a new Series that only
.... those file pointers ....
def apply(self, func, **kwargs):
""" apply a function to the files """
results = []
for f in self.files:
results.append(func(self.read_file(f)))
return Results(results)
This can very easily get quite complicated. For instance, if you apply an operation that does a reduction that you can fit in memory, Results can simpley be a pandas.Series (or Frame). Hoever,
you may be doing a transformation which necessitates you writing out a new set of transformed data files. If you so, then you have to handle this.
Several more suggestions:
You may want to hold onto your data in possibly multiple ways that may be useful. For instance you say that you are saving multiple values in a 1-hour slice. It may be that you can split these 1-hour files instead into a file for each variable you are saving but save a much longer slice that then becomes memory readable.
You might want to resample the data to lower frequencies, and work on these, loading the data in a particular slice as needed for more detailed work.
You might want to create a dataset that is queryable across time, e.g. say high-low peaks at varying frequencies, e.g. maybe using the Table format see here
Thus you may have multiple variations of the same data. Disk space is usually much cheaper/easier to manage than main memory. It makes a lot of sense to take advantage of that.

Related

Modifying a list from multiple Python pools

I have a large data set (~2Gb) to analyse and I'd like to multi process it to reduce the run time of the code. I've imported the dataset into a list which I will then want to run numerous passes over. On each pass I'll set up a pool for each available core and each pool will then only assess a certain block of the data set (note, the pool still needs access to the complete data set).
Each line of the input file takes the format "a,b,c,d,e,f,g,h" and all are numbers.
I'm struggling to separate out the get the parameters in the Calc1stPass Pool; I'm getting a tuple index out or range error. Can anyone help me out with this error please?
def Calc1stPass(DataSet,Params):
print("DataSet =", DataSet)
print("Params =", Params)
Pass, (PoolNumber, ArrayCount, CoreCount) = Params
StartRow = int((ArrayCount / CoreCount) * PoolNumber)
EndRow = int(((ArrayCount / CoreCount) * (PoolNumber+1))-1)
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
def main():
#Importing the file
print("Importing File ", FileToImport)
OriginalDataSet = []
f = open(FileToImport)
for line in f:
StrippedLine = line.rstrip()
OriginalDataSet.append(StrippedLine.split(",",))
ArrayCount = len(OriginalDataSet)
#Running passes on dataset
for Pass in range(NumberofPasses):
print("Running Pass : ", Pass + 1, " of ", NumberofPasses)
CoreCount = mp.cpu_count()
WorkPool=mp.Pool(CoreCount)
for PoolNumber in range(CoreCount):
Params = [Pass,PoolNumber,ArrayCount,CoreCount]
RevisedDataSet = WorkPool.starmap(Calc1stPass, product(OriginalDataSet, zip(range(1),Params)))
print(RevisedDataSet)
if __name__ == "__main__":
freeze_support()
main()
Okay, here we go with what I came up with after some discussion plus trial and error. I hope I've kept it somewhat comprehensible. However, it seems you are very new to a lot of this, so you probably have a lot of reading to do regarding how certain libraries and data types work.
Analyzing the algorithm
Let's start with taking a closer look at your computation:
for Pass in range(Passes:
for Row in range(StartRow,EndRow):
Rand = randrange(ArrayCount)
Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
Value3 = Value1 - Value2
NewValue = Decimal(DataSet[Row][7]) + Value3
DataSet[Row][7] = str(NewValue)
So basically, we update a single row through a computation involving another random row.
Assumptions that I make:
the real algorithm does a bit more stuff, otherwise it is hard to see what you want to achieve
the access pattern of the real algorithm stays the same
Following our discussion, there are no functional reasons for the following aspects:
Computation in Decimal is unnecessary. float will do just fine
The values don't need to be stored as string. We can use an array of float
At this point it is clear that we can save tremendous amounts of runtime by using a numpy array instead of a list of string.
There is an additional hazard here for parallelization: We use random numbers. When we use multiple processes, the random number generators need to be set up for parallel generation. We'll cross that bridge when we get there.
Notably, the output column is no input for the next pass. The inputs per pass stay constant.
Input / Output
The input file format seems to be a simple CSV mostly filled with floating point numbers (using only one decimal place) and one column not being a floating point number. The text based format coupled with your information that there are gigabytes of data means that a significant amount of time will be spent just parsing the input file or formatting the output. I'll try to be efficient in both but keep things simple enough that extensions in both are possible.
Optimizing the sequential algorithm
It is always advisable to first optimize the sequential case before parallelizing. So we start here. We begin with parsing the input file into a numpy array.
import numpy as np
def ReadInputs(Filename):
"""Read a CSV file containing 10 columns
The 7th column is skipped because it doesn't contains floating point values
Return value:
2D numpy array of floats
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
return np.loadtxt(Filename, delimiter=',', usecols=UsedColumns)
Since we are using numpy, we switch over to its random number generators. This is the setup routine. It allows us to force deterministic values for easier debugging.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same same random numbers are being used
Return value:
numpy random number generator
"""
SeedInt = 0 if Deterministic else None
Seed = np.random.SeedSequence(SeedInt)
return np.random.default_rng(Seed)
And now the main computation. Numpy makes this very straight-forward.
def ComputePass(DataSets, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
RandomGenerator -- numpy random number generator
"""
Count = len(DataSets)
RandomIndices = RandomGenerator.integers(
low=0, high=Count, size=Count)
RandomRows = DataSets[RandomIndices]
# All rows: first column + second column
Value1 = DataSets[:, 0] + DataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
DataSets[:, 7] += Value3
I've kept the structure the same. That means there are a few optimizations that we can still do:
We never use most columns. Columns that are unnecessary should be removed from the array (skipped in input parsing) to reduce memory consumption and improve locality of data. If necessary for output, it is better to merge in the output stage, maybe by re-reading the input file to gather the remaining columns
Since Value1 and Value2 never change, we could pre-compute Value3 for all rows and just use that. Again, if we don't need the first two columns in memory, better to remove them
If we transpose the array (or store in Fortran order), we improve vectorization. This will make the use of MPI harder, but not impossible
I've not done any of this because I do not want to stray too far from the original algorithm.
The last step is the output. Here I go with a pure Python route to keep things simple and replicate the input file format:
def WriteOutputs(Filename, DataSets):
LineFormat = "{:.1f}, " * 6 + "+" + ", {:.1f}" * 3 + "\n"
with open(Filename, 'w') as OutFile:
for Row in DataSets:
OutFile.write(LineFormat.format(*Row))
Now the entire operation is rather simple:
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets = ReadInputs(InFilename)
for _ in range(Passes):
ComputePass(DataSets, RandomGenerator)
WriteOutputs(OutFilename, DataSets)
if __name__ == '__main__':
main()
Parallelization framework
There are two main concerns for parallelization:
For every row, we need access to the entire input data set to pick a random entry
The amount of calculation per row is very low
So we need to find a way that keeps overhead per row small and shares the input data set efficiently.
The first choice is multiprocessing since, you know, standard library and all that. However, I think that the normal usage patterns have too much overhead. It's certainly possible but I would like to use MPI for this to give us as much performance as possible. Also, your first attempt at parallelization used a pattern that matches MPI's preferred pattern. So it is a good fit.
A word towards the concept of MPI: multiprocessing.Pool works with a main process that distributes work items among a set of worker processes. MPI start N processes that all execute the same code. There is no main process. The only distinguishing feature is the process "rank", which is a number [0, N). If you need a main process, the one with rank 0 is usually chosen. Other than that, the idea is that all processes execute the same code, only picking different indices or offsets based on their rank. If processes need to communicate, there are a couple of "collective" communication patterns such as broadcasting, scattering, and gathering.
Option 1: Pure MPI
Let's rewrite the code. The main idea is this: We distribute rows in the data set among all processes. Then each process calculates all passes for its own set of rows. Input and output take considerable time, so we try to do as much as possible parallelized, too.
We start by defining a helper function that defines how we distribute rows among all processes. This is very similar to what you had in your original version.
from mpi4py import MPI
def MakeDistribution(NumberOfRows):
"""Computes how the data set should be distributed across processes
Arguments:
NumberOfRows -- size of the whole dataset
Return value:
(Offsets, Counts) numpy integer arrays. One entry per process
"""
Comm = MPI.COMM_WORLD
WorldSize = Comm.Get_size()
SameSize, Tail = divmod(NumberOfRows, WorldSize)
Counts = np.full(WorldSize, SameSize, dtype=int)
Counts[:Tail] += 1
# Start offset per process
Offsets = np.cumsum(Counts) - Counts[0]
return Offsets, Counts
A second helper function is used to distribute the data sets among all processes. MPI's allgather function is used to collect results of a computation among all processes into one array. The normal function gather collects the whole array on one process. Allgather collects it in all processes. Since all processes need access to all data sets for their random access, we use allgather. Allgatherv is a generalized version that allows different number of entries per process. We need this because we cannot guarantee that all processes have the same number of rows in their local data set.
This function uses the "buffer" interface of mpi4py. This is the more efficient version but also very error-prone. If we mess up an index or the size of a data type, we risk data corruption.
def DistributeDataSets(DataSets, Offsets, Counts):
"""Shares the datasets with all other processes
Arguments:
DataSets -- numpy array of floats. Changed in place
Offsets, Counts -- See MakeDistribution
Return value:
DataSets. Most likely a reference to the original.
Might be an updated copy
"""
# Sanitize the input. Better safe than sorry and shouldn't cost anything
DataSets = np.ascontiguousarray(DataSets, dtype='f8')
assert len(DataSets) == np.sum(Counts)
# MPI works best if we pretend to have 1-dimensional data
InnerSize = np.prod(DataSets.shape[1:], dtype=int)
# I really wish mpi4py had a helper for this
BufferDescr = (DataSets,
Counts * InnerSize,
Offsets * InnerSize,
MPI.DOUBLE)
MPI.COMM_WORLD.Allgatherv(MPI.IN_PLACE, BufferDescr)
return DataSets
We split reading the input data into two parts. First we read all lines in a single process. This is relatively cheap and we need to know the total number of rows before we can distribute the datasets. Then we scatter the lines among all processes and let each process parse its own set of rows. After that, we use the DistributeDataSets function to let each process know all the results.
Scattering the lines uses mpi4py's pickle interface that can transfer arbitrary objects among processes. It's slower but more convenient. For stuff like lists of strings it's very good.
def ParseLines(TotalLines, Offset, OwnLines):
"""Allocates a data set and parses the own segment of it
Arguments:
TotalLines -- number of rows in the whole data set across all processes
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
Return value:
a 2D numpy array. The rows [Offset:Offset+len(OwnLines)] are initialized
with the parsed values
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
DataSet = np.empty((TotalLines, len(UsedColumns)), dtype='f8')
OwnEnd = Offset + len(OwnLines)
for Row, Line in zip(DataSet[Offset:OwnEnd], OwnLines):
Columns = Line.split(',')
# overwrite in-place with new values
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
return DataSet
def DistributeInputs(Filename):
"""Read input from the file and distribute it among processes
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts) with
DataSets -- 2D array containing all values in the CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets = ParseLines(LineCount, Offsets[Rank], Lines)
del Lines # release strings because this is a huge array
# Share the parsed result
DataSets = DistributeDataSets(DataSets, Offsets, Counts)
return DataSets, Offsets, Counts
Now we need to update the way the random number generator is initialized. What we need to prevent is that each process has the same state and generates the same random numbers. Thankfully, numpy gives us a convenient way of doing this.
def MakeRandomGenerator(Deterministic=False):
"""Initializes the random number generator avoiding birthday paradox
Arguments:
Deterministic -- if True, the same number of processes should always result
in the same random numbers being used
Return value:
numpy random number generator
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
AllSeeds = None
if not Rank:
# the root process (rank=0) generates a seed sequence for everyone else
WorldSize = Comm.Get_size()
SeedInt = 0 if Deterministic else None
OwnSeed = np.random.SeedSequence(SeedInt)
AllSeeds = OwnSeed.spawn(WorldSize)
# mpi4py can scatter Python objects. This is the simplest way
OwnSeed = Comm.scatter(AllSeeds, root=0)
return np.random.default_rng(OwnSeed)
The computation itself is almost unchanged. We just need to limit it to the rows for which the individual process is responsible.
def ComputePass(DataSets, Offset, Count, RandomGenerator):
"""The main computation
Arguments:
DataSets -- 2D numpy array. Changed in place
Offset, Count -- rows that should be updated by this process
RandomGenerator -- numpy random number generator
"""
RandomIndices = RandomGenerator.integers(
low=0, high=len(DataSets), size=Count)
RandomRows = DataSets[RandomIndices]
# Creates a "view" into the whole dataset for the given slice
OwnDataSets = DataSets[Offset:Offset + Count]
# All rows: first column + second column
Value1 = OwnDataSets[:, 0] + OwnDataSets[:, 1]
Value2 = RandomRows[:, 0] + RandomRows[:, 1]
Value3 = Value1 - Value2
# This change is in-place of the whole DataSets array
OwnDataSets[:, 7] += Value3
Now we come to writing the output. The most expensive part is formatting the floating point numbers into strings. So we let each process format its own data. MPI has a file IO interface that allows all processes to write a single file together. Unfortunately, for text files, we need to calculate the offsets before writing the data. So we format all rows into one huge string per process, then write the file.
import io
def WriteOutputs(Filename, DataSets, Offset, Count):
"""Writes all DataSets to a CSV file
We parse all rows to a string (one per process), then write it
collectively using MPI
Arguments:
Filename -- output path
DataSets -- all values among all processes
Offset, Count -- the rows for which the local process is responsible
"""
StringBuf = io.StringIO()
LineFormat = "{:.6f}, " * 6 + "+" + ", {:.6f}" * 3 + "\n"
for Row in DataSets[Offset:Offset+Count]:
StringBuf.write(LineFormat.format(*Row))
StringBuf = StringBuf.getvalue() # to string
StringBuf = StringBuf.encode() # to bytes
Comm = MPI.COMM_WORLD
BytesPerProcess = Comm.allgather(len(StringBuf))
Rank = Comm.Get_rank()
OwnOffset = sum(BytesPerProcess[:Rank])
FileLength = sum(BytesPerProcess)
AccessMode = MPI.MODE_WRONLY | MPI.MODE_CREATE
OutFile = MPI.File.Open(Comm, Filename, AccessMode)
OutFile.Set_size(FileLength)
OutFile.Write_ordered(StringBuf)
OutFile.Close()
The main process is almost unchanged.
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
DataSets, Offsets, Counts = DistributeInputs(InFilename)
Rank = MPI.COMM_WORLD.Get_rank()
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
if __name__ == '__main__':
main()
You need to call this script with mpirun or mpiexec. E.g. mpiexec python3 script_name.py
Using shared memory
The MPI pattern has one significant drawback: Each process needs its own copy of the whole data set. Given its size, this is very inconvenient. We might run out of memory before we run out of CPU cores for multithreading. As a different idea, we can use shared memory. Shared memory allows multiple processes to access the same physical memory without any extra cost. This has some drawbacks:
We need a very recent python version. 3.8 IIRC
Python's implementation may behave differently on various operating systems. I could only test it on Linux. There is a chance that it will not work on any different system
IMHO python's implementation is not great. You will notice that the final version will print some warnings which I think are harmless. Maybe I'm using it wrong but I don't see a more correct way of using it
It limits you to a single PC. MPI itself is perfectly capable (and indeed designed to) operate across multiple systems on a network. Shared memory works only locally.
The major benefit is that the memory consumption does not increase with the number of processes.
We start by allocating such a data set.
From here on, we put in "barriers" at various points where processes may have to wait for one another. For example because all processes need to access the same shared memory segment, they all have to open it before we can unlink it.
from multiprocessing import shared_memory
def AllocateSharedDataSets(NumberOfRows, NumberOfCols=9):
"""Creates a numpy array in shared memory
Arguments:
NumberOfRows, NumberOfCols -- basic shape
Return value:
(DataSets, Buf) with
DataSets -- numpy array shaped (NumberOfRows, NumberOfCols).
Datatype float
Buf -- multiprocessing.shared_memory.SharedMemory that backs the array.
Close it when no longer needed
"""
length = NumberOfRows * NumberOfCols * np.float64().itemsize
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Buf = None
BufName = None
if not Rank:
Buf = shared_memory.SharedMemory(create=True, size=length)
BufName = Buf.name
BufName = Comm.bcast(BufName)
if Rank:
Buf = shared_memory.SharedMemory(name=BufName, size=length)
DataSets = np.ndarray((NumberOfRows, NumberOfCols), dtype='f8',
buffer=Buf.buf)
Comm.barrier()
if not Rank:
Buf.unlink() # this may differ among operating systems
return DataSets, Buf
The input parsing also changes a little because have to put the data into the previously allocated array
def ParseLines(DataSets, Offset, OwnLines):
"""Reads lines into a preallocated array
Arguments:
DataSets -- [Rows, Cols] numpy array. Will be changed in-place
Offset -- starting offset of the set of rows parsed by this process
OwnLines -- list of lines to be parsed by the local process
"""
UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
OwnEnd = Offset + len(OwnLines)
OwnDataSets = DataSets[Offset:OwnEnd]
for Row, Line in zip(OwnDataSets, OwnLines):
Columns = Line.split(',')
Row[:] = [float(Columns[Column]) for Column in UsedColumns]
def DistributeInputs(Filename):
"""Read input from the file and stores it in shared memory
Arguments:
Filename -- path to the CSV file to parse
Return value:
(DataSets, Offsets, Counts, Buf) with
DataSets -- [Rows, 9] array containing two copies of all values in the
CSV file
Offsets -- Row indices (one per rank) where each process starts its own
processing
Counts -- number of rows per process
Buf -- multiprocessing.shared_memory.SharedMemory object backing the
DataSets object
"""
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
Lines = None
LineCount = None
if not Rank:
# Read the data. We do as little work as possible here so that other
# processes can help with the parsing
with open(Filename) as InFile:
Lines = InFile.readlines()
LineCount = len(Lines)
# broadcast so that all processes know the number of datasets
LineCount = Comm.bcast(LineCount, root=0)
Offsets, Counts = MakeDistribution(LineCount)
# reshape into one list per process
if not Rank:
Lines = [Lines[Offset:Offset+Count]
for Offset, Count
in zip(Offsets, Counts)]
# distribute strings for parsing
Lines = Comm.scatter(Lines, root=0)
# parse into a float array
DataSets, Buf = AllocateSharedDataSets(LineCount)
try:
ParseLines(DataSets, Offsets[Rank], Lines)
Comm.barrier()
return DataSets, Offsets, Counts, Buf
except:
Buf.close()
raise
Output writing is exactly the same. The main process changes slightly because now we have to manage the life time of the shared memory.
import contextlib
def main():
InFilename = "indata.csv"
OutFilename = "outdata.csv"
Passes = 20
RandomGenerator = MakeRandomGenerator()
Comm = MPI.COMM_WORLD
Rank = Comm.Get_rank()
DataSets, Offsets, Counts, Buf = DistributeInputs(InFilename)
with contextlib.closing(Buf):
Offset = Offsets[Rank]
Count = Counts[Rank]
for _ in range(Passes):
ComputePass(DataSets, Offset, Count, RandomGenerator)
WriteOutputs(OutFilename, DataSets, Offset, Count)
Results
I've not benchmarked the original version. The sequential version requires 2 GiB memory and 3:20 minutes for 12500000 lines and 20 passes.
The pure MPI version requires 6 GiB and 42 seconds with 6 cores.
The shared memory version requires a bit over 2 GiB of memory and 38 seconds with 6 cores.

Sharing a large dataframe with multiprocessing.pool

I have a function which I want to compute in parallel using multiprocessing. The function takes an argument, but also loads subsets from two very large dataframe which has already been loaded into memory (one of which is about 1G and the other is just over 6G).
largeDF1 = pd.read_csv(directory + 'name1.csv')
largeDF2 = pd.read_csv(directory + 'name2.csv')
def f(x):
load_content1 = largeDF1.loc[largeDF1['FirstRow'] == x]
load_content2 = largeDF1.loc[largeDF1['FirstRow'] == x]
#some computation happens here
new_data.to_csv(directory + 'output.csv', index = False)
def main():
multiprocessing.set_start_method('spawn', force = True)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
input = input_data['col']
pool.map_async(f, input)
pool.close()
pool.join()
The problem is that the files are too big and when I run them over multiple cores I get a memory issue. I want to know if there is a way where the loaded files can be shared across all processes.
I have tried manager() but could not get it to work. Any help is appreciated. Thanks.
If you were running this on a UNIX-like system (which uses the fork startmethod by default) the data would be shared out-of-the-box. Most operating systems use copy-on-write for memory pages. So even if you fork a process several times they would share most of the memory pages that contain the dataframes, al long as you don't modify those dataframes.
But when using the spawn start method, each worker process has to load the dataframe. I'm not sure if the OS is smart enough in that case to share the memory pages. Or indeed that these spawned processes would all have the same memory lay-out.
The only portable solution I can think of would be to leave the data on disk and use mmap in the workers to map it into memory read-only. That way the OS would notice that multiple processes are mapping the same file, and it would only load one copy.
The downside is that the data would be in memory in on-disk csv format, which makes reading data from it (without making a copy!) less convenient. So you might want to prepare the data beforehand into a form that it easier to use. Like e.g. convert the data from 'FirstRow' into a binary file of float or double that you can iterate over with struct.iter_unpack.
The function below (from my statusline script) uses mmap to count the amount of messages in a mailbox file.
def mail(storage, mboxname):
"""
Report unread mail.
Arguments:
storage: a dict with keys (unread, time, size) from the previous call or an empty dict.
This dict will be *modified* by this function.
mboxname (str): name of the mailbox to read.
Returns: A string to display.
"""
stats = os.stat(mboxname)
if stats.st_size == 0:
return 'Mail: 0'
# When mutt modifies the mailbox, it seems to only change the
# ctime, not the mtime! This is probably releated to how mutt saves the
# file. See also stat(2).
newtime = stats.st_ctime
newsize = stats.st_size
if not storage or newtime > storage['time'] or newsize != storage['size']:
with open(mboxname) as mbox:
with mmap.mmap(mbox.fileno(), 0, prot=mmap.PROT_READ) as mm:
start, total = 0, 1 # First mail is not found; it starts on first line...
while True:
rv = mm.find(b'\n\nFrom ', start)
if rv == -1:
break
else:
total += 1
start = rv + 7
start, read = 0, 0
while True:
rv = mm.find(b'\nStatus: R', start)
if rv == -1:
break
else:
read += 1
start = rv + 10
unread = total - read
# Save values for the next run.
storage['unread'], storage['time'], storage['size'] = unread, newtime, newsize
else:
unread = storage['unread']
return f'Mail: {unread}'
In this case I used mmap because it was 4x faster than just reading the file. See normal reading versus using mmap.

How to Extract Individual Chords, Rests, and Notes from a midi file?

I am making a program that should be able to extract the notes, rests, and chords from a certain midi file and write the respective pitch (in midi tone numbers - they go from 0-127) of the notes and chords to a csv file for later use.
For this project, I am using the Python Library "Music21".
from music21 import *
import pandas as pd
#SETUP
path = r"Pirates_TheCarib_midi\1225766-Pirates_of_The_Caribbean_Medley.mid"
#create a function for taking parsing and extracting the notes
def extract_notes(path):
stm = converter.parse(path)
treble = stm[0] #access the first part (if there is only one part)
bass = stm[1]
#note extraction
notes_treble = []
notes_bass = []
for thisNote in treble.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_treble.append(indiv_note) # print's the note and the note's
offset
for thisNote in bass.getElementsByClass("Note"):
indiv_note = [thisNote.name, thisNote.pitch.midi, thisNote.offset]
notes_bass.append(indiv_note) #add the notes to the bass
return notes_treble, notes_bass
#write to csv
def to_csv(notes_array):
df = pd.DataFrame(notes_array, index=None, columns=None)
df.to_csv("attempt1_v1.csv")
#using the functions
notes_array = extract_notes(path)
#to_csv(notes_array)
#DEBUGGING
stm = converter.parse(path)
print(stm.parts)
Here is the link to the score I am using as a test.
https://musescore.com/user/1699036/scores/1225766
When I run the extract_notes function, it returns two empty arrays and the line:
print(stm.parts)
it returns
<music21.stream.iterator.StreamIterator for Score:0x1b25dead550 #:0>
I am confused as to why it does this. The piece should have two parts, treble and bass. How can I get each note, chord and rest into an array so I can put it in a csv file?
Here is small snippet how I did it. I needed to get all notes, chords and rests for specific instrument. So at first I iterated through part and found specific instrument and afterwards check what kind of type note it is and append it.
you can call this method like
notes = get_notes_chords_rests(keyboard_instruments, "Pirates_of_The_Caribbean.mid")
where keyboard_instruments is list of instruments.
keyboard_nstrument = ["KeyboardInstrument", "Piano", "Harpsichord", "Clavichord", "Celesta", ]
def get_notes_chords_rests(instrument_type, path):
try:
midi = converter.parse(path)
parts = instrument.partitionByInstrument(midi)
note_list = []
for music_instrument in range(len(parts)):
if parts.parts[music_instrument].id in instrument_type:
for element_by_offset in stream.iterator.OffsetIterator(parts[music_instrument]):
for entry in element_by_offset:
if isinstance(entry, note.Note):
note_list.append(str(entry.pitch))
elif isinstance(entry, chord.Chord):
note_list.append('.'.join(str(n) for n in entry.normalOrder))
elif isinstance(entry, note.Rest):
note_list.append('Rest')
return note_list
except Exception as e:
print("failed on ", path)
pass
P.S. It is important to use try block because a lot of midi files on the web are corrupted.

Preprocessing CSV data using tensorflow DataSet API

I'm playing around a bit with tensorflow, but am a bit confused about the input pipeline. The data I'm working on is in a large csv file, with 307 columns, of which the first is a string representing a date, and the rest are floats.
I'm running into some problems with preprocessing my data. I want to add a couple of features instead of, but based on, the date string. (specifically, a sine and a cosine representing the time). I also want to group the next 120 values in the CSV row together as one feature, the 96 ones after that as one feature, and base my label off of the remaining values in the CSV.
This is my code for generating the datasets for now:
import tensorflow as tf
defaults = []
defaults.append([""])
for i in range(0,306):
defaults.append([1.0])
def dataset(train_fraction=0.8):
path = "training_examples_shuffled.csv"
# Define how the lines of the file should be parsed
def decode_line(line):
items = tf.decode_csv(line, record_defaults=defaults)
datetimeString = items[0]
minuteFeatures = items[1:121]
halfHourFeatures = items[121:217]
labelFeatures = items[217:]
## Do something to convert datetimeString to timeSine and timeCosine
features_dict = {
'timeSine': timeSine,
'timeCosine': timeCosine,
'minuteFeatures': minuteFeatures,
'halfHourFeatures': halfHourFeatures
}
label = [1] # placeholder. I seem to need some python logic here, but I'm
not sure how to apply that to data in tensor format.
return features_dict, label
def in_training_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
num_buckets = 1000000
bucket_id = tf.string_to_hash_bucket_fast(line, num_buckets)
# Use the hash bucket id as a random number that's deterministic per example
return bucket_id < int(train_fraction * num_buckets)
def in_test_set(line):
"""Returns a boolean tensor, true if the line is in the training set."""
return ~in_training_set(line)
base_dataset = (tf.data
# Get the lines from the file.
.TextLineDataset(path))
train = (base_dataset
# Take only the training-set lines.
.filter(in_training_set)
# Decode each line into a (features_dict, label) pair.
.map(decode_line))
# Do the same for the test-set.
test = (base_dataset.filter(in_test_set).map(decode_line))
return train, test
My question now is: how can I access the string in the datetimeString Tensor to convert it to a datetime object? Or is this the wrong place to be doing this? I'd like to use the time and the day of the week as input features.
And secondly: Pretty much the same for the label based on the remaining values of the CSV. Can I just use standard python code for this in some way, or should I be using basic tensorflow ops to achieve what I want, if possible?
Finally, any comments on whether this is a decent way of handling my inputs? Tensorflow is a bit confusing, with old tutorials spread around the internet using deprecated ways of handling inputs.

Fast extraction of chunks of lines from large CSV file

I have a large CSV file full of stock-related data formatted as such:
Ticker Symbol, Date, [some variables...]
So each line starts of with the symbol (like "AMZN"), then has the date, then has 12 variables related to price or volume on the selected date. There are about 10,000 different securities represented in this file and I have a line for each day that the stock has been publicly traded for each of them. The file is ordered first alphabetically by ticker symbol and second chronologically by date. The entire file is about 3.3 GB.
The sort of task I want to solve would be to be able to extract the most recent n lines of data for a given ticker symbol with respect to the current date. I have code that does this, but based on my observations it seems to take, on average, around 8-10 seconds per retrieval (all tests have been extracting 100 lines).
I have functions I'd like to run that require me to grab such chunks for hundreds or thousands of symbols, and I would really like to reduce the time. My code is inefficient, but I am not sure how to make it run faster.
First, I have a function called getData:
def getData(symbol, filename):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(filename, 'r') as f:
for line in f:
match = checkMatch(symbol, l, line)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and match:
out.append(formatLineData(line[:-1].split(",")))
elif not beforeMatch and not match:
break
return out
(This code has a couple of helper functions, checkMatch and formatLineData, which I will show below.) Then, there is another function called getDataColumn that gets the column I want with the correct number of days represented:
def getDataColumn(symbol, col=12, numDays=100, changeRateTransform=False):
dataset = getData(symbol)
if not changeRateTransform:
column = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
column = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return column
(changeRateTransform converts raw numbers into daily change rate numbers if True.) The helper functions:
def checkMatch(symbol, symbolLength, line):
out = False
if line[:symbolLength+1] == symbol + ",":
out = True
return out
def formatLineData(lineData):
out = [lineData[0]]
out.append(datetime.strptime(lineData[1], '%Y-%m-%d').date())
out += [float(d) for d in lineData[2:6]]
out += [int(float(d)) for d in lineData[6:9]]
out += [float(d) for d in lineData[9:13]]
out.append(int(float(lineData[13])))
return out
Does anyone have any insight on what parts of my code run slow and how I can make this perform better? I can't do the sort of analysis I want to do without speeding this up.
EDIT:
In response to the comments, I made some changes to the code in order to utilize the existing methods in the csv module:
def getData(symbol, database):
out = ["Symbol","Date","Open","High","Low","Close","Volume","Dividend",
"Split","Adj_Open","Adj_High","Adj_Low","Adj_Close","Adj_Volume"]
l = len(symbol)
beforeMatch = True
with open(database, 'r') as f:
databaseReader = csv.reader(f, delimiter=",")
for row in databaseReader:
match = (row[0] == symbol)
if beforeMatch and match:
beforeMatch = False
out.append(formatLineData(row))
elif not beforeMatch and match:
out.append(formatLineData(row))
elif not beforeMatch and not match:
break
return out
def getDataColumn(dataset, col=12, numDays=100, changeRateTransform=False):
if not changeRateTransform:
out = [day[col] for day in dataset[-numDays:]]
else:
n = len(dataset)
out = [(dataset[i][col] - dataset[i-1][col])/dataset[i-1][col] for i in range(n - numDays, n)]
return out
Performance was worse using the csv.reader class. I tested on two stocks, AMZN (near top of file) and ZNGA (near bottom of file). With the original method, the run times were 0.99 seconds and 18.37 seconds, respectively. With the new method leveraging the csv module, the run times were 3.04 seconds and 64.94 seconds, respectively. Both return the correct results.
My thought is that the time is being taken up more from finding the stock than from the parsing. If I try these methods on the first stock in the file, A, the methods both run in about 0.12 seconds.
When you're going to do lots of analysis on the same dataset, the pragmatic approach would be to read it all into a database. It is made for fast querying; CSV isn't. Use the sqlite command line tools, for example, which can directly import from CSV. Then add a single index on (Symbol, Date) and lookups will be practically instantaneous.
If for some reason that is not feasible, for example because new files can come in at any moment and you cannot afford the preparation time before starting your analysis of them, you'll have to make the best of dealing with CSV directly, which is what the rest of my answer will focus on. Remember that it's a balancing act, though. Either you pay a lot upfront, or a bit extra for every lookup. Eventually, for some amount of lookups it would have been cheaper to pay upfront.
Optimization is about maximizing the amount of work not done. Using generators and the built-in csv module aren't going to help much with that in this case. You'd still be reading the whole file and parsing all of it, at least for line breaks. With that amount of data, it's a no-go.
Parsing requires reading, so you'll have to find a way around it first. Best practices of leaving all intricacies of the CSV format to the specialized module bear no meaning when they can't give you the performance you want. Some cheating must be done, but as little as possible. In this case, I suppose it is safe to assume that the start of a new line can be identified as b'\n"AMZN",' (sticking with your example). Yes, binary here, because remember: no parsing yet. You could scan the file as binary from the beginning until you find the first line. From there read the amount of lines you need, decode and parse them the proper way, etc. No need for optimization there, because a 100 lines are nothing to worry about compared to the hundreds of thousands of irrelevant lines you're not doing that work for.
Dropping all that parsing buys you a lot, but the reading needs to be optimized as well. Don't load the whole file into memory first and skip as many layers of Python as you can. Using mmap lets the OS decide what to load into memory transparently and lets you work with the data directly.
Still you're potentially reading the whole file, if the symbol is near the end. It's a linear search, which means the time it takes is linearly proportional to the number of lines in the file. You can do better though. Because the file is sorted, you could improve the function to instead perform a kind of binary search. The number of steps that will take (where a step is reading a line) is close to the binary logarithm of the number of lines. In other words: the number of times you can divide your file into two (almost) equally sized parts. When there are one million lines, that's a difference of five orders of magnitude!
Here's what I came up with, based on Python's own bisect_left with some measures to account for the fact that your "values" span more than one index:
import csv
from itertools import islice
import mmap
def iter_symbol_lines(f, symbol):
# How to recognize the start of a line of interest
ident = b'"' + symbol.encode() + b'",'
# The memory-mapped file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Skip the header
mm.readline()
# The inclusive lower bound of the byte range we're still interested in
lo = mm.tell()
# The exclusive upper bound of the byte range we're still interested in
hi = mm.size()
# As long as the range isn't empty
while lo < hi:
# Find the position of the beginning of a line near the middle of the range
mid = mm.rfind(b'\n', 0, (lo+hi)//2) + 1
# Go to that position
mm.seek(mid)
# Is it a line that comes before lines we're interested in?
if mm.readline() < ident:
# If so, ignore everything up to right after this line
lo = mm.tell()
else:
# Otherwise, ignore everything from right before this line
hi = mid
# We found where the first line of interest would be expected; go there
mm.seek(lo)
while True:
line = mm.readline()
if not line.startswith(ident):
break
yield line.decode()
with open(filename) as f:
r = csv.reader(islice(iter_symbol_lines(f, 'AMZN'), 10))
for line in r:
print(line)
No guarantees about this code; I didn't pay much attention to edge cases, and I couldn't test with (any of) your file(s), so consider it a proof of concept. It is plenty fast, however – think tens of milliseconds on an SSD!
So I have an alternative solution which I ran and tested on my own as well with a sample data set that I got on Quandl that appears to have all the same headers and similar data. (Assuming that I havent misunderstood the end result that your trying to achieve).
I have this command line tool that one of our engineers built for us for parsing massive csvs - since I deal with absurd amount of data on a day to day basis - it is open sourced and you can get it here: https://github.com/DataFoxCo/gocsv
I also already wrote the short bash script for it in case you don't want to pipeline the commands but it does also support pipelining.
The command to run the following short script follows a super simple convention:
bash tickers.sh wikiprices.csv 'AMZN' '2016-12-\d+|2016-11-\d+'
#!/bin/bash
dates="$3"
cat "$1" \
| gocsv filter --columns 'ticker' --regex "$2" \
| gocsv filter --columns 'date' --regex "$dates" > "$2"'-out.csv'
both arguments for ticker and for dates are regexes
You can add as many variations as your want into that one regex, separating them by |.
So if you wanted AMZN and MSFT then you would simply modify it to this: AMZN|MSFT
I did something very similar with the dates - but i only limited my sample run to any dates from this month or last month.
End Result
Starting data:
myusername$ gocsv dims wikiprices.csv
Dimensions:
Rows: 23946
Columns: 14
myusername$ bash tickers.sh wikiprices.csv 'AMZN|MSFT' '2016-12-\d+'
myusername$ gocsv dims AMZN|MSFT-out.csv
Dimensions:
Rows: 24
Columns: 14
Here is a sample where I limited to only those 2 tickers and then to december only:
Voila - in a matter of seconds you have a second file saved with out the data you care about.
The gocsv program has great documentation by the way - and a ton of other functions e.g. running a vlookup basically at any scale (which is what inspired the creator to make the tool)
in addition to using csv.reader I think using itertools.groupby would speed up looking for the wanted sections, so the actual iteration could look something like this:
import csv
from itertools import groupby
from operator import itemgetter #for the keyfunc for groupby
def getData(wanted_symbol, filename):
with open(filename) as file:
reader = csv.reader(file)
#so each line in reader is basically line[:-1].split(",") from the plain file
for symb, lines in groupby(reader, itemgetter(0)):
#so here symb is the symbol at the start of each line of lines
#and lines is the lines that all have that symbol in common
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in lines:
#here we have each line as a list of fields
#for only the lines that have `wanted_symbol` as the first element
<DO STUFF HERE>
so in the space of <DO STUFF HERE> you could have the out.append(formatLineData(line)) to do what your current code does but the code for that function has a lot of unnecessary slicing and += operators which I think are pretty expensive for lists (might be wrong), another way you could apply the conversions is to have a list of all the conversions:
def conv_date(date_str):
return datetime.strptime(date_str, '%Y-%m-%d').date()
#the conversions applied to each element (taken from original formatLineData)
castings = [str, conv_date, #0, 1
float, float, float, float, #2:6
int, int, int, #6:9
float, float, float, float, #9:13
int] #13
then use zip to apply these to each field in a line in a list comprehension:
[conv(val) for conv, val in zip(castings, line)]
so you would replace <DO STUFF HERE> with out.append with that comprehension.
I'd also wonder if switching the order of groupby and reader would be better since you don't need to parse most of the file as csv, just the parts you are actually iterating over so you could use a keyfunc that seperates just the first field of the string
def getData(wanted_symbol, filename):
out = [] #why are you starting this with strings in it?
def checkMatch(line): #define the function to only take the line
#this would be the keyfunc for groupby in this example
return line.split(",",1)[0] #only split once, return the first element
with open(filename) as file:
for symb, lines in groupby(file,checkMatch):
#so here symb is the symbol at the start of each line of lines
if symb != wanted_symbol:
continue #skip this whole section if it has a different symbol
for line in csv.reader(lines):
out.append( [typ(val) for typ,val in zip(castings,line)] )
return out

Categories