splitting long string at indices in parallel in python - python

I have many files that each have several million rows; each row is a dumped data entry and is several hundred characters long. The rows come in groups and the first two characters tell me the type of row it is, and I use that to parse it. This structure prohibits me from loading the rows to a dataframe, for example, or anything else that does not go through the rows one at a time.
For each row, I currently create a dictionary vals = {}, and then sequentially run through about fifty keys along the lines of
vals{'name'} = row[2:24]
vals{'state'} = row[24:26]
Instead of doing fifty assignments sequentially, can I do this simultaneously or in parallel in some simple manner?
Is
vals{'name'},vals{'state'} = row[2:24],row[24:26]
faster if I do this simultaneous assignment for many entries? I could also reformulate this as a list comprehension. Would that be faster than running through sequentially?

To answer your question, no, doing multiple assignment will not speed up your program. This is because the multiple assignment syntax is just a different way of writing multiple assignments on different lines.
For example
vals{'name'},vals{'state'} = row[2:24],row[24:26]
is equivalent to
vals{'name'}= row[2:24]
vals{'state'} = row[2:24]
If you want to optimize your code, your should start by profiling it to determine the parts that are taking the largest amount of time. I would also check to ensure that you are not doing multiple reads from the same file, as these are very slow compared to reading from memory. If possible, you should read the entire file into memory first, and then process it.

Related

Pyarrow Write/Append Columns Arrow File

I have a calculator that iterates a couple of hundred object and produces Nx1 arrays for each of those objects. N here being 1-10m depending on configurations. Right now I am summing over these by using a generator expression, so memory consumption is low. However, I would like to store the Nx1 arrays to file, so I can do other computations.(Compute quantiles, partial sums etc. pandas style) Preferably I would like to use pa.memory_map on a single file (in order to have dataframes not loaded into memory), but I can not see how I can produce such a file without generating the entire result first. (Monte Carlo results on 200-500*10m floats).
If I understand correctly RecordBatchStreamWriter needs a part of the entire table, and I can not produce only a part of it. The parts the calculator produces is the columns, not parts of all columns. Is there any way of writing "columns" one by one? Either by appending, or create an empty arrow file which can be filled? (schema known).
As I see it, my alternative is to write several files and use "dataset" /tabular data to "join" them together. My "other computations" would then have to filter or pull parts into memory as I can`t see in the docs that "dataset()" work with memory_map.The result set is to big to fit in memory. (At least on the server it is running on)
I`m on day 2 of digging the docs and trying to understand how it all works, so apologies if the "lingo" is not all correct.
On further inspection, it looks like all files used in datasets() must have same schema, so I can not split "columns" in separate files either, can I..
EDIT
After wrestling with this library, I now produce single column files which I later combine in a single file. However, in following the suggested solution visible memory consumption (task manager) skyrockets in the step of combining the files. I would expect peaks for every "rowgroup" or combined recordbatch, but instead steadily increase to use all memory. A snip of this step:
readers = [pa.ipc.open_stream(file) for file in self.tempfiles]
combined_schema = pa.unify_schemas([r.schema for r in readers])
with pa.ipc.new_stream(
os.path.join(self.filepath, self.outfile_name + ".arrow"),
schema=combined_schema,
) as writer:
for group in zip(*readers):
combined_batch = pa.RecordBatch.from_arrays(
[g.column(0) for g in group], names=combined_schema.names
)
writer.write_batch(combined_batch)
From this link I would expect that running memory consumption to be that of combined_batch and some.
You could do the write in two passes.
First, write each column to its own file. Make sure to set a row group size small enough that a table consisting of one row group from each file comfortably fits into memory.
Second, create a streaming reader for each file you created and one writer. Read a single row group from each one. Create a table by combining all of the partial columns and write the table to your writer. Repeat until you've exhausted all your readers.
I'm not sure that memory mapping is going to help you much here.

How do I perform deduplication with the python record linkage toolkit with large data sets?

I am currently using Python Record Linkage Toolkit to perform deduplication on data sets at work. In an ideal world, I would just use blocking or sortedneighborhood to trim down the size of the index of record pairs, but sometimes I need to do a full index on a data set with over 75k records, which results in a couple billion records pairs.
The issue I'm running into is that the workstation I'm able to use is running out of memory, so it can't store the full 2.5-3 billion pair multi-index. I know the documentation has ideas for doing record linkage with two large data sets using numpy split, which is simple enough for my usage, but doesn't provide anything for deduplication within a single dataframe. I actually incorporated this subset suggestion into a method for splitting the multiindex into subsets and running those, but it doesn't get around the issue of the .index() call seemingly loading the entire multiindex into memory and causing an out of memory error.
Is there a way to split a dataframe and compute the matched pairs iteratively so I don't have to load the whole kit and kaboodle into memory at once? I was looking at dask, but I'm still pretty green on the whole python thing, so I don't know how to incorporate the dask dataframes into the record linkage toolkit.
While I was able to solve this, sort of, I am going to leave it open because I suspect given my inexperience with python, my process could be improved.
Basically, I had to ditch the index function from record linkage toolkit. I pulled out the Index of the dataframe I was using, and then converted it to a list, and passed it through the itertools combinations function.
candidates = fl
candidates = candidates.index
candidates = candidates.tolist()
candidates = combinations(candidates,2)
This then gave me an iteration object full of tuples, without having to load everything in to memory. I then passed it into an islice grouper as a for loop.
for x in iter(lambda: list(islice(candidates,1000000)),[]):
I then proceeded to perform all of the necessary comparisons in the for loop, and added the resultant dataframe to a dictionary, which I then concatenate at the end for the full list. Python's memory usage hasn't risen above 3GB the entire time.
I would still love some information on how to incorporate dask into this, so I will accept any answer that can provide that (unless the mods think I should open a new question).

Running filter over a large amount of data points and a long time period?

I need to apply two running filters on a large amount of data. I have read that creating variables on the fly is not a good idea, but I wonder if it still might be the best solution for me.
My question:
Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?
Why I want to do it:
The data are 400x700 arrays for 15min time steps over a year (So I have 35000 400x700 arrays). Each time step is read into python individually. Now I need to apply one running filter that checks if the last four time steps are equal (element-wise) and if they are, then all four values are set to zero. The next filter uses the data after the first filter has run and checks if the sum of the last twelve time steps exceeds a certain value. When both filters are done I want to sum up the values, so that at the end of the year I have one 400x700 array with the filtered accumulated values.
I do not have enough memory to read in all the data at once. So I thought I could create a loop where for each time step a new variable for the 400x700 array is created and the two filters run. The older arrays that are filtered I could then add to the yearly sum and delete, so that I do not have more than 16 (4+12) time steps(arrays) in memory at all times.
I don’t now if it’s correct of me to ask such a question without any code to show, but I would really appreciate the help.
If your question is about the best data structure to keep a certain amount of arrays in memory, in this case I would suggest using a three dimensional array. It's shape would be (400, 700, 12) since twelve is how many arrays you need to look back at. The advantage of this is that your memory use will be constant since you load new arrays into the larger one. The disadvantage is that you need to shift all arrays manually.
If you don't want to deal with the shifting yourself I'd suggest using a deque with a maxlen of 12.
"Can I create arrays in a loop with the help of a counter (array1, array2…) and then call them with the counter (something like: ‘array’+str(counter) or ‘array’+str(counter-1)?"
This is a very common question that I think a lot of programmers will face eventually. Two examples for Python on Stack Overflow:
generating variable names on fly in python
How do you create different variable names while in a loop? (Python)
The lesson to learn from this is to not use dynamic variable names, but instead put the pieces of data you want to work with in an encompassing data structure.
The data structure could e.g. be a list, dict or Numpy array. Also the collections.deque proposed by #Midnighter seems to be a good candidate for such a running filter.

Removing duplicates from BIG text file

I have a rather big text file , average 30GB. I want to remove duplicate lines from this file. What is a good efficient algorithm to do this. for small files, I usually use dictionaries, eg Python dictionaries to store unique keys. But this time the file is rather big. any language suggestion is fine. ( i am thinking of using C? or is it rather not language dependent but the algorithm that is more important? ). thanks
If you can't just fire up an instance on amazon with enough memory to hold everything in RAM, this is the strategy I would use:
Step 1 - go through and generate a checksum/hashvalue for each line. I'd probably use SIPHASH. Output these to a file.
Step 2 - sort the file of siphash values, and throw away any that only have one entry. Output the result as a set of hashvalues & number of matches.
Step 3 - read through the file. regenerate the hashvalue for each line. If its a line that has a match, hold onto it in memory. If there's another already in memory with same hashvalue, compare to see if the lines themselves match. Output "match" if true. If you've already seen all N lines that have the same hashvalue and they didn't match, go ahead and dispose of the record.
This strategy depends on the number of duplicates being only a small fraction of the number of total lines. If that's not the case, then I would use some other strategy, like a divide and conquer.

creating an iterator in Python from a dictionary in memory-efficient way

I'm iterating through a very large tab-delimited file (containing millions of lines) and pairing different lines of it based on the value of some field in that file, e.g.
mydict = defaultdict()
for line in myfile:
# Group all lines that have the same field into a list
mydict[line.field].append(line)
Since "mydict" gets very large, I'd like to make it into an iterator so I don't have to hold it all in memory. How can I make it so instead of populating a dictionary, I will create an iterator that I can loop through and get all these lists of lines that have the same field value?
Thanks.
It sounds like you might want a database. There's a variety of relational and non-relational databases you can pick from (some more efficient than others, depending on what you are trying to achieve), but sqlite (built into python) would be the easiest.
Or, if there are only a small number of line.fields to process, you could just read the files several times.
But there's no real magic bullet.
"millions of lines" is not very large unless the lines are long. If the lines are long you might save some memory by storing only positions in the file (.tell()/.seek()).
If the file is sorted by line.field; you could use itertools.groupby().
SQL’s GROUP BY might help for average-sized files (e.g., using sqlite as #wisty suggested).
For really large files you could use MapReduce.

Categories