Python numpy: Data file with multiple datasets

Python numpy: Data file with multiple datasets - python

Suppose you have a data file which includes several data sets separated by the string "--" in the following format:
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
...
How can you read the whole file into an array of arrays so that you can plot all data sets afterwards to the same picture with a for loop looping over the outer array ?
genfromtxt('data.dat', delimiter=("--"))
gives lots of
Line #1550 (got 1 columns instead of 2)

I will update ...
I would first split the file into multiple files, which can reside in memory as objects or on the filesystems as new files.
You can locate the string -- with the module re.
Then you can use the link I posted above.

If you're 100% certain that you have no negative values in your file, you can try a quick:
np.genfromtxt(your_file, comments="-")
The comments="-" will force genfromtxt to ignore all the characters after -, which of course will give weird results if you have negative variables. Moreover, the result will be just a lump of your dataset in a single array
Otherwise, the safest route is to iterate on your file and store the lines that do not match -- in one list per block, something along the lines:
blocks = []
current = []
for line in your_file:
if line.startswith("-"):
blocks.append(np.array(current))
current = []
else:
current += line.split()
You may have to get rid of the first block if empty.
You could also check a mmap based solution already posted.

Related

subsetting very large files - python methods for optimal performance

I have one file (index1) with 17,270,877 IDs, and another file (read1) with a subset of these IDs (17,211,741). For both files, the IDs are on every 4th line.
I need a new (index2) file that contains only the IDs in read1. For each of those IDs I also need to grab the next 3 lines from index1. So I'll end up with index2 whose format exactly matches index1 except it only contains IDs from read1.
I am trying to implement the methods I've read here. But I'm stumbling on these two points: 1) I need to check IDs on every 4th line, but I need all of the data in index1 (in order) because I have to write the associated 3 lines following the ID. 2) unlike that post, which is about searching for one string in a large file, I'm searching for a huge number of strings in another huge file.
Can some folks point me in some direction? Maybe none of those 5 methods are ideal for this. I don't know any information theory; we have plenty of RAM so I think holding the data in RAM for searching is the most efficient? I'm really not sure.
Here a sample of what the index look like (IDs start with #M00347):
#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0
CCTAAGGTTCGG
+
CDDDDFFFFFCB
#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0
CGCCATGCATCC
+
BBCCBBFFFFFF
#M00347:30:000000000-BCWL3:1:1101:15711:1332 1:N:0:0
TTTGGTTCCCGG
+
CDCDECCFFFCB
read1 looks very similar, but the lines before and after the '+' are different.

If data of index1 can fit in memory, the best approach is to do a single scan of this file and store all data in a dictionary like this:
{"#M00347:30:000000000-BCWL3:1:1101:15589:1332 1:N:0:0":["CCTAAGGTTCGG","+","CDDDDFFFFFCB"],
"#M00347:30:000000000-BCWL3:1:1101:15667:1332 1:N:0:0":["CGCCATGCATCC","+","BBCCBBFFFFFF"],
..... }
Values can be stored as formatted string as you prefer.
After this, you can do a single scan on read1 and when an IDs is encountered you can do a simple lookup on the dictionary to retrieve needed data.

Parsing a string with defined data dictionary into a list efficiently

I have a large flat file which I need to parse using a list which contains the variable name, the starting point, and the length of the variable along with the type. e.g.
columns = [['LOAD_CYCLE', 131, 6, 'int'],
['OPERATOR', 59, 8, 'Char (8)'],
['APP_DATE', 131, 8, 'Date'],
['UNIQUE_KEY', 245, 25, 'Char (25)']]
This list contains 1,600 items. The only really important columns are the starting point and the length of the variable. This is used to split each line in the flat file into a list of variables, which is used to create a new file to be inserted into a database. The data type is important, but I can always do that section later.
Currently, my method is to read the file in chunks (it is a very large file; over 6GB), and then process the chunk piece by piece:
line = data_file.read(chunk*1000)
for x in range(1000):
offset = chunk*x
for item in columns:
piece = line[item[1]+offset:item[1]+item[2]+offset].replace('\n','')
#Depending on the data type, a piece may undergo one or two checks before being
#added to a list which is then written to an output file
The time consuming part is iterating through the columns. Is this the only way to do this? Or is there perhaps a more efficient way to split the string? Something involving maps?

This seems like a great case for the struct module. Assuming you're using CPython, this effectively moves the loop over the columns into C.
First, you need to build up the format string.
Since your columns appear to be specified in arbitrary order, rather than ordered by starting point, and may have gaps between them, this isn't quite trivial… but it shouldn't be too hard. Something like this:
sorted_columns = sorted(columns, key=operator.itemgetter(1))
formats = []
offset = 0
for name, start, length, vtype in sortedcolumns:
# add padding bytes
if start > offset:
formats.append('{}x'.format(start-offset))
formats.append('{}s'.format(length))
format = struct.Struct('=' + ''.join(formats))
Then:
offset = chunk*x
values = format.unpack_from(line, offset)
And now you have a tuple of 1600 items.
Of course to do anything with that tuple, you may have to iterate over it anyway. But maybe you can do that in C as well. For example, if you're just inserting the values into a SQL database, then creating a giant SQL statement with 1600 parameters (in the same order as in sorted_columns) and passing the tuple as the arguments may take care of that for you:
cursor.execute(giant_insert_sql, values)
But if you need to do something more complicated to each value, then you'll need to do something like one of the following:
Use NumPy and/or Pandas to vectorize the loop. (Note that they can also be used to just load the whole file into memory, vectorizing the outer loop as well as the inner one, if you've got the RAM… but that shouldn't be nearly as much of a performance gain.)
Run your existing code in PyPy instead of CPython.
Use Numba to JIT the code within CPython.
Write a C extension to replace your inner loop—which, if you're lucky, may be as simple as just moving your Python code to a function in a separate file and compiling it with Cython.

Trying to fill an array with data opened from files

the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.

You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.

Access list of items with list of indices

Consider a large list of named items (first line) returned from a large csv file (80 MB) with possible interrupted spacing
name_line = ['a',,'b',,'c' .... ,,'cb','cc']
I am reading the remainder of the data in line by line and I only need to process data with a corresponding name. Data might look like
data_line = ['10',,'.5',,'10289' .... ,,'16.7','0']
I tried it two ways. One is popping the empty columns from each line of the read
blnk_cols = [1,3, ... ,97]
while data:
...
for index in blnk_cols: data_line.pop(index)
the other is compiling the items associated with a name from L1
good_cols = [0,2,4, ... ,98,99]
while data:
...
data_line = [data_line[index] for index in good_cols]
in the data I am using there will definitely be more good lines then bad lines although it might be as high as half and half.
I used the cProfile and pstats package to determine my weakest links in speed which suggested the pop was the current slowest item. I switched to the list comp and the time almost doubled.
I imagine one fast way would be to slice the array retrieving only good data, but this would be complicated for files with alternating blank and good data.
what I really need is to be able to do
data_line = data_line[good_cols]
effectively passing a list of indices into a list to get back those items.
Right now my program is running in about 2.3 seconds for a 10 MB file and the pop accounts for about .3 seconds.
Is there a faster way to access certain locations in a list. In C it would just be de-referencing an array of pointers to the correct indices in the array.
Additions:
name_line in file before read
a,b,c,d,e,f,g,,,,,h,i,j,k,,,,l,m,n,
name_line after read and split(",")
['a','b','c','d','e','f','g','','','','','h','i','j','k','','','','l','m','n','\n']

Try a generator expression,
data_line = (data_line[i] for i in good_cols)
Also read here about
Generator Expressions vs. List Comprehension
as the top answer tells you: 'Basically, use a generator expression if all you're doing is iterating once'.
So you should benefit from this.

Searching for duplicate records within a text file where the duplicate is determined by only two fields

First, Python Newbie; be patient/kind.
Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match.
So far, my code looks like this:
# Compress Full Part to a Compressed Part
# and Check for Duplicates on Supplier Code
# and Compressed Part combination
import sys
import re
import time
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
start=time.time()
try:
file1 = open("C:\Accounting\May Accounting\May.txt", "r")
except IOError:
print >> sys.stderr, "Cannot Open Read File"
sys.exit(1)
try:
file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a")
except IOError:
print >> sys.stderr, "Cannot Open Write File"
sys.exit(1)
hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR"
file2.write(hdrList+chr(10))
lines_seen=set()
affirm="YES"
records = file1.readlines()
for record in records:
fields = record.split(chr(124))
if fields[0]=="CIGSupplier":
continue #If incoming file has a header line, skip it
file2.write(fields[0]+"|"), #Supplier Code
file2.write(fields[1]+"|"), #Full_Part
file2.write(fields[2]+"|"), #Part Status
file2.write(fields[3]+"|"), #Alias Flag
file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag
file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part
dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1])
if dupechk not in lines_seen:
file2.write(chr(10))
lines_seen.add(dupechk)
else:
file2.write(affirm+chr(10))
print "it took", time.time() - start, "seconds."
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
file2.close()
file1.close()
It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file.
Confusing enough?
Thanks.

Maybe you could consider building a dictionary mapping (supplier_number, compressed_part_number) tuples to data structures (nested lists perhaps, or instances of a custom class for improved readability & maintainability) holding information on line numbers for the lines the records matching the key tuple appear in your file plus possibly the complete records themselves.
This would end up putting all the data from the file into a large in-memory dictionary, which might or might not be a problem depending on your requirements; if you skip the actual records and only hold line numbers, the dictionary will be much smaller.
You can then iterate over the entries in the dictionary spitting out the duplicates to a file as you go.

I think you should sort the entries in the input file first. Maybe it will consume too much memory, but you should first try to read all input in memory, sort this based upon the value of dupechk and then you can iterate over all entries and easily see if there are two or more identical records. Because identical records are grouped, it is easy to output just those records.

This might be more efficient/feasible for the large files you are dealing with:
Sort the file based on the supplier code and compressed part number - dump it to a temporary file. I don't think it is worth actually tacking on the compressed part number, just compute it from the full part number when needed. However, that is pure conjecture and definitely deserves some quick benchmarking.
Iterate through the temporary file (might want to take advantage of 'with'). Check if current line's supplier code and compressed part number is identical to previous one - if it is, you have identified a duplicate. Handle as you see fit. Since the file is sorted you reduce the memory requirement of needing to store all the lines in memory to a set of consecutive identical lines.

You are already reading the whole file into memory. You don't need to sort. Instead of a set, have a dict mapping (supplier, compressed_pn) to line_number_last_seen - 1. That way, when you discover a duplicate, you can output the two duplicate records immediately. This method requires only one pass over the file. You don't need to write a temporary file.
If you often have 3 or more records with the same key, you may wish to use an approach that maps the key to a list of line indices. At the end of reading the file, you iterate over the dictionary looking for lists with more than 1 entry.

Couple of comments:
Using file.readlines on a large file is wasteful - it's reading the entire file into memory. You should, instead, take advantage that a file is iterable, reading a single line at a time by default.
Your file format is basically a CSV, with a pipe instead of a comma as a separator. So, use the CSV module. The CSV is written in C and escapes most of the interpreted overhead. It also provides a nice iterable interface which also does not require reading the whole file into memory, either.
You should additionally use a DictReader from the csv module. If the header is in the file, great, the class will parse it and use as the keys further on. If not, specify the header in the code. Either way, fields[0] is uninformative and error prone. fields["CIGSUPPLIER"] is much more self-documenting.
Just as with reading, use the csv module for writing. Again, you can specify the delimiter.
Don't use file2.write(char(10)). Use file2.write('\n'), and open your file appropriately. Alternatively, if you're using the csv.writer class, these become unnecessary.
Otherwise, your logic and flow looks alright. I'd overall advise against using the chr(*) calls, unless that character is truly unprintable. newlines and pipes are printable (or have supported escapes), and should be used as such.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python numpy: Data file with multiple datasets - python

I will update ... I would first split the file into multiple files, which can reside in memory as objects or on the filesystems as new files. You can locate the string -- with the module re. Then you can use the link I posted above.

Related

subsetting very large files - python methods for optimal performance

Parsing a string with defined data dictionary into a list efficiently

Trying to fill an array with data opened from files

Access list of items with list of indices

Searching for duplicate records within a text file where the duplicate is determined by only two fields

Categories

Resources