I have a text-file which is a huge set of data(around 9 GB). I have arranged the file as 244 X 3089987 with data delimited with tabs. I would like to load this text-file in Matlab as a matrix. Here is what I have tried and I have been unsuccessful (My Matlab gets hung).
fread = fopen('merge.txt','r');
formatString = repmat('%f',244,3089987);
C = textscan(fread,formatString);
Am I doing something wrong or is my approach wrong? If this is easily possible in Python, could someone please suggest accordingly.
If you read the documentation for textscan you will see that you can define an input argument N so that:
textscan reads file data using the formatSpec N times, where N is a
positive integer. To read additional data from the file after N
cycles, call textscan again using the original fileID. If you resume a
text scan of a file by calling textscan with the same file identifier
(fileID), then textscan automatically resumes reading at the point
where it terminated the last read.
You can also pass a blank formatSpec to textscan in order to read in an arbitrary number of columns. This is how dlmread, a wrapper for textscan operates.
For example:
fID = fopen('test.txt');
chunksize = 10; % Number of lines to read for each iteration
while ~feof(fID) % Iterate until we reach the end of the file
datachunk = textscan(fID, '', chunksize, 'Delimiter', '\t', 'CollectOutput', true);
datachunk = datachunk{1}; % Pull data out of cell array. Can take time for large arrays
% Do calculations
end
fclose(fID);
This will read in 10 line chunks until you reach the end of the file.
If you have enough RAM to store the data (a 244 x 3089987 array of double is just over 6 gigs) then you can do:
mydata = textscan(fID, '', 'Delimiter', '\t', 'CollectOutput', true);
mydata = mydata{1}; % Pull data out of cell array. Can take time for large arrays
try:
A = importdata('merge.txt', '\t');
http://es.mathworks.com/help/matlab/ref/importdata.html
and if the rows are not delimited by '\n':
[C, remaining] = vec2mat(A, 244)
http://es.mathworks.com/help/comm/ref/vec2mat.html
Another option in recent MATLAB releases is to use datastore. This has the advantage of being designed to allow you to page through the data, rather than read the whole lot at once. It can generally deduce all the formatting stuff.
http://www.mathworks.com/help/matlab/import_export/read-and-analyze-data-in-a-tabulartextdatastore.html
I'm surprised this is even trying to run, when I try something similar textscan throws an error.
If you really want to use textscan you only need the format for each row so you can replace 244 in your code with 1 and it should work. Edit: having read your comment not that in the first element is the number of columns so you should do formatString = repmat('%f',1, 244);. Also you can apparently just leave the format as empty ('') and it will work.
However, Matlab has several text import functions of which textscan is rarely the easiest way to do something.
In this case I would probably use dlmread, which does any delimitated numerical data. You want something like:
C=dlmread('merge.txt', '\t');
Also as you are trying to load 9GB of data I assume you have enough memory, you'll probably get an out of memory error if you don't but it is something to consider.
Related
I am working on a project where I need to process large amounts of txt files (about 7000 files), where each file has 2 columns of floats with 12500 rows.
I am using pandas, and takes about 2 min 20 sec which is a bit long. With MATLAB this takes 1 min less. I would like to get closer or faster than MATLAB.
Is there any faster alternative that I can implement with python?
I tried Cython and the speed was the same as with pandas.
Here is the code in am using. It reads the files composed of column 1 (time) and column 2 (amplitude). I calculate the envelope and make a list with the resulting envelopes for all files. I extract the time from the first file using the simple numpy.load_txt(), which is slower than pandas but no impact since it is just one file.
Any ideas?
For coding suggestions please try to use the same format as I use.
Cheers
Data example:
1.7949600e-05 -5.3232106e-03
1.7950000e-05 -5.6231098e-03
1.7950400e-05 -5.9230090e-03
1.7950800e-05 -6.3228746e-03
1.7951200e-05 -6.2978830e-03
1.7951600e-05 -6.6727570e-03
1.7952000e-05 -6.5727906e-03
1.7952400e-05 -6.9726562e-03
1.7952800e-05 -7.0726226e-03
1.7953200e-05 -7.2475638e-03
1.7953600e-05 -7.1725890e-03
1.7954000e-05 -6.9476646e-03
1.7954400e-05 -6.6227738e-03
1.7954800e-05 -6.4228410e-03
1.7955200e-05 -5.8480342e-03
1.7955600e-05 -6.1979166e-03
1.7956000e-05 -5.7980510e-03
1.7956400e-05 -5.6231098e-03
1.7956800e-05 -5.3482022e-03
1.7957200e-05 -5.1732611e-03
1.7957600e-05 -4.6484375e-03
20 files here:
https://1drv.ms/u/s!Ag-tHmG9aFpjcFZPqeTO12FWlMY?e=f6Zk38
folder_tarjet="D:\this"
if len(folder_tarjet) > 0:
print ("You chose %s" % folder_tarjet)
list_of_files = os.listdir(folder_tarjet)
list_of_files.sort(key=lambda f: os.path.getmtime(join(folder_tarjet, f)))
num_files=len(list_of_files)
envs_a=[]
for elem in list_of_files:
file_name=os.path.join(folder_tarjet,elem)
amp=pd.read_csv(file_name,header=None,dtype={'amp':np.float64},delim_whitespace=True)
env_amplitudes = np.abs(hilbert(np.array(pd.DataFrame(amp[1]))))
envs_a.append(env_amplitudes)
envelopes=np.array(envs_a).T
file_name=os.path.join(folder_tarjet,list_of_files[1])
Time=np.loadtxt(file_name,usecols=0)
I would suggest you not to use csv to store and load big data quantities, if you already have your data in csv you can still convert all of it at once into a faster format. For instance you can use pickle, h5, feather and parquet, they are non human readable but have much better performance in any other metric.
After the conversion, you will be able to load the data in few seconds (if not less than a second) instead of minutes, so in my opinion it is a much better solution than trying to make a marginal 50% optimization.
If the data is being generated, make sure to generate it in a compressed format. If you don't know which format to use, parquet for instance would be one of the fastest for reading and writing and you can also load it from Matlab.
Here you will the options already supported by pandas.
As discussed in #Ziur Olpa's answer and the comments, a binary format is bound to be faster than to parse text.
The quick way to get those gains is to use Numpy's own NPY format, and have your reader function cache those onto disk; that way, when you re-(re-re-)run your data analysis, it will use the pre-parsed NPY files instead of the "raw" TXT files. Should the TXT files change, you can just remove all NPY files and wait a while longer for parsing to happen (or maybe add logic to look at the modification times of the NPY files c.f. their corresponding TXT files).
Something like this – I hope I got your hilbert/abs/transposition logic the way you wanted to.
def read_scans(folder):
"""
Read scan files from a given folder, yield tuples of filename/matrix.
This will also create "cached" npy files for each txt file, e.g. a.txt -> a.txt.npy.
"""
for filename in sorted(glob.glob(os.path.join(folder, "*.txt")), key=os.path.getmtime):
np_filename = filename + ".npy"
if os.path.exists(np_filename):
yield (filename, np.load(np_filename))
else:
df = pd.read_csv(filename, header=None, dtype=np.float64, delim_whitespace=True)
mat = df.values
np.save(np_filename, mat)
yield (filename, mat)
def read_data_opt(folder):
times = None
envs = []
for filename, mat in read_scans(folder):
if times is None:
# If we hadn't got the times yet, grab them from this
# matrix (by slicing the first column).
times = mat[:, 0]
env_amplitudes = signal.hilbert(mat[:, 1])
envs.append(env_amplitudes)
envs = np.abs(np.array(envs)).T
return (times, envs)
On my machine, the first run (for your 20-file dataset) takes
read_data_opt took 0.13302898406982422 seconds
and the subsequent runs are 4x faster:
read_data_opt took 0.031115055084228516 seconds
I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.
I have a CSV with named headers. Here are the first five rows
v0 v1 v2 v3 v4
1001 5529 24 56663 16445
1002 4809 30.125 49853 28069
1003 407 20 28462 8491
1005 605 19.55 75423 4798
1007 1607 20.26 79076 12962
I'd like to read in the data and be able to view it fully. I tried doing this:
import numpy as np
np.set_printoptions(threshold=np.inf)
main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]
However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?
OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:
>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)
When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).
>>> x
This is the same as
>>> print(repr(x))
The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.
I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.
Somewhat curiously, writing this array as a csv is fast, with a minimal delay:
>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')
And I know what savetxt does - it iterates on rows, and does a file write
f.write(fmt % tuple(row))
Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.
Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?
I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.
I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.
Pandas has many tricks for operating with table like data.
import pandas as pd
df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:
np.set_printoptions(threshold=100000, suppress=True)
The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.
I have a tab separated .txt file that keeps numbers as matrix. Number of lines is 904,652 and the number of columns is 26,600 (tab separated). The total size of the file is around 48 GB. I need to load this file as matrix and take the transpose of the matrix to extract training and testing data. I am using Python, pandas, and sklearn packages. I behave 500GB memory server but it is not enough to load it with pandas package. Could anyone help me about my problem?
The loading code part is below:
def open_with_pandas_read_csv(filename):
df = pandas.read_csv(filename, sep=csv_delimiter, header=None)
data = df.values
return data
If your server has 500GB of RAM, you should have no problem using numpy's loadtxt method.
data = np.loadtxt("path_to_file").T
The fact that it's a text file makes it a little harder. As a first step, I would create a binary file out of it, where each number takes a constant number of bytes. It will probably also reduce the file size.
Then, I would make a number of passes, and in each pass I would write N rows in the output file.
Pseudo code:
transposed_rows = [ [], .... , [] ] # length = N
for p in range(columns / N):
for row in range(rows):
x = read_N_numbers_from_row_of_input_matrix(row,pass*N)
for i in range(N):
transposed_rows[i].append(x)
for i in range(N):
append_to_output_file(transposed_rows[i])
The transformation to binary file enables to read a sequence of numbers from the middle of the row.
N should be small enough to fit transposed_rows() in memory, i.e. N*rows should be reasonable.
N should be large enough so that we take advantage of caching. If N=1, this means we're wasting a lot of reads to generate a single row of output.
It sounds to me that your are working on genetic data. If so, consider using --transpose with plink, it is very quick: http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode
I found a solution (I believe there are still more efficient and logical ones) on stackoverflow. np.fromfile() method loads huge files more efficiently more than np.loadtxt() and np.genfromtxt() and even pandas.read_csv(). It took just around 274 GB without any modification or compression. I thank everyone who tried to help me on this issue.
I am trying to write a 100x6 numpy array to a data file using
# print array to file
data = open("store.dat", "w")
for j in xrange(len(array)):
data.write(str(array[i]) + "\n")
but when the file writes, it automatically splits the line after 68 characters i.e. 5 columns of a 6 column array. For example, I want to see
[ 35.47842918 21.82382715 3.18277209 0.38992263 1.17862342 0.46170848]
as one of the lines, but am getting
[ 35.47842918 21.82382715 3.18277209 0.38992263 1.17862342
0.46170848]
I've narrowed it down to a problem with str(array[i]) and it deciding to make itself a new line.
Secondly, is there a better way to be going about this? I'm very new to python and know little about properly coding in it. Ultimately, I'm writing out a simulation of stars to later be read by a module that will render the coordinates in VPython. I thought, to avoid the problems of real-time rendering I could just pass a file once the simulation is complete. Is this inefficient? Should I be passing an object instead?
Thank you
Perhaps it would be more convenient to write it using numpy.save instead?
Alternatively, you can also use:
numpy.array_str(array[i], max_line_width=1000000)
I want to output the numeric result from image processing to the .xls file in one line exactly in cells horizontally, does anybody can advice what Python module to use and what code to add, please? In other words, how to arrange digits from an array and put them exactly in Excel cells horizontally?
Code fragment:
def fouriertransform(self): #function for FT computation
for filename in glob.iglob ('*.tif'):
imgfourier = mahotas.imread (filename) #read the image
arrayfourier = numpy.array([imgfourier])#make an array
# Take the fourier transform of the image.
F1 = fftpack.fft2(imgfourier)
# Now shift so that low spatial frequencies are in the center.
F2 = fftpack.fftshift(F1)
# the 2D power spectrum is:
psd2D = np.abs(F2)**2
print psd2D
f.write(str(psd2D))#write to file
This should be a comment, but I don't have enough rep.
What other people say
It looks like this is a duplicate of Python Writing a numpy array to a CSV File, which uses numpy.savetxt to generate a csv file (which excel can read)
Example
Numpy is probably the easiest way to go in my opinion. You could get fancy and use the np.flatten module. After that, it would be as simple as saving the array with a comma separating each element. Something like this (tested!) is even simpler
import numpy as np
arr = np.ones((3,3))
np.savetxt("test.csv",arr,delimiter=',',newline=',')
Note this is a small hack, since newlines are treated as commas. This makes "test.csv" look like:
1,1,1,1,1,1,1,1,1, # there are 9 ones there, I promise! 3x3
Note there is a trailing comma. I was able to open this in excel no problem, voila!