I'd really like to create a numpy array from a csv file, however, I'm having issues when the file is ~50k lines long (like the MNIST training set). The file I'm trying to import looks something like this:
0.0,0.0,0.0,0.5,0.34,0.24,0.0,0.0,0.0
0.0,0.0,0.0,0.4,0.34,0.2,0.34,0.0,0.0
0.0,0.0,0.0,0.34,0.43,0.44,0.0,0.0,0.0
0.0,0.0,0.0,0.23,0.64,0.4,0.0,0.0,0.0
It works fine for something thats 10k lines long, like the validation set:
import numpy as np
csv = np.genfromtxt("MNIST_valid_set_data.csv",delimiter = ",")
If I do the same with the training data (larger file), I'll get a c-style segmentation fault. Does anyone know any better ways besides breaking the file up and then piecing it together?
The end result is that I'd like to pickle the arrays into a similar mnist.pkl.gz file but I can't do that if I can't read in the data.
Any help would be greatly appreciated.
I think you really want to track down the actual problem and solve it, rather than just work around it, because I'll bet you have other problems with your NumPy installation that you're going to have to deal with eventually.
But, since you asked for a workaround that's better than manually splitting the files, reading them, and merging them, here are two:
First, you can split the files programmatically and dynamically, instead of manually. This avoids wasting a lot of your own human effort, and also saves the disk space needed for those copies, even though it's conceptually the same thing you already know how to do.
As the genfromtxt docs make clear, the fname argument can be a pathname, or a file object (open in 'rb' mode), or just a generator of lines (as bytes). Of course a file object is itself a generator of lines, but so is, say, an islice of a file object, or a group from a grouper. So:
import numpy as np
from more_itertools import grouper
def getfrombigtxt(fname, *args, **kwargs):
with open(fname, 'rb') as f:
return np.vstack(np.genfromtxt(group, *args, **kwargs)
for group in grouper(f, 5000, b''))
If you don't want to install more_itertools, you can also just copy the 2-line grouper implementation from the Recipes section of the itertools docs, or even inline zipping the iterators straight into your code.
Alternatively, you can parse the CSV file with the stdlib's csv module instead of with NumPy:
import csv
import numpy as np
def getfrombigtxt(fname, delimiter=','):
with open(fname, 'r') as f: # note text mode, not binary
rows = (list(map(float, row)) for row in csv.reader(f))
return np.vstack(rows)
This is obviously going to be a lot slower… but if we're talking about turning 50ms of processing into 1000ms, and you only do it once, who cares?
Related
I have been trying to read the header (first 100 lines) of a netCDF file in Python, but have been facing some issues. I am familiar with the read_nc function available in the synoptReg package for R and with the ncread function that comes with MATLAB, as well as the read_csv function available in the pandas library. To my knowledge, however, there isn't anything similar for netCDF (.nc) files.
Noting this, and using answers from this question, I've tried the following (with no success):
with open(filepath,'r') as f:
for i in range(100):
line = next(f).strip()
print(line)
However, I receive this error, even though I've ensured that tabs have not been mixed with spaces and that the for statement is within the with block (as given as explanations by the top answers to this question):
'utf-8' codec can't decode byte 0xbb in position 411: invalid start byte
I've also tried the following:
with open(filepath,'r') as f:
for i in range(100):
line = [next(f) for i in range(100)]
print(line)
and
from itertools import islice
with open('/Users/toshiro/Desktop/Projects/CCAR/Data/EDGAR/v6.0_CO2_excl_short-cycle_org_C_2010_TOTALS.0.1x0.1.nc','r') as f:
for i in range(100):
line = list(islice(f, 100))
print(line)
But receive the same error as above. Are there any workarounds for this?
You can't. netCDFs are binary files and can't be interpreted as text.
If the files are netCDF3 encoded, you can read them in with scipy.io.netcdf_file. But it's much more likely they are netCDF4, in which case you'll need the netCDF4 package.
On top of this, I'd highly recommend the xarray package for reading and working with netCDF data. It supports a labeled N-dimensional array interface - think pandas indexes on each dimension of a numpy array.
Whether you go with netCDF or xarray, netCDFs are self-describing and support arbitrary reads, so you don't need to load the whole file to view the metadata. So similar to viewing the head of a text file, you can simply do:
import xarray as xr
ds = xr.open_dataset("path/to/myfile.nc")
print(ds) # this will give you a preview of your data
Additionally, xarray does have a xr.Dataset.head function which will display the first 5 (or N if you provide an int) elements along each dimension:
ds.head() # display a 5x5x...x5 preview of your data
See the getting started guide and the User guide section on reading and writing netCDF files for more info.
My dataset looks at flight delays and cancellations from 2009 to 2018. Here are the important points to consider:
Each year is its own csv file so '2009.csv', '2010.csv', all the way to '2018.csv'
Each file is roughly 700mb
I used the following to combine csv files
import pandas as pd
import numpy as np
import os, sys
import glob
os.chdir('c:\\folder'
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
combined_airline_csv = pd.concat([pd.read_csv(f) for f in all_filenames])
combined_airline_csv.to_csv('combined_airline_csv.csv', index =False, encoding = 'utf-8-sig')
When I run this, I receive the following message:
MemoryError: Unable to allocate 43.3MiB for an array with shape(5674621, ) and data type float64.
I am presuming that my file is too large and that will need to run this on a virtual machine (i.e. AWS).
Any thoughts?
Thank you!
This is a duplicate of how to merge 200 csv files in Python.
Since you just want to combine them into one file, there is no need to load all data into a dataframe at the same time. Since they all have the same structure, I would advise creating one filewriter, then open each file with a file reader and write (if we want to be fancy let's call it stream) the data line by line. Just be careful not to copy the headers each time, since you only want them one time. Pandas is simply not the best tool for this task :)
In general, this is a typical task that can also be done easily and even faster directly on the command line. (code depends on the os)
I have 1000s of CSV files that I would like to append and create one big numpy array. The problem is that the numpy array would be much bigger than my RAM. Is there a way of writing a bit at a time to disk without having the entire array in RAM?
Also is there a way of reading only a specific part of the array from disk at a time?
When working with numpy and large arrays, there are several approaches depending on what you need to do with that data.
The simplest answer is to use less data. If your data has lots of repeating elements, it is often possible to use a sparse array from scipy because the two libraries are heavily integrated.
Another answer (IMO: the correct solution to your problem) is to use a memory mapped array. This will let numpy only load the necessary parts to ram when needed, and leave the rest on disk. The files containing the data can be simple binary files created using any number of methods, but the built-in python module that would handle this is struct. Appending more data would be as simple as opening the file in append mode, and writing more bytes of data. Make sure that any references to the memory mapped array are re-created any time more data is written to the file so the information is fresh.
Finally is something like compression. Numpy can compress arrays with savez_compressed which can then be opened with numpy.load. Importantly, compressed numpy files cannot be memory-mapped, and must be loaded into memory entirely. Loading one column at a time may be able to get you under the threshold, but this could similarly be applied to other methods to reduce memory usage. Numpy's built in compression techniques will only save disk space not memory. There may exist other libraries that perform some sorts of streaming compression, but that is beyond the scope of my answer.
Here is an example of putting binary data into a file then opening it as a memory-mapped array:
import numpy as np
#open a file for data of a single column
with open('column_data.dat', 'wb') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
#open the array as a memory-mapped file
column_mmap = np.memmap('column_data.dat', dtype=np.float)
#read some data
print(np.mean(column_mmap[0:1024]))
#write some data
column_mmap[0:512] = .5
#deletion closes the memory-mapped file and flush changes to disk.
# del isn't specifically needed as python will garbage collect objects no
# longer accessable. If for example you intend to read the entire array,
# you will need to periodically make sure the array gets deleted and re-created
# or the entire thing will end up in memory again. This could be done with a
# function that loads and operates on part of the array, then when the function
# returns and the memory-mapped array local to the function goes out of scope,
# it will be garbage collected. Calling such a function would not cause a
# build-up of memory usage.
del column_mmap
#write some more data to the array (not while the mmap is open)
with open('column_data.dat', 'ab') as f:
#for 1024 "csv files"
for _ in range(1024):
csv_data = np.random.rand(1024).astype(np.float) #represents one column of data
f.write(csv_data.tobytes())
So, I have this database with thousands of rows and columns. At the start of the program I load the data and assign a variable to it:
data=np.loadtxt('database1.txt',delimiter=',')
Since this database contains many elements, it takes minutes to start the program. Is there a way in Python (similar to .mat files in matlab) which makes me only load the data once even when I stop the program then run it again? Currenly my time is wasted waiting for the program to load the data if I just change a small thing for testing.
Firstly, the Numpy package isn't good to read a large file, the Pandas package it's so strongly.
So just stop using np.loadtxt and start using pd.read_csv instead.
But, if you want to use it
I think that the np.fromfile() module is more efficient and faster than np.loadtxt().
So, my advice try:
data = np.fromfile('database1.txt', sep=',')
instead of:
data = np.loadtxt('database1.txt',delimiter=',')
You could pickle to cache your data.
import pickle
import os
import numpy as np
if os.path.isfile("cache.p"):
with open("cache.p","rb") as f:
data=pickle.load(f)
else:
data=data=np.loadtxt('database1.txt',delimiter=',')
with open("cache.p","wb") as f:
pickle.dump(data,f)
The first time it will be very slow, then in later executions it will be pretty fast.
just tested with a file containing 1 million rows and 20 columns of random floats, it took ~30s the first time, and ~0.4s the following times.
CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.