import numpy as np, itertools
x1 = np.linspace(0.1, 3.5, 3)
x2 = np.arange(5, 24, 3)
x3 = np.arange(50.9, 91.5, 3)
def calculate(x1,x2,x3):
res = x1**5+x2*x1+x3
return res
products = list(itertools.product(x1,x2,x3))
results = [calculate(a,b,c) for a,b,c in products]
I have to save the results as look up tables for future use.
In my real case, the file is going to be very large around 1GB. So I need faster way of reading that file later.
What is the best way and file format to save it to access in future?
outputs = np.column_stack((products,results))
np.savetxt('test.out',outputs, delimiter = ',')
My future use as follows:
#given_x1,given_x2,given_x3 = 0.2, 8, 60
#open the look up table
#read the neighbouring two values for the given values
#linearly interpolate between two values for the results.
I'd construct a 1-D array from the list comprehension and save this out:
In [37]:
a = np.array([calculate(a,b,c) for a,b,c in products])
np.savetxt(r'c:\data\lut.txt', a)
In [39]:
b = np.loadtxt(r'c:\data\lut.txt')
np.all(a==b)
Out[39]:
True
Related
I have 3 different CSV files. Each has 70 rows and 430 columns. I want to create and save a boolean result file (with the same shape) that put true if the condition is met.
one file include temperature data, one wind data and one Rh data.condition is: [(t>=35) & (w>=7) & (rh<30)]
I want the saved file to be 0 and 1 file that show in which cell the condition has been meet (1) or not (0). The problem is that results are not true! I really appreciate your help.
import numpy as np
import pandas as pd
dft = pd.read_csv ("D:/practicet.csv",header = None)
dfrh = pd.read_csv ("D:/practicerh.csv",header = None)
dfw = pd.read_csv ("D:/practicew.csv",header = None)
result_set = []
for i in range (0,dft.shape[1]):
t=dft[i]
w=dfw[i]
rh=dfrh[i]
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
result_set = np.append(result_set,result)
np.savetxt("D:/result.csv", result_set, delimiter = ",")
You can generate boolean series by testing each column of the frame. You simply then concatenate columns back into a DataFrame object.
import pandas as pd
data = pd.read_csv('data.csv')
bool_temp = data['temperature'] > 22
bool_week = data['week'] > 5
bool_humid = data['humidity'] > 50
data_tmp = [bool_humid, bool_temp, bool_week]
df = pd.concat(data_tmp, axis=1, keys=[s.name for s in data_tmp])
The dummy data:
temperature,week,humidity
25,3,80
29,4,60
22,4,20
20,5,30
2,7,80
30,9,80
are written to data.csv
Give this a shot.
This is a proxy problem for yours, with random arrays from [0,100] in the same shape as your CSV.
import numpy as np
dft = np.random.rand(70,430)*100.
dfrh = np.random.rand(70,430)*100.
dfw = np.random.rand(70,430)*100.
result_set = []
for i in range(dft.shape[0]):
result = ((dft[i] >= 35) & (dfw[i] >= 7) & (dfrh[i] < 30))
result_set.append(result)
np.savetxt("result.csv", result_set, delimiter = ",")
The critical problem with your code is:
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
This does not do what you think it's doing. You (i) initialize an empty array (which will have garbage values), and then you (ii) apply your boolean mask to it. So, now you have a garbage array masked into another garbage array according to your specified boolean rules.
As an example...
In [5]: a = np.array([1,2,3,4,5])
In [6]: mask = np.array([True,False,False,False,True])
In [7]: a[mask]
Out[7]: array([1, 5])
I´ve a Pandas dataframe that I read from csv and contains X and Y coordinates and a value that I need to put in a matrix and save it to a text file. So, I created a numpy array with max(X) and max(Y) extension.
I´ve this file:
fid,x,y,agblongo_tch_alive
2368458,1,1,45.0126083457747
2368459,1,2,44.8996854102889
2368460,2,2,45.8565022933761
2358154,3,1,22.6352522929758
2358155,3,3,23.1935887499899
And I need this one:
45.01 44.89 -9999.00
-9999.00 45.85 -9999.00
22.63 -9999.00 23.19
To do that, I´m using a loop like this:
for row in data.iterrows():
p[int(row[1][2]),int(row[1][1])] = row[1][3]
and then I save it to disk using np.array2string. It works.
As the original csv has 68 M lines, it´s taking a lot of time to process, so I wonder if there´s another more pythonic and fast way to do that.
Assuming the columns of your df are 'x', 'y', 'value', you can use advanced indexing
>>> x, y, value = data['x'].values, data['y'].values, data['value'].values
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> result[y, x] = value
This will, however, not work properly if coordiantes are not unique.
In that case it is safer (but slower) to use add.at:
>>> result = np.zeros((y.max()+1, x.max()+1), value.dtype)
>>> np.add.at(result, (y, x), value)
Alternatively, you can create a sparse matrix since your data happen to be in sparse coo format. Using the '.A' property you can then convert that to a normal (dense) array as needed:
>>> from scipy import sparse
>>> spM = sparse.coo_matrix((value, (y, x)), (y.max()+1, x.max()+1))
>>> (spM.A == result).all()
True
Update: if the fillvalue is not zero the above must be modified.
Method 1: replace second line with (remember this should only be used if coordinates are unique):
>>> result = np.full((y.max()+1, x.max()+1), fillvalue, value.dtype)
Method 2: does not work
Method 3: after creating spM do
>>> spM.sum_duplicates()
>>> assert spM.has_canonical_format
>>> spM.data -= fillvalue
>>> result2 = spM.A + fillvalue
I am trying to solve a min value problem, I could obtain the min values from two loops but, what I really need is also the exact values that correspended to output min.
from __future__ import division
from numpy import*
b1=0.9917949
b2=0.01911
b3=0.000840
b4=0.10175
b5=0.000763
mu=1.66057*10**(-24) #gram
c=3.0*10**8
Mler=open("olasiM.txt","w+")
data=zeros(0,'float')
for A in range(1,25):
M2=zeros(0,'float')
print 'A=',A
for Z in range(1,A+1):
SEMF=mu*c**2*(b1*A+b2*A**(2./3.)-b3*Z+b4*A*((1./2.)-(Z/A))**2+(b5*Z**2)/(A**(1./3.)))
SEMF=array(SEMF)
M2=hstack((M2,SEMF))
minm2=min(M2)
data=hstack((data,minm2))
data=hstack((data,A))
datalist = data.tolist()
for i in range (len(datalist)):
Mler.write(str(datalist[i])+'\n')
Mler.close()
Here, what I want is to see the min value of the SEMF and, corresponding A,Z values, For example, it has to be A=1, Z=1 and SEMF= some#
I also don't know how to write these, A and Z values to the document
The big advantage of numpy over using python lists is vectorized operations. Unfortunately your code fails completely in using them. For example the whole inner loop that has Z as index can easily be vectorized. You instead are computing the single elements using python floats and then stacking them one by one in the numpy array M2.
So I'd refactor that part of the code by:
import numpy as np
# ...
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
Here the SEMF array should be exactly what you'd obtain as the final M2 array. Now you can find the minimum and stack that value into your data array:
min_val = SEMF.min()
data = hstack((data,minm2))
data = hstack((data,A))
If you also what to keep track for which value of Z you got the minimum you can use the argmin method:
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
The final code should look like:
from __future__ import division
import numpy as np
b1 = 0.9917949
b2 = 0.01911
b3 = 0.000840
b4 = 0.10175
b5 = 0.000763
mu = 1.66057*10**(-24) #gram
c = 3.0*10**8
data=zeros(0,'float')
for A in range(1,25):
Zs = np.arange(1, A+1, dtype=float)
SEMF = mu*c**2 * (b1*A + b2*A**(2./3.) - b3*Zs + b4*A*((1./2.) - (Zs/A))**2 + (b5*Zs**2)/(A**(1./3.)))
min_val, min_pos = SEMF.min(), SEMF.argmin()
data = hstack((data,np.array([min_val, min_pos, A])))
datalist = data.tolist()
with open("olasiM.txt","w+") as mler:
for i in range (len(datalist)):
mler.write(str(datalist[i])+'\n')
Note that numpy provides some functions to save/load array to/from files, like savetxt so I suggest that instead of manually saving the values there to use these functions.
Probably some numpy expert could vectorize also the operations for the As. Unfortunately my numpy knowledge isn't that advanced and I don't know how the handle the fact that we'd have a variable number of dimensions due to the range(1, A+1) thing...
I saved a couple of numpy arrays with np.save(), and put together they're quite huge.
Is it possible to load them all as memory-mapped files, and then concatenate and slice through all of them without ever loading anythin into memory?
Using numpy.concatenate apparently load the arrays into memory. To avoid this you can easily create a thrid memmap array in a new file and read the values from the arrays you wish to concatenate. In a more efficient way, you can also append new arrays to an already existing file on disk.
For any case you must choose the right order for the array (row-major or column-major).
The following examples illustrate how to concatenate along axis 0 and axis 1.
1) concatenate along axis=0
a = np.memmap('a.array', dtype='float64', mode='w+', shape=( 5000,1000)) # 38.1MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(15000,1000)) # 114 MB
b[:,:] = 222
You can define a third array reading the same file as the first array to be concatenated (here a) in mode r+ (read and append), but with the shape of the final array you want to achieve after concatenation, like:
c = np.memmap('a.array', dtype='float64', mode='r+', shape=(20000,1000), order='C')
c[5000:,:] = b
Concatenating along axis=0 does not require to pass order='C' because this is already the default order.
2) concatenate along axis=1
a = np.memmap('a.array', dtype='float64', mode='w+', shape=(5000,3000)) # 114 MB
a[:,:] = 111
b = np.memmap('b.array', dtype='float64', mode='w+', shape=(5000,1000)) # 38.1MB
b[:,:] = 222
The arrays saved on disk are actually flattened, so if you create c with mode=r+ and shape=(5000,4000) without changing the array order, the 1000 first elements from the second line in a will go to the first in line in c. But you can easily avoid this passing order='F' (column-major) to memmap:
c = np.memmap('a.array', dtype='float64', mode='r+',shape=(5000,4000), order='F')
c[:, 3000:] = b
Here you have an updated file 'a.array' with the concatenation result. You may repeat this process to concatenate in pairs of two.
Related questions:
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
Maybe an alternative solution, but I also had a single multidimensional array spread over multiple files which I only wanted to read. I solved this issue with dask concatenation.
import numpy as np
import dask.array as da
a = np.memmap('a.array', dtype='float64', mode='r', shape=( 5000,1000))
b = np.memmap('b.array', dtype='float64', mode='r', shape=(15000,1000))
c = da.concatenate([a, b], axis=0)
This way one avoids the hacky additional file handle. The dask array can then be sliced and worked with almost like any numpy array, and when it comes time to calculate a result one calls compute.
Note that there are two caveats:
it is not possible to do in-place re-assignment e.g. c[::2] = 0 is not possible, so creative solutions are necessary in those cases.
this also means the original files can no longer be updated. To save results out, the dask store methods should be used. This method can again accept a memmapped array.
If u use order='F',will leads another problem, which when u load the file next time it will be quit a mess even pass the order='F. So my solution is below, I have test a lot, it just work fine.
fp = your old memmap...
shape = fp.shape
data = your ndarray...
data_shape = data.shape
concat_shape = data_shape[:-1] + (data_shape[-1] + shape[-1],)
print('cancat shape:{}'.format(concat_shape))
new_fp = np.memmap(new_file_name, dtype='float32', mode='r+', shape=concat_shape)
if len(concat_shape) == 1:
new_fp[:shape[0]] = fp[:]
new_fp[shape[0]:] = data[:]
if len(concat_shape) == 2:
new_fp[:, :shape[-1]] = fp[:]
new_fp[:, shape[-1]:] = data[:]
elif len(concat_shape) == 3:
new_fp[:, :, :shape[-1]] = fp[:]
new_fp[:, :, shape[-1]:] = data[:]
fp = new_fp
fp.flush()
I want to know how I should index / access some data programmatically in python.
I have columnar data: depth, temperature, gradient, gamma, for a set of boreholes. There are n boreholes. I have a header, which lists the borehole name and numeric ID. Example:
Bore_name,Bore_ID,,,Bore_name,Bore_ID,,,, ...
<a row of headers>
depth,temp,gradient,gamma,depth,temp,gradient,gamma ...
I don't know how to index the data, apart from rude iteration:
with open(filename,'rU') as f:
bores = f.readline().rstrip().split(',')
headers = f.readline().rstrip().split(',')
# load from CSV file, missing values are empty 'cells'
tdata = numpy.genfromtxt(filename, skip_header=2, delimiter=',', missing_values='', filling_values=numpy.nan)
for column in range(0,numpy.shape(tdata)[1],4):
# plots temperature on x, depth on y
pl.plot(tdata[:,column+1],tdata[:,column], label=bores[column])
# get index at max depth
depth = numpy.nanargmin(tdata[:,column])
# plot text label at max depth (y) and temp at that depth (x)
pl.text(tdata[depth,column+1],tdata[depth,column],bores[column])
It seems easy enough this way, but I've been using R recently and have got a bit used to their way of referencing data objects via classes and subclasses interpreted from headers.
Well if you like R's data.table, there have been a few (at least) attempts to re-create that functionality in NumPy--through additional classes in NumPy Core and through external Python libraries. The effort i find most promising is the datarray library by Fernando Perez. Here's how it works.
>>> # create a NumPy array for use as our data set
>>> import numpy as NP
>>> D = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> # create some generic row and column names to pass to the constructor
>>> row_ids = [ "row{0}".format(c) for c in range(D1.shape[0]) ]
>>> rows = 'rows_id', row_ids
>>> variables = [ "col{0}".format(c) for c in range(D1.shape[1]) ]
>>> cols = 'variable', variables
Instantiate the DataArray instance, by calling the constructor and passing in an ordinary NumPy array and a list of tuples--one tuple for each axis, and since ndim = 2 here, there are two tuples in the list each tuple is comprised of axis label (str) and a sequence of labels for that axes (list).
>>> from datarray.datarray import DataArray as DA
>>> D1 = DA(D, [rows, cols])
>>> D1.axes
(Axis(name='rows', index=0, labels=['row0', 'row1', 'row2', 'row3',
'row4', 'row5', 'row6', 'row7']), Axis(name='cols', index=1,
labels=['col0', 'col1', 'col2', 'col3', 'col4']))
>>> # now you can use R-like syntax to reference a NumPy data array by column:
>>> D1[:,'col1']
DataArray([8, 5, 0, 7, 8, 9, 9, 4])
('rows',)
You could put your data into a dict for each borehole, keyed by the borehole id, and values as dicts with headers as keys. Roughly like this:
data = {boreid1:{"temp":temparray, ...}, boreid2:{"temp":temparray}}
Probably reading from files will be a little bit more cumbersome with these approach, but for plotting you could do something like
pl.plot(data[boreid]["temperature"], data[boreid]["depth"])
Here are idioms for naming rows and columns:
row0, row1 = np.ones((2,5))
for col in range(0, tdata.shape[1], 4):
depth,temp,gradient,gamma = tdata[:, col:col+4] .T
pl.plot( temp, depth )
See also namedtuple:
from collections import namedtuple
Rec = namedtuple( "Rec", "depth temp gradient gamma" )
r = Rec( *tdata[:, col:col+4].T )
print r.temp, r.depth
datarray (thanks Doug) is certainly more general.