Python3 pysal error message for large dataset - python

I am using a package called pysal to run the following Theil Decomposition to find within and between outputs.
When I create a small dataframe below the package works.
See code below:
import pysal
path="/Users/username/Desktop/file1.csv"
df_table=pd.read_table(path, sep=",")
df2=pd.DataFrame(df_table)
df= df2.sort_values(['exposure'], ascending=True)
rr = np.array(df['exposure'])
drop = pysal.inequality.theil.Theil(rr)
print ('drop.T', drop.T) # this is total theil
dropp = pysal.inequality.theil.TheilD(rr, df['race'] )
print ('WG', dropp.wg) #within group
print ("BG", dropp.bg) #between group
When I try to run the same code on a much larger file I get the following ERROR MESSAGE:
How do i fix error message??
TypeError: unorderable types: float() < str()
Below is the sourced code from the pysal package
The data types appear to be the same for both file types.
Im uncertain why it is working work a small file but not a large file.
def __init__(self, y, partition):
groups = np.unique(partition)
T = Theil(y).T
ytot = y.sum(axis=0)
#group totals
gtot = np.array([y[partition == gid].sum(axis=0) for gid in groups])
mm = np.dot
if ytot.size == 1: # y is 1-d
sg = gtot / (ytot * 1.)
sg.shape = (sg.size, 1)
else:
sg = mm(gtot, np.diag(1. / ytot))
ng = np.array([sum(partition == gid) for gid in groups])
ng.shape = (ng.size,) # ensure ng is 1-d
n = y.shape[0]
# between group inequality
sg = sg + (sg==0) # handle case when a partition has 0 for sum
bg = np.multiply(sg, np.log(mm(np.diag(n * 1. / ng), sg))).sum(axis=0)
self.T = T
self.bg = bg
self.wg = T - bg

Related

python multiprocessing import Pool, cpu_count: causes forever loop

The code using multiprocessing causes a forever loop.
I'm using a building an iris recognition system. this is the matching function. everything works fine until the multiprocessing the part.
I'm attaching screenshot of the error output below so that you get a better idea.
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Code:
##-----------------------------------------------------------------------------
## Import
##-----------------------------------------------------------------------------
import numpy as np
from os import listdir
from fnmatch import filter
import scipy.io as sio
from multiprocessing import Pool, cpu_count
from itertools import repeat
import warnings
warnings.filterwarnings("ignore")
##-----------------------------------------------------------------------------
## Function
##-----------------------------------------------------------------------------
def matching(template_extr, mask_extr, temp_dir, threshold=0.38):
"""
Description:
Match the extracted template with database.
Input:
template_extr - Extracted template.
mask_extr - Extracted mask.
threshold - Threshold of distance.
temp_dir - Directory contains templates.
Output:
List of strings of matched files, 0 if not, -1 if no registered sample.
"""
# Get the number of accounts in the database
n_files = len(filter(listdir(temp_dir), '*.mat'))
if n_files == 0:
return -1
# Use all cores to calculate Hamming distances
args = zip(
sorted(listdir(temp_dir)),
repeat(template_extr),
repeat(mask_extr),
repeat(temp_dir),
)
with Pool(processes=cpu_count()) as pools:
result_list = pools.starmap(matchingPool, args)
filenames = [result_list[i][0] for i in range(len(result_list))]
hm_dists = np.array([result_list[i][1] for i in range(len(result_list))])
# Remove NaN elements
ind_valid = np.where(hm_dists>0)[0]
hm_dists = hm_dists[ind_valid]
filenames = [filenames[idx] for idx in ind_valid]
# Threshold and give the result ID
ind_thres = np.where(hm_dists<=threshold)[0]
# Return
if len(ind_thres)==0:
return 0
else:
hm_dists = hm_dists[ind_thres]
filenames = [filenames[idx] for idx in ind_thres]
ind_sort = np.argsort(hm_dists)
return [filenames[idx] for idx in ind_sort]
#------------------------------------------------------------------------------
def calHammingDist(template1, mask1, template2, mask2):
"""
Description:
Calculate the Hamming distance between two iris templates.
Input:
template1 - The first template.
mask1 - The first noise mask.
template2 - The second template.
mask2 - The second noise mask.
Output:
hd - The Hamming distance as a ratio.
"""
# Initialize
hd = np.nan
# Shift template left and right, use the lowest Hamming distance
for shifts in range(-8,9):
template1s = shiftbits(template1, shifts)
mask1s = shiftbits(mask1, shifts)
mask = np.logical_or(mask1s, mask2)
nummaskbits = np.sum(mask==1)
totalbits = template1s.size - nummaskbits
C = np.logical_xor(template1s, template2)
C = np.logical_and(C, np.logical_not(mask))
bitsdiff = np.sum(C==1)
if totalbits==0:
hd = np.nan
else:
hd1 = bitsdiff / totalbits
if hd1 < hd or np.isnan(hd):
hd = hd1
# Return
return hd
#------------------------------------------------------------------------------
def shiftbits(template, noshifts):
"""
Description:
Shift the bit-wise iris patterns.
Input:
template - The template to be shifted.
noshifts - The number of shift operators, positive for right
direction and negative for left direction.
Output:
templatenew - The shifted template.
"""
# Initialize
templatenew = np.zeros(template.shape)
width = template.shape[1]
s = 2 * np.abs(noshifts)
p = width - s
# Shift
if noshifts == 0:
templatenew = template
elif noshifts < 0:
x = np.arange(p)
templatenew[:, x] = template[:, s + x]
x = np.arange(p, width)
templatenew[:, x] = template[:, x - p]
else:
x = np.arange(s, width)
templatenew[:, x] = template[:, x - s]
x = np.arange(s)
templatenew[:, x] = template[:, p + x]
# Return
return templatenew
#------------------------------------------------------------------------------
def matchingPool(file_temp_name, template_extr, mask_extr, temp_dir):
"""
Description:
Perform matching session within a Pool of parallel computation
Input:
file_temp_name - File name of the examining template
template_extr - Extracted template
mask_extr - Extracted mask of noise
Output:
hm_dist - Hamming distance
"""
# Load each account
data_template = sio.loadmat('%s%s'% (temp_dir, file_temp_name))
template = data_template['template']
mask = data_template['mask']
# Calculate the Hamming distance
hm_dist = calHammingDist(template_extr, mask_extr, template, mask)
return (file_temp_name, hm_dist)
how can I remove multiprocessing and make code still work fine?
screenshots dropbox link
Use python's itertools.starmap()
hope it helps

Duplicating rows in dataframe python

Good afternoon everyone,
I am currently writing a thesis on the KMV model in python. I took inspiration from the code here to solve the non-linear equations. Here is the link to the CSV file used to create the dataframe. And this is the code I have so far:
Importation of the required modules
from datetime import datetime
import pandas as pd
import numpy as np
import scipy.optimize as sco
from scipy.stats import norm
df = pd.DataFrame()
df = pd.read_csv("AREX.csv", sep=';', engine = "python", decimal=',')
Functions to prepare the file for the model to run
def clean():
# df.rename(columns ={"Date": "Date"}, inplace = True)
# df["Date"] = pd.to_datetime(df['Date'])
df.set_index("Date", inplace = True)
df['AREX.O']=df['AREX.O'].astype(float)
df.drop(['Total Short Term debt'], axis =1, inplace = True)
return df
def preparation():
df['e']=df['AREX.O']*df['Share Outstanding']
df['Short Term Debt']=df['Debt']-df['Total Long term Debt']
df['f']=df['Short Term Debt']+df['Total Long term Debt']*0.5
df['log_ret'] = np.log(df['AREX.O']) - np.log(df['AREX.O'].shift(1))
# df['stdev']=df['log_ret'].rolling(252).std()*m.sqrt(252)
return df
Algorithm used to solve for a and sigma_a.
I only tried to adapt the code to my dataframe here
def algo1():
# formatting the vaules as required
df["f"] = df["f"].astype(float)
df["e"] = df["e"].astype(float)
# #computating of key input variable for the model
df['a'] = df['f'].add(df["e"])
#defining a function for the black Scholes equation
def bseqn(a, debug=False):
d1 = (np.log(a/f) + (r + 0.5*sigma_a**2)*T)/(sigma_a*np.sqrt(T))
d2 = d1 - sigma_a*np.sqrt(T)
y1 = e - (a*norm.cdf(d1) - np.exp(-r*T)*f*norm.cdf(d2))
if debug:
print("d1 = {:.6f}".format(d1))
print("d2 = {:.6f}".format(d2))
print("Error = {:.6f}".format('y1'))
return y1
#Solving the model
time_horizon=[1]
timesteps = range(1, len(df))
results = np.empty((df.shape[0],len(time_horizon)))
#looping to solve for each row
for i, years in enumerate(time_horizon):
T = 1
results[:,i] = df.loc[:,'a']
for i_t, t in enumerate(timesteps):
a = results[t-10:t,i]
ra =np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #gives initial value of sigma_a
if i_t == 0:
subset_timesteps = range(t-1, t+1)
print(subset_timesteps)
else:
subset_timesteps = [t]
n_its = 0
while n_its < 10:
n_its += 1
for t_sub in subset_timesteps:
r = df.iloc[t_sub]['r']
f = df.iloc[t_sub]['f']
e = df.iloc[t_sub]['e']
sol = sco.fsolve(bseqn, results[t_sub,i]) #if I replace newton with fsolve the code works properly
results[t_sub,i] = sol # stores the new values of a
# Update sigma_a based on new values of a
last_sigma_a = sigma_a
a = results[t-10:t,i]
ra = np.log(a/np.roll(a,1))
sigma_a = np.nanstd(ra) #new val of sigma
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
else:
pass
return df
Run function
def run():
clean()
preparation()
algo1()
print(df)
print(list(df))
# main_df = df.to_csv("AREX_D.csv")
The output should write the results of sigma_a on the created sigma_a column but instead of that it adds a row so instead of 1500 rows i end-up with 3000 rows most of it being Nan values. I do not understand where the code asks that...
I suspect it to come from these lines:
diff = last_sigma_a - sigma_a
if abs(diff) < 1e-3:
df.loc[t_sub,'sigma_a'] = sigma_a
break
Does anyone has any insight on what is happening ?
Here is a picture of the output :
Thank you very much!

For loop Python- from Matlab

I am starting to code up in Python and I come from a Matlab background. I have a problem with a for loop that I am trying to do.
So this is my for loop from Matlab,
ix = indoor(1);
idx = indoor(2)-indoor(1);
%Initialize X apply I.C
X = [ix;idx];
for k=(1:1:287)
X(:,k+1) = Abest*X(:,k) + Bbest*outdoor(k+1) + B1best* (cbest4/cbest1);
end
In this code Abest is a 2x2 matrix, Bbest is a 2x1 matrix, outdoor is a 288x1 vector, B1best is a 2x1 matrix. The matricies are found from a function using the matrix expodential command. c4 and c1 are terms defined before, constants.
In Python I have been able to get the matrix exponential command to work in my function but I can't get that for loop to work.
Xo = np.array([[ix],[idx]])
num1 = range(0,276)
for k in num1:
Xo[:,k+1] = Ae*Xo[:,k] + Be*outdoor[k+1] + Be1*(c4/c1)
Again Ae,Be,Be1 are matrices of the same size just like the Matlab ones. Same thing for the outdoor vector.
I have tried everything I can think of to make it work... The only thing that worked for me was,
Xo = np.zeros(())
#Initial COnditions
ix = np.array(indoor[0])
idx = np.array(indoor[1]-indoor[0])
Xo = np.array([[ix],[idx]])
#Range for the for loop
num1 = range(0,1)
for k in num1:
Xo = Ae*Xo[k] + Be*outdoor[k+1] + Be1*(c4/c1)
Now, this thing will work but only give me two points. If I change the range I get an error. I'm assuming this code works because my original Xo is just two states so k goes through those two states but that's not what I want.
If anyone could help me out that would be very helpful! If I'm making some code error, it's honestly because I'm not understanding the 'For loop' in python to well when it comes to data analysis and having it loop through the rows and increment the columns. Thank you for your time.
Upon Request here is my full code:
import scipy.io as sc
import math as m
import numpy as np
import matplotlib.pyplot as plt
import sys
from scipy.linalg import expm, sinm, cosm
import pandas as pd
df = pd.read_excel('datatemp.xlsx')
outdoor = np.array(df[['Outdoor']])
indoor = np.array(df[['Indoor']])
###########################. FUNCTION DEFINE. #################################################
#Progress bar
def progress(count, total, status=''):
percents = round(100.0 * count / float(total), 1)
sys.stdout.write(' %s%s ...%s\r' % ( percents, '%', status))
sys.stdout.flush()
#Define Matrix for Model
def Matrixbuild(c1,c2,c3):
A = np.array([[0,1],[-c3/c1,-c2/c1]])
B = np.array([[0],[1/c1]])
B1 = np.array([[1],[0]])
C = np.zeros((2,2))
D = np.zeros((2,2))
F = np.array([[0,1,0,1],[-c3/c1,-c2/c1,1/c1,0],[0,0,0,0],[0,0,0,0]])
R = np.array(expm(F))
Ae = np.array([[R.item(0),R.item(1)],[R.item(4),R.item(5)]])
Be = np.array([[R.item(2)],[R.item(6)]])
Be1 = np.array([[R.item(3)],[R.item(7)]])
return Ae,Be,Be1;
###########################. Data. #################################################
#USED FOR JUST TRYING WITHOUT ACTUAL DATA
# outdoor = np.array([5.8115,4.394,5.094,5.1123,5.1224])
# indoor = np.array([15.595,15.2429,15.0867,14.9982,14.8993])
###########################. Model Define. #################################################
Xo = np.zeros((2,288))
ix = np.array(indoor[0])
idx = np.array(indoor[1])
err_min = m.inf
c1spam = np.linspace(0.05,0.001,30)
c2spam = np.linspace(6.2,6.5,30)
c3spam = np.linspace(7.1,7.45,30)
totalspam = len(c1spam)*len(c2spam)*len(c3spam)
ind = 0
for c1 in c1spam:
for c2 in c2spam:
for c3 in c3spam:
c4 = 1.1
#MatrixBuild Function
result = Matrixbuild(c1,c2,c3)
Ae,Be,Be1 = result
Xo = np.array([ix,idx])
Datarange = range(0,len(outdoor)-1,1)
for k in Datarange:
Xo[:,k+1] = np.matmul(Ae,Xo[:,k]) + np.matmul(Be,outdoor[k+1]) + Be1*(c4/c1)
ind = ind + 1
print(Xo)
err = np.linalg.norm(Xo[0,range(0,287)]-indoor.T)
if err<err_min:
err_min = err
cbest = np.array([[c1],[c2],[c3],[c4]])
progress(ind,totalspam,status='Done')
# print(X)
# print(err)
# print(cbest)
###########################. Model with Cbest Values. #################################################
c1 = cbest[0]
c2 = cbest[1]
c3 = cbest[2]
result2 = Matrixbuild(c1,c2,c3)
AeBest,BeBest,Be1Best = result2
Xo = np.array([ix,idx])
Datarange = np.arange(0,len(outdoor)-1)
for k in Datarange:
Xo[:,k+1] = np.matmul(AeBestb,Xo[:,k]) + np.matmul(BeBest,outdoor[k+1]) + Be1Best*(c4/c1)
err = np.linalg.norm(Xo[0,range(0,287)]-indoor.T)
print(cbest)
print(err)
###########################. Plots. #################################################
plt.figure(0)
time = np.linspace(1,2,2)
plt.scatter(time,X[0],s=15,c="blue")
plt.scatter(time,indoor[0:2],s=15,c="red")
plt.show()
And again my error occurs in the line with the for loop of
for k in Datarange:
Xo[:,k+1] = np.matmul(Ae,Xo[k]) + np.matmul(Be,outdoor[k+1]) + Be1*(c4/c1)
I was trying to use np.matmul for matrix multiplication but even without it, it wasn't working.
If there are any other questions about my code please ask. Essentially I'm trying to find the best c1,c2,c3 coefficients that fit my data which is indoor temperature by using a basic second order constant coefficient model.
Have you tried with Xo[:,k+1] instead of Xo(:,k+1)? Python uses [] for slicing and indexing.
EDIT:
Xo = np.array([[ix],[idx]])
This creates a 1x1 array with 1 value: (ix, idx). I think you're looking for something like Xo = np.zeros((ix, idx)), which will give you an ixxidx array initialized to zeros. If you don't need the zeros you can use Xo = np.empty((ix, idx)).
See the docs on array creation.
So by reading into how python works a little more and allocation for arrays/matrices, I was able to find out how to do it. I needed to first allocate my 'Xo' value and then input the initial conditions in order for the For loop to work.
Xo = np.zeros((2,num2))
Xo = np.asmatrix(Xo)
Xo[0,0] = ix
Xo[1,0] = idx
Also for the 'for loop', I called the range some value like this,
num1 = range(0,4)
num2 = len(num1) + 1
This helped in order to calculate the total dimension of 'Xo', by calling it 'num2'. It was also defined like that because my 'For loop' went (k+1), this the dimension would grow larger, ex:
for k in num1:
Xo[:,k+1] = Ae*Xo[:,k] + Be*outdoor[k+1] + Be1*(c4/c1)
But there it is! I figured it by comparing Matlab printouts to Python printouts and just trying to debug one line at a time. Now I have the same exact value print out in both goods, so it is time to start using the python code!

Implementing Flajolet and Martin’s Algorithm in python

The following is the code which I've written to implement Flajolet and Martin’s Algorithm. I've used Jenkins hash function to generate a 32 bit hash value of data. The program seems to follow the algorithm but is off the mark by about 20%. My data set consists of more than 200,000 unique records whereas the program outputs about 160,000 unique records. Please help me in understanding the mistake(s) being made by me. The hash function is implemented as per Bob Jerkins' website.
import numpy as np
from jenkinshash import jhash
class PCSA():
def __init__(self, nmap, maxlength):
self.nmap = nmap
self.maxlength = maxlength
self.bitmap = np.zeros((nmap, maxlength), dtype=np.int)
def count(self, data):
hashedValue = jhash(data)
indexAlpha = hashedValue % self.nmap
ix = hashedValue / self.nmap
ix = bin(ix)[2:][::-1]
indexBeta = ix.find("1") #find index of lsb
if self.bitmap[indexAlpha, indexBeta] == 0:
self.bitmap[indexAlpha, indexBeta] = 1
def getCardinality(self):
sumIx = 0
for row in range(self.nmap):
sumIx += np.where(self.bitmap[row, :] == 0)[0][0]
A = sumIx / self.nmap
cardinality = self.nmap * (2 ** A)/ MAGIC_CONST
return cardinality
If you are running this in Python2, then the division to calculate A may result in A being changed to an integer.
If this is the case, you could try changing:
A = sumIx / self.nmap
to
A = float(sumIx) / self.nmap

Python NetCDF IOError: netcdf: NetCDF: Invalid dimension ID or name

I am writing a script in python for handling NetCDF files, but I am facing some issues in creating variables, here is the part of the code:
stepnumber_var = ofl.createVariable("step_number", "i",("step_number",))
stepnumber_var.standard_name = "step_number"
atomNumber_var = ofl.createVariable("atom_number", "i", ("atom_number",))
atomNumber_var.standard_name = "atom__number"
But gives me this error:
Traceback (most recent call last):
File "sub_avg.py", line 141, in <module>
atomNumber_var = ofl.createVariable("atom_number", "i", ("atom_number",))
IOError: netcdf: NetCDF: Invalid dimension ID or name
My question is, why the first variable is created without any problem and the second doesn't work?
Thanks
Here it is the full code
from array import array
import os
import sys
import math
import string as st
import numpy as N
from Scientific.IO.NetCDF import NetCDFFile as S
if len(sys.argv) < 2:
sys.exit( "No input file found. \nPlease privide NetCDF trajectory input file" )
#######################
## Open NetCDF file ###
#######################
infl = S(sys.argv[1], 'r')
file = sys.argv[1]
title,ext = file.split(".")
#for v in infl.variables: # Lists the variables in file
# print(v)
#################################################################################
# Variable "configurations" has the structure [step_number, atom_number, x y z] #
#################################################################################
varShape = infl.variables['configuration'].shape # This gets the shape of the variable, i.e. the dimension in terms of elements
nSteps = varShape[0]
nAtoms = varShape[1]
coordX_atom = N.zeros((nSteps,nAtoms))
coordY_atom = N.zeros((nSteps,nAtoms))
coordZ_atom = N.zeros((nSteps,nAtoms))
sumX = [0] * nAtoms
sumY = [0] * nAtoms
sumZ = [0] * nAtoms
######################################################
# 1) Calculate the average structure fron trajectory #
######################################################
for i in range(0, 3):
for j in range(0, 3):
coordX_atom[i][j] = infl.variables["configuration"][i,j,0]
coordY_atom[i][j] = infl.variables["configuration"][i,j,1]
coordZ_atom[i][j] = infl.variables["configuration"][i,j,2]
sumX[j] = sumX[j] + coordX_atom[i][j]
sumY[j] = sumY[j] + coordY_atom[i][j]
sumZ[j] = sumZ[j] + coordZ_atom[i][j]
avgX = [0] * nAtoms
avgY = [0] * nAtoms
avgZ = [0] * nAtoms
for j in range(0, 3):
avgX[j] = sumX[j]/nSteps
avgY[j] = sumY[j]/nSteps
avgZ[j] = sumZ[j]/nSteps
##############################################################
# 2) Subtract average structure to each atom and for each frame #
##############################################################
for i in range(0, 3):
for j in range(0, 3):
coordX_atom[i][j] = infl.variables["configuration"][i,j,0] - avgX[j]
coordY_atom[i][j] = infl.variables["configuration"][i,j,1] - avgY[j]
coordZ_atom[i][j] = infl.variables["configuration"][i,j,2] - avgZ[j]
#######################################
# 3) Write new NetCDF trajectory file #
#######################################
ofl = S(title + "_subAVG.nc", "a")
############################################################
# Get information of variables contained in the NetCDF input file
#############################################################
i = 0
for v in infl.variables:
varNames = [v for v in infl.variables]
i += 1
#############################################
# Respectively get, elements names in variable, dimension of elements and lenght of the array variableNames
##############################################
for v in infl.variables["box_size"].dimensions:
boxSizeNames = [v for v in infl.variables["box_size"].dimensions]
for v in infl.variables["box_size"].shape:
boxSizeShape = [v for v in infl.variables["box_size"].shape]
boxSizeLenght = boxSizeNames.__len__()
print boxSizeLenght
for v in infl.variables["step"].dimensions:
stepNames = [v for v in infl.variables["step"].dimensions]
for v in infl.variables["step"].shape:
stepShape = [v for v in infl.variables["box_size"].shape]
stepLenght = stepNames.__len__()
print stepLenght
for v in infl.variables["configuration"].dimensions:
configurationNames = [v for v in infl.variables["configuration"].dimensions]
for v in infl.variables["configuration"].shape:
configurationShape = [v for v in infl.variables["configuration"].shape]
configurationLenght = configurationNames.__len__()
print configurationLenght
for v in infl.variables["description"].dimensions:
descriptionNames = [v for v in infl.variables["description"].dimensions]
for v in infl.variables["description"].shape:
descriptionShape = [v for v in infl.variables["description"].shape]
descriptionLenght = descriptionNames.__len__()
print descriptionLenght
for v in infl.variables["time"].dimensions:
timeNames = [v for v in infl.variables["time"].dimensions]
for v in infl.variables["time"].shape:
timeShape = [v for v in infl.variables["time"].shape]
timeLenght = timeNames.__len__()
print timeLenght
#Get Box size
xBox = infl.variables["box_size"][0,0]
yBox = infl.variables["box_size"][0,1]
zBox = infl.variables["box_size"][0,2]
# Get description lenght
description_lenghtLenght = infl.variables["description"][:]
############################################################
# Create Dimensions
############################################################
stepnumber_var = ofl.createVariable("step_number", "i",("step_number",))
stepnumber_var.standard_name = "step_number"
atomNumber_var = ofl.createVariable("atom_number", "i", ("atom_number",))
atomNumber_var.standard_name = "atom__number"
#
#xyz_var = ofl.createVariable("xyz", "f",("xyz",))
#xyz_var.units = "nanometers"
#xyz_var.standard_name = "xyz"
#
#configuration_var = ofl.createVariable("configuration", "f", ("step_number", "atom_number", "xyz"))
#configuration_var.units = "nanometers"
#configuration_var.standard_name = "configuration"
#
#print configuration_var.shape
#step_var = ofl.createVariable("box_size_lenght", 3)
#configuration_var = ofl.createVariable("atom_number", nAtoms)
#description_var = ofl.createVariable("xyz", 3)
#time_var = ofl.createVariable(description_lenght, description_lenghtLenght)
#
#a = infl.variables["step_number"].dimensions.keys()
#print a
Thanks!
This may be a case of a library trying to be "helpful" (see the end of my post for details, but I can't confirm it). To fix this, you should explicitly create dimensions for atom_number and step_number, by using the following before you create the variables (assuming I am understanding nSteps and nAtoms correctly):
ofl.createDimension("step_number", nSteps)
ofl.createDimension("atom_number", nAtoms)
If you are new to netCDF, I might suggest looking at either the netcdf4-python package,
http://unidata.github.io/netcdf4-python/
of the netCDF package found in scipy:
http://docs.scipy.org/doc/scipy/reference/io.html
What might be going on: it looks like the issue is that when you create the variable step_number, the library is trying to be helpful by creating a step_number dimension with unlimited length. However, you can only have one unlimited dimension in a netcdf-3 file, so the helpful "trick" does not work.
atomNumber_var.standard_name = "atom__number"
The atom__number has two "__" instead of one "_". I am not sure if this is your problem, but it may be something to look at.
I would also suggest making your netcdf file steps clearer. I like to break them down into 3 steps. I used an example of scientific data using ocean sst. You also have a section for creating dimensions, but you don't actually do it. This is more correctly create variable section.
Create Dimensions
Create Variable
Fill the variable
from netCDF4 import Dataset
ncfile = Dataset('temp.nc','w')
lonsdim = latdata.shape #Set dimension lengths
latsdim = londata.shape
###############
#Create Dimensions
###############
latdim = ncfile.createDimension('latitude', latsdim)
londim = ncfile.createDimension('longitude', lonsdim)
###############
#Create Variables
################# The variables contain the dimensions previously set
latitude = ncfile.createVariable('latitude','f8',('latitude'))
longitude = ncfile.createVariable('longitude','f8',('longitude'))
oceantemp = ncfile.createVariable('SST','f4' ('latitude','longitude'),fill_value=-99999.0)
###############
Fill Variables
################
latitude[:] = latdata #lat data to fill in
longitude[:] = londata #lon data to fill in
oceantemp[:,:] = sst[:,:] #some variable previous calculated
I hope this is helpful.

Categories