Saving data in Python - python

I am trying to save the data in a CSV format. The current and desired outputs are attached.
import numpy as np
import csv
r = np.linspace(0, 100e-6, 5)
A=np.array([[23.9496871440374 - 1336167292.56833*r**2],
[21.986288555672 - 1373636804.80965*r**2]])
with open('Vel_Profiles.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows((r,A))
The current output is
The desired output is

Here is what worked for me to get your expected output:
import numpy as np
import csv
r = np.linspace(0, 100e-6, 5)
A=np.array([[23.9496871440374 - 1336167292.56833*r**2],
[21.986288555672 - 1373636804.80965*r**2]])
out = np.vstack([r,A.squeeze()]).T
np.savetxt('Vel_Profiles.csv', out, delimiter=',', fmt=['%2.2E', '%.5f', '%.6f'])
output:
0.00E+00 23.94969 21.986289
2.50E-05 23.11458 21.127766
5.00E-05 20.60927 18.552197
7.50E-05 16.43375 14.259582
1.00E-04 10.58801 8.249921
UPDATE
Specifying the format of all columns in a more general way like asked in the comments
r = np.linspace(0, 100e-6, 5)
A=np.array([[23.9496871440374 - 1336167292.56833*r**2],
[21.986288555672 - 1373636804.80965*r**2]])
out = np.vstack([r,A.squeeze()]).T
test = np.hstack([out,out,out])
print(test.shape)
# (5, 9)
# build list of same len than shape[1] with format
# here , we would have 3 times the same data next to each other so just multiply it by 3
my_format = ['%2.2E', '%.5f', '%.6f']
my_list_of_formats = my_format*3
# ['%2.2E', '%.5f', '%.6f', '%2.2E', '%.5f', '%.6f', '%2.2E', '%.5f', '%.6f']
#or like this:
my_list_of_formats = [my_format[i % 3] for i in range(test.shape[1])]
# ['%2.2E', '%.5f', '%.6f', '%2.2E', '%.5f', '%.6f', '%2.2E', '%.5f', '%.6f']
np.savetxt('Vel_Profiles.csv', test, delimiter=',', fmt=my_list_of_formats)
you can also specify just one format like '%2.2E' to fmt=, then every column gets formatted that way

You don't need to use another library, you can use Numpy itself.
You can do this:
import numpy as np
np.savetxt('file_name.csv', your_array, delimiter=',')
If you need to stack your arrays first you can do something like this first:
array = np.vstack([r, A])
Check out the documentation here:
savetxt: https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html
vstack: https://numpy.org/doc/stable/reference/generated/numpy.vstack.html

Related

How do I optimize a for loop for faster results in Python

I've written a piece of code to extract data from a HDF5 file and save into a dataframe that I can export as .csv later. The final data frame effectively has 2.5 million rows and is taking a lot of time to execute.
Is there any way, I can optimize this code so that it can run effectively.
Current runtime is 7.98 minutes!
Ideally I would want to run this program for 48 files like these and expect a faster run time.
Link to source file: https://drive.google.com/file/d/1g2fpJHZmD5FflfB4s3BlAoiB5sGISKmg/view
import h5py
import numpy as np
import pandas as pd
#import geopandas as gpd
#%%
f = h5py.File('mer.h5', 'r')
for key in f.keys():
#print(key) #Names of the root level object names in HDF5 file - can be groups or datasets.
#print(type(f[key])) # get the object type: usually group or dataset
ls = list(f.keys())
#Get the HDF5 group; key needs to be a group name from above
key ='DHI'
#group = f['OBSERVATION_TIME']
#print("Group")
#print(group)
#for key in ls:
#data = f.get(key)
#dataset1 = np.array(data)
#length=len(dataset1)
masterdf=pd.DataFrame()
data = f.get(key)
dataset1 = np.array(data)
#masterdf[key]=dataset1
X = f.get('X')
X_1 = pd.DataFrame(X)
Y = f.get('Y')
Y_1 = pd.DataFrame(Y)
#%%
data_df = pd.DataFrame(index=range(len(Y_1)),columns=range(len(X_1)))
for i in data_df.index:
data_df.iloc[i] = dataset1[0][i]
#data_df.to_csv("test.csv")
#%%
final = pd.DataFrame(index=range(1616*1616),columns=['X', 'Y','GHI'])
k=0
for y in range(len(Y_1)):
for x in range(len(X_1[:-2])): #X and Y ranges are not same
final.loc[k,'X'] = X_1[0][x]
final.loc[k,'Y'] = Y_1[0][y]
final.loc[k,'GHI'] = data_df.iloc[y,x]
k=k+1
# print(k)`
we can optimize loops by vectorizing operations. this is one/two orders of magnitude faster than their pure python equivalents(especially in numerical computations). vectorization is something we can get with NumPy. it is a library with efficient data structures designed to hold matrix data.
Could you please try the following (file.h5 your file):
import pandas as pd
import h5py
with h5py.File("file.h5", "r") as file:
df_X = pd.DataFrame(file.get("X")[:-2], columns=["X"])
df_Y = pd.DataFrame(file.get("Y"), columns=["Y"])
DHI = file.get("DHI")[0][:, :-2].reshape(-1)
final = df_Y.merge(df_X, how="cross").assign(DHI=DHI)[["X", "Y", "DHI"]]
Some explanations:
First read the data with key X into a dataframe df_X with one column X, except for the last 2 data points.
Then read the full data with key Y into a dataframe df_Y with one column Y.
Then get the data with key DHI and take the first element [0] (there are no more): Result is a NumpPy array with 2 dimensions, a matrix. Now remove the last two columns ([:, :-2]) and reshape the matrix into an 1-dimensional array, in the order you are looking for (order="C" is default). The result is the column DHI of your final dataframe.
Finally take the cross product of df_Y and df_X (y is your outer dimension in the loop) via .merge with how="cross", add the DHI column, and rearrange the columns in the order you want.

If loop and saving the boolean results

I have 3 different CSV files. Each has 70 rows and 430 columns. I want to create and save a boolean result file (with the same shape) that put true if the condition is met.
one file include temperature data, one wind data and one Rh data.condition is: [(t>=35) & (w>=7) & (rh<30)]
I want the saved file to be 0 and 1 file that show in which cell the condition has been meet (1) or not (0). The problem is that results are not true! I really appreciate your help.
import numpy as np
import pandas as pd
dft = pd.read_csv ("D:/practicet.csv",header = None)
dfrh = pd.read_csv ("D:/practicerh.csv",header = None)
dfw = pd.read_csv ("D:/practicew.csv",header = None)
result_set = []
for i in range (0,dft.shape[1]):
t=dft[i]
w=dfw[i]
rh=dfrh[i]
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
result_set = np.append(result_set,result)
np.savetxt("D:/result.csv", result_set, delimiter = ",")
You can generate boolean series by testing each column of the frame. You simply then concatenate columns back into a DataFrame object.
import pandas as pd
data = pd.read_csv('data.csv')
bool_temp = data['temperature'] > 22
bool_week = data['week'] > 5
bool_humid = data['humidity'] > 50
data_tmp = [bool_humid, bool_temp, bool_week]
df = pd.concat(data_tmp, axis=1, keys=[s.name for s in data_tmp])
The dummy data:
temperature,week,humidity
25,3,80
29,4,60
22,4,20
20,5,30
2,7,80
30,9,80
are written to data.csv
Give this a shot.
This is a proxy problem for yours, with random arrays from [0,100] in the same shape as your CSV.
import numpy as np
dft = np.random.rand(70,430)*100.
dfrh = np.random.rand(70,430)*100.
dfw = np.random.rand(70,430)*100.
result_set = []
for i in range(dft.shape[0]):
result = ((dft[i] >= 35) & (dfw[i] >= 7) & (dfrh[i] < 30))
result_set.append(result)
np.savetxt("result.csv", result_set, delimiter = ",")
The critical problem with your code is:
result=np.empty(dft.shape,dtype=bool)
result=result[(t>=35) & (w>=7) & (rh<30)]
This does not do what you think it's doing. You (i) initialize an empty array (which will have garbage values), and then you (ii) apply your boolean mask to it. So, now you have a garbage array masked into another garbage array according to your specified boolean rules.
As an example...
In [5]: a = np.array([1,2,3,4,5])
In [6]: mask = np.array([True,False,False,False,True])
In [7]: a[mask]
Out[7]: array([1, 5])

Saving the results as LUT

import numpy as np, itertools
x1 = np.linspace(0.1, 3.5, 3)
x2 = np.arange(5, 24, 3)
x3 = np.arange(50.9, 91.5, 3)
def calculate(x1,x2,x3):
res = x1**5+x2*x1+x3
return res
products = list(itertools.product(x1,x2,x3))
results = [calculate(a,b,c) for a,b,c in products]
I have to save the results as look up tables for future use.
In my real case, the file is going to be very large around 1GB. So I need faster way of reading that file later.
What is the best way and file format to save it to access in future?
outputs = np.column_stack((products,results))
np.savetxt('test.out',outputs, delimiter = ',')
My future use as follows:
#given_x1,given_x2,given_x3 = 0.2, 8, 60
#open the look up table
#read the neighbouring two values for the given values
#linearly interpolate between two values for the results.
I'd construct a 1-D array from the list comprehension and save this out:
In [37]:
a = np.array([calculate(a,b,c) for a,b,c in products])
np.savetxt(r'c:\data\lut.txt', a)
In [39]:
b = np.loadtxt(r'c:\data\lut.txt')
np.all(a==b)
Out[39]:
True

Numpy set dtype=None, cannot splice columns and set dtype=object cannot set dtype.names

I am running Python 2.6. I have the following example where I am trying to concatenate the date and time string columns from a csv file. Based on the dtype I set (None vs object), I am seeing some differences in behavior that I cannot explained, see Question 1 and 2 at the end of the post. The exception returned is not too descriptive, and the dtype documentation doesn't mention any specific behavior to expect when dtype is set to object.
Here is the snippet:
#! /usr/bin/python
import numpy as np
# simulate a csv file
from StringIO import StringIO
data = StringIO("""
Title
Date,Time,Speed
,,(m/s)
2012-04-01,00:10, 85
2012-04-02,00:20, 86
2012-04-03,00:30, 87
""".strip())
# (Fail) case 1: dtype=None splicing a column fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr1 = np.genfromtxt(data, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
arr1.dtype.names = header # assign the header to names
# so we can do y=arr['Speed']
y1 = arr1['Speed']
# Q1 IndexError: invalid index
#a1 = arr1[:,0]
#print a1
# EDIT1:
print "arr1.shape "
print arr1.shape # (3,)
# Fails as expected TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
# z1 = arr1['Date'] + arr1['Time']
# This can be workaround by specifying dtype=object, which leads to case 2
data.seek(0) # resets
# (Fail) case 2: dtype=object assign header fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr2 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
# Q2 ValueError: there are no fields define
#arr2.dtype.names = header # assign the header to names. so we can use it to do indexing
# ie y=arr['Speed']
# y2 = arr['Date'] + arr['Time'] # column headings were assigned previously by arr.dtype.names = header
data.seek(0) # resets
# (Good) case 3: dtype=object but don't assign headers
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr3 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
y3 = arr3[:,0] + arr3[:,1] # slice the columns
print y3
# case 4: dtype=None, all data are ints, array dimension 2-D
# simulate a csv file
from StringIO import StringIO
data2 = StringIO("""
Title
Date,Time,Speed
,,(m/s)
45,46,85
12,13,86
50,46,87
""".strip())
next(data2) # eat away the title line
header = [item.strip() for item in next(data2).split(',')] # get the headers
arr4 = np.genfromtxt(data2, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
#arr4.dtype.names = header # Value error
print "arr4.shape "
print arr4.shape # (3,3)
data2.seek(0) # resets
Question 1: At comment Q1, why can I not slice a column, when dtype=None?
This could be avoided by
a) arr1=np-genfromtxt... was initialized with dtype=object like case 3,
b) arr1.dtype.names=... wascommented out to avoid the Value error in case 2
Question 2: At comment Q2, why can I not set the dtype.names when dtype=object?
EDIT1:
Added a case 4 that shows when the dimension of the array would be 2-D if the values in the simulated csv files are all ints instead. One can slice the column, but assigning the dtype.names would still fail.
Update the term 'splice' to 'slice'.
Question 1
This is indexing, not 'splicing', and you can't index into the columns of data for exactly the same reason I explained to you before in my answer to Question 7 here. Look at arr1.shape - it is (3,), i.e. arr1 is 1D, not 2D. There are no columns for you to index into.
Now look at the shape of arr2 - you'll see that it's (3,3). Why is this? If you do specify dtype=desired_type, np.genfromtxt will treat every delimited part of your input string the same (i.e. as desired_type), and it will give you an ordinary, non-structured numpy array back.
I'm not quite sure what you wanted to do with this line:
z1 = arr1['Date'] + arr1['Time']
Did you mean to concatenate the date and time strings together like this: '2012-04-01 00:10'? You could do it like this:
z1 = [d + ' ' + t for d,t in zip(arr1['Date'],arr1['Time'])]
It depends what you want to do with the output (this will give you a list of strings, not a numpy array).
I should point out that, as of version 1.7, Numpy has core array types that support datetime functionality. This would allow you to do much more useful things like computing time deltas etc.
dts = np.array(z1,dtype=np.datetime64)
Edit:
If you want to plot timeseries data, you can use matplotlib.dates.strpdate2num to convert your strings to matplotlib datenums, then use plot_date():
from matplotlib import dates
from matplotlib import pyplot as pp
# convert date and time strings to matplotlib datenums
dtconv = dates.strpdate2num('%Y-%m-%d%H:%M')
datenums = [dtconv(d+t) for d,t in zip(arr1['Date'],arr1['Time'])]
# use plot_date to plot timeseries
pp.plot_date(datenums,arr1['Speed'],'-ob')
You should also take a look at Pandas, which has some nice tools for visualising timeseries data.
Question 2
You can't set the names of arr2 because it is not a structured array (see above).

Python iteration with array

I guess it is a simple question, I am doing simple while iteration and want to save data within data array so I can simple plot it.
tr = 25 #sec
fr = 50 #Hz
dt = 0.002 #2ms
df = fr*(dt/tr)
i=0;
f = 0
data = 0
while(f<50):
i=i+1
f = ramp(fr,f,df)
data[i] = f
plot(data)
How to correctly define data array? How to save results in array?
One possibility:
data = []
while(f<50):
f = ramp(fr,f,df)
data.append(f)
Here, i is no longer needed.
you could initialize a list like this:
data=[]
then you could add data like this:
data.append(f)
For plotting matplotlib is a good choice and easy to install and use.
import pylab
pylab.plot(data)
pylab.show()
He needs "i" b/c it starts from 1 in the collection. For your code to work use:
data = {} # this is dictionary and not list

Categories