Related
I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.
Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)
This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]
cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP' nan]], dtype=object)
a = np.concatenate(b) #concatenated to get a single array, this worked well
print(np.sort(a)) # to sort alphabetically
it gave me error **error AxisError: axis -1 is out of bounds for array of dimension 0*
I also tried using a.sort() it is also giving me **TypeError: '<' not supported between instances of 'float' and 'str'**
The above is a CSV file containing list of postcodes of different persons which involves travelling from one postcode to another for different jobs, a person could travel to 5 postcoodes a day. using numpy array, I got list of list of postcodes.
I then concatenate the list of postcode to get one big list of postcode after which I want to sort it in an alphabetical order but it kept giving me errors.
Please, can someone help
As it was mentioned in the comments, this error is caused by the comparison of nan to string. To fix this, you cannot use a NumPy array (for sorting), but rather a list.
Convert the array to a list
Remove the nan values
Sort
# Get the data (in your scenario, this would be achieved by reading from your file)
b = np.array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP', nan]], dtype=object)
# Flatten
a = np.concatenate(b)
# Remove nan values - they are converted to strings when concatenated
a = np.array([x for x in a if x != 'nan'])
# Finally, sort
a.sort()
If I have two arrays of wavelength and fluxes.
print(wave)
array([3872.06965186, 3872.07965186, 3872.08965186, ..., 6942.82937577,
6942.83937577, 6942.84937577])
print(flux)
array([0.00278573, 0.00270251, 0.00324619, ..., 0.0014955 , 0.0015335 ,
0.00155908], dtype=float32)
df1 = pandas.DataFrame({'wave':[wave]})
df2 = pandas.DataFrame({'flux':[flux]})
df = pandas.concat([df1,df2],axis=1)
Output:
df
wave flux
0 [3872.0696518611626, 3872.079651860265, 3872.0... [0.0027857346, 0.0027025137, 0.0032461907, 0.0...
however the output I want is
wave flux
0 3872.0696518611626 0.0027857346
3872.079651860265 0.0027025137
3872.0... 0.0032461907
You were passing list of list to the column of dataframe [wave] and [flux]. When you pass list of list, you intended to pass a list to each row.
Try passing wave and flux instead.
df1 = pd.DataFrame({'wave':wave})
df2 = pd.DataFrame({'flux':flux})
df = pd.concat([df1,df2],axis=1)
print(df)
wave flux
0 3872.069652 0.002786
1 3872.079652 0.002703
2 3872.089652 0.003246
3 6942.829376 0.001495
4 6942.839376 0.001534
5 6942.849376 0.001559
Here is the most optimal solution if flux and wave is of same size.
df = pd.DataFrame({'wave':wave, "flux":flux})
You just want:
pandas.DataFrame(dict(wave=wave, flux=flux))
You've basically done:
pandas.DataFrame(dict(wave=[wave], flux=[flux]))
Note, when you pass a dict into a dataframe constructor, it treats it as a mapping of columns. Since your lists have a single item, it is being interpreted as a column with a single row.
Also, pd.concat is an unnecessary intermediate step.
I am trying to convert an array into a data frame. And this below is the array.
Array =
[[array([[1.28327719, 0.86585652, 0.66163084],
[1.80828697, 1.24887998, 0.70235812],
[2.66044828, 1.35045788, 0.68215603],
[1.33065745, 1.4577204 , 0.75933679]]),
array([[1.28560483, 0.98658628, 0.67595305],
[1.73489671, 1.482433 , 0.71539607],
[1.29564167, 1.44918617, 0.74288636],
[2.43989581, 1.19118473, 0.64724577]]),
array([[1.27456576, 1.57166264, 0.854981 ],
[1.87001532, 1.57796163, 0.66740871],
[2.74672303, 1.29211241, 0.63669436],
[1.35104199, 0.84856452, 0.69297247]]),
array([[1.38296077, 0.91410661, 0.68056606],
[1.68320947, 1.42367818, 0.6659204 ],
[1.26965674, 1.55126723, 0.73756696],
[2.28880844, 1.27031044, 0.66577891]])],
[array([[1.72877886, 1.47973077, 0.68263402],
[2.28954891, 1.47387583, 0.72014133],
[1.25488202, 1.52890787, 0.72603781],
[1.36624708, 1.02959695, 0.72986648]]),
array([[1.78269554, 1.45968652, 0.65845671],
[1.29550163, 1.56630194, 0.80255398],
[1.33910381, 1.06375653, 0.73887124],
[2.99602633, 1.32380946, 0.71921367]]),
array([[1.32761929, 0.86097994, 0.61124086],
[1.36946819, 1.64210996, 0.66995842],
[1.29004191, 1.69784434, 1.17951575],
[2.29966943, 1.71713578, 0.62684209]]),
array([[1.50548041, 1.56619072, 0.64304549],
[2.38288223, 1.6995361 , 0.62946513],
[1.28558107, 0.78421077, 0.60182813],
[1.22364377, 1.6643322 , 1.00434432]])]]
pd.DataFrame(centroid)
0 1 2 3
0 [[1.283277189792161, 0.8658565155306925, 0.661... [[1.2856048285071469, 0.9865862768448912, 0.67... [[1.274565759781191, 1.5716626415220676, 0.854... [[1.3829607676718185, 0.9141066092756043, 0.68...
1 [[1.7287788611203834, 1.479730766338439, 0.682... [[1.7826955386102115, 1.4596865242143404, 0.65... [[1.3276192850743926, 0.8609799418002607, 0.61... [[1.5054804147099767, 1.566190719572681, 0.643...
If I just put them in pd.Dataframe it shows like this. and I tried to change the column's name by this code.
pd.DataFrame({'Summar':centroid[:,0],'Autumn':centroid[:,1],'Winter':centroid[:,2],'Spring':centroid[:,3]})
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-91-5cb5f6e37746> in <module>()
----> 1 pd.DataFrame({'Summar':Array[:,0],'Autumn': Array[:,1],'Winter': Array[:,2],'Spring': Array[:,3]})
TypeError: list indices must be integers or slices, not tuple
But it shows this error ....
Your Array is a list of list containing arrays. Try converting it to an array using
Array = np.array(Array)
Then check the shape of your Array
print (Array.shape)
# (2, 4, 4, 3)
You should now be able to use slicing
I am trying to read data from a text file into a 2D array and then access each element of the data. I have tried a number of different approaches but I am unable to access each element of the data,
Here is an extract of the data,
GRID 16 7.5 5.961539 0.
GRID 17 7.5 11.92308 0.
GRID 18 7.5 17.88461 0.
GRID 19 7.5 23.84615 0.
GRID 20 7.5 29.80769 0.
GRID 21 7.5 35.76923 0.
GRID 22 7.5 41.73077 0.
GRID 23 7.5 47.69231 0.
GRID 24 7.5 53.65384 0.
Using the example here, Import nastran nodes deck in Python using numpy
It imports OK but it as a 1D array and I 'ary[1,1]' for example, I get the following response,
x[1,1]
Traceback (most recent call last):
File "<ipython-input-85-3e593ebbc211>", line 1, in <module>
x[1,1]
IndexError: too many indices for array
What I am hoping for is,
17
I have also tried the following code and again this reads into a 1D array,
ary = []
with open(os.path.join(dir, fn)) as fi:
for line in fi:
if line.startswith('GRID'):
ary.append([line[i:i+8] for i in range(0, len(line), 8)])
and I get the following error,
ary[1,2]
Traceback (most recent call last):
File "<ipython-input-83-9ac21a0619e9>", line 1, in <module>
ary[1,2]
TypeError: list indices must be integers or slices, not tuple
I am new to Python but I do have experience with VBA where I have used arrays a lot, but I am struggling to understand how to load an array and how to access the specific data.
You can use genfromtxt function.
import numpy as np
ary = np.genfromtxt(file_name, dtype=None)
This will automatically load your file and detect fields type. Now you can access ary by row or by column, for example
In: ary['f1']
Out: array([16, 17, 18, 19, 20, 21, 22, 23, 24])
In: ary[2]
Out: (b'GRID', 18, 7.5, 17.88461, 0.)
or by single element:
In: ary[3]['f1']
Out: 19
In: ary['f1'][3]
Out: 19
You are importing it from a text file? Can you save the text file as a csv? If so, you can easily load the data using pandas.
import pandas as pd
data = pd.read_csv(path_to_file)
Also, it might be that you just need to reshape your numpy array using something like:
x = x.reshape(-1, 4)
EDIT:
Since your format is based on fixed width, you would want to use the fixed width in pandas instead of read_csv. Example below uses width of 8.
x = pd.read_fwf(path_to_file, widths=8)