Concatenating and sorting - python

cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP' nan]], dtype=object)
a = np.concatenate(b) #concatenated to get a single array, this worked well
print(np.sort(a)) # to sort alphabetically
it gave me error **error AxisError: axis -1 is out of bounds for array of dimension 0*
I also tried using a.sort() it is also giving me **TypeError: '<' not supported between instances of 'float' and 'str'**
The above is a CSV file containing list of postcodes of different persons which involves travelling from one postcode to another for different jobs, a person could travel to 5 postcoodes a day. using numpy array, I got list of list of postcodes.
I then concatenate the list of postcode to get one big list of postcode after which I want to sort it in an alphabetical order but it kept giving me errors.
Please, can someone help

As it was mentioned in the comments, this error is caused by the comparison of nan to string. To fix this, you cannot use a NumPy array (for sorting), but rather a list.
Convert the array to a list
Remove the nan values
Sort
# Get the data (in your scenario, this would be achieved by reading from your file)
b = np.array([['WV5 6NY', 'RE4 9VU', 'BU4 N90', 'TU3 5RE', 'NE5 4F'],
['SA8 7TA', 'BA31 0PO', 'DE3 2FP', 'LR98 4TS', nan],
['MN0 4NU', 'RF5 5FG', 'WA3 0MN', 'EA15 8RE', 'BE1 4RE'],
['SB7 0ET', 'SA7 0SB', 'BT7 6NS', 'TA9 0LP', nan]], dtype=object)
# Flatten
a = np.concatenate(b)
# Remove nan values - they are converted to strings when concatenated
a = np.array([x for x in a if x != 'nan'])
# Finally, sort
a.sort()

Related

How to combine h5 data numpy arrays based on date in filename?

I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.
Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)
This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]

Vector product of a list of array, array by array

I've a list of array and each array contains 5 softmax arrays. I want, as an output, a final list with only one element per array, which is the product of all the 5 arrays.
vals_pred = [res.iloc[i]['y_pred'][0:5] for i in range(len(res))
if len(res.iloc[i]['y_pred']) > lookahead]
res is a dataframe. This is an example of what is found within the list.
array([[1.34866089e-01, 5.28018773e-02, 2.23537564e-01, 4.62821350e-02,
8.76934379e-02, 4.26145524e-01, 5.53494925e-03, 2.31384877e-02],
[1.10163569e-01, 7.80740231e-02, 5.52961051e-01, 4.57449956e-03,
4.64441329e-02, 2.07768157e-01, 1.45530776e-05, 4.66483916e-18],
[1.82191223e-01, 1.10042050e-01, 3.27700675e-01, 3.38860601e-03,
1.07456036e-01, 2.69037366e-01, 1.84074364e-04, 3.74613562e-13],
[1.80145595e-02, 7.61333853e-03, 8.86637151e-01, 1.02691650e-02,
2.46689599e-02, 5.27242124e-02, 7.26214203e-05, 2.64242381e-19],
[5.62842265e-02, 1.42695876e-02, 8.42117667e-01, 1.62272118e-02,
4.17451970e-02, 2.88727339e-02, 4.83323034e-04, 2.92075066e-13]],
dtype=float32), array([[5.4129714e-04, 3.6672730e-02, 4.8940146e-06, 8.3479950e-05,
6.5760143e-02, 1.6968355e-02, 8.7981069e-01, 1.5837120e-04],
[2.1086331e-07, 1.4350067e-03, 1.6849227e-16, 8.4671612e-08,
2.4794264e-07, 3.8307374e-03, 9.9473369e-01, 6.4219920e-24],
[1.7740911e-04, 2.1856705e-02, 1.5427456e-11, 1.2592359e-08,
2.0797092e-03, 3.5571402e-01, 6.2017220e-01, 3.4836909e-20],
[1.7687974e-07, 4.4844073e-05, 9.6608031e-14, 4.5008356e-08,
1.0638499e-03, 1.0105613e-04, 9.9878997e-01, 4.8578437e-24],
[6.6033276e-11, 5.1952496e-09, 5.8398927e-21, 4.5752773e-13,
5.2611317e-06, 2.4885111e-07, 9.9999452e-01, 2.5868782e-10]],
dtype=float32)
The desired output is a list similarly composed of 2 arrays (in this example case), but each array is the multiplication of the 5 arrays (element by element) that contains the list element. Next I need to get the argmax.
Any ideas on how I could do this? I can use any Python library
Assuming your list is named array_list, you can use a list comprehension and numpy.prod:
import numpy as np
[np.prod(a) for a in array_list]
output: [1.2981583896975622e-116, 2.2208998715549213e-267]
Or maybe you want the product only on the fist axis per array:
[np.prod(a, axis=0) for a in array_list]
output:
[array([2.74459685e-06, 4.92834563e-08, 3.02441299e-02, 1.19552067e-10,
4.50698513e-07, 3.62616474e-05, 5.20432043e-19, 3.12070056e-63]),
array([2.36512231e-31, 2.67974413e-19, 7.17724310e-66, 1.83289538e-39,
1.89791218e-19, 5.81467373e-16, 5.42100925e-01, 4.45251202e-80])]
And to get the argmax:
[np.argmax(np.prod(a, axis=0)) for a in array_list]
output: [2, 6]

Reading a complex csv file into a numpy array

I have such a csv file;
rgb-28.ppm
rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)
rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)
On each line, there is the name of a file, and there is one or multiple tuples belonging to that file.
I want to read this csv file as the following.
On each row, the first column will involve the name of the file. The next columns will involve the tuples. If there won't be any tuple, the column will be empty. If there is a tuple, the tuple will occupy the column.
And when I want to read this file as the following;
contours = genfromtxt(path, delimiter=' ')
I get the following error:
Line #36098 (got 6 columns instead of 1)
How can I read such kind of a file into a csv?
Thanks,
Try this. The idea is, from the input file, find line which has the maximum number of columns. Use this, to construct a dynamic column list names. Pass this column list as the column names to Pandas. As mentioned in the comments, numpy is not efficient in handling the missing values. Once the data is in DataFrame, use the columns C1, C2, etc. to remove the unwanted characters, and then str.split to convert the numbers into a list to numbers.
import pandas as pd
l_max_col_nos = 0
l_f = open('data.csv','r')
for each_line in l_f:
l_split = len(each_line.split('\t'))
if l_split > l_max_col_nos:
l_max_col_nos = l_split
l_f.close()
l_column_list = []
for each_i in xrange(l_max_col_nos):
l_column_list.append('C' + str(each_i))
print l_column_list
l_df = pd.read_csv('data.csv',sep='\t',header=None,names=l_column_list)
print l_df
print l_df['C1'].str.replace(')','').str.replace('(','').str.replace('\s','').str.split(',')
Output
['C0', 'C1', 'C2']
C0 C1 \
0 rgb-28.ppm NaN
1 rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.918...
2 rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.884...
C2
0 NaN
1 NaN
2 (281.05219, 54.649971, 319.0, 108.78567, 0.616...
0 NaN
1 [214.75142, 45.618622, 319.0, 152.53371, 0.918...
2 [235.09999, 47.999729, 319.0, 147.49998, 0.884...
dtype: object
When you use genfromtxt(path, delimiter=' '), it reads each line, splits it on the delimiter. Without further specifications it takes the number of split strings in the first line as the expected number for all lines.
The first line has just one string - so it expects one column all the way down.
The 2nd line has that string, but it also has those 5 number strings. Yes they are wrapped in () and separated by ,; but they are also separated by the space. genfromtxt does not handle ().
And then the 3rd line has 2 of those () blocks.
The csv.reader can handle quoted strings, but I don't think it can treat () as "...".
Your parsing goal does not fit an array or table. It sounds like you expect a variable of number of 'columns' per row, and that each such 'column' will contain this 5 number tuple. That does not compute. Yes, you could force that structure into an object type array, but the fit is bad.
However if each tuple of numbers contains 5, I can see creating a dictionary with the filename as key, and each tuple of that line as a row in a 5 column 2d array. But regardless of target structure you need to figure out a way of one line, such as that one with 2 tuples. How do split it on the spaces, without splitting on the ', '? Once you have () groups you can strip off the (), and split on ', '. The re, regular expression, module might be the best tool for this (I'll try to develop that).
=======================
A possible parsing of your example
Start with a line parsing function:
def foo(aline):
alist = re.split(' \(',aline)
key = alist[0]
rest = alist[1:]
rest = [r.strip().strip(')') for r in rest]
if len(rest)>0:
rest = np.array([[float(i) for i in r.split(',')] for r in rest])
else:
rest = None
return [key, rest]
Your sample text - copy-n-paste and split into lines
In [310]: txt="""rgb-28.ppm
rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)
rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)"""
In [311]: txt=txt.splitlines()
In [312]: txt
Out[312]:
['rgb-28.ppm',
'rgb-29.ppm (214.75142, 45.618622, 319.0, 152.53371, 0.91839749)',
'rgb-30.ppm (235.09999, 47.999729, 319.0, 147.49998, 0.88473213) (281.05219, 54.649971, 319.0, 108.78567, 0.61637461)']
Now pass each line through the function:
In [313]: data = []
In [314]: for line in txt:
.....: data.append(foo(line))
In [315]: data
Out[315]:
[['rgb-28.ppm', None],
['rgb-29.ppm',
array([[ 214.75142 , 45.618622 , 319. , 152.53371 ,
0.91839749]])],
['rgb-30.ppm',
array([[ 235.09999 , 47.999729 , 319. , 147.49998 ,
0.88473213],
[ 281.05219 , 54.649971 , 319. , 108.78567 ,
0.61637461]])]]
In [316]: data[2][1].shape
Out[316]: (2, 5)
The last line contains the data in a 2x5 array. The first has None.
Splitting on ' (' seems to be enough to handle the larger groups. It leaves a trailing ')' on the groups, but that's easy to strip off. The rest is to split each group into substrings, and convert those to floats.
As written the function has no error checking or robustness, but it is a start. The data might not exactly in the form you want, but it can be reworked as needed.

Numpy set dtype=None, cannot splice columns and set dtype=object cannot set dtype.names

I am running Python 2.6. I have the following example where I am trying to concatenate the date and time string columns from a csv file. Based on the dtype I set (None vs object), I am seeing some differences in behavior that I cannot explained, see Question 1 and 2 at the end of the post. The exception returned is not too descriptive, and the dtype documentation doesn't mention any specific behavior to expect when dtype is set to object.
Here is the snippet:
#! /usr/bin/python
import numpy as np
# simulate a csv file
from StringIO import StringIO
data = StringIO("""
Title
Date,Time,Speed
,,(m/s)
2012-04-01,00:10, 85
2012-04-02,00:20, 86
2012-04-03,00:30, 87
""".strip())
# (Fail) case 1: dtype=None splicing a column fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr1 = np.genfromtxt(data, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
arr1.dtype.names = header # assign the header to names
# so we can do y=arr['Speed']
y1 = arr1['Speed']
# Q1 IndexError: invalid index
#a1 = arr1[:,0]
#print a1
# EDIT1:
print "arr1.shape "
print arr1.shape # (3,)
# Fails as expected TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'numpy.ndarray'
# z1 = arr1['Date'] + arr1['Time']
# This can be workaround by specifying dtype=object, which leads to case 2
data.seek(0) # resets
# (Fail) case 2: dtype=object assign header fails
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr2 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
# Q2 ValueError: there are no fields define
#arr2.dtype.names = header # assign the header to names. so we can use it to do indexing
# ie y=arr['Speed']
# y2 = arr['Date'] + arr['Time'] # column headings were assigned previously by arr.dtype.names = header
data.seek(0) # resets
# (Good) case 3: dtype=object but don't assign headers
next(data) # eat away the title line
header = [item.strip() for item in next(data).split(',')] # get the headers
arr3 = np.genfromtxt(data, dtype=object, delimiter=',',skiprows=1) # skiprows=1 for the row with units
y3 = arr3[:,0] + arr3[:,1] # slice the columns
print y3
# case 4: dtype=None, all data are ints, array dimension 2-D
# simulate a csv file
from StringIO import StringIO
data2 = StringIO("""
Title
Date,Time,Speed
,,(m/s)
45,46,85
12,13,86
50,46,87
""".strip())
next(data2) # eat away the title line
header = [item.strip() for item in next(data2).split(',')] # get the headers
arr4 = np.genfromtxt(data2, dtype=None, delimiter=',',skiprows=1)# skiprows=1 for the row with units
#arr4.dtype.names = header # Value error
print "arr4.shape "
print arr4.shape # (3,3)
data2.seek(0) # resets
Question 1: At comment Q1, why can I not slice a column, when dtype=None?
This could be avoided by
a) arr1=np-genfromtxt... was initialized with dtype=object like case 3,
b) arr1.dtype.names=... wascommented out to avoid the Value error in case 2
Question 2: At comment Q2, why can I not set the dtype.names when dtype=object?
EDIT1:
Added a case 4 that shows when the dimension of the array would be 2-D if the values in the simulated csv files are all ints instead. One can slice the column, but assigning the dtype.names would still fail.
Update the term 'splice' to 'slice'.
Question 1
This is indexing, not 'splicing', and you can't index into the columns of data for exactly the same reason I explained to you before in my answer to Question 7 here. Look at arr1.shape - it is (3,), i.e. arr1 is 1D, not 2D. There are no columns for you to index into.
Now look at the shape of arr2 - you'll see that it's (3,3). Why is this? If you do specify dtype=desired_type, np.genfromtxt will treat every delimited part of your input string the same (i.e. as desired_type), and it will give you an ordinary, non-structured numpy array back.
I'm not quite sure what you wanted to do with this line:
z1 = arr1['Date'] + arr1['Time']
Did you mean to concatenate the date and time strings together like this: '2012-04-01 00:10'? You could do it like this:
z1 = [d + ' ' + t for d,t in zip(arr1['Date'],arr1['Time'])]
It depends what you want to do with the output (this will give you a list of strings, not a numpy array).
I should point out that, as of version 1.7, Numpy has core array types that support datetime functionality. This would allow you to do much more useful things like computing time deltas etc.
dts = np.array(z1,dtype=np.datetime64)
Edit:
If you want to plot timeseries data, you can use matplotlib.dates.strpdate2num to convert your strings to matplotlib datenums, then use plot_date():
from matplotlib import dates
from matplotlib import pyplot as pp
# convert date and time strings to matplotlib datenums
dtconv = dates.strpdate2num('%Y-%m-%d%H:%M')
datenums = [dtconv(d+t) for d,t in zip(arr1['Date'],arr1['Time'])]
# use plot_date to plot timeseries
pp.plot_date(datenums,arr1['Speed'],'-ob')
You should also take a look at Pandas, which has some nice tools for visualising timeseries data.
Question 2
You can't set the names of arr2 because it is not a structured array (see above).

Python - Subtract 2 lists from each other that are in a dictionary

So I have to take information from a large data file that has 14 properties (columns). Using this information I was able to take the data and combine it into a list of floats. I have to analyse it and was required to normalise the values (value - minvalue)/(maxvalue - minvalue). I then put the original value list into a dictionary with the normalised values to that they are still related.
I need to now take 2 different keys of this dictionary which correspond to 2 different lists of normalised values and then subtract them from each other for further analysis.
Sample of my dictionary info:
(3.0, 13.73, 4.36, 2.26, 22.5, 88.0, 1.28, 0.47, 0.52, 1.15, 6.62, 0.78, 1.75, 520.0):
[0.7105263157894738, 0.7154150197628459, 0.4812834224598929, 0.6134020618556701, 0.1956521739130435, 0.10344827586206898, 0.02742616033755273, 0.7358490566037735, 0.2334384858044164, 0.4556313993174061, 0.2439024390243903, 0.1758241758241758, 0.17261055634807418]
there are over 100 similar entries
Using Python3 and no libraries apart from math
Any help is appreciated but if you feel there is an easier way to do this please let me know.
Edit: I cannot use any imported libraries
Ill add in some of my code but I have to snip a large portion of it out as it is much too large to include in this post.
for line in temp_file:
line = line.strip() #remove white space
line_list = line.split(",") #split the list into components seperated by commas
temp_list2 = []
for item in line_list[0:]:
value_float = float(item) #make values currently of type string into type float
temp_list2.append(value_float)
tuple_list = tuple(temp_list2) #make each item into a seperate tuple and then list
data_list.append(tuple_list) #these tuples in a master list data_list
prop_elts = [(x[1:]) for x in data_list]
------snip-------- (here is just where i defined each of the columns and then calculated the normalised values)
i = 0
while i < len(data_list):
all_props_templist = [prop1_list[i],prop2_list[i],prop3_list[i],prop4_list[i],prop5_list[i],prop6_list[i],prop7_list[i],prop8_list[i],prop9_list[i],prop10_list[i],prop11_list[i],prop12_list[i],prop13_list[i]]
all_properties.append(all_props_templist)
i = i + 1
my_data_Dictionary = {el1: el2 for el1, el2 in zip(data_list,all_properties )}
If data is your dict,
[a-b for a, b in zip(data[key1], data[key2])]
is a list whose elements are the diffference between the corresponding elements in data[key1] and data[key2].
PS. When you see numbered variable names:
all_props_templist = [prop1_list[i],prop2_list[i],prop3_list[i],prop4_list[i],prop5_list[i],prop6_list[i],prop7_list[i],prop8_list[i],prop9_list[i],prop10_list[i],prop11_list[i],prop12_list[i],prop13_list[i]]
know that the situation is crying out for a list with an index in place of the number:
all_props_templist = [prop_list[j][i] for j in range(13)]

Categories