I have a matrix where ONE COLUMN is a CSV, like this:-
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
How do I normalize this, so I get one row for each value in the CSV column, i.e. so that it looks like this:-
[
[1, 123],
[1, 354],
[1, 23],
[2, 234],
[2, 34],
[2, 678]
]
I'm open to using numpy or pandas.
Note, in my specific case there are many other non-CSV columns too.
Thanks
In the example you gave, this will do it:
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
import ast
expanded = [
[ index, item ]
for index, rowString in matrix
for item in ast.literal_eval('[' + rowString + ']')
]
For your other "non-CSV" cases it depends on how they are formatted. Here, ast.literal_eval was a good tool for converting your apparent standard (comma-separated string) into a Python sequence that the variable item could iterate over. Other conversion approaches might be needed for other formats.
That produces a list of lists exactly as you specified. pandas is a good tool to use from there though. To then convert the list of lists into a pandas.DataFrame, you could say:
import pandas as pd
df = pd.DataFrame(expanded, columns=['index', 'item']).set_index(['index'])
print(df)
# prints:
#
# item
# index
# 1 123
# 1 354
# 1 23
# 2 234
# 2 34
# 2 678
Or, if by "many other non-CSV columns" you just mean an arbitrary number of additional entries in each row of matrix, but that the last one is still always CSV text, then it could look like this:
matrix = [
[1, 3.1415927, 'Mary Poppins', "123,354,23"],
[2, 2.7182818, 'Genghis Khan', "234,34,678"]
]
import ast
expanded = [
row[:-1] + [item]
for row in matrix
for item in ast.literal_eval('[' + row[-1] + ']')
]
import pandas as pd
df = pd.DataFrame(expanded).set_index([0])
If the matrix contains couples of the form (first, text), you can write:
result = [
[first, int(rest)]
for first, text in matrix
for rest in text.split(",")]
Or, without comprehension list:
result = []
for first, text in matrix:
for rest in text.split(","):
result.append([first, int(rest)])
Related
I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.
Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)
This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]
I have a column in a Spark Dataframe that contains a list of strings. I am hoping to do the following and am not sure how:
Search the column for the presence of a substring, if this substring is present, replace that string with a word.
If one of the desired substrings is not present, then replace the string with 'other'
Sample SDF:
data = [
[1, ["EQUCAB", "EQUCAS", "CANMA"]],
[2, ["CANMA", "FELCAT", "SUSDO"]],
[3, ["BOSTAU", "EQUCAB"]],
]
df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]
sdf = spark.createDataFrame(df)
Desired output:
data = [
[1, ["horse", "horse", "other"]],
[2, ["other", "cat", "other"]],
[3, ["cow", "horse"]],
]
df = pd.DataFrame(data, columns=["Item", "String"])
df["String"] = [",".join(map(str, l)) for l in df["String"]]
so basically, any element that contains EQU is assigned horse, any element that contains FEL is assigned cat, any element that contains BOS is assigned cow, and the rest are assigned other.
from pyspark.sql.functions import when, col, lit
df = df.withColum("String",
when(col('String').contains('EQU'), lit('horse'))
.when(col('String').contains('FEL'), lit('cat'))
.when(col('String').contains('BOS'), lit('cow')).otherwise(lit('other')))
I have a data in this form:
49907 87063
42003 51519
21301 46100
97578 26010
52364 86618
25783 71775
1617 29096
2662 47428
74888 54550
17182 35976
86973 5323
......
I need to traverse it at the end like for line in file.
I want to split them like first column values store in array one and second column values store in array two, so whenever I call Array_one[0], Array_two[0] I will get the first row values like 49907 87063 and same for other values.
You can use space as a seperator.
Ex:
import pandas as pd
df = pd.read_csv(filename, sep="\s+", names = ["A", "B"])
print(df["A"][0])
print(df["B"][0])
Output:
49907
87063
for i in df.values:
print(i)
Output:
[49907 87063]
[42003 51519]
[21301 46100]
[97578 26010]
[52364 86618]
[25783 71775]
[ 1617 29096]
[ 2662 47428]
[74888 54550]
[17182 35976]
[86973 5323]
You can use numpy.genfromtxt to extract directly into a numpy array:
A = np.genfromtxt(file, dtype=int)
Whitespace is the default delimiter.
You can then use standard numpy indexing / slicing:
To extract the first row: A[0]; the second column: A[:, 1].
To extract the first element of the first row: A[0, 0].
To extract the first element of the second column: A[0, 1]
To iterate the entire array by row:
for i in range(A.shape[0]):
print(A[i])
I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following
first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]
When I load the this CSV with pandas data frame and using the following code
data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})
Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like
170.0, [19 23 234 376]
162.0, [1 2 3 4]
How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.
group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))
With your current csv file the 'third' column comes in as a string, instead of a list.
There might be nicer ways to convert to a list, but here goes...
from ast import literal_eval
data = pd.read_csv('test_groupby.csv')
# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')
# Convert string to list...
data['third'] = data['third'].apply(literal_eval)
group_data=data.groupby('first')
# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
# (np.array should work here, but...?)
ans = group_data.aggregate(
{'third': lambda x: list(np.unique(
np.concatenate(x.values)))})
print(ans)
third
first
162 [1, 2, 3, 4]
170 [19, 23, 234, 376]
I have this code that looks for a certain value in a huge csv file. Those values are 223.2516 for column 2 in the file which is denoted as "row[2]" and 58.053 for column 3 denoted as "row[3]". I have the code set up so that I can find anything close to those values within an established limit. I know that the value 223.2516 doesnt exist in the file so I'm looking for everything that is relatively close as you can see in the code. The last two commands give an output of all the values:
In [54]: [row[2] for row in data if abs(row[2]-223.25)<0.001]
Out[54]:
[223.24945646,
223.25013049,
223.25093125999999,
223.24943973000001,
223.24924296,
223.24958522]
and
In [55]: [row[3] for row in data if abs(row[3]-58.053)<0.001]
Out[55]:
[58.052124569999997,
58.052942659999999,
58.053108100000003,
58.053536250000001,
58.05346918,
58.053109259999999,
58.052188620000003,
58.052528559999999,
58.053201559999998,
58.052009560000002,
58.052036010000002,
58.053623790000003,
58.052450120000003,
58.052405720000003,
58.053431590000002,
58.053709660000003,
58.053117569999998,
58.052511709999997]
The problem that I have is that I need both values to be within the same row. I'm not looking for the values independent of each other. The 223 value and the 58.0 value both have to be in the same row, theyre coordinates.
Is there a way to output only those values that are in the same row or at least, print the row number in which each value is in, along with the value?
Here's my code:
import numpy
from matplotlib import *
from pylab import *
data = np.genfromtxt('result.csv',delimiter=',',skip_header=1, dtype=float)
[row[2] for row in data if abs(row[2]-223.25)<0.001]
[row[3] for row in data if abs(row[3]-58.053)<0.001]
Question looks familiar. Use enumerate. As an example:
data = [[3, 222], [8, 223], [1,224], [5, 223]]
A = [ [ind,row[0],row[1]] for ind,row in enumerate(data) if abs(row[1]-223)<1 ]
print A
[[1, 8, 223], [3, 5, 223]]
This way, you get the index and you get the pair of values you want.
Take the idea and convert back to your example. So something like:
[ [ind, row] for ind,row in enumerate(data) if abs(row[2]-223.25)<0.001]
#brechmos has it. More explicitly, enumerate allows you to track the implicit index of any iterator as you iterate through it.
phrase = 'green eggs and spam':
for ii, word in enumerate(phrase.split()):
print("Word %d is %s" % (ii, word))
Output…
Word 0 is green
Word 1 is eggs
Word 2 is and
Word 3 is spam