Output row number along with the value in it - python

I have this code that looks for a certain value in a huge csv file. Those values are 223.2516 for column 2 in the file which is denoted as "row[2]" and 58.053 for column 3 denoted as "row[3]". I have the code set up so that I can find anything close to those values within an established limit. I know that the value 223.2516 doesnt exist in the file so I'm looking for everything that is relatively close as you can see in the code. The last two commands give an output of all the values:
In [54]: [row[2] for row in data if abs(row[2]-223.25)<0.001]
Out[54]:
[223.24945646,
223.25013049,
223.25093125999999,
223.24943973000001,
223.24924296,
223.24958522]
and
In [55]: [row[3] for row in data if abs(row[3]-58.053)<0.001]
Out[55]:
[58.052124569999997,
58.052942659999999,
58.053108100000003,
58.053536250000001,
58.05346918,
58.053109259999999,
58.052188620000003,
58.052528559999999,
58.053201559999998,
58.052009560000002,
58.052036010000002,
58.053623790000003,
58.052450120000003,
58.052405720000003,
58.053431590000002,
58.053709660000003,
58.053117569999998,
58.052511709999997]
The problem that I have is that I need both values to be within the same row. I'm not looking for the values independent of each other. The 223 value and the 58.0 value both have to be in the same row, theyre coordinates.
Is there a way to output only those values that are in the same row or at least, print the row number in which each value is in, along with the value?
Here's my code:
import numpy
from matplotlib import *
from pylab import *
data = np.genfromtxt('result.csv',delimiter=',',skip_header=1, dtype=float)
[row[2] for row in data if abs(row[2]-223.25)<0.001]
[row[3] for row in data if abs(row[3]-58.053)<0.001]

Question looks familiar. Use enumerate. As an example:
data = [[3, 222], [8, 223], [1,224], [5, 223]]
A = [ [ind,row[0],row[1]] for ind,row in enumerate(data) if abs(row[1]-223)<1 ]
print A
[[1, 8, 223], [3, 5, 223]]
This way, you get the index and you get the pair of values you want.
Take the idea and convert back to your example. So something like:
[ [ind, row] for ind,row in enumerate(data) if abs(row[2]-223.25)<0.001]

#brechmos has it. More explicitly, enumerate allows you to track the implicit index of any iterator as you iterate through it.
phrase = 'green eggs and spam':
for ii, word in enumerate(phrase.split()):
print("Word %d is %s" % (ii, word))
Output…
Word 0 is green
Word 1 is eggs
Word 2 is and
Word 3 is spam

Related

How to combine h5 data numpy arrays based on date in filename?

I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.
Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)
This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]

Python extract data from a semi-structured .xlsx file

I have a .xlsx file which looks as the attached file. What is the most common way to extract the different data parts from this excel file in Python?
Ideally there would be a method that is defined as :
pd.read_part_csv(columns=['data1', 'data2','data3'], rows=['val1', 'val2', 'val3'])
and returns an iterator over pandas dataframes which hold the values in the given table.
here is a solution with pylightxl that might be a good fit for your project if all you are doing is reading. I wrote the solution in terms of rows but you could just as well have done it in terms of columns. See docs for more info on pylightxl https://pylightxl.readthedocs.io/en/latest/quickstart.html
import pylightxl
db = pylightxl.readxl('Book1.xlsx')
# pull out all the rowIDs where data groups start
keyrows = [rowID for rowID, row in enumerate(db.ws('Sheet1').rows,1) if 'val1' in row]
# find the columnIDs where data groups start (like in your example, not all data groups start in col A)
keycols = []
for keyrow in keyrows:
# add +1 since python index start from 0
keycols.append(db.ws('Sheet1').row(keyrow).index('val1') + 1)
# define a dict to hold your data groups
datagroups = {}
# populate datatables
for tableIndex, keyrow in enumerate(keyrows,1):
i = 0
# data groups: keys are group IDs starting from 1, list: list of data rows (ie: val1, val2...)
datagroups.update({tableIndex: []})
while True:
# pull out the current group row of data, and remove leading cells with keycols
datarow = db.ws('Sheet1').row(keyrow + i)[keycols[tableIndex-1]:]
# check if the current row is still part of the datagroup
if datarow[0] == '':
# current row is empty and is no longer part of the data group
break
datagroups[tableIndex].append(datarow)
i += 1
print(datagroups[1])
print(datagroups[2])
[[1, 2, 3, ''], [4, 5, 6, ''], [7, 8, 9, '']]
[[9, 1, 4], [2, 4, 1], [3, 2, 1]]
Note that output of table 1 has extra '' on it, that is because the size of the sheet data is larger than your group size. You can easily remove these with list.remove('') if you like

How do I normalize a Python matrix with one CSV column?

I have a matrix where ONE COLUMN is a CSV, like this:-
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
How do I normalize this, so I get one row for each value in the CSV column, i.e. so that it looks like this:-
[
[1, 123],
[1, 354],
[1, 23],
[2, 234],
[2, 34],
[2, 678]
]
I'm open to using numpy or pandas.
Note, in my specific case there are many other non-CSV columns too.
Thanks
In the example you gave, this will do it:
matrix = [
[1,"123,354,23"],
[2,"234,34,678"]
]
import ast
expanded = [
[ index, item ]
for index, rowString in matrix
for item in ast.literal_eval('[' + rowString + ']')
]
For your other "non-CSV" cases it depends on how they are formatted. Here, ast.literal_eval was a good tool for converting your apparent standard (comma-separated string) into a Python sequence that the variable item could iterate over. Other conversion approaches might be needed for other formats.
That produces a list of lists exactly as you specified. pandas is a good tool to use from there though. To then convert the list of lists into a pandas.DataFrame, you could say:
import pandas as pd
df = pd.DataFrame(expanded, columns=['index', 'item']).set_index(['index'])
print(df)
# prints:
#
# item
# index
# 1 123
# 1 354
# 1 23
# 2 234
# 2 34
# 2 678
Or, if by "many other non-CSV columns" you just mean an arbitrary number of additional entries in each row of matrix, but that the last one is still always CSV text, then it could look like this:
matrix = [
[1, 3.1415927, 'Mary Poppins', "123,354,23"],
[2, 2.7182818, 'Genghis Khan', "234,34,678"]
]
import ast
expanded = [
row[:-1] + [item]
for row in matrix
for item in ast.literal_eval('[' + row[-1] + ']')
]
import pandas as pd
df = pd.DataFrame(expanded).set_index([0])
If the matrix contains couples of the form (first, text), you can write:
result = [
[first, int(rest)]
for first, text in matrix
for rest in text.split(",")]
Or, without comprehension list:
result = []
for first, text in matrix:
for rest in text.split(","):
result.append([first, int(rest)])

Finding outliers in an excel row

As an example say column C has 1000 cells and most are filled with '1' however there are a couple of '2' sprinkled in. I'm trying to be able to find how many '2' there are and print the number.
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
for cell in ws['C']:
print(cell.value)
How can I iterate through the column and just pull how many twos there are?
As #K.Marker pointed out, you can query the count of a specific value in the rows with
[c.value for c in ws['C']].count(2)
But what if you don't know the values and/or you'd like to see the distribution of the values of a particular row? You can use a Counter which has dict-like behaviour.
In [446]: from collections import Counter
In [448]: from collections import Counter
In [449]: counter = Counter([c.value for c in ws[3]])
In [451]: counter
Out[451]: Counter({1: 17, 2: 5})
In [452]: for k, v in counter.items():
...: print('{0} occurs {1} time(s)'.format(k, v))
...:
1 occurs 17 time(s)
2 occurs 5 time(s)
import openpyxl
wb = openpyxl.load_workbook('TestBook')
ws = wb.get_sheet_by_name('Sheet1')
num_of_twos = [c.value for c in ws["C"]].count(2)
The list comprehension creates a list of cell values throughout column C, and it counts how many 2 is in it.
Are you looking for how many 2's are there?
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#iterate over it and increase the count
for r in row:
if r==2:
count+=1
Now, this only works with values of "2" and doesn't find other outliers. To find outliers in general you will have to determine a threshold first. In this example I'll use the average value, although you would need to determine the best test to get the threshold for outliers based on your data. Don't worry, statistics are fun!
count = 0
#load a row in the list
row = list(worksheet.rows)[wantedRowNumber]
#calculatethe average
#using numpy
import numpy as np
NPavg = np.mean(list)
#without numpy
#need to cast it to float - otherwise it will round it to int
avg=sum(row)/float(len(row))
#iterate over it and increase the count
for r in row:
#of course use your own threshold,
#determined appropriately, instead of average
if r>NPavg:
count+=1

how to search a list of values in another list

I'am new in Python.
Is there a way to search a list of values (words and fraises) in another list (csv table), and get only the matched rows?
Example:
LiastOfValues=['smoking','hard smoker','alcoholic']
ListfromCSV =
ID,TYPE,STRING1,NUMBER
1, a,'this is hard smoker man',4
2, b,'this one likes to drink',5
3, c,'dont like sigarets',6
4, e,'this one is smoking',7
To search LiastOfValues in each row and reply only the matched rows.
The Output:
Output=
ID,TYPE,STRING1,NUMBER
1, a,'this is hard smoker man',4
4, e,'this one is smoking',7
I have tryed this:
import csv
ListfromCSV ="ListfromCSV.txt"
LiastOfValues=['smoking','hard smoker','alcoholic','smoker']
with open(ListfromCSV ,'r') as f:
LineReader=csv.reader(f,delimiter=',')
for i in LineReader:
if value in (i[2])) :
print (i)
Try this. It assumes your csv is not nested and contains strings. If it is nested, you can convert the lists to strings:
[row for row in csv if any(map(lambda x: x in row,LiastOfValues))]
This code should get a list with matched rows (does not include header unless you match it)

Categories