my goal for this question is to insert a comma between every character in every column value, which have been hashed and padded to a length of 19 digits.
The code below works partially, but the array values get messed up by trying to apply the f_comma function to the column value...thanks for any help!
I've taken some of the answers from other questions and have created the following code:
using this function -
def f_comma(p_string, n=1):
p_string = str(p_string)
return ','.join(p_string[i:i+n] for i in range(0, len(p_string), n))
And opening a tsv file
data = pd.read_csv('a1.tsv', sep = '\t', dtype=object)
I have modified another answer to do the following -
h = 1
try:
while data.columns[h]:
a = data.columns[h]
data[a] = f_comma((abs(data[a].apply(hash))).astype(str).str.zfill(19))
h += 1
except IndexError:
pass
which returns this array
array([[ '0, , , , ,4,1,7,5,7,0,1,4,5,4,6,1,6,5,3,1,4,6,1,\n,N,a,m,e,:, ,d,a,t,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t',
'0, , , , ,6,2,9,1,6,7,0,8,4,2,8,2,9,1,0,9,5,9,4,\n,N,a,m,e,:, ,n,a,m,e,,, ,d,t,y,p,e,:, ,o,b,j,e,c,t']], dtype=object)
without the f_comma function the array looks like -
array([['3556968867719847281', '3691880917405293133']], dtype=object)
The goal is an array like this -
array([['3,5,5,6,9,6,8,8,6,7,7,1,9,8,4,7,2,8,1', '3,6,9,1,8,8,0,9,1,7,4,0,5,2,9,3,1,3,3']], dtype=object)
You should be able to use pandas string functions.
e.g. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.join.html
df["my_column"].str.join(',')
Related
I have hundreds of .h5 files with dates in their filename (e.g ...20221017...). For each file, I have extracted some parameters into a numpy array of the format
[[param_1a, param_2a...param_5a],
...
[param_1x, param_2x,...param_5x]]
which represents data of interest. I want to group the data by month, so instead of having (e.g) 30 arrays for one month, I have 1 array which represents the average of the 30 arrays. How can I do this?
This is the code I have so far, filename represents a txt file of file names.
def combine_months(filename):
fin = open(filename, 'r')
next_name = fin.readline()
while (next_name != ""):
year = next_name[6:10]
month = next_name[11:13]
date = month+'\\'+year
#not sure where to go from here
fin.close()
An example of what I hope to achieve is that say array_1, array_2, array_3 are numpy arrays representing data from different h5 files with the same month in the date of their filename.
array_1 = [[ 1 4 10]
[ 2 5 11]
[3 6 12]]
array_2 = [[ 1 2 5]
[ 2 2 3]
[ 3 6 12]]
array_3 = [[ 2 4 10]
[ 3 2 3]
[ 4 6 12]]
I want the result to look like:
2022_04_data = [[1,3,7.5]
[2, 2, 6.5]
[3,4,7.5]
[4,6,12]]
Note that the first number of each row represents an ID, so I need to group those data together based on the first number as well.
Ok, here is the beginning of an answer. (I suspect you may have more questions as you work thru the details.)
There are several ways to get the filenames. You could put them in a file, but it's easier (and better IMHO) to use the glob.iglob() function. There are 2 examples below that show how to: 1) open each file, 2) read the data from the data dataset into an array, and 3) append the array to a list. The first example has the file names in a list. The second uses the glob.iglob() function to get the filenames. (You could also use glob.glob() to create a list of names.)
Method 1: read filenames from list
import h5py
arr_list = []
for h5file in ['20221001.h5', '20221002.h5', '20221003.h5']:
with h5py.File(h5file,'r') as h5f:
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
Method 2: use glob.iglob() to get files using wildcard names
import h5py
from glob import iglob
arr_list = []
for h5file in iglob('202210*.h5'):
with h5py.File(h5file,'r') as h5f:
print(h5f.keys()) # to get the dataset names from the keys
arr = h5f['data'][()]
#print(arr)
arr_list.append(arr)
After you have read the datasets into arrays, you iterate over the list, do your calculations and create a new array from the results. Code below shows how to get the shape and dtype.
for arr in arr_list:
# do something with the data based on column 0 value
print(arr.shape, arr.dtype)
Code below shows a way to sum rows with matching column 0 values. Without more details it's hard to show exactly how to do this. It reads all column 0 values into a sorted list, then uses to size count and sum arrays, then as a index to the proper row.
# first create a list from column 0 values, then sort
row_value_list = []
for arr in arr_list:
col_vals = arr[:,0]
for val in col_vals:
if val not in row_value_list:
row_value_list.append(val)
# Sort list of column IDs
row_value_list.sort()
# get length index list to create cnt and sum arrays
a0 = len(row_value_list)
# get shape and dtype from 1st array, assume constant for all
a1 = arr_list[0].shape[1]
dt = arr_list[0].dtype
arr_cnt = np.zeros(shape=(a0,a1),dtype=dt)
arr_cnt[:,0] = row_value_list
arr_sum = np.zeros(shape=(a0,a1),dtype=dt)
arr_sum[:,0] = row_value_list
for arr in arr_list:
for row in arr:
idx = row_value_list.index(row[0])
arr_cnt[idx,1:] += 1
arr_sum[idx,1:] += row[1:]
print('Count Array\n',arr_cnt)
print('Sum Array\n',arr_sum)
arr_ave = arr_sum/arr_cnt
arr_ave[:,0] = row_value_list
print('Average Array\n',arr_ave)
Here is an alternate way to create row_value_list from a set. It's simpler because sets don't retain duplicate values, so you don't have to check for existing values when adding them to row_value_set.
# first create a set from column 0 values, then create a sorted list
row_value_set = set()
for arr in arr_list:
col_vals = set(arr[:,0])
row_value_set = row_value_set.union(col_vals)
row_value_list = sorted(row_value_set)
This is a new, updated answer that addresses the comment/request about calculating the median. (It also calculates the mean, and can be easily extended to calculate other statistics from the masked array.)
As noted in my comment on Nov 4 2022, "starting from my first answer quickly got complicated and hard to follow". This process is similar but different from the first answer. It uses glob to get a list of filenames (instead of iglob). Instead of loading the H5 datasets into a list of arrays, it loads all of the data into a single array (data is "stacked" on the 0-axis.). I don't think this increases the memory footprint. However, memory could be a problem if you load a lot of very large datasets for analysis.
Summary of the procedure:
Use glob.glob() to load filenames to a list based on a wildcard
Allocate an array to hold all the data (arr_all) based on the # of
files and size of 1 dataset.
Loop thru all H5 files, loading data to arr_all
Create a sorted list of unique group IDs (column 0 values)
Allocate arrays to hold mean/median (arr_mean and arr_median) based on the # of unique row IDs and # of columns in arr_all.
Loop over values in ID list, then:
a. Create masked array (mask) where column 0 value = loop value
b. Broadcast mask to match arr_all shape then apply to create ma_arr_all
c. Loop over columns of ma_arr_all, compress to get unmasked values, then calculate mean and median and save.
Code below:
import h5py
from glob import glob
import numpy as np
# use glob.glob() to get list of files using wildcard names
file_list = glob('202210*.h5')
with h5py.File(file_list[0],'r') as h5f:
a0, a1 = h5f['data'].shape
# allocate array to hold values from all datasets
arr_all = np.zeros(shape=(len(file_list)*a0,a1), dtype=h5f['data'].dtype)
start, stop = 0, a0
for i, h5file in enumerate(file_list):
with h5py.File(h5file,'r') as h5f:
arr_all[start:stop,:] = h5f['data'][()]
start += a0
stop += a0
# Create a set from column 0 values, and use to create a sorted list
row_value_list = sorted(set(arr_all[:,0]))
arr_mean = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
arr_median = np.zeros(shape=(len(row_value_list),arr_all.shape[1]))
col_0 = arr_all[:,0:1]
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
all_mask= np.broadcast_to(row_mask, arr_all.shape)
ma_arr_all = np.ma.masked_array(arr_all, mask=all_mask)
for j in range(ma_arr_all.shape[1]):
masked_col = ma_arr_all[:,j:j+1].compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
print('Mean values:\n',arr_mean)
print('Median values:\n',arr_median)
Added Nov 22, 2022:
Method above uses np.broadcast_to() introduced in NumPy 1.10. Here is an alternate method for prior versions. (Replaces the entire for i, row_val loop.) It should be more memory efficient. I haven't profiled to verify, but arrays all_mask and ma_arr_all are not created.
for i, row_val in enumerate(row_value_list):
row_mask = np.where(col_0==row_val, False, True ) # True mask value ignores data.
for j in range(arr_all.shape[1]):
masked_col = np.ma.masked_array(arr_all[:,j:j+1], mask=row_mask).compressed()
arr_mean[i:i+1,j:j+1] = np.mean(masked_col)
arr_median[i:i+1,j:j+1] = np.median(masked_col)
I ran with values provided by OP. Output is provided below and is the same for both methods:
Mean values:
[[ 1. 3. 7.5 ]
[ 2. 3.66666667 8. ]
[ 3. 4.66666667 9. ]
[ 4. 6. 12. ]]
Median values:
[[ 1. 3. 7.5]
[ 2. 4. 10. ]
[ 3. 6. 12. ]
[ 4. 6. 12. ]]
I have a pandas df like this.
up_value valsup
0 59044.21272 59044.21272
1 59040.68568 59158.53136
2 59044.21272 59279.91816
3 59040.69570 59394.23280
4 59044.22274 59515.63370
... ... ...
6081 58917.07896 774036.35472
6082 58917.07896 774153.95368
6083 58917.08898 774271.68432
6084 58917.07896 774389.15160
6085 58917.08898 774506.88228
I'm trying to use numpy argwhere and create a new pandas column like this.
df["idx_up"] = np.argwhere(df["valsup"].values > df["up_value"].values)
But it returns the following error.
ValueError: Length of values (6085) does not match length of index (6086)
When I do, print(np.argwhere(df["valsup"].values > df["up_value"].values)), the output looks like this.
[[ 1]
[ 2]
[ 3]
...
[6083]
[6084]
[6085]]
So it seems like np.argwhere only returns 6085 values instead of 6086.
I wanna assign the output to pandas. Can someone tell me how to fix the error?
Thanks
At the code from one of the answers at that url,
idx_up = idx_up[0][0] if len(idx_up) else -1
this code checks only idx_up at index 0.
You should add column first like
df['idx_up'] = -1
and update like
df['idx_up'].iloc[[x[0] for x in idx_up]] = [x[0] for x in idx_up]
I have two column, one is a string, and the other is a numpy array of floats
a = 'this is string'
b = np.array([-2.355, 1.957, 1.266, -6.913])
I would like to store them in a row as separate columns in a hdf5 file. For that I am using pandas
hdf_key = 'hdf_key'
store5 = pd.HDFStore('file.h5')
z = pd.DataFrame(
{
'string': [a],
'array': [b]
})
store5.append(hdf_key, z, index=False)
store5.close()
However, I get this error
TypeError: Cannot serialize the column [array] because
its data contents are [mixed] object dtype
Is there a way to store this to h5? If so, how? If not, what's the best way to store this sort of data?
I can't help you with pandas, but can show you how do this with pytables.
Basically you create a table referencing either a numpy recarray or a dtype that defines the mixed datatypes.
Below is a super simple example to show how to create a table with 1 string and 4 floats. Then it adds rows of data to the table.
It shows 2 different methods to add data:
1. A list of tuples (1 tuple for each row) - see append_list
2. A numpy recarray (with dtype matching the table definition) -
see simple_recarr in the for loop
To get the rest of the arguments for create_table(), read the Pytables documentation. It's very helpful, and should answer additional questions. Link below:
Pytables Users's Guide
import tables as tb
import numpy as np
with tb.open_file('SO_55943319.h5', 'w') as h5f:
my_dtype = np.dtype([('A','S16'),('b',float),('c',float),('d',float),('e',float)])
dset = h5f.create_table(h5f.root, 'table_data', description=my_dtype)
# Append one row using a list:
append_list = [('test string', -2.355, 1.957, 1.266, -6.913)]
dset.append(append_list)
simple_recarr = np.recarray((1,),dtype=my_dtype)
for i in range(5):
simple_recarr['A']='string_' + str(i)
simple_recarr['b']=2.0*i
simple_recarr['c']=3.0*i
simple_recarr['d']=4.0*i
simple_recarr['e']=5.0*i
dset.append(simple_recarr)
print ('done')
I have a dataframe that consists of two decimal values and an Id:
When I apply the as matrix function on the x and y values it yields an array that looks like this:
coords = df.as_matrix(columns=['x', 'y'])
coords
yields:
array([[ 0.0703843 , 0.170845 ],
[ 0.07022078, 0.17150128],
[ 0.07208886, 0.17159163],
...,
[ 0.07162819, 0.17044404],
[ 0.06951432, 0.17096308],
[ 0.07104143, 0.17040137]])
This immediately seemed strange since the length of the decimal place were inconsistent but I just assumed pandas was doing some shortening for display purposes
But then when I tried to retrieve the IDs - I could only get one or zero matchs when they should all match:
ids = []
for coord in coords:
try:
_id = df.loc[df['x'] == coord[0]]['id'][1]
ids.append(_id)
except:
pass
len(ids)
1
What I am trying to understand is why the pd.as_matrix function extracts a value from the data frame that cannot be referenced again, and if so how do retrieve the ids from the data frame.
Any help here would be appreciated.
Thanks
Edit
Bellow is an subset of the data frame in CSV:
,id,x,y
0,07379a26-2447-4fce-83ac-4784abf07389,0.07038429591623253,0.17084500318384327
1,f5cc3adb-0588-4705-b1a3-fe1b669b776f,0.07022078416348305,0.17150127781674332
2,b5a57ffe-8565-4443-9685-11675ce25dc4,0.07208886125821728,0.17159163002146055
3,940efcaa-6d9d-4b10-a0fe-d8ec8c1d9c7e,0.07057468050347501,0.1700482708522834
4,616d7794-565a-4d2d-98cb-334beb5b91ef,0.07057895306948389,0.170054305037284
5,e2d1819d-1f58-407d-9950-be0a0c00374b,0.07161607658023798,0.17013089473907284
6,6a739687-f9ad-47bd-8a4b-c47bc4b2aec6,0.070163429153604,0.16889764101717875
7,dd2df646-9a66-4baa-8815-d24f1858eda7,0.07035099968831582,0.16995622800529742
8,6a224d76-efea-4313-803d-c25b619dae0a,0.07066777462044714,0.17021849979554743
9,321147fa-ee51-4bab-9634-199c92a42d2f,0.06984869509314469,0.17098101436534555
10,e52d6289-01ba-4e7d-8054-bb9a349c0505,0.07068704829137691,0.17029718331066224
11,517f256b-6171-4d93-9b4b-0f81aac828fb,0.0713283119291569,0.16983952831019206
12,e339c742-9784-49fc-a435-790db0364229,0.07131341496221469,0.1698513011377732
13,6f20ad5a-22fb-43a2-8885-838e5161df14,0.06942397329210678,0.1716572235671854
14,f6e1008f-2b22-4d88-8c84-c0dc4f2d822e,0.06942427697939664,0.17165098925109726
15,8a2d35e5-10a2-4188-b98f-54200d2db8da,0.07048162129308791,0.16896051533992895
16,adab8fd8-4348-412d-85d2-01491886967b,0.07076495746208027,0.16966622176968035
17,df79523b-848b-45a9-8dab-fe53c2a5b62d,0.06988926585338372,0.17028143287771583
18,db05d97c-3b16-4da8-9659-820fc7e3f858,0.0713167479593096,0.1685149810693375
19,d43963d1-b803-473c-85dc-2ed2e9f77f4e,0.07045583812582461,0.1706502407290604
20,9d99c9a6-2de3-4e6a-9bd7-9d7ece358a2f,0.07044174575566758,0.17066067488910522
21,3eec44be-b9e2-45a2-b919-05028f5a0ba9,0.07079585677115756,0.16920818686920963
22,9f836847-2b67-4b33-930a-1f84452628ba,0.07078522829778934,0.16919781903167638
23,fbaa8958-a5d5-4dfb-91f7-8c11afe226a8,0.07128542860765898,0.16834798505762455
24,a84b59c4-4145-472d-a26a-4c930648c16c,0.07196635776157265,0.17047633495883885
25,29cf8ad3-7068-4207-b0a2-4f0cff337c9f,0.0719701195278871,0.17051442269732875
26,d0f512c8-5c4f-427a-99e1-ebb4c5b363e5,0.0718787509597688,0.17054903897593635
27,74b1db2d-002b-4f89-8d02-ac084e9a3cd5,0.07089130417373782,0.16981103290127117
28,89210a0c-8144-491d-9e98-19e7f4c3085e,0.07076060461092577,0.1707011426749184
29,aebb377e-7c26-4bb5-8563-c3055a027844,0.07103977816965212,0.17113978347674103
30,00b527a0-d40a-44b4-90f9-750fd447d2d7,0.07097785505134419,0.16963542019904118
31,8c186559-f50d-40ca-a821-11596e1e5261,0.06992637446216321,0.17110063865050085
32,0e64cf14-6ccd-4ad0-9715-ab410f6baf6a,0.0718311255786932,0.1705675237580442
33,f5479823-1efe-47b8-9977-73dc41d1d69e,0.07016981880399553,0.1703708437681898
34,385cfa13-2476-4e3d-b755-3063a7f802b9,0.07016550435008462,0.17037054473511137
35,a40bf573-b701-46f0-9a06-5857cf3ab199,0.0701443567773146,0.17035314147536326
36,0c5a9751-2c1b-4003-834d-9584d2f907a2,0.07016050805421256,0.17038992836178396
37,65b09067-9cf0-492d-8a70-13d4f92f8a10,0.07137336818557355,0.1684713798357405
The issue is with the df.loc function on geo-dataframes.
Once I exported it to a csv, then re read the dataframe in using normal pandas it seemed to work just fine.
Just letting who finds this know.
If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!