Main Problem:
numpy arrays of the same type and same size are not being column stacked together using np.hstack, np.column_stack, or np.concatenate(axis=1).
Explaination:
I don't understand what properties of a numpy array can change such that numpy.hstack, numpy.column_stack and numpy.concatenate(axis=1) do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.
I have tried a simple case which works as I would expect it to:
input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output:
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)
That's perfectly fine by me, and what I want.
However, what I get from my program is this:
First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]
Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
['908.791', '943.621'] ['908.073', '940.529'] []]
What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '943.621'] ['908.073', '940.529'] []]
Clearly, this isn't the right answer.
The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:
import numpy as np
def contiguous_regions(condition):
d = np.diff(condition)
idx, = d.nonzero()
idx += 1
if condition[0]:
idx = np.r_[0, idx]
if condition[-1]:
idx = np.r_[idx, condition.size]
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)
It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.
Here is the real module providing the problem:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, column_stacked_data_chain))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, file_data))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths (number of rows) for each set of data in the file
data_lengths = contig[:,1] - contig[:,0]
# Get the maximum length of data (max number of contiguous rows) in the file
maxs = np.amax(data_lengths)
# Find the indices for where this long list of data is (index within the indices array of the file)
# If there are two equally long lists of data, get both indices
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
# The file data with this longest contiguous chain of numbers
# If there are multiple sets of data of the same length, they are added in columns
longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
print "First array:"
print longest_data_chains[0]
print
print "Second array (to be added on in columns):"
print longest_data_chains[1]
column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)
print
print "What should be the two arrays side by side or in columns:"
print column_stacked_data_chain
###############################################################################################
###############################################################################################
xy = np.array(zip(*xy_array), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
UPDATE:
I got it to work with the help of #hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']]) in both cases was not sufficient since the real case I was using had a dtype=object and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.
The solution which fixed this was calling a map(float, data) to every list that was given by the readlines function.
Here is what I ended up with:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, file_data))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, xy_array))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths
data_lengths = contig[:,1] - contig[:,0]
# All maximums in contiguous data
maxs = np.amax(data_lengths)
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
print ss
# The file data with this longest contiguous chain of numbers
# Float must be cast to each value in the lists of the contiguous data and cast to a numpy array
longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])
# If there are multiple sets of data of the same length, they are added in columns
column_stacked_data_chain = np.hstack(longest_data_chains)
xy = np.array(zip(*column_stacked_data_chain), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.
It's the empty list at the end of your array's that's causing your problem:
>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[1, 2], [3, 4], []])
>>> a.shape
(2L, 2L)
>>> a.dtype
dtype('int32')
>>> b.shape
(3L,)
>>> b.dtype
dtype('O')
Because of that empty list at the end, instead of creating a 2D array it is creating a 1D, with every item holding a two item long list object.
Related
cols = [2,4,6,8,10,12,14,16,18] # selected the columns i want to work with
df = pd.read_csv('mywork.csv')
df1 = df.iloc[:, cols]
b= np.array(df1)
b
outcome
b = [['WV5 6NY' 'RE4 9VU' 'BU4 N90' 'TU3 5RE' 'NE5 4F']
['SA8 7TA' 'BA31 0PO' 'DE3 2FP' 'LR98 4TS' 0]
['MN0 4NU' 'RF5 5FG' 'WA3 0MN' 'EA15 8RE' 'BE1 4RE']
['SB7 0ET' 'SA7 0SB' 'BT7 6NS' 'TA9 0LP' 'BA3 1OE']]
a = np.concatenate(b) #concatenated to get a single array, this worked well
a = np.array([x for x in a if x != 'nan'])
a = a[np.where(a != '0')] #removed the nan
print(np.sort(a)) # to sort alphabetically
#Sorted array
['BA3 1OE' 'BA31 0PO' 'BE1 4RE' 'BT7 6NS' 'BU4 N90'
'DE3 2FP' 'EA15 8RE' 'LR98 4TS' 'MN0 4NU', 'NE5 4F' 'RE4 9VU'
'RF5 5FG' 'SA7 0SB' 'SA8 7TA' 'SB7 0ET' 'TA9 0LP' 'TU3 5RE'
'WA3 0MN' 'WV5 6NY']
#Find the index position of all elements of b in a(sorted array)
def findall_index(b, a )
result = []
for i in range(len(a)):
for j in range(len(a[i])):
if b[i][j] == a:
result.append((i, j))
return result
print(findall_index(0,result))
I am still very new with python, I tried finding the index positions of all element of b in a above. The underneath codes blocks doesn't seem to be giving me any result. Please can some one help me.
Thank you in advance.
One way you could approach this is by zipping (creating pairs) the index of elements in b with the actual elements and then sorting this new array based on the elements only. Now you have a mapping from indices of the original array to the new sorted array. You can then just loop over the sorted pairs to map the current index to the original index.
I would highly suggest you to code this yourself, since it will help you learn!
How could one map a a list of ndarrays containing string objects into specific floats ? For instance, if the user decides to map orange to 1.0 and grapefruit to 2.0 ?
myList = [np.array([['orange'], ['orange'], ['grapefruit']], dtype=object), np.array([['orange'], ['grapefruit'], ['orange']], dtype=object)]
So one would have:
convList = [np.array([['1.0'], ['1.0'], ['2.0']], dtype=float), np.array([['1.0'], ['2.0'], ['1.0']], dtype=float)]
I tried to implement this function:
def map_str_to_float(iterator):
d = {}
for ndarr in iterator:
for string_ in ndarr:
d[string_] = float(input('Enter your map for {}: '.format(string_)))
return d
test = map_str_to_float(myList)
print(test)
But I get the following error:
d[string_] = float(input('Enter your map for {}: '.format(string_)))
TypeError: unhashable type: 'numpy.ndarray'
I believe it's because the type of string_ is a numpy array instead of a string...
For the error, on debugging string_ is an array ['orange'], cant be key of dictionary
As for How to convert a list of ndarray of strings into floats
We use indices, get the indices of strings, and use those indices to print required new indices in same order.
Basically np.array([1, 2])[0, 1, 0, 0] will give new array of size 4 with entries in order of indices. Same logic will apply which will skip dictionary mapping in python. Mapping operation will happen through indices in C, so should be fast.
Comments should explain what happens
import numpy as np
dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21')
# Get all the unique strings, and their indices
# Values of indices are based on uniques ordering
uniques, indices = np.unique(dataSet, return_inverse=True)
# >>> uniques
# array(['george', 'greg', 'kevin'], dtype='<U21')
# >>> indices
# array([2, 1, 0, 2])
# Originial array
# >>> uniques[indices]
# array(['kevin', 'greg', 'george', 'kevin'], dtype='<U21')
new_indices = np.array([float(input()) for e in uniques])
# Get new indices indexed using original positions of unique strings in numpy array
print(new_indices[indices])
# You can do the same for multi dimensional arrays
With that nested loop you will ask the user for an input 6 times (but you have 2 values grapefruit and orange). I would suggest you to get the unique values first and ask for just unique values:
To do so:
unique_values = np.unique(np.array(myList))
Now as the user for each unique value for a number:
d = {}
for unique_value in unique_values:
d[unique_value] = float(input(f"give me a number for {unique_value} "))
Now you got your map in variable d.
Update after a comment
Then you can write your own unique method.
Please notice the code below would get all unique values regardless of the length of it as long as it's 1D.
unique_values = []
for each_ndarray in myList:
for value in each_ndarray:
if not value[0] in unique_values:
unique_values.append(value[0])
I am trying to get the elements in an ndarray that are strings. That is, exclude the elements that are integers and floats.
Lets say I have this array:
x = np.array([1,'hello',2,'world'])
I want it to return:
array(['hello','world'],dtype = object)
I've tried doing np.where(x == np.str_) to get the indices where that condition is true, but it's not working.
Any help is much appreciated.
You can make a function to do it, and loop over the array:
def getridofnumbers(num):
try:
x = int(num)
except:
return True
return False
output = np.array([i for i in x if getridofnumbers(i)])
if we want to keep all the numpy goodness (broadcasting etc), we can convert that into a ufunc using vectorize (or np.frompyfunc):
import numpy as np
#vectorize the fucntion, with a boolean return type
getrid = np.vectorize(getridofnumbers, otypes=[bool])
x[getrid(x)]
array(['hello', 'world'], dtype='<U11')
#or ufunc, which will require casting:
getrid = np.frompyfunc(getridofnumbers, 1, 1)
x[getrid(x).astype(bool)]
When you run x = np.array([1,'hello',2,'world']), numpy converts everything to string type.
If it is one dimensional array, you can use:
y = np.array([i for i in x if not i.replace(".","",1).replace("e+","").replace("e-","").replace("-","").isnumeric()])
to get all non-numeric values.
It can identify all floats with negative sign and and e+/e- )
like, for input: x = np.array([1,'hello',+2e-50,'world', 2e+50,-2, 3/4, 6.5 , "!"])
output will be : array(['hello', 'world', '!'], dtype='<U5')
To keep it simple, I am listing out in numbered form
1) I have a list with filenames
2) I would like to extract a record from numpy array which has a maximum row-wise sum higher when compared to other records(rows)
Please find the screenshot below for reference
What I have done is created an array and found out the sum using np.sum function. However, I am not able to find a method to extract the row based on this sum condition. I would like to have only the specific row and sum value which can be tagged to an element in list. Is there any elegant python function to do this?
t1 = ['abc_1.png','abc_2.png'] -- list with filenames as elements
arr_1 = np.random.rand(3,3) -- array 1
arr_2 = np.random.rand(3,3) -- array 2
arr1_sum = np.sum(arr_1,axis=1)
arr2_sum = np.sum(arr_2,axis=1) -- the last two statement returns an array. I would like to extract the corresponding row/record which contributes to that sum and tag it to the first and second element in list (abc_1.png)
The expected output can be in either list or dictionary form. Please find the sample screenshot below
you are looking for np.argmax:
max_row = arr_2[np.argmax(arr2_sum), :]
output = list(max_row)
output.append(np.max(arr2_sum))
output = {'abc_2.png' : output}
If you want to do this iteratively for a list of files and arrays, you could:
files = ['abc_1.png','abc_2.png']
arr_1 = np.random.rand(3,3)
arr_2 = np.random.rand(3,3)
arrays = [arr_1, arr_2]
sums = [np.sum(arr, axis=1) for arr in arrays]
output_dict = {}
for i in range(len(files)):
max_index = int(np.where(sums[i] == max(sums[i]))[0])
output_dict[files[i]] = arrays[i][max_index]
And like #warped said, you could use np.argmax() to forego the for loop:
files = ['abc_1.png','abc_2.png']
arr_1 = np.random.rand(3,3)
arr_2 = np.random.rand(3,3)
arrays = [arr_1, arr_2]
sums = [np.sum(arr, axis=1) for arr in arrays]
output_dict = {files[i]: list(arrays[i][np.argmax(sums[i]), :]) for i in range(len(files))}
I have a text file with letters (tab delimited), and a numpy array (obj) with a few letters (single row). The text file has rows with different numbers of columns. Some rows in the text file may have multiple copies of same letters (I will like to consider only a single copy of a letter in each row). Letters in the same row of the text file are assumed to be similar to each other. Also, each letter of the numpy array obj is present in one or more rows of the text file.
Below is an example of the text file (you can download the file from here ):
b q a i m l r
j n o r o
e i k u i s
In the above example, the letter o is mentioned two times in the second row, and the letter i is denoted two times in the third row. I will like to consider single copies of letters rows of the text file.
This is an example of obj: obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
I want to compare obj with rows of the text file and form clusters from elements in obj.
This is how I want to do it. Corresponding to each row of the text file, I want to have a list which denotes a cluster (In the above example we will have three clusters since the text file has three rows). For every given element of obj, I want to find rows of the text file where the element is present. Then, I will like to assign index of that element of obj to the cluster which corresponds to the row with maximum length (the lengths of rows are decided with all rows having single copies of letters).
Below is a python code that I have written for this task
import pandas as pd
import numpy as np
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python').values[:,:].astype('<U1000')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
for i in range(data.shape[0]):
globals()['data_row' + str(i).zfill(3)] = []
globals()['clust' + str(i).zfill(3)] = []
for j in range(len(obj)):
if obj[j] in set(data[i, :]): globals()['data_row' + str(i).zfill(3)] += [j]
for i in range(len(obj)):
globals()['obj_lst' + str(i).zfill(3)] = [0]*data.shape[0]
for j in range(data.shape[0]):
if i in globals()['data_row' + str(j).zfill(3)]:
globals()['obj_lst' + str(i).zfill(3)][j] = len(globals()['data_row' + str(j).zfill(3)])
indx_max = globals()['obj_lst' + str(i).zfill(3)].index( max(globals()['obj_lst' + str(i).zfill(3)]) )
globals()['clust' + str(indx_max).zfill(3)] += [i]
for i in range(data.shape[0]): print globals()['clust' + str(i).zfill(3)]
>> [0]
>> [3]
>> [1, 2, 4]
The above code gives me the right answer. But, in my actual work, the text file has tens of thousands of rows, and the numpy array has hundreds of thousands of elements. And, the above given code is not very fast. So, I want to know if there is a better (faster) way to implement the above functionality and aim (using python).
You can do it using merge after a stack on data (in pandas), then some groupby with nunique or idxmax to get what you want
#keep data in pandas
data = pd.read_csv('file.txt', sep=r'\t+', header=None, engine='python')
obj = np.asarray(['a', 'e', 'i', 'o', 'u'])
#merge to keep only the letters from obj
df = (data.stack().reset_index(0,name='l')
.merge(pd.DataFrame({'l':obj})).set_index('level_0'))
#get the len of unique element of obj in each row of data
# and use transform to keep this lenght along each row of df
df['len'] = df.groupby('level_0').transform('nunique')
#get the result you want in a series
res = (pd.DataFrame({'data_row':df.groupby('l')['len'].idxmax().values})
.groupby('data_row').apply(lambda x: list(x.index)))
print(res)
data_row
0 [0]
1 [3]
2 [1, 2, 4]
dtype: object
res contains the clusters with the index being the row in the original data