Find most common string in 2D numpy array - python

Im making a 2D numpy array in python which looks like this
['0.001251993149471442' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'winter'] ['0.002232327408019874' 'winter']
I want to get the most common string from the entire array.
I did find some ways to do this already but all of those have the same problem that it wont work because there are 2 datatypes in the array.
Is there an easier way to get the most common element from an entire column (not row) besides just running it through a for loop and counting?

You can get a count of all the values using numpy and collections. It's not clear from your question whether the numeric values in your 2D list are actually numbers or strings, but this works for both as long as the numeric values are first and the words are second:
import numpy
from collections import Counter
input1 = [['0.001251993149471442', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'winter'], ['0.002232327408019874', 'winter']]
input2 = [[0.001251993149471442, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'winter'], [0.002232327408019874, 'winter']]
def count(input):
oneDim = list(numpy.ndarray.flatten(numpy.array(input))) # flatten the list
del oneDim[0::2] # remove the 'numbers' (i.e. elements at even indices)
counts = Counter(oneDim) # get a count of all unique elements
maxString = counts.most_common(1)[0] # find the most common one
print(maxString)
count(input1)
count(input2)
If you want to also include the numbers in the count, simply skip the line del oneDim[0::2]

Unfortunately, mode() method exists only in Pandas, not in Numpy,
so the first step is to flatten your array (arr) and convert it to
a pandasonic Series:
s = pd.Series(arr.flatten())
Then if you want to find the most common string (and note that Numpy
arrays have all elements of the same type), the most intuitive solution
is to execute:
s.mode()[0]
(s.mode() alone returns a Series, so we just take the initial element
of it).
The result is:
'0.002232327408019874'
But if you want to leave out strings that are convertible to numbers,
you need a different approach.
Unfortunately, you can not use s.str.isnumeric() because it finds
strings composed solely of digits, but your "numeric" strings contain
also dots.
So you have to narrow down your Series (s) using str.match and
then invoke mode:
s[~s.str.match('^[+-]?(?:\d|\d+\.\d*|\d*\.\d+)$')].mode()[0]
This time the result is:
'herfst'

Related

Is there a Numpy function for casting just one element to a different type than the rest?

My code is the following:
file_ = open('file.txt', 'r')
lines = file_.readlines()
data = []
for row in lines:
temp = row.split()
data.append(np.array(temp).astype(np.float64))
I want to cast every item in the array to float EXCEPT the final one, which I want to remain a string.
How can I do this?
No, there is no function to cast elements of the same array to different types. Unlike regular Python lists, numpy arrays are homogeneous and store elements with fixed physical record sizes, so each element of the array must always have the same type.
You could handle the strings separately and parse only the numeric part into a numpy array:
for row in lines:
temp = row.split()
numbers = temp[:-1]
stringbit = temp[-1]
data.append(np.array(numbers).astype(np.float64))
Alternatively, if your data is very consistent and each line always has the same type structure, you might be able to use a more complex numpy dtype and numpy.genfromtext to make each line an element of a larger array.
You might also find a pandas.DataFrame fits better for working with this kind of heterogeneous data.
A related question with useful details: NumPy array/matrix of mixed types
You can use recarrays.
Of your rows are records with similar data, you can create a custom dtype that does what you want. The requirement for a homogenous datatype in this case is that the number of elements is constant and there is an upper bound on the number of characters in the final string.
Here is an example that assumes the string only holds ASCII characters:
max_len = 10
dtype = np.dtype([('c1', np.float_), ('c2', np.float_), ('c3', np.float_), ('str', f'S{max_len}')])
row = [(10.0, 1.2, 4.5, b'abc')]
result = np.array(row, dtype)
If you don't want to name each float column separately, you can make that field a subarray:
dtype = np.dtype([('flt', np.float_, 3), ('str', f'S{max_len}')])
row = [([10.0, 1.2, 4.5], b'abc')]
If the strings are not of a known length, you can use the object dtype in that field and simply store a reference.
Even though it's possible, you may find it simpler to just load the floats into one array and the strings into another. I generally find it simpler to work with arrays of a homogenous built in dtype than recarrays.

How to sort a 2 dimention array of integers? In python 3 with numpy

How can i sort an array like this: arr=[[2,1,1,2,3,3],[1,1,2,3,2,2],[1,2,1,3,2,2]]
Into: sorted_arr=[[1,1,2,3,2,2],[1,2,1,3,2,2],[2,1,1,2,3,3]]
thats not part of my code its just an example of what i need. I have an array with a lot of arrays and integers on it, and the integers are 1,2,3 i want to sort it, for example, one array is 111111111 and is in the middle of the main array, i want it at the beginning
The logic is, that in my real code i have 2 arrays and i compare them, so i have a nested loop, and to make it faster, if a very close elemnts are at the beggining it will speed a lot my code, so thats why i want to sort it, The array has a lot of arrays with splitted integers into it, so i want to sort that arrays like the integer would be 1
sorted(arr)
works for me. Have you tried it?
According to your description, I guess you want to sort the rows according to the columns by interpreting the columns as the keys of primary-order, secondary-order, etc. If that is the case, numpy.lexsort can do a good job.
Try this code
import numpy as np
arr = np.array([[2,1,1,2,3,3],
[1,1,2,3,2,2],
[1,2,1,3,2,2]])
argsorted = np.lexsort(arr.transpose()[::-1])
print(arr[argsorted])
you can easily transform arr[argsorted] to list by list(arr[argsorted])

i want to iterate over this structure. It is treating everyhting inside as one element and not as seperate numbers

loss=[[ -137.70171527 -81408.95809899 -94508.84395371 -311.81198933 -294.08711874]]
When I print loss it prints the addition of the numbers and not the individual numbers. I want to change this to a list so I can iterate over each individual number bit and I don't know how, please help.
I have tried:
result = map(tuple,loss)
However it prints the addition of the inside. When I try to index it says there is only 1 element. It works if I put a comma in between but this is a matrix that is outputed from other codes so i can't change or add to it.
You have a list of a list with numbers the outer list contains exactly one element (namely the inner list). The inner list is your list of integers over which you want to iterate. Hence to iterate over the inner list you would first have to access it from the outer list using indices for example like this:
for list_item in loss[0]:
do_something_for_each_element(list_item)
Moreover, I think that you wanted to have separate elements in the inner list and not compute one single number, didn't you? If that is the case you have to separate each element using a ,.
E.g.:
loss=[[-137.70171527, -81408.95809899, -94508.84395371, -311.81198933, -294.08711874]]
EDIT:
As you clarified in the comments you want to iterate over a numpy matrix. One way to do so is by converting the matrix into an n dimensional array (ndarray) and the iterate that structure. This could for example look like this, other options have also been presented in this answer(Iterate over a numpy Matrix rows):
import numpy as np
test_matrix=np.matrix([[1, 2], [3, 4]])
for row in test_matrix.A:
print(row)
note that the A attribute of a matrix object is its ndarray representation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).

Replace slice of a numpy array with values from another array

Say I've got two numpy arrays which were created this way:
zeros = np.zeros((270,270))
ones = np.ones((150,150))
How can I insert ones in zeros at position [60,60]?
I want an array that looks like a "square in the square".
I've tried the following two options:
np.put(empty, [60,60], ones)
np.put(empty, [3541], ones)
np.put[empty, [60:210,60:210], ones)
but the latter yields invalid syntax and the first two don't work either. Has anyone got an idea how this could work?
This is one way you can replace values in zeros with ones.
zeros[60:210,60:210] = ones

"list indices must be integers, not tuple" but it worked before?

I have a 2d list of the form:
d = [[0.87768026489137663, -0.42848220833223599],
[0.87770426313019434, -0.428411425505765],
[0.87796388044104012, -0.42873867479872063],
[0.87801587662514491, -0.42860583582101786],
[0.87794315468933382, -0.42847396647067809]]
I want to get a single column from it, I've done this before on a different program using d[:,0] or d[:,1] and it worked perfectly. But now when I try that I get the error: list indices must be integers, not tuple. I know this must be a really simple fix but I'm just not sure whats wrong. I'm using python 3.4 if that matters.
You have a list of lists. What you want to do is iterate through the list of lists, and for every sub-list, pick out the first item if you want the first column, or the second item if you want the second column, etc. The following one-liner will do that:
column = [x[0] for x in d]
Note that x[0] selects the first item in the sub-list. If you want the second item, take x[1], etc. Generally, if you want the nth column in your 2d list (call it d), the code to grab that column is:
column = [x[n] for x in d]
In python you cant get a column using R notation from a matrix, you can do that using numpy lib. If you want to get the column i using pure python, just do:
columns = map(list,zip(d))
column_i = columns [i] #i is the column that you want
Example
d = [[1,2],[3,4] ]
new_d = zip(d)
>> [(1,3),(2,4)]
map(list,new_d)
>> [[1,3],[2,4]]
It seems to me that you are going to use the data in the list for further calculations. My favourite for handling such lists is "numpy". If you import the numpy module, you can access the data like you proposed:
import numpy as np
d = np.array([[0.87768026489137663, -0.42848220833223599],
[0.87770426313019434, -0.428411425505765],
[0.87796388044104012, -0.42873867479872063],
[0.87801587662514491, -0.42860583582101786],
[0.87794315468933382, -0.42847396647067809]])
d[:,1]
output:
array([-0.42848221, -0.42841143, -0.42873867, -0.42860584, -0.42847397])
I find it much easier to use numpy for such data as it is more intuitive to use than list comprehensions.
Hope this helps.

Categories