I have two matrices (I want them for part of speech tagging). The first one contains the pos tags probabilities and the second contains the words probabilities. I need to extract numbers and sum the matrices. The problem is when I call each cell the string part appears, too. But I need the numbers. How can I call them. (Is this a correct way of making matrices? if not, how can I correct it with tags in heads of rows and columns?)
import numpy as np
A = np.array([[{'ARTART':0}],[{'ARTN':1}],[{'ARTV':0}],[{'ARTP':0}],
[{'NART':0}],[{'NN':0.13}],[{'NV':0.43}],[{'NP':0.44}],
[{'VART':0.65}],[{'VN':0.35}],[{'VV':0}],[{'VP':0}],
[{'PART':0.74}],[{'PN':0.26}],[{'PV':0}],[{'PP':0}],
[{'NULLART':0.71}],[{'NULLN':0.29}],[{'NULLV':0}],[{'NULLP':0}]]).reshape(5,4)
#print (A)
B = np.array([[{'ARTflies':0}],[{'ARTlike':0}],[{'ARTa':0.36}],[{'ARTflower':0}],
[{'Nflies':0.025}],[{'Nlike':0.012}],[{'Na':0.001}],[{'Nflower':0.063}],
[{'Vflies':0.076}],[{'Vlike':0.1}],[{'Va':0}],[{'Vflower':0.05}],
[{'Pflies':0}],[{'Plike':0.068}],[{'Pa':0}],[{'Pflower':0}]]).reshape(4,4)
#print (B)
#print (A[4][0])
I think you could achieve this task by using just 2 dictionaries, one for each array which you are currently making:
A = {'ARTART':0, 'ARTN':1, 'ARTV': 0} # and so on
Then you can grab the values of each entry in the dictionary with:
A_val = A.values()
And finally you can sum the values with:
A_sum = sum(A_val)
Related
The problem: given a (large) Python list-of-lists, or, semi-equivalently, a numpy array, extract information from the array in a SQL-like manner, i.e., as if the array were a database.
For example: given a 4-column by (large) N-row array, extract the rows where the first column is equal to X. In SQL this would be:
SELECT * FROM array WHERE col_1_id = X
In Python, however... ¯\_(ツ)_/¯
An attempt to make the issue clearer:
The array in question holds in each sublist/row [M, a^2, b^2, c^2], where M is the sum of the squares. The list contains millions of entries, and M ranges from ~100 to ~10^6.
The desire is to extract from this data only the rows for which at least 8 different rows have the same sum. Naively we can do this with a loop:
Output = []
for i in [0..10^6]:
newarray = []
for row in array:
if row[0] == i:
newarray.append(row)
if len(newarray) >= 8:
Output.extend(newarray)
save(Output, 'outputfilename')
This output is a much shorter and more workable array. But my understanding is that this is incredibly inefficient (We're looping through a million row array a million times, that's a trillion calls, that seems problematic.)
Were this data in a database, I could grab it with:
SELECT * FROM array WHERE col_1 = i AND COUNT(i) >= 8
(depending on which SQL this might take a different form).
So far as I can tell, neither Python nor numpy has built-in functions that act like this. I don't expect the language to parse a SQL query, but there must be some tool within the language that approximates this function.
Numpy has a select method that doesn't actually select rows in this way, and some other methods that sound like they might make these operations possible but seem to do nothing of the sort. As mentioned below, the documentation is very thin on examples.
I have seen things somewhat like this done using collections.Counter(), but I'm not sure this specific desire can be done with it and am uncertain how to do it. The documentation is... thin on examples.
I'm aware of the fact that this may be an XY question, and have hence attempted to leave out the X except as examples of what I've tried. I am, however, in need of tools using Python (via SageMath/Jupyter). If there's a way of directly storing numpy/Python data in a database-like format and hitting it with SQL-like queries, that would be great too.
This might not be exactly what you are looking for, but I hope it can be helpful either way. :) I wrote a loop implementation that should be more efficient than the one you provided since we only loop through the column twice. We use a dictionary to keep track of the number of times a specific value in the first column occurs.
countDict = {}
#Counting the number of times a sum occurs in the first column of the array
for row in array:
if row[0] in countDict:
#If row sum exists in dictionary we increment the count
countDict[row[0]] +=1
else:
#Else we add the first count (1)
countDict[row[0]] = 1
output = [] #Output to generate
#Loop through first column of array again
for row in array:
#If the sum value occured at least 8 times we add it to the output list
if countDict[row[0]] >= 8:
output.append(row)
I am a bit new to Python. I am enumerating through a large list of data, as shown below, and would like to find the mean of every line.
for index, line in enumerate (data):
#calculate the mean
However, the lines of this particular set of data are as such:
[array([[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]], dtype=float32)].
I would like to find the mean of both 2x1s separately, then obtain a list of those two averages.
The question contains a list of lists which is inside a list.
So you just need to get rid of the outside list first (it has only one element). Then you can find the average (mean) of the other lists using the statistics module.
It will look like this:
import statistics
# given data
x = [[[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]]]
x = x[0] # remove outer list
# elements via list comprehension (other methods too)
s0 = [a[0] for a in x]
s1 = [a[1] for a in x]
# print results
print(statistics.mean(s0))
print(statistics.mean(s1))
The result look like this:
1.81753965e-10
1.89993675e-10
Im making a 2D numpy array in python which looks like this
['0.001251993149471442' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'herfst'] ['0.002232327408019874' 'winter'] ['0.002232327408019874' 'winter']
I want to get the most common string from the entire array.
I did find some ways to do this already but all of those have the same problem that it wont work because there are 2 datatypes in the array.
Is there an easier way to get the most common element from an entire column (not row) besides just running it through a for loop and counting?
You can get a count of all the values using numpy and collections. It's not clear from your question whether the numeric values in your 2D list are actually numbers or strings, but this works for both as long as the numeric values are first and the words are second:
import numpy
from collections import Counter
input1 = [['0.001251993149471442', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'herfst'], ['0.002232327408019874', 'winter'], ['0.002232327408019874', 'winter']]
input2 = [[0.001251993149471442, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'herfst'], [0.002232327408019874, 'winter'], [0.002232327408019874, 'winter']]
def count(input):
oneDim = list(numpy.ndarray.flatten(numpy.array(input))) # flatten the list
del oneDim[0::2] # remove the 'numbers' (i.e. elements at even indices)
counts = Counter(oneDim) # get a count of all unique elements
maxString = counts.most_common(1)[0] # find the most common one
print(maxString)
count(input1)
count(input2)
If you want to also include the numbers in the count, simply skip the line del oneDim[0::2]
Unfortunately, mode() method exists only in Pandas, not in Numpy,
so the first step is to flatten your array (arr) and convert it to
a pandasonic Series:
s = pd.Series(arr.flatten())
Then if you want to find the most common string (and note that Numpy
arrays have all elements of the same type), the most intuitive solution
is to execute:
s.mode()[0]
(s.mode() alone returns a Series, so we just take the initial element
of it).
The result is:
'0.002232327408019874'
But if you want to leave out strings that are convertible to numbers,
you need a different approach.
Unfortunately, you can not use s.str.isnumeric() because it finds
strings composed solely of digits, but your "numeric" strings contain
also dots.
So you have to narrow down your Series (s) using str.match and
then invoke mode:
s[~s.str.match('^[+-]?(?:\d|\d+\.\d*|\d*\.\d+)$')].mode()[0]
This time the result is:
'herfst'
I have the following problem:
I have a dictionary where each K,V is as follows:
k = 1 String
v = nested lists in the following manner,
inside V are A number of lists, inside each A list, there are B number of lists, inside each B list there are C number of entries.
Eventually, I would like to do some calculations of averages and standard deviations so I would like to create a dictionary where the key is the same and the v is as follows:
a matrix of A rows by B columns, where each entry in the matrix is the list B. This would allow me to arrange the data in such a way that I could remove specific values from each column of the matrix to do some calculations.
This was my reasoning, so I have tried the following:
#Initialzing the new dictionary
matrix_dictionary = {}
for k,v in overall_dictionary.iteritems():
num_rows = len(v) #Number of rows in desired matrix
for i in v:
width = len(i) #Number of columns in desired matrix
#Initializing the matrix
data_matrix = [[] for i in xrange(0,width) for j in xrange(0,num_rows)]
for y in xrange(0,height)#For row in row
for element in v: #For each list A in v
counter = 0; #one of indices to add element to specific spot in matrix
for i in element:#for B list in A
data_matrix[y][counter] = i #Trying to add list B inside matrix
counter = counter + 1;
matrix_dictionary[k] = data_matrix #Adding key value pair to dic
Different attempt at explaining problem
for each k in the dictionary, i have a 3D v
v for example is made up of 100 lists (A)
Each list has 50 lists (B) inside of it
Each list B has a list C, Where two indices of C are of interest
I want to create a giant table that is A rows by B columns and all of the C lists are inside
Example: first row, first column has C1, first row second column has C2, etc...
I want the matrix to then be the value in the dictionary
I saw the following errors
1) Index error for the matrix
2) The following works outside on its own
for k, v in ovarall_dictionary.iteritems():
for A in v:
print(A) #Prints the list A containing a bunch of lists B
for B in A:
print(B) #Prints each B list
However, the following does not work, I get index out of range for list, why?
z = v[0]
print(z)
Eventual Goal
For each k,v in new matrix dictionary
For each column in the matrix and within each cell for that column
get two indices X and Y
get the average of all of the X's, get the average of all the Y's
Make a new dictionary with k as string, and list of results as value pairs
Help needed on
I've been explained that I have a 3D array inside v: A[B[C]]]
I want to create a AxB matrix where different values of C are easily callable in the matrix
The good news is: You already have your dictionary in the form you want! A list of lists is a 2D matrix.
The bad news: It's not really so clear what errors you're having and where they are coming up. More detail on that would be helpful for people to come up with solutions.
In the meantime, a few comments on your code (though I don't think it's necessary since you already have a matrix):
You use height in your code, but never initialize it. Is height = num_rows?
Instead of iterating through all of the rows, updating width each time, just do it once, like:
width = len(v[0]) #This assumes all rows have same number of cols
No need for the semi-colons. (Not that there's anything wrong with them!)
The indentation in your code isn't clear. Make sure everything is in the right for loop.
You should have 'for element' not 'for element a'.
I think the counter complicates things. Try to use 'enumerate' or something like that instead.
I have a array of identifiers that have been grouped into threes. For each group, I would like to randomly assign them to one of three sets and to have those assignments stored in another array. So, for a given array of grouped identifiers (I presort them):
groupings = array([1,1,1,2,2,2,3,3,3])
A possible output would be
assignments = array([0,1,2,1,0,2,2,0,1])
Ultimately, I would like to be able to generate many of these assignment lists and to do so efficiently. My current method is just to create an zeroes array and set each consecutive subarray of length 3 to a random permutation of 3.
assignment = numpy.zeros((12,10),dtype=int)
for i in range(0,12,3):
for j in range(10):
assignment[i:i+3,j] = numpy.random.permutation(3)
Is there a better/faster way?
Two things I can think about:
instead of visiting the 2D array 3 row * 1 column in your inner loop, try to visit it 1*3. Accessing 2D array horizontally first is usually faster than vertically first, since it gives you better spatial locality, which is good for caching.
instead of running numpy.random.permutation(3) each time, if 3 is fixed and is a small number, try to generate the arrays of permutations beforehand and save them into a constant array of array like: (array([0,1,2]), array([0,2,1]), array([1,0,2])...). You just need to randomly pick one array from it each time.