I have a (normal, unordered) dictionary that is holding my data and I extract some of the data into a numpy array to do some linear algebra. Once that's done I want to put the resulting ordered numpy vector data back into the dictionary with all of data. What's the best, most Pythonic, way to do this?
Joe Kington suggests in his answer to "Writing to numpy array from dictionary" that two solutions include:
Using Ordered Dictionaries
Storing the sorting order in another data structure, such as a dictionary
Here are some (possibly useful) details:
My data is in nested dictionaries. The outer is for groups: {groupKey: groupDict} and group keys start at 0 and count up in order to the total number of groups. groupDict contains information about items: (itemKey: itemDict}. itemDict has keys for the actual data and these keys typically start at 0, but can skip numbers as not all "item locations" are populated. itemDict keys include things like 'name', 'description', 'x', 'y', ...
Getting to the data is easy, dictionaries are great:
data[groupKey][itemKey]['x'] = 0.12
Then I put data such as x and y into a numpy vectors and arrays, something like this:
xVector = numpy.empty( xLength )
vectorIndex = 0
for groupKey, groupDict in dataDict.items()
for itemKey, itemDict in groupDict.items()
xVector[vectorIndex] = itemDict['x']
vectorIndex += 1
Then I go off and do my linear algebra and calculate a z vector that I want to add back into dataDict. The issue is that dataDict is unordered, so I don't have any way of getting the proper index.
The Ordered Dict method would allow me to know the order and then index through the dataDict structure and put the data back in.
Alternatively, I could create another dictionary while inside the inner for loop above that stores the relationship between vectorIndex, groupKey and itemKey:
sortingDict[vectorIndex]['groupKey'] = groupKey
sortingDict[vectorIndex]['itemKey'] = itemKey
Later, when it's time to put the data back, I could just loop through the vectors and add the data:
vectorIndex = 0
for z in numpy.nditer(zVector):
dataDict[sortingDict[vectorIndex]['groupKey']][sortingDict[vectorIndex]['itemKey']]['z'] = z
Both methods seem equally straight forward to me. I'm not sure if changing dataDict to an ordered dictionary will have any other effects elsewhere in my code, but probably not. Adding the sorting dictionary also seems pretty easy as it will get created at the same time as the numpy arrays and vectors. Left on my own I think I would go with the sortingDict method.
Is one of these methods better than the others? Is there a better way I'm not thinking of? My data structure works well for me, but if there's a way to change that to improve everything else I'm open to it.
I ended up going with option #2 and it works quite well.
Related
The problem: given a (large) Python list-of-lists, or, semi-equivalently, a numpy array, extract information from the array in a SQL-like manner, i.e., as if the array were a database.
For example: given a 4-column by (large) N-row array, extract the rows where the first column is equal to X. In SQL this would be:
SELECT * FROM array WHERE col_1_id = X
In Python, however... ¯\_(ツ)_/¯
An attempt to make the issue clearer:
The array in question holds in each sublist/row [M, a^2, b^2, c^2], where M is the sum of the squares. The list contains millions of entries, and M ranges from ~100 to ~10^6.
The desire is to extract from this data only the rows for which at least 8 different rows have the same sum. Naively we can do this with a loop:
Output = []
for i in [0..10^6]:
newarray = []
for row in array:
if row[0] == i:
newarray.append(row)
if len(newarray) >= 8:
Output.extend(newarray)
save(Output, 'outputfilename')
This output is a much shorter and more workable array. But my understanding is that this is incredibly inefficient (We're looping through a million row array a million times, that's a trillion calls, that seems problematic.)
Were this data in a database, I could grab it with:
SELECT * FROM array WHERE col_1 = i AND COUNT(i) >= 8
(depending on which SQL this might take a different form).
So far as I can tell, neither Python nor numpy has built-in functions that act like this. I don't expect the language to parse a SQL query, but there must be some tool within the language that approximates this function.
Numpy has a select method that doesn't actually select rows in this way, and some other methods that sound like they might make these operations possible but seem to do nothing of the sort. As mentioned below, the documentation is very thin on examples.
I have seen things somewhat like this done using collections.Counter(), but I'm not sure this specific desire can be done with it and am uncertain how to do it. The documentation is... thin on examples.
I'm aware of the fact that this may be an XY question, and have hence attempted to leave out the X except as examples of what I've tried. I am, however, in need of tools using Python (via SageMath/Jupyter). If there's a way of directly storing numpy/Python data in a database-like format and hitting it with SQL-like queries, that would be great too.
This might not be exactly what you are looking for, but I hope it can be helpful either way. :) I wrote a loop implementation that should be more efficient than the one you provided since we only loop through the column twice. We use a dictionary to keep track of the number of times a specific value in the first column occurs.
countDict = {}
#Counting the number of times a sum occurs in the first column of the array
for row in array:
if row[0] in countDict:
#If row sum exists in dictionary we increment the count
countDict[row[0]] +=1
else:
#Else we add the first count (1)
countDict[row[0]] = 1
output = [] #Output to generate
#Loop through first column of array again
for row in array:
#If the sum value occured at least 8 times we add it to the output list
if countDict[row[0]] >= 8:
output.append(row)
This is a quality of life query that I feel like there is an answer to, but can't find (maybe I'm using the wrong terms)
Essentially, I have multiple sets of large data files that I would like to perform analysis on. This involves reading each of these datafiles and storing them as an array (of variable length).
So far I have been doing
import numpy as np
input1 = np.genfromtxt('data1.dat')
input2 = np.genfromtxt('data2.dat')
etc. I was wondering if there is a method of dynamically assigning an array to each of these datasets. Since you can read these dynamically with a for loop,
for i in xrange(2):
input = np.genfromtxt('data%i.dat'%i)
I was hoping to combine the above to create a bunch of arrays; input1, input2, etc. without myself typing out genfromtxt multiple times. Surely there is a method if I had 100 datasets (aptly named data0, data1, etc) to import.
A solution I can think of is maybe creating a function,
import numpy as np
def input(a):
return np.genfromtxt('data%i.dat'%a)
But obviously, I would prefer to store this in memory instead of constantly regenerate a list, and would be extremely grateful to know if this is possible in Python.
You can choose to store your arrays in either a dict or a list:
Option 1
Using a dict.
data = {}
for i in xrange(2):
data['input{}'.format(i)] = np.genfromtxt('data{}.dat'.format(i))
You can access each array by key.
Option 2
Using a list.
data = []
for i in xrange(2):
data.append(np.genfromtxt('data{}.dat'.format(i)))
Alternatively, using a list comprehension:
data = [np.genfromtxt('data{}.dat'.format(i)) for i in xrange(2)]
You can also use a map, it returns a list:
data = map(lambda x: np.genfromtxt('data{}.dat'.format(x)), xrange(2))
Now you can access each array by index.
I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...
I have a list of objects (clusters) and each object has an attribute vertices which is a list of numbers. I want to construct a dictionary (using a one liner) such that the key is a vertex number and the value is the index of the corresponding cluster in the actual list.
Ex:
clusters[0].vertices = [1,2]
clusters[1].vertices = [3,4]
Expected Output:
{1:0,2:0,3:1,4:1}
I came up with the following:
dict(reduce(lambda x,y:x.extend(y) or x, [
dict(zip(vertices, [index]*len(vertices))).items()
for index,vertices in enumerate([i.vertices for i in clusters])]))
It works... but is there a better way of doing this?
Also comment on the efficiency of the above piece of code.
PS: The vertex lists are disjoint.
This is a fairly simple solution, using a nested for:
dict((vert, i) for (i, cl) in enumerate(clusters) for vert in cl.vertices)
This is also more efficient than the version in the question, since it doesn't build lots of intermediate lists while collecting the data for the dict.
I am writing a program to simulate the actual polling data companies like Gallup or Rasmussen publish daily: www.gallup.com and www.rassmussenreports.com
I'm using a brute force method, where the computer generates some random daily polling data and then calculates three day averages to see if the average of the random data matches pollsters numbers. (Most companies poll numbers are three day averages)
Currently, it works well for one iteration, but my goal is to have it produce the most common simulation that matches the average polling data. I could then change the code of anywhere from 1 to 1000 iterations.
And this is my problem. At the end of the test I have an array in a single variable that looks something like this:
[40.1, 39.4, 56.7, 60.0, 20.0 ..... 19.0]
The program currently produces one array for each correct simulation. I can store each array in a single variable, but I then have to have a program that could generate 1 to 1000 variables depending on how many iterations I requested!?
How do I avoid this? I know there is an intelligent way of doing this that doesn't require the program to generate variables to store arrays depending on how many simulations I want.
Code testing for McCain:
test = []
while x < 5:
test = round(100*random.random())
mctest.append(test)
x = x +1
mctestavg = (mctest[0] + mctest[1] + mctest[2])/3
#mcavg is real data
if mctestavg == mcavg[2]:
mcwork = mctest
How do I repeat without creating multiple mcwork vars?
Would something like this work?
from random import randint
mcworks = []
for n in xrange(NUM_ITERATIONS):
mctest = [randint(0, 100) for i in xrange(5)]
if sum(mctest[:3])/3 == mcavg[2]:
mcworks.append(mctest) # mcavg is real data
In the end, you are left with a list of valid mctest lists.
What I changed:
Used a list comprehension to build the data instead of a for loop
Used random.randint to get random integers
Used slices and sum to calculate the average of the first three items
(To answer your actual question :-) ) Put the results in a list mcworks, instead of creating a new variable for every iteration
Are you talking about doing this?
>>> a = [ ['a', 'b'], ['c', 'd'] ]
>>> a[1]
['c', 'd']
>>> a[1][1]
'd'
Lists in python can contain any type of object -- If I understand the question correctly, will a list of lists do the job? Something like this (assuming you have a function generate_poll_data() which creates your data:
data = []
for in xrange(num_iterations):
data.append(generate_poll_data())
Then, data[n] will be the list of data from the (n-1)th run.
since you are thinking in variables, you might prefer a dictionary over a list of lists:
data = {}
data['a'] = [generate_poll_data()]
data['b'] = [generate_poll_data()]
etc.
I would strongly consider using NumPy to do this. You get efficient N-dimensional arrays that you can quickly and easily process.
A neat way to do it is to use a list of lists in combination with Pandas. Then you are able to create a 3-day rolling average.
This makes it easy to search through the results by just adding the real ones as another column, and using the loc function for finding which ones that match.
rand_vals = [randint(0, 100) for i in range(5))]
df = pd.DataFrame(data=rand_vals, columns=['generated data'])
df['3 day avg'] = df['generated data'].rolling(3).mean()
df['mcavg'] = mcavg # the list of real data
# Extract the resulting list of values
res = df.loc[df['3 day avg'] == df['mcavg']]['3 day avg'].values
This is also neat if you intend to use the same random values for different polls/persons, just add another column with their real values and perform the same search for them.