Storing multiple arrays in Python - python

I am writing a program to simulate the actual polling data companies like Gallup or Rasmussen publish daily: www.gallup.com and www.rassmussenreports.com
I'm using a brute force method, where the computer generates some random daily polling data and then calculates three day averages to see if the average of the random data matches pollsters numbers. (Most companies poll numbers are three day averages)
Currently, it works well for one iteration, but my goal is to have it produce the most common simulation that matches the average polling data. I could then change the code of anywhere from 1 to 1000 iterations.
And this is my problem. At the end of the test I have an array in a single variable that looks something like this:
[40.1, 39.4, 56.7, 60.0, 20.0 ..... 19.0]
The program currently produces one array for each correct simulation. I can store each array in a single variable, but I then have to have a program that could generate 1 to 1000 variables depending on how many iterations I requested!?
How do I avoid this? I know there is an intelligent way of doing this that doesn't require the program to generate variables to store arrays depending on how many simulations I want.
Code testing for McCain:
test = []
while x < 5:
test = round(100*random.random())
mctest.append(test)
x = x +1
mctestavg = (mctest[0] + mctest[1] + mctest[2])/3
#mcavg is real data
if mctestavg == mcavg[2]:
mcwork = mctest
How do I repeat without creating multiple mcwork vars?

Would something like this work?
from random import randint
mcworks = []
for n in xrange(NUM_ITERATIONS):
mctest = [randint(0, 100) for i in xrange(5)]
if sum(mctest[:3])/3 == mcavg[2]:
mcworks.append(mctest) # mcavg is real data
In the end, you are left with a list of valid mctest lists.
What I changed:
Used a list comprehension to build the data instead of a for loop
Used random.randint to get random integers
Used slices and sum to calculate the average of the first three items
(To answer your actual question :-) ) Put the results in a list mcworks, instead of creating a new variable for every iteration

Are you talking about doing this?
>>> a = [ ['a', 'b'], ['c', 'd'] ]
>>> a[1]
['c', 'd']
>>> a[1][1]
'd'

Lists in python can contain any type of object -- If I understand the question correctly, will a list of lists do the job? Something like this (assuming you have a function generate_poll_data() which creates your data:
data = []
for in xrange(num_iterations):
data.append(generate_poll_data())
Then, data[n] will be the list of data from the (n-1)th run.

since you are thinking in variables, you might prefer a dictionary over a list of lists:
data = {}
data['a'] = [generate_poll_data()]
data['b'] = [generate_poll_data()]
etc.

I would strongly consider using NumPy to do this. You get efficient N-dimensional arrays that you can quickly and easily process.

A neat way to do it is to use a list of lists in combination with Pandas. Then you are able to create a 3-day rolling average.
This makes it easy to search through the results by just adding the real ones as another column, and using the loc function for finding which ones that match.
rand_vals = [randint(0, 100) for i in range(5))]
df = pd.DataFrame(data=rand_vals, columns=['generated data'])
df['3 day avg'] = df['generated data'].rolling(3).mean()
df['mcavg'] = mcavg # the list of real data
# Extract the resulting list of values
res = df.loc[df['3 day avg'] == df['mcavg']]['3 day avg'].values
This is also neat if you intend to use the same random values for different polls/persons, just add another column with their real values and perform the same search for them.

Related

List of lists (or numpy array): extracting data via SQL-like methods?

The problem: given a (large) Python list-of-lists, or, semi-equivalently, a numpy array, extract information from the array in a SQL-like manner, i.e., as if the array were a database.
For example: given a 4-column by (large) N-row array, extract the rows where the first column is equal to X. In SQL this would be:
SELECT * FROM array WHERE col_1_id = X
In Python, however... ¯\_(ツ)_/¯
An attempt to make the issue clearer:
The array in question holds in each sublist/row [M, a^2, b^2, c^2], where M is the sum of the squares. The list contains millions of entries, and M ranges from ~100 to ~10^6.
The desire is to extract from this data only the rows for which at least 8 different rows have the same sum. Naively we can do this with a loop:
Output = []
for i in [0..10^6]:
newarray = []
for row in array:
if row[0] == i:
newarray.append(row)
if len(newarray) >= 8:
Output.extend(newarray)
save(Output, 'outputfilename')
This output is a much shorter and more workable array. But my understanding is that this is incredibly inefficient (We're looping through a million row array a million times, that's a trillion calls, that seems problematic.)
Were this data in a database, I could grab it with:
SELECT * FROM array WHERE col_1 = i AND COUNT(i) >= 8
(depending on which SQL this might take a different form).
So far as I can tell, neither Python nor numpy has built-in functions that act like this. I don't expect the language to parse a SQL query, but there must be some tool within the language that approximates this function.
Numpy has a select method that doesn't actually select rows in this way, and some other methods that sound like they might make these operations possible but seem to do nothing of the sort. As mentioned below, the documentation is very thin on examples.
I have seen things somewhat like this done using collections.Counter(), but I'm not sure this specific desire can be done with it and am uncertain how to do it. The documentation is... thin on examples.
I'm aware of the fact that this may be an XY question, and have hence attempted to leave out the X except as examples of what I've tried. I am, however, in need of tools using Python (via SageMath/Jupyter). If there's a way of directly storing numpy/Python data in a database-like format and hitting it with SQL-like queries, that would be great too.
This might not be exactly what you are looking for, but I hope it can be helpful either way. :) I wrote a loop implementation that should be more efficient than the one you provided since we only loop through the column twice. We use a dictionary to keep track of the number of times a specific value in the first column occurs.
countDict = {}
#Counting the number of times a sum occurs in the first column of the array
for row in array:
if row[0] in countDict:
#If row sum exists in dictionary we increment the count
countDict[row[0]] +=1
else:
#Else we add the first count (1)
countDict[row[0]] = 1
output = [] #Output to generate
#Loop through first column of array again
for row in array:
#If the sum value occured at least 8 times we add it to the output list
if countDict[row[0]] >= 8:
output.append(row)

Finding the mean of two different 2x1 row-column groups for Data

I am a bit new to Python. I am enumerating through a large list of data, as shown below, and would like to find the mean of every line.
for index, line in enumerate (data):
#calculate the mean
However, the lines of this particular set of data are as such:
[array([[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]], dtype=float32)].
I would like to find the mean of both 2x1s separately, then obtain a list of those two averages.
The question contains a list of lists which is inside a list.
So you just need to get rid of the outside list first (it has only one element). Then you can find the average (mean) of the other lists using the statistics module.
It will look like this:
import statistics
# given data
x = [[[2.3325655e-10, 2.4973504e-10],
[1.3025138e-10, 1.3025231e-10]]]
x = x[0] # remove outer list
# elements via list comprehension (other methods too)
s0 = [a[0] for a in x]
s1 = [a[1] for a in x]
# print results
print(statistics.mean(s0))
print(statistics.mean(s1))
The result look like this:
1.81753965e-10
1.89993675e-10

numpy:Change values in array by randomly differently selecting index

I am new in numpy, and I am having troubles with simple managment of numpy arrays.
I am doing a task in which it said that randomly daily select different 12 items in a numpy by index to change its value.
import numpy as np
import random
N = 20
s = np.zeros([N])
for t in range(12):
randomindex = random.randint(0,len(s)-1)
s[randomindex] = 10
thanks for u answering .I'm sorry for my describing,i'm not good at how writting problem of python by english.--!.I will give more detailed information
e.g. s=(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
and i randomly choose a item from its numpy by its index ,
randomindex=random.randint(0,len(s)-1),
randomindex will be 0-19,
and s(randomindex)=10,if the randomindex is 2 means s(2) is 10,
s=(1,2,10,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20).
And if i want choose 3 items, i do 3 times'for',how can i everytimes choose different index changes its value in numpy.
and daily which means i will give sum the new s and given to a new numpy R[T]
like:
import numpy as np
import random
N = 20
s = np.zeros([N])
T=10 #DAY
R = np.zereos([T])
for t in range(T-1):
R[T+1]=R[T]+R[T]*3
for t in range(12):
randomindex = random.randint(0,len(s)-1)
s[randomindex] = 10
R[T]=np.sum(s)
I'm having a little difficulty understanding what you're asking, but I think you want a way to be selecting different values in a randomized order. And the problem with the above code is that you may be getting duplicates.
I have two solutions. One, you can use the Python random library. The Python random.shuffle function will randomly shuffle all the values in an iterable (such as a list or array). You can then proceed to access them sequentially as you normally would. Here's an example of Random Shuffle.
list1 = ["Apple", "Grapes", "Bananas","Grapes"]
random.shuffle(list1)
print(list1)
The second solution doesn't involve the use of a library. You can simply make an addition into the code above. Create a new empty list, and every time you retrieve a value add it into this list. Also, whenever you retrieve a value check to see whether it's already located in the list. For instance, if you retrieve the value 5, add it to the list. If later on, you end up with the value 5 again, run a check through the list and you will find that 5 already exists in it. You can then reset that iteration and take input again.
You can append to a list with the following code.
newlist = ["Python", "Java","HTML","CSS"]
newlist.append("Ruby")
print(newlist)

Choosing python data structures to speed up algorithm implementation

So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.
There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.
The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:
def index_lists(lists):
index_hash = {}
for raw_list in lists:
length = len(raw_list)
if length > index_hash.get(index,{}).get("length"):
index = find_index(raw_list)
index_hash[index] = {"list": raw_list, "length": length}
return index_hash
This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.
What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:
def at_most_hash(index_hash):
most_hash = {}
for i in xrange(1<<28): # pretty sure this is a bad idea
max_entry = index_hash.get(i)
if max_entry:
max_length = max_entry["length"]
max_word = max_entry["list"]
else:
max_length = 0
max_word = []
for j in xrange(28): # again, probably not great
subset_index = i & ~(1<<j) # gets us a pre-computed subset
at_most_entry = most_hash.get(subset_index, {})
at_most_length = at_most_entry.get("length",0)
if at_most_length > max_length:
max_length = at_most_length
max_list = at_most_entry["list"]
most_hash[i] = {"length": max_length, "list": max_list}
return most_hash
This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?
The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.
Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.
In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.
Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.
Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:
import numpy
index_array = numpy.zeros(1<<28, dtype=int) # We could probably get away with dtype=int8.
for raw_list in lists:
i = find_index(raw_list)
index_array[i] = max(index_array[i], len(raw_list))
We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:
for bit_index in xrange(28):
index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])
index_array = index_array.reshape([1<<28])
Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.
We then compute the products of lengths at complementary indices:
products = index_array * index_array[::-1] # We'd probably have to adjust this part
# if we picked dtype=int8 earlier.
find where the best product is:
best_product_index = products.argmax()
and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.
This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.
For example, the set {0,2,3} would be represented by 20 + 22 + 23 = 13 and {4,5} would be represented by 24 + 25 = 48
This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.

Efficient Array replacement in Python

I'm wondering what is the most efficient way to replace elements in an array with other random elements in the array given some criteria. More specifically, I need to replace each element which doesn't meet a given criteria with another random value from that row. For example, I want to replace each row of data as a random cell in data(row) which is between -.8 and .8. My inefficinet solution looks something like this:
import numpy as np
data = np.random.normal(0, 1, (10, 100))
for index, row in enumerate(data):
row_copy = np.copy(row)
outliers = np.logical_or(row>.8, row<-.8)
for prob in np.where(outliers==1)[0]:
fixed = 0
while fixed == 0:
random_other_value = r.randint(0,99)
if random_other_value in np.where(outliers==1)[0]:
fixed = 0
else:
row_copy[prob] = row[random_other_value]
fixed = 1
Obviously, this is not efficient.
I think it would be faster to pull out all the good values, then use random.choice() to pick one whenever you need it. Something like this:
import numpy as np
import random
from itertools import izip
data = np.random.normal(0, 1, (10, 100))
for row in data:
good_ones = np.logical_and(row >= -0.8, row <= 0.8)
good = row[good_ones]
row_copy = np.array([x if f else random.choice(good) for f, x in izip(good_ones, row)])
High-level Python code that you write is slower than the C internals of Python. If you can push work down into the C internals it is usually faster. In other words, try to let Python do the heavy lifting for you rather than writing a lot of code. It's zen... write less code to get faster code.
I added a loop to run your code 1000 times, and to run my code 1000 times, and measured how long they took to execute. According to my test, my code is ten times faster.
Additional explanation of what this code is doing:
row_copy is being set by building a new list, and then calling np.array() on the new list to convert it to a NumPy array object. The new list is being built by a list comprehension.
The new list is made according to the rule: if the number is good, keep it; else, take a random choice from among the good values.
A list comprehension walks over a sequence of values, but to apply this rule we need two values: the number, and the flag saying whether that number is good or not. The easiest and fastest way to make a list comprehension walk along two sequences at once is to use izip() to "zip" the two sequences together. izip() will yield up tuples, one at a time, where the tuple is (f, x); f in this case is the flag saying good or not, and x is the number. (Python has a built-in feature called zip() which does pretty much the same thing, but actually builds a list of tuples; izip() just makes an iterator that yields up tuple values. But you can play with zip() at a Python prompt to learn more about how it works.)
In Python we can unpack a tuple into variable names like so:
a, b = (2, 3)
In this example, we set a to 2 and b to 3. In the list comprehension we unpack the tuples from izip() into variables f and x.
Then the heart of the list comprehension is a "ternary if" statement like so:
a if flag else b
The above will return the value a if the flag value is true, and otherwise return b. The one in this list comprehension is:
x if f else random.choice(good)
This implements our rule.

Categories