Python - Create sparse matrix representation from 10000 random values

Python - Create sparse matrix representation from 10000 random values - python

I'm having a homework assignment about airport flights, where at first i have to create the representation of a sparse matrix(i, j and values) for a 1000x1000 array from 10000 random numbers with the following criteria:
i and j must be between 0-999 since are the rows and columns of array
values must be between 1.0-5.0
i must not be equal to j
i and j must be present only once
The i is the departure airport, the j is the arrival airport and the values are the hours for the trip from i to j.
Then i have to find the roundtrips for an airport A with 2 to 8 maximum stops based on the criteria above. For example:
A, D, F, G, A is a legal roundtrip with 4 stops
A, D, F, D, A is not a legal roundtrip since the D is visited twice
NOTE: the problem must be solved purely with python built-in libraries. No external libraries are accepted like scipy and numpy.
I have tried to run a loop for 10000 numbers and assign to row, column and value a random number based on the above criteria but i guess this is not what the assignment asks me to do since the loop doesn't stop.
I guess the i and j are not the actual iloc and j representations of the sparse matrix but rather the values of those? i don't know.
I currently don't have a working code other than the example for the roundtrip implementation. Although will raise an error if the list is empty:
dNext = {
0: [],
1: [4, 2, 0],
2: [1, 4],
3: [0],
4: [3, 1]
}
def findRoundTrips(trip, n, trips):
if (trip[0] == trip[-1]) and (1 < len(trip) <= n + 1):
trips.append(trip.copy())
return
for x in dNext[trip[-1]]:
if ((x not in trip[1:]) and (len(trip) < n)) or (x == trip[0]):
trip.append(x)
findRoundTrips(trip, n, trips)
trip.pop()

Here's how I would build a sparse matrix:
from collections import defaultdict
import random
max_location = 1000
min_value = 1.0
max_value = 5.0
sparse_matrix = defaultdict(list)
num_entries = 10000
for _ in range(num_entries):
source = random.randint(0, max_location)
dest = random.randint(0, max_location)
value = random.uniform(min_value, max_value)
sparse_matrix[source].append((dest, value))
What this does is define a sparse matrix as a dictionary where the key of the dictionary is the starting point of a trip. The values of a key define everywhere you can fly to and how long it takes to fly there as a list of tuples.
Note, I didn't check that I'm using randint and uniform perfectly correctly, if you use this, you should look at the documentation of those functions to find out if there are any off-by-one errors in this solution.

Related

What is the math program I'm trying to solve in python?

I am trying to solve this math problem in python, and I'm not sure what it is called:
The answer X is always 100
Given a list of 5 integers, their sum would equal X
Each integer has to be between 1 and 25
The integers can appear one or more times in the list
I want to find all the possible unique lists of 5 integers that match.
These would match:
20,20,20,20,20
25,25,25,20,5
10,25,19,21,25
along with many more.
I looked at itertools.permutations, but I don't think that handles duplicate integers in the list. I'm thinking there must be a standard math algorithm for this, but my search queries must be poor.
Only other thing to mention is if it matters that the list size could change from 10 integers to some other length (6, 24, etc).

This is a constraint satisfaction problem. These can often be solved by a method called linear programming: You fix one part of the solution and then solve the remaining subproblem. In Python, we can implement this approach with a recursive function:
def csp_solutions(target_sum, n, i_min=1, i_max=25):
domain = range(i_min, i_max + 1)
if n == 1:
if target_sum in domain:
return [[target_sum]]
else:
return []
solutions = []
for i in domain:
# Check if a solution is still possible when i is picked:
if (n - 1) * i_min <= target_sum - i <= (n - 1) * i_max:
# Construct solutions recursively:
solutions.extend([[i] + sol
for sol in csp_solutions(target_sum - i, n - 1)])
return solutions
all_solutions = csp_solutions(100, 5)
This yields 23746 solutions, in agreement with the answer by Alex Reynolds.

Another approach with Numpy:
#!/usr/bin/env python
import numpy as np
start = 1
end = 25
entries = 5
total = 100
a = np.arange(start, end + 1)
c = np.array(np.meshgrid(a, a, a, a, a)).T.reshape(-1, entries)
assert(len(c) == pow(end, entries))
s = c.sum(axis=1)
#
# filter all combinations for those that meet sum criterion
#
valid_combinations = c[np.where(s == total)]
print(len(valid_combinations)) # 23746
#
# filter those combinations for unique permutations
#
unique_permutations = set(tuple(sorted(x)) for x in valid_combinations)
print(len(unique_permutations)) # 376

You want combinations_with_replacement from itertools library. Here is what the code would look like:
from itertools import combinations_with_replacement
values = [i for i in range(1, 26)]
candidates = []
for tuple5 in combinations_with_replacement(values, 5):
if sum(tuple5) == 100:
candidates.append(tuple5)
For me on this problem I get 376 candidates. As mentioned in the comments above if these are counted once for each arrangement of the 5-pair, then you'd want to look at all, permutations of the 5 candidates-which may not be all distinct. For example (20,20,20,20,20) is the same regardless of how you arrange the indices. However, (21,20,20,20,19) is not-this one has some distinct arrangements.

I think that this could be what you are searching for: given a target number SUM, a left treshold L, a right treshold R and a size K, find all the possible lists of K elements between L and R which sum gives SUM. There isn't a specific name for this problem though, as much as I was able to find.

Can I define a value in a dictionary directly using a for loop within the dictionary itself?

I'm working in some Project Euler problems and have a solution I'd like to make more adaptable. The problem itself isn't important here but for those of you that are curious, it's problem 11.
Currently, I have a grid of 20 by 20 integer values and I'm finding the maximum product of 4 adjacent values. It all works fine and pretty quickly. What I currently have is the following:
maxi = 0
amount = 4
for i in range (0,len(grid) - amount):
for j in range (0,len(grid) - amount):
try:
max_dic = {
'right':grid[i][j]*grid[i][j+1]*grid[i][j+2]*grid[i][j+3],
'down':grid[i][j]*grid[i+1][j]*grid[i+2][j]*grid[i+3][j],
'down_right':grid[i][j]*grid[i+1][j+1]*grid[i+2][j+2]*grid[i+3][j+3],
'down_left':grid[i][j]*grid[i+1][j-1]*grid[i+2][j-2]*grid[i+3][j-3]
}
except IndexError:
pass
max_key = str(max(max_dic.items(), key=operator.itemgetter(1))[0])
if max_dic[max_key] > maxi:
maxi = max_dic[max_key]
What I would like to do is replace the values in the dictionary by something I can vary (so something in terms of amount) and I thought of using a for loop to range from 0 to amount-1 which would look like this:
'right': for a in range(amount): # Multiply the correct values
However, I'm not sure whether or not this is possible and if so, how to implement it.
Any advice on how I could do this ?

In your current implementation, you won't get a max_dic result at all whenever
any of the pieces exceed the grid boundaries. Often in problems like these, you
do indeed want a partial result. If so, you probably want
the ability to handle the IndexError in a more fine-grained way. For example,
you could create a simple helper function that takes a grid and two indexes and
returns either the value or some default (a 1 in the case of multiplication).
def get_val(grid, i, j, default = 1):
try:
return grid[i][j]
except IndexError:
return default
Once you have that building block, it's just a matter of preparing some lists
of indexes and then using a few functions from the standard library:
from operator import mul
from functools import reduce
# Inside your two loops over i and j ...
ms = list(range(i, i + amount))
ns = list(range(j, j + amount))
rns = list(range(j, j - amount, -1))
max_dic = {
'right' : reduce(mul, [get_val(grid, i, n) for n in ns]),
'down' : reduce(mul, [get_val(grid, m, j) for m in ns]),
'down_right' : reduce(mul, [get_val(grid, m, n) for m, n in zip(ms, ns)]),
'down_left' : reduce(mul, [get_val(grid, m, n) for m, n in zip(ms, rns)]),
}

You could write something like
import numpy as np
...
'right': np.prod([grid[i][j] for j in range(i, i + amount)])

Finding maximum sum of occurrences of one element in two attempts from a list

Best explained by example. If a python list is -
[[0,1,2,0,4],
[0,1,2,0,2],
[1,0,0,0,1],
[1,0,0,1,0]]
I want to select two sub-lists which will yield the max sum of occurrences of zeros present - where sum is to be calculated as below
SUM = No. of zeros present in the first selected sub-list + No. of zeros present in the second selected sub-list which were not present in the first selected sub-list.
In this case, answer is 5. (First or second sub-list and the last sub-list). (Note that the third sub-list is not to be selected because it has zero present in 3rd index which is same as in first/second sub-list we have to select and it will amount to sum as 4 which will not be maximum if we consider the last sub-list)
What kind of algorithm is best suited if we were to apply it on a big input? Is there a better way to do this in better than in N2 time?

Binary operations are fairly useful for this task:
Convert each sublist to a binary number, where a 0 is turned into a 1 bit, and other numbers are turned into a 0 bit.
For example, [0,1,2,0,4] would be turned into 10010, which is 18.
Eliminate duplicate numbers.
Combine the remaining numbers pairwise and combine them with a binary OR.
Find the number with the most 1 bits.
The code:
lists = [[0,1,2,0,4],
[0,1,2,0,2],
[1,0,0,0,1],
[1,0,0,1,0]]
import itertools
def to_binary(lst):
num = ''.join('1' if n == 0 else '0' for n in lst)
return int(num, 2)
def count_ones(num):
return bin(num).count('1')
# Step 1 & 2: Convert to binary and remove duplicates
binary_numbers = {to_binary(lst) for lst in lists}
# Step 3: Create pairs
combinations = itertools.combinations(binary_numbers, 2)
# Step 4 & 5: Compute binary OR and count 1 digits
zeros = (count_ones(a | b) for a, b in combinations)
print(max(zeros)) # output: 5

The efficiency of the naive algorithm is O(n(n-1)*m) ~ O(n2m) where n is the number of lists and m is the length of each list. When n and m are comparable in magnitude, this equates to O(n3).
It might be helpful to observe that naive matrix multiplication is also O(n3). This might lead us to the following algorithm:
Write each list with only 1's and 0's, where a 1 indicates a non-zero entry.
Arrange these lists in a matrix A.
Compute the product M=AAT.
Find the minimum element in M; the row and column correspond to the lists which produce the maximize number of non-overlapping zeros.
Here, (3) is the limiting step of the algorithm. Asymptotically, depending on your matrix multiplication algorithm, you can achieve a complexity down to roughly O(n2.4).
An example Python implementation would look like:
import numpy as np
lists = [[0,1,2,0,4],
[0,1,2,0,2],
[1,0,0,0,1],
[1,0,0,1,0]]
filtered = list(set(tuple(1 if e else 0 for e in sub) for sub in lists))
A = np.mat(filtered)
D = np.einsum('ik,jk->ij', A, A)
indices= np.unravel_index(np.argmin(D), D.shape)
print(f'{indices}: {len(lists[0]) - D[indices]}') # (0, 3): 0
Note that this algorithm on it's own has the fundamental inefficiency that it is calculating both the lower-triangular and upper-triangular halves of dot product matrix. However, the numpy speed-up will probably offset this from the combinations approach. See the timing results below:
def numpy_approach(lists):
filtered = list(set(tuple(1 if e else 0 for e in sub) for sub in lists))
A = np.mat(filtered, dtype=bool).astype(int)
D = np.einsum('ik,jk->ij', A, A)
return len(lists[0]) - D.min()
def itertools_approach(lists):
binary_numbers = {int(''.join('1' if n == 0 else '0' for n in lst), 2)
for lst in lists}
combinations = itertools.combinations(binary_numbers, 2)
zeros = (bin(a | b).count('1') for a, b in combinations)
return max(zeros)
from time import time
N = 1000
lists = [[random.randint(0, 5) for _ in range(10)] for _ in range(100)]
for name, function in {
'numpy approach': numpy_approach,
'itertools approach': itertools_approach
}.items():
start = time()
for _ in range(N):
function(lists)
print(f'{name}: {time() - start}')
# numpy approach: 0.2698099613189697
# itertools approach: 0.9693171977996826

The algorithm should look something like (with Haskell code as example, so as not to make the process trivial for you in Python:
turn each sublist into "Is zero" or "Isn't zero"
map (map (\x -> if x==0 then 1 else 0)) bigList
Enumerate the list so you can keep indices
enumList = zip [0..] bigList
Compare each sublist with its successive sublists
myCompare = concat . go
where
go [] = []
go ((ix, xs):xss) = [((ix, iy), zipWith (.|.) xs ys) | (iy, ys) <- xss] : go xss
Calculate your maxes
best = maximumBy (compare `on` (sum . snd)) $ myCompare enumList
Pull out the indices
result = fst best

Sliding window on list of lists in Python

I'm trying to use numpy/pandas to constuct a sliding window style comparator. I've got list of lists each of which is a different length. I want to compare each list to to another list as depicted below:
lists = [[10,15,5],[5,10],[5]]
window_diff(l[1],l[0]) = 25
The window diff for lists[0] and lists[1] would give 25 using the following window sliding technique shown in the image below. Because lists[1] is the shorter path we shift it once to the right, resulting in 2 windows of comparison. If you sum the last row in the image below we get the total difference between the two lists using the two windows of comparison; in this case a total of 25. To note we are taking the absolute difference.
The function should aggregate the total window_diff between each list and the other lists, so in this case
tot = total_diffs(lists)
tot>>[40, 30, 20]
# where tot[0] represents the sum of lists[0] window_diff with all other lists.
I wanted to know if there was a quick route to doing this in pandas or numpy. Currently I am using a very long winded process of for looping through each of the lists and then comparing bitwise by shifting the shorter list in accordance to the longer list.
My approach works fine for short lists, but my dataset is 10,000 lists long and some of these lists contain 60 or so datapoints, so speed is a criteria here. I was wondering if numpy, pandas had some advice on this? Thanks
Sample problem data
from random import randint
lists = [[random.randint(0,1000) for r in range(random.randint(0,60))] for x in range(100000)]

Steps :
Among each pair of lists from the input list of lists create sliding windows for the bigger array and then get the absolute difference against the smaller one in that pair. We can use NumPy strides to get those sliding windows.
Get the total sum and store this summation as a pair-wise differentiation.
Finally sum along each row and col on the 2D array from previous step and their summation is final output.
Thus, the implementation would look something like this -
import itertools
def strided_app(a, L, S=1 ): # Window len = L, Stride len/stepsize = S
a = np.asarray(a)
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(a, shape=(nrows,L), strides=(S*n,n))
N = len(lists)
pair_diff_sums = np.zeros((N,N),dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
pair_diff_sums[i,j] = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
pair_diff_sums[i,j] = np.abs(strided_app(B,L=len(A)) - A).sum()
out = pair_diff_sums.sum(1) + pair_diff_sums.sum(0)
For really heavy datasets, here's one method using one more level of looping -
N = len(lists)
out = np.zeros((N),dtype=type(lists[0][0]))
for k,i in enumerate(lists):
for j in lists:
if len(i)>len(j):
out[k] += np.abs(strided_app(i,L=len(j)) - j).sum()
else:
out[k] += np.abs(strided_app(j,L=len(i)) - i).sum()
strided_app is inspired from here.
Sample input, output -
In [77]: lists
Out[77]: [[10, 15, 5], [5, 10], [5]]
In [78]: pair_diff_sums
Out[78]:
array([[ 0, 25, 15],
[25, 0, 5],
[15, 5, 0]])
In [79]: out
Out[79]: array([40, 30, 20])

Just for completeness of #Divakar's great answer and for its application to very large datasets:
import itertools
N = len(lists)
out = np.zeros(N, dtype=type(lists[0][0]))
for i, j in itertools.combinations(range(N), 2):
A, B = lists[i], lists[j]
if len(A)>len(B):
diff = np.abs(strided_app(A,L=len(B)) - B).sum()
else:
diff = np.abs(strided_app(B,L=len(A)) - A).sum()
out[i] += diff
out[j] += diff
It does not create unnecessary large datasets and updates a single vector while iterating only over the upper triangular array.
It will still take a while to compute, as there is a tradeoff between computational complexity and larger-than-ram datasets. Solutions for larger than ram datasets often rely on iterations, and python is not great at it. Iterating in python over a large dataset is slow, very slow.
Translating the code above to cython could speedup things a bit.

Matrix of variable size [i x j] (Python, Numpy)

I am attempting to build a simple genetic algorithm that will optimize to an input string, but am having trouble building the [individual x genome] matrix (row n is individual n's genome.) I want to be able to change the population size, mutation rate, and other parameters to study how that affects convergence rate and program efficiency.
This is what I have so far:
import random
import itertools
import numpy as np
def evolve():
goal = 'Hello, World!' #string to optimize towards
ideal = list(goal)
#converting the string into a list of integers
for i in range (0,len(ideal)):
ideal [i] = ord(ideal[i])
print(ideal)
popSize = 10 #population size
genome = len(ideal) #determineing the length of the genome to be the length of the target string
mut = 0.03 #mutation rate
S = 4 #tournament size
best = float("inf") #initial best is very large
maxVal = max(ideal)
minVal = min(ideal)
print (maxVal)
i = 0 #counting variables assigned to solve UnboundLocalError
j = 0
print(maxVal, minVal)
#constructing initial population array (individual x genome)
pop = np.empty([popSize, len(ideal)])
for i, j in itertools.product(range(i), range(j)):
pop[i, j] = [i, random.randint(minVal,maxVal)]
print(pop)
This produces a matrix of the population size with the correct genome length, but the genomes are something like:
[ 6.91364167e-310 6.91364167e-310 1.80613009e-316 1.80613009e-316
5.07224590e-317 0.00000000e+000 6.04100487e+151 3.13149876e-120
1.11787892e+253 1.47872844e-028 7.34486815e+223 1.26594941e-118
7.63858409e+228]
I need them to be random integers corresponding to random ASCII characters .
What am I doing wrong with this method?
Is there a way to make this faster?
I found my current method here:
building an nxn matrix in python numpy, for any n
I found another method that I do not understand, but seems faster and simper, if I can use it here I would like to.
Initialise numpy array of unknown length
Thank you for any assistance you can provide.

Your loop isn't executing because i and j are both 0, so range(i) and range(j) are empty. Also you can't assign a list [i,random] to an array value (np.empty defaults to np.float64). I've simply changed it to only store the random number, but if you really want to store a list, you can change the creation of pop to be pop = np.empty([popSize, len(ideal)],dtype=list)
Otherwise use this for the last lines:
for i, j in itertools.product(range(popSize), range(len(ideal))):
pop[i, j] = random.randint(minVal,maxVal)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Create sparse matrix representation from 10000 random values - python

Related

What is the math program I'm trying to solve in python?

Can I define a value in a dictionary directly using a for loop within the dictionary itself?

Finding maximum sum of occurrences of one element in two attempts from a list

Sliding window on list of lists in Python

Matrix of variable size [i x j] (Python, Numpy)

Categories

Resources