Alternating Directions in Python List - More Pythonic Solution - python

I feel this should be simple but I'm stuck on finding a neat solution. The code I have provided works, and gives the output I expect, but I don't feel it is Pythonic and it's getting on my nerves.
I have produced three sets of coordinates, X, Y & Z using 'griddata' from a base data set. The coordinates are evenly spaced over an unknown total area / shape (not necessarily square / rectangle) producing the NaN results which I want to ignore of the boundaries of each list. The list should be traversed from the 'bottom left' (in a coordinate system), across the x axis, up one space in the y direction then right to left before continuing. There could be an odd or even number of rows.
The operation to be performed on each point is the same no matter the direction, and it is guaranteed that the every point which exists in X a point exists in Y and Z as can be seen in the code below.
Arrays (lists?) are of the format DataPoint[rows][columns].
k = 0
for i in range(len(x)):
if k % 2 == 0: # cut left to right, then right to left
for j in range(len(x[i])):
if not numpy.isnan(x[i][j]):
file.write(f'X{x[i][j]} Y{y[i][j]} Z{z[i][j]}')
else:
for j in reversed(range(len(x[i]))):
if not numpy.isnan(x[i][j]):
file.write(f'X{x[i][j]} Y{y[i][j]} Z{z[i][j]}')
k += 1
One solution I could think of would be to reverse every other row in each of the lists before running the loop. It would save me a few lines, but probably wouldn't make sense from a performance standpoint - anyone have any better suggestions?
Expected route through list:
End════<══════╗
╔══════>══════╝
╚══════<══════╗
Start══>══════╝

Here's a variant:
for i, (x_row, y_row, z_row) in enumerate(zip(x, y, z)):
if i % 2:
z_row = reversed(x_row)
y_row = reversed(y_row)
z_row = reversed(z_row)
row_strs = list()
for x_elem, y_elem, z_elem in zip(x_row, y_row, z_row):
if not numpy.isnan(x_elem):
row_strs.append(f"X{x_elem} Y{y_elem} Z{z_elem}")
file.write("".join(row_strs))
Considerations:
There is no recipe for an optimization that will always perform better than any other. It also depends on the data that the code handles. Here's a list of things that I could think of, without knowing how the data looks like:
for index range(len(sequence)): is not a Pythonic way of iterating. Here, the foreach idiom is used. If the index is required, [Python 3.Docs]: Built-in Functions - enumerate(iterable, start=0) could be used
This no longer applies because of the previous bullet, but reversed(range(n)) is same as range(n - 1, -1, -1). Don't know whether the latter is faster, but it looks like it would be
Iterate over multiple iterables at once, using [Python 3.Docs]: Built-in Functions - zip(*iterables)
Don't need k, already have i
In general when working with files, it's better to read / write fewer times bigger chunks of data than many times smaller chunks of data (files generally reside on disk and disk operations are slow). However, buffering occurs by default (at Python, OS levels), so this is no longer an issue, but still. But again as always, it's a trade-off between resources (time, memory, ...). I chose to write to file once per line (rather than once per element - as it was originally). Of course, there's the 3rd possibility of writing everything at once, but I imagined that for larger data sets, it won't be the best solution
Probably, some optimizations could also happen at NumPy level (as it would handle bulk data much faster than Python code (iterating) does), but I'm not an expert in that area, nor do I know how the data looks like

I agree with #Prune, your code looks readable and does what it should do. You could compress it a bit by precomputing the indices, like so (note that this start from the top left):
import numpy as np
# generate some sample data
x = np.arange(100).reshape(10,10)
#precompute both directions
fancyranges = (
list(range(len(x[0,:]))),
reversed(list(range(len(x[0,:]))))
)
for a in range(x.shape[0]):
# call appropriate directions
for b in fancyranges[a%2]:
# do things
print(x[a,b])

you can move repeatable code to sub_func for further changes in one place
def func():
def sub_func():
# repeatable code
if not numpy.isnan(x[i][j]):
print(f'X{x[i][j]}...')
k = 0
for i in range(len(x)):
if k % 2 == 0: # cut left to right, then right to left
for j in range(len(x[i])):
sub_func()
else:
for j in reversed(range(len(x[i]))):
sub_func()
k += 1
func()

Related

How can i make this python code quicklier?

How can i make this code more quicklier?
def add_may_go(x,y):
counter = 0
for i in range(-2,3):
cur_y = y + i
if cur_y < 0 or cur_y >= board_order:
continue
for j in range(-2,3):
cur_x = x+j
if (i == 0 and j == 0) or cur_x < 0 or cur_x >= board_order or [cur_y,cur_x] in huge_may_go:
continue
if not public_grid[cur_y][cur_x]:
huge_may_go.append([cur_y,cur_x])
counter += 1
return counter
INPUT:
something like: add_may_go(8,8), add_may_go(8,9) ...
huge_may_go is a huge list like:
[[7,8],[7,9], [8,8],[8,9],[8,10]....]
public_grid is also a huge list, the size is same as board_order*board_order
every content it have has to possble from : 0 or 1
like:
[
[0,1,0,1,0,1,1,...(board_order times), 0, 1],
... board_order times
[1,0,1,1,0,0,1,...(board_order times), 0, 1],
]
an board_order is a global variable which usually is 19 (sometimes it is 15 or 20)
It runs toooooooo slowy now. This function is gonna run for hundreds time. Any possible suggestions is ok!
I have tried numpy. But numpy make it more slowly! Please help
It is difficult to provide a definitive improvement without sample data and a bit more context. Using numpy would be beneficial if you can manage to perform all the calls (i.e. all (x,y) coordinate values) in a single operation. There are also strategies based on sets that could work but you would need to maintain additional data structures in parallel with the public_grid.
Based only on that piece of code, and without changing the rest of the program, there are a couple of things you could do that will provide small performance improvements:
loop only on eligible coordinates rather than skip invalid ones (outside of board)
only compute the curr_x, curr_y values once (track them in a dictionary for each x,y pairs). This is assuming that the same x,y coordinates are used in multiple calls to the function.
use comprehensions when possible
use set operations to avoid duplicate coordinates in huge_may_go
.
hugeCoord = dict() # keep track of the offset coordinates
def add_may_go(x,y):
# compute coordinates only once (the first time)
if (x,y) not in hugeCoord:
hugeCoord[x,y] = [(cx,cy)
for cy in range(max(0,y-2),min(board_order,y+3))
for cx in range(max(0,x-2),min(board_order,x+3))
if cx != x or cy != y]
# get the resulting list of coordinates using a comprehension
fit = {(cy,cx) for cx,cy in hugeCoord[x,y] if not public_grid[cy][cx]}
fit.difference_update(huge_may_go) # use set to avoid duplicates
huge_may_go.extend(fit)
return len(fit)
Note that, if huge_may_go was a set instead of a list, adding to it without repetitions would be more efficient because you could update it directly (and return the difference in size)
if (i == 0 and j == 0)...: continue
Small improvement; reduce the number of iterations by not making those.
for i in (1,2):
do stuff with i and -i
for j in (1,2):
do stuff with j and -j
I want to highlight 2 places which need special attention.
if (...) [cur_y,cur_x] in huge_may_go:
Unlike rest of conditions, this is not arithmetic condition, but contains check, if huge_may_go is list it does take O(n) time or speaking simply is proportional to number of elements in list.
huge_may_go.append([cur_y,cur_x])
PythonWiki described .append method of list as O(1) but with disclaimer that Individual actions may take surprisingly long, depending on the history of the container. You might use collections.deque as replacement for list which was designed with performance of insert (at either end) in mind.
If huge_may_go must not contain duplicates and you do not care about order, then you might use set rather than list and use it for keeping tuples of y,x (set is unable to hold lists). When using .add method of set you might skip contains check, as adding existing element will have not any effect, consider that
s = set()
s.add((1,2))
s.add((3,4))
s.add((1,2))
print(s)
gives output
{(1, 2), (3, 4)}
If you would then need some contains check, set contains check is O(1).

For-Loop over python float array

I am working with the IRIS dataset. I have two sets of data, (1 training set) (2 test set). Now I want to calculate the euclidean distance between every test set row and the train set rows. However, I only want to include the first 4 points of the row.
A working example would be:
dist = np.linalg.norm(inner1test[0][0:4]-inner1train[0][0:4])
print(dist)
***output: 3.034243***
The problem is that I have 120 training set points and 30 test set points - so i would have to do 2700 operations manually, thus I thought about iterating through with a for-loop. Unfortunately, every of my attemps is failing.
This would be my best attempt, which shows the error message
for i in inner1test:
for number in inner1train:
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
(IndexError: arrays used as indices must be of integer (or boolean)
type)
What would be the best solution to iterate through this array?
ps: I will also provide a screenshot for better vizualisation.
From what I see, inner1test is a tuple of lists, so the i value will not be an index but the actual list.
You should use enumerate, which returns two variables, the index and the actual data.
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
print(dist)
Also, if your lists begin the be bigger, consider using a generator which will execute your calculcations iteration per iteration and return only one value at a time, avoiding to return a big chunk of results which would occupy a lot of memory.
eg:
def my_calculatiuon(inner1test, inner1train):
for i, value in enumerate(inner1test):
for j, number in enumerate(inner1train):
dist = np.linalg.norm(inner1test[i][0:4]-inner1train[number][0:4])
yield dist
for i in my_calculatiuon(inner1test, inner1train):
print(i)
You might also want to investigate python list comprehension which is sometimes more elegant way to handle for loops with lists.
[EDIT]
Here's a probably easier solution anyway, without the need of indexes, which won't fail to enumerate a numpy object:
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
[/EDIT]
This was the final solution with the correct output for me:
distanceslist = list()
for testvalue in inner1test:
for testtrain in inner1train:
dist = np.linalg.norm(testvalue[0:4]-testtrain[0:4])
distances = (dist, testtrain[0:4])
distanceslist.append(distances)
distanceslist

Create multiple numpy arrays of same size at once [duplicate]

I was unable to find anything describing how to do this, which leads to be believe I'm not doing this in the proper idiomatic Python way. Advice on the 'proper' Python way to do this would also be appreciated.
I have a bunch of variables for a datalogger I'm writing (arbitrary logging length, with a known maximum length). In MATLAB, I would initialize them all as 1-D arrays of zeros of length n, n bigger than the number of entries I would ever see, assign each individual element variable(measurement_no) = data_point in the logging loop, and trim off the extraneous zeros when the measurement was over. The initialization would look like this:
[dData gData cTotalEnergy cResFinal etc] = deal(zeros(n,1));
Is there a way to do this in Python/NumPy so I don't either have to put each variable on its own line:
dData = np.zeros(n)
gData = np.zeros(n)
etc.
I would also prefer not just make one big matrix, because keeping track of which column is which variable is unpleasant. Perhaps the solution is to make the (length x numvars) matrix, and assign the column slices out to individual variables?
EDIT: Assume I'm going to have a lot of vectors of the same length by the time this is over; e.g., my post-processing takes each log file, calculates a bunch of separate metrics (>50), stores them, and repeats until the logs are all processed. Then I generate histograms, means/maxes/sigmas/etc. for all the various metrics I computed. Since initializing 50+ vectors is clearly not easy in Python, what's the best (cleanest code and decent performance) way of doing this?
If you're really motivated to do this in a one-liner you could create an (n_vars, ...) array of zeros, then unpack it along the first dimension:
a, b, c = np.zeros((3, 5))
print(a is b)
# False
Another option is to use a list comprehension or a generator expression:
a, b, c = [np.zeros(5) for _ in range(3)] # list comprehension
d, e, f = (np.zeros(5) for _ in range(3)) # generator expression
print(a is b, d is e)
# False False
Be careful, though! You might think that using the * operator on a list or tuple containing your call to np.zeros() would achieve the same thing, but it doesn't:
h, i, j = (np.zeros(5),) * 3
print(h is i)
# True
This is because the expression inside the tuple gets evaluated first. np.zeros(5) therefore only gets called once, and each element in the repeated tuple ends up being a reference to the same array. This is the same reason why you can't just use a = b = c = np.zeros(5).
Unless you really need to assign a large number of empty array variables and you really care deeply about making your code compact (!), I would recommend initialising them on separate lines for readability.
Nothing wrong or un-Pythonic with
dData = np.zeros(n)
gData = np.zeros(n)
etc.
You could put them on one line, but there's no particular reason to do so.
dData, gData = np.zeros(n), np.zeros(n)
Don't try dData = gData = np.zeros(n), because a change to dData changes gData (they point to the same object). For the same reason you usually don't want to use x = y = [].
The deal in MATLAB is a convenience, but isn't magical. Here's how Octave implements it
function [varargout] = deal (varargin)
if (nargin == 0)
print_usage ();
elseif (nargin == 1 || nargin == nargout)
varargout(1:nargout) = varargin;
else
error ("deal: nargin > 1 and nargin != nargout");
endif
endfunction
In contrast to Python, in Octave (and presumably MATLAB)
one=two=three=zeros(1,3)
assigns different objects to the 3 variables.
Notice also how MATLAB talks about deal as a way of assigning contents of cells and structure arrays. http://www.mathworks.com/company/newsletters/articles/whats-the-big-deal.html
If you put your data in a collections.defaultdict you won't need to do any explicit initialization. Everything will be initialized the first time it is used.
import numpy as np
import collections
n = 100
data = collections.defaultdict(lambda: np.zeros(n))
for i in range(1, n):
data['g'][i] = data['d'][i - 1]
# ...
How about using map:
import numpy as np
n = 10 # Number of data points per array
m = 3 # Number of arrays being initialised
gData, pData, qData = map(np.zeros, [n] * m)

Cropping Python lists by value instead of by index

Good evening, StackOverflow.
Lately, I've been wrestling with a Python program which I'll try to outline as briefly as possible.
In essence, my program plots (and then fits a function to) graphs. Consider this graph.
The graph plots just fine, but I'd like it to do a little more than that: since the data is periodic over an interval OrbitalPeriod (1.76358757), I'd like it to start with our first x value and then iteratively plot all of the points OrbitalPeriod away from it, and then do the same exact thing over the next region of length OrbitalPeriod.
I know that there is a way to slice lists in Python of the form
croppedList = List[a:b]
where a and b are the indices of the first and last elements you'd like to include in the new list, respectively. However, I have no idea what the indices are going to be for each of the values, or how many values fall between each OrbitalPeriod-sized interval.
What I want to do in pseudo-code looks something like this.
croppedList = fullList on the domain [a + (N * OrbitalPeriod), a + (N+1 * OrbitalPeriod)]
where a is the x-value of the first meaningful data point.
If you have a workaround for this or a cropping method that would accept values instead of indices as arguments, please let me know. Thanks!
If you are working with numpy, you can use it inside the brackets
m = x
M = x + OrbitalPeriod
croppedList = List[m <= List]
croppedList = croppedList[croppedList < M]

Optimising model of social network evolution

I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.
Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.
So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.
Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!
Could anyone offer any tips on how I might optimise the code?
Thanks.
#Importing required librarys
try:
import matplotlib.pyplot as plt
except:
raise
import networkx as nx
import csv
import random
import math
def prod(iterable):
p= 1
for n in iterable:
p *= n
return p
def Sum(iterable):
p= 0
for n in iterable:
p += n[3]
return p
def CalcTriads(n):
firstgen=G.neighbors(n)
Edges=[]
Triads=[]
for i in firstgen:
Edges.append(G.edges(i))
for i in xrange(len(Edges)):
for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i)
if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
t=[n,Edges[i][j][0],Edges[i][j][1]]
t.sort()
Triads.append(t)# Add found nodes to Triads.
new_Triads = []# Delete duplicate triads.
for elem in Triads:
if elem not in new_Triads:
new_Triads.append(elem)
Triads = new_Triads
for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
a=G[Triads[i][0]][Triads[i][1]].values()
b=G[Triads[i][1]][Triads[i][2]].values()
c=G[Triads[i][2]][Triads[i][0]].values()
Q=prod(a+b+c)
Triads[i].append(Q)
return Triads
###### Import sorted edge data ######
li=[]
with open('Sorted Data.csv', 'rU') as f:
reader = csv.reader(f)
for row in reader:
li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)
for i in xrange(800000):
e = random.choice(li) # Choose random edge
TriNei=[]
a=CalcTriads(e[0]) # Find triads of first node in the chosen edge
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member
e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge
G.clear()
G.add_weighted_edges_from(li)
TriNei=[]
a=CalcTriads(e[0])
for i in xrange(0,len(a)):
if set([e[1]]).issubset(a[i]):
TriNei.append(a[i])
postH=-Sum(TriNei)# Calculate the post flip "energy".
if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
continue
elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
e[2]=-1*e[2]
The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:
Unless you are using Python 3, change
for i in range(800000):
to
for i in xrange(800000):
The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.
Also, change
j=random.choice(range(len(li)))
e=li[j] # Choose random edge
to
e = random.choice(li)
and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).
There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.
As far as I can tell the line:
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.
Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace
for i in range(0,len(a)):
if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(a[i])
with
for x in a:
if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
TriNei.append(x)
However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.
BTW, why loop 800000 times? What is the significance of that number?
Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.
There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.
You also need to fix your indentation before you can expect a decent answer.
Regardless, this question might be better suited to Code Review.
I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.
Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...
import random
import networkx as nx
# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
d['weight']=random.randrange(-1,2,2) # -1 or 1
edges=G.edges()
for i in range(10):
u,v = random.choice(edges) # random edge
nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
triads = [(u,v,n) for n in nbrs]
print "triads",triads
for u,v,w in triads:
print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
G[u][v]['weight']*=-1

Categories