I have a dataframe with columns let's call them A and B, I want to find all symetrics pairs for example if :
pd.DataFrame({'A':[1, 2 , 3], 'B':[2, 1, 3]})
I want to return all pairs like here I will get (I actually don't need both only (2,1) or (2,1) is enough)
(1,2) and (2,1)
I have first tried an algorithm which works but is really slow in my dataframe of length 26325 after 10 minute it wasn't finished
listTuples = list()
for index, row in test.iterrows():
listTuples.append((row["A"], row["B"])) # convert to a list of tuple
answer = [(x, y) for (x, y) in listTuples if (y, x) in listTuples]
In general you pretty much never have to iterate over rows in a pandas dataframe. In this case, you could speed things up with
listTuples = zip(df.A,df.B)
If that's the part of your code that is running slowly, that should solve your problem.
Your list comprehension step looks efficient to me... your problem could be that you have many duplicate entries in the list that you continually check. Start by picking out only the unique ones, and then running your list comprehension.
Also see this stack overflow question about picking unique tuples from a list of tuples in python.
Related
Suppose I have the following Numpy nd array:
array([[['a',0,0,0],
[0,'b','c',0],
['e','d',0,0]]])
Now I would like to define 'double connections' of elements as follows:
We consider each column in this array as a time instant, and all elements in this instant are considered to happen at the same time. 0 means nothing happens. For example, a and e happens at the first time instant, b and d happens at the second time instant, and c itself happens in the third time instant.
If two elements, I believe it has 'double connections', and I would like to print the connections like this(if there is no such pair in one column, just move on to the next column until the end):
('a','e')
('e','a')
('b','d')
('d','b')
I tried to come up with solutions on iterating all the columns but did not work.Can anyone share some tips on this?
You can recreate the original array by the following commands
array = np.array([['a',0,0,0],
[0,'b','c',0],
['e','d',0,0],dtype=object)
You could count how many non-zero elements you have for each column. You select the columns with two non-zero elements, repeat them and inverse every second column:
pairs = np.repeat(array[(array[:, (array != 0).sum(axis=0) == 2]).nonzero()].reshape((2, -1)).T, 2, axis=0)
pairs[1::2] = pairs[1::2, ::-1]
If you want to convert these to tuples like in your desired output you could just do a list comprehension:
output = [tuple(pair) for pair in pairs]
I'm currently trying to create a column in a pandas dataframe, that creates a counter that equals the number of rows in the dataframe, divided by 2. Here is my code so far:
# Fill the cycles column with however many rows exist / 2
for x in ((jac_output.index)/2):
jac_output.loc[x, 'Cycles'] = x+1
However, I've noticed that it misses out values every so often, like this:
[
Why would my counter miss a value every so often as it gets higher? And is there another way of optimizing this, as it seems to be quite slow?
you may have removed some data from the dataframe, so some indicies are missing, therefore you should use reset_index to renumber them, or you can just use
for x in np.arange(0,len(jac_output.index),1)/2:.
You can view jac_output.index as a list like [0, 1, 2, ...]. When you divide it by 2, it results in [0, 0.5, 1, ...]. 0.5 is surely not in your original index.
To slice the index into half, you can try:
jac_output.index[:len(jac_output.index)//2]
I have a list of list defined as 'list' containing two columns that I need to find the average by iterating over the rows. I would normally just use tuble, then sum the column and divide it with the length.
As I need to iterate I am pretty lost of what my approach is going to be? Can anyone point me in a direction
Could you see if this is what is expected:
>> data = [[1,2,3,1, 3], [4,5,6, 2, 1]]
>> print([sum(x)/len(x) for x in data])
>> [2.0, 3.6]
I am using Python3 here.
Also, you may avoid using keywords as variable names. Eg: list
I have a numpy array of 4000*6 (6 column). And I have a numpy column (1*6) of minimum values (made from another numpy array of 3000*6).
I want to find everything in the large array that is below those values. but each value to it's corresponding column.
I've tried the simple way, based on a one column solution I already had:
largearray=[float('nan') if x<min_values else x for x in largearray]
but sadly it didn't work :(.
I can do a for loop for each column and each value, but i was wondering if there is a faster more elegant solution.
Thanks
EDIT: I'll try to rephrase: I have 6 values, and 6 columns.
i want to find the values in each column that are lower then the corresponding one from the 6 values.
by array I mean a 2d array. sorry if it wasn't clear
sorry, i'm still thinking in Matlab a bit.
this my loop solution. It's on df, not numpy. still, is there a faster way?
a=0
for y in dfnames:
df[y]=[float('nan') if x<minvalues[a] else x for x in df[y]]
a=a+1
df is the large array or dataframe
dfnames are the column names i'm interested in.
minvalues are the minimum values for each column. I'm assuming that the order is the same. bad assumption, but works for now.
will appreciate any help making it better
I think you just need
result = largearray.copy()
result[result < min_values] = np.nan
That is, result is a copy of largearray but ay element less than the corresponding column of min_values is set to nan.
If you want to blank entire rows only when all entries in the row are less than the corresponding column of min_values, then you want:
result = largearray.copy()
result[np.all(result < min_values, axis=1)] = np.nan
I don't use numpy, so it may be not commont used solution, but such work:
largearray = numpy.array([[1,2,3], [3,4,5]])
minvalues =numpy.array([3,4,5])
largearray1=[(float('nan') if not numpy.all(numpy.less(x, min_values)) else x) for x in largearray]
result should be: [[1,2,3], 'nan']
def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]