I'm currently trying to create a column in a pandas dataframe, that creates a counter that equals the number of rows in the dataframe, divided by 2. Here is my code so far:
# Fill the cycles column with however many rows exist / 2
for x in ((jac_output.index)/2):
jac_output.loc[x, 'Cycles'] = x+1
However, I've noticed that it misses out values every so often, like this:
[
Why would my counter miss a value every so often as it gets higher? And is there another way of optimizing this, as it seems to be quite slow?
you may have removed some data from the dataframe, so some indicies are missing, therefore you should use reset_index to renumber them, or you can just use
for x in np.arange(0,len(jac_output.index),1)/2:.
You can view jac_output.index as a list like [0, 1, 2, ...]. When you divide it by 2, it results in [0, 0.5, 1, ...]. 0.5 is surely not in your original index.
To slice the index into half, you can try:
jac_output.index[:len(jac_output.index)//2]
Related
I want to reindex a couple of pandas dataframes. I want to the last index to move up to the first, and I aim to build a general function.
I could find the last index by using len. But how do I then get the index series range(1, (len(a)-1)) printed out like 1,2,3,4, etc.?
What would be a smart solution to this problem?
a=pd.DataFrame({'geo':['Nation', 'geo1', 'geo2', 'geo3', 'geo4', 'geo5', 'geo6', 'county']})
print(a.geo)
# This is what I want to achieve in a general function
print(a.geo.reindex(index=[7, 0, 1,2,3,4,5,6]))
# The last index can be located using len. But what about the rest?
print(a.geo.reindex(index=[(len(a)-1), 0, 1,2,3,4,5,6]))
b=pd.DataFrame({'geo':['Nation', 'geo1', 'geo2', 'geo3', 'geo4', 'county']})
print(b.geo)
print(b.geo.reindex(index=[5, 0, 1,2,3,4]))
print(pd.DataFrame({'geo': np.roll(a['geo'], shift=1)}))
An alternative way is to do:
print(a.iloc[np.arange(-1, len(a)-1)])
I have a list of list defined as 'list' containing two columns that I need to find the average by iterating over the rows. I would normally just use tuble, then sum the column and divide it with the length.
As I need to iterate I am pretty lost of what my approach is going to be? Can anyone point me in a direction
Could you see if this is what is expected:
>> data = [[1,2,3,1, 3], [4,5,6, 2, 1]]
>> print([sum(x)/len(x) for x in data])
>> [2.0, 3.6]
I am using Python3 here.
Also, you may avoid using keywords as variable names. Eg: list
I have a dataframe with columns let's call them A and B, I want to find all symetrics pairs for example if :
pd.DataFrame({'A':[1, 2 , 3], 'B':[2, 1, 3]})
I want to return all pairs like here I will get (I actually don't need both only (2,1) or (2,1) is enough)
(1,2) and (2,1)
I have first tried an algorithm which works but is really slow in my dataframe of length 26325 after 10 minute it wasn't finished
listTuples = list()
for index, row in test.iterrows():
listTuples.append((row["A"], row["B"])) # convert to a list of tuple
answer = [(x, y) for (x, y) in listTuples if (y, x) in listTuples]
In general you pretty much never have to iterate over rows in a pandas dataframe. In this case, you could speed things up with
listTuples = zip(df.A,df.B)
If that's the part of your code that is running slowly, that should solve your problem.
Your list comprehension step looks efficient to me... your problem could be that you have many duplicate entries in the list that you continually check. Start by picking out only the unique ones, and then running your list comprehension.
Also see this stack overflow question about picking unique tuples from a list of tuples in python.
I discovered a very subtle bug in my code. I frequently delete rows from a dataframe in my analysis. Because this will leave gaps in the index, I try to end all functions by resetting the index at the end with
df0 = df0.reset_index (drop = True)
Then I continue in the next function with
for row in xrange (df0.shape [0]):
print df0.loc [row]
print df0.iloc [row]
However, if I dont reset the index correctly, the first row might have an index of 192. The index of 192 is not the same as the row number of 0. This leads to the problem that df0.loc[row] accesses the row with index 0, and df0.iloc[row] are accessing the row with index 192. This has caused a very strange bug, in that I try to update row 0, but index 192 gets updated instead. Or vice versa.
But in reality, I dont use any df0.loc() or df0.iloc() functions because they are too slow. My code is riddled with df0.get_value(...) and df0.set_value(...) functions because they are the fastest functions when accessing values.
And it seems that some of the functions are accessed by index, and other are accessed by row numbers? I am confused. Can someone explain to me? What are the best practices? Are some functions using index to access values, and other are using row numbers? Have I misunderstood something? Should I always reset_index() as often I can? Or never do that?
EDIT: To recap: I manually merge some rows in functions so there will be gaps in the indicies. In other functions I iterate over each row and do calculations. However, if I have reset the index I get other calculation results than if I don't reset the index. Why? That is my problem.
.loc[] looks at index labels, which may or may not be integer-valued.
If your index is [0, 1, 3] (a non-sequential integer index), .loc[2] won't find anything, because there is no index label 2.
Similarly, if your index is ['a', 'b', 'c'] (a non-integer index), .loc[2] will come up empty.
.iloc[] looks at index positions, which will always be integer-valued.
If your index is [0, 1, 3], .loc[2] will return the row corresponding to 3.
If your index is ['a', 'b', 'c'], .loc[2] will return the row corresponding to 'c'.
That's not a bug, that's just how those indexers are designed. Whether one fits your purpose depends on the structure of your data and what you're trying to accomplish. It's hard to make a recommendation without knowing more.
That said, it does sound like your code is getting kind of thorny. Having to perform reset_index() in a bunch of different places and keep constant track of which row you're trying to update suggest that you may not be taking advantage of Pandas' ability to perform vector-based calculations across many rows and columns at once. Maybe the task you want to accomplish makes this inevitable. But it's worth taking some time to consider whether you can't vectorize some of what you're doing, so that you can apply it to the whole dataframe or a subset of the dataframe, rather than operating on individual cells one at a time.
def maxvalues():
for n in range(1,15):
dummy=[]
for k in range(len(MotionsAndMoorings)):
dummy.append(MotionsAndMoorings[k][n])
max(dummy)
L = [x + [max(dummy)]] ## to be corrected (adding columns with value max(dummy))
## suggest code to add new row to L and for next function call, it should save values here.
i have an array of size (k x n) and i need to pick the max values of the first column in that array. Please suggest if there is a simpler way other than what i tried? and my main aim is to append it to L in columns rather than rows. If i just append, it is adding values at the end. I would like to this to be done in columns for row 0 in L, because i'll call this function again and add a new row to L and do the same. Please suggest.
General suggestions for your code
First of all it's not very handy to access globals in a function. It works but it's not considered good style. So instead of using:
def maxvalues():
do_something_with(MotionsAndMoorings)
you should do it with an argument:
def maxvalues(array):
do_something_with(array)
MotionsAndMoorings = something
maxvalues(MotionsAndMoorings) # pass it to the function.
The next strange this is you seem to exlude the first row of your array:
for n in range(1,15):
I think that's unintended. The first element of a list has the index 0 and not 1. So I guess you wanted to write:
for n in range(0,15):
or even better for arbitary lengths:
for n in range(len(array[0])): # I chose the first row length here not the number of columns
Alternatives to your iterations
But this would not be very intuitive because the max function already implements some very nice keyword (the key) so you don't need to iterate over the whole array:
import operator
column = 2
max(array, key=operator.itemgetter(column))[column]
this will return the row where the i-th element is maximal (you just define your wanted column as this element). But the maximum will return the whole row so you need to extract just the i-th element.
So to get a list of all your maximums for each column you could do:
[max(array, key=operator.itemgetter(column))[column] for column in range(len(array[0]))]
For your L I'm not sure what this is but for that you should probably also pass it as argument to the function:
def maxvalues(array, L): # another argument here
but since I don't know what x and L are supposed to be I'll not go further into that. But it looks like you want to make the columns of MotionsAndMoorings to rows and the rows to columns. If so you can just do it with:
dummy = [[MotionsAndMoorings[j][i] for j in range(len(MotionsAndMoorings))] for i in range(len(MotionsAndMoorings[0]))]
that's a list comprehension that converts a list like:
[[1, 2, 3], [4, 5, 6], [0, 2, 10], [0, 2, 10]]
to an "inverted" column/row list:
[[1, 4, 0, 0], [2, 5, 2, 2], [3, 6, 10, 10]]
Alternative packages
But like roadrunner66 already said sometimes it's easiest to use a library like numpy or pandas that already has very advanced and fast functions that do exactly what you want and are very easy to use.
For example you convert a python list to a numpy array simple by:
import numpy as np
Motions_numpy = np.array(MotionsAndMoorings)
you get the maximum of the columns by using:
maximums_columns = np.max(Motions_numpy, axis=0)
you don't even need to convert it to a np.array to use np.max or transpose it (make rows to columns and the colums to rows):
transposed = np.transpose(MotionsAndMoorings)
I hope this answer is not to unstructured. Some parts are suggestions to your function and some are alternatives. You should pick the parts that you need and if you have any trouble with it, just leave a comment or ask another question. :-)
An example with a random input array, showing that you can take the max in either axis easily with one command.
import numpy as np
aa= np.random.random([4,3])
print aa
print
print np.max(aa,axis=0)
print
print np.max(aa,axis=1)
Output:
[[ 0.51972266 0.35930957 0.60381998]
[ 0.34577217 0.27908173 0.52146593]
[ 0.12101346 0.52268843 0.41704152]
[ 0.24181773 0.40747905 0.14980534]]
[ 0.51972266 0.52268843 0.60381998]
[ 0.60381998 0.52146593 0.52268843 0.40747905]