I am looking to obtain unique percentiles even for same value in Python
For example, the following case is giving the output as expected.
Case 1
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1.rank(pct=True)
Case 1 Output - [0.25, 0.5, 0.75, 1]
I expect the output to be same even when the input series is [2, 2, 2, 4]. However, here the output is [0.5, 0.5, 0.5, 1]. I don't mind either one of the outputs.
[0.25, 0.5, 0.75, 1]
[0.5, 0.25, 0.75, 1]
[0.25, 0.75, 0.5, 1]
Please let me know if there is a way to achieve that.
Rank has a parameter method which defaults to 'average' which gives you the results are you are seeing. Let's change that to 'first'.
s1 = pd.Series([2,2,2,4])
s1.rank(pct=True,method='first')
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
There is no simple function to do this. Although I understand what you want to do, this is not a percentile score. In fact, what you've shown here is a percentage rank, which is not the same as percentile.
To get the functionality you want, I believe that you'll have to group and compute the values yourself.
Related
Problem description
Introductory remark: For the code have a look below
Let's say we have a pandas dataframe consisting of 3 columns and 2 rows.
I'd like to add a 4th column called 'Max_LF' that will consist of an array. The value of the cell is retrieved by having a look at the column 'Max_WD'. For the first row that would be 0.35 which will than be compared to the values in the column 'WD' where 0.35 can be found at the third position. Therefore, the third value of the column 'LF' should be written into the column 'Max_LF'. If the value of 'Max_WD' occures multiple times in 'WD', then all corresponding items of 'LF' should be written into 'Max_LF'.
Failed attempt
So far I had various attemps on first retrieving the index of the item in 'Max_WD' in 'WD'. After potentially retrieving the index the idea was to then get the items of 'LF' via their index:
df4['temp_indices'] = [i for i, x in enumerate(df4['WD']) if x == df4['Max_WD']]
However, a ValueError occured:
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
This is what the example dateframe looks like
df = pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41]})
The expected outcome should look like
df=pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41], 'Max_LF': [[3] ,[2,3], [3,4]]})
You could get it by simply using lambda as follows
df['Max_LF'] = df.apply(lambda x : [i + 1 for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
output is
LF Max_WD WD Max_LF
0 [1, 2, 3] 0.35 [0.28, 0.34, 0.35, 0.18] [3]
1 [1, 2, 3] 0.45 [0.42, 0.45, 0.45, 0.18] [2, 3]
2 [1, 2, 3] 0.41 [0.31, 0.21, 0.41, 0.41] [3, 4]
Thanks guys! With your help I was able to solve my problem.
Like Prince Francis suggested I first did
df['temp'] = df.apply(lambda x : [i for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
to get the indicees of the 'WD'-values in 'LF'. In a second stept I then could add the actual column 'Max_LF' by doing
df['LF_Max'] = df.apply(lambda x: [x['LF'][e] for e in (x['temp'])],axis=1)
Thanks a lot guys!
You can achieve it by applying a function over axis 1.
For this, I recommend you to first convert the WD list into a pd.Series (or a numpy.ndarray) and then compare all the values at once.
Assuming that you want a list of all the values higher than the threshold, you could use this:
>>> def get_max_wd(x):
... wd = pd.Series(x.WD)
... return list(wd[wd >= x.Max_WD])
...
>>> df.apply(get_max_wd, axis=1)
0 [0.35]
1 [0.45, 0.45]
2 [0.41, 0.41]
dtype: object
The result of the apply can then be assigned as a new column into the dataframe:
df['Max_LF'] = df.apply(get_max_wd, axis=1)
If what you are after is only the maximum value (see my comment above), you can use the max() method within the function.
I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?
What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch
Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()
Suppose I have a floating point dataset (x) which can assume any values between 0.0 and 1.0. I want to categorized the data into custom bins,eg,:
cat= 0 # the output category
if x > 0.8 and x<=0.9:
cat = 1
if x > 0.7 and x<=0.8:
cat=2
if x>0.6 and x<=0.7:
cat = 3
and so on... Is this the most efficient (in terms of how many lines i have to write) way to do this? I was thinking whether there is some way where i just specify the lower and upper range of the category and the category number and not have to write so many if statements.
I suggest you move the data into pandas dataframe
df['data'] = pd.DataFrame(x)
binInterval = [0, 0.6, 0.7, 0.8, 0.9]
binLabels = [0, 4, 3, 2, 1]
df['binned'] = pd.cut(df['data'], bins = binInterval, labels=binLabels)
refer documentaion here
simply:
categories = [0.6, 0.7, 0.8, 0.9]
cat = [categories[i]<x and categories[i+1]>=x for i in range(0, len(categories)-1)].index(True) + 1
I have Pandas (version 0.14.1) DataFrame object like this
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
It returns
y dy
0 1 0.1
1 2 0.3
2 3 0.1
3 4 0.2
4 5 0.4
where the first column is value and the second is error.
First case: I want to make a plot for y-values
df['y'].plot(style="ro-")
Second case: I want to add a vertical errorbars dy for y-values
df['y'].plot(style="ro-", yerr=df['dy'])
So, If I add yerr or xerr parameter to plot method, It ignores style.
Is it Pandas feature or bug?
As TomAugspurger pointed out, it is a known issue. However, it has an easy workaround in most cases: use fmt keyword instead of style keyword to specify shortcut style options.
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
df['y'].plot(fmt='ro-', yerr=df['dy'], grid='on')
I have a numpy array such as:
array = [0.2, 0.3, 0.4]
(this vector is actually size 300k dense, I'm just illustrating with simple examples)
and a sparse symmetric matrix created using Scipy such as follows:
M = [[0, 1, 2]
[1, 0, 1]
[2, 1, 0]]
(represented as dense just to illustrate; in my real problem it's a (300k x 300k) sparse matrix)
Is it possible to multiply all rows by the elements in array and then make the same operation regarding the columns?
This would result first in :
M = [[0 * 0.2, 1 * 0.2, 2 * 0.2]
[1 * 0.3, 0 * 0.3, 1 * 0.3]
[2 * 0.4, 1 * 0.4, 0 * 0.4]]
(rows are being multiplied by the elements in array)
M = [[0, 0.2, 0.4]
[0.3, 0, 0.3]
[0.8, 0.4, 0]]
And then the columns are multiplied:
M = [[0 * 0.2, 0.2 * 0.3, 0.4 * 0.4]
[0.3 * 0.2, 0 * 0.3, 0.3 * 0.4]
[0.8 * 0.2, 0.4 * 0.3, 0 * 0.4]]
Resulting finally in:
M = [[0, 0.06, 0.16]
[0.06, 0, 0.12]
[0.16, 0.12, 0]]
I've tried applying the solution I found in this thread, but it didn't work; I multiplied the data of the M by the elements in array as it was suggested, then transposed the matrix and applied the same operation but the result wasn't correct, still coudn't understand why!
Just to point this out, the matrix I'll be running this operations are somewhat big, it has 20 million non-zero elements so efficiency is very important!
I appreciate your help!
Edit:
Bitwise solution worked very well. Here it took 1.72 s to compute this operation but that's ok to our work. Tnx!
In general you want to avoid loops and use matrix operations for speed and efficiency. In this case the solution is simple linear algebra, or more specifically matrix multiplication.
To multiply the columns of M by the array A, multiply M*diag(A). To multiply the rows of M by A, multiply diag(A)*M. To do both: diag(A)*M*diag(A), which can be accomplished by:
numpy.dot(numpy.dot(a, m), a)
diag(A) here is a matrix that is all zeros except having A on its diagonal. You can have methods to create this matrix easily (e.g. numpy.diag() and scipy.sparse.diags()).
I expect this to run very fast.
The following should work:
[[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
Example:
>>> array = [0.2, 0.3, 0.4]
>>> M = [[0, 1, 2], [1, 0, 1], [2, 1, 0]]
>>> [[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
[[0.0, 0.059999999999999998, 0.16000000000000003], [0.059999999999999998, 0.0, 0.12], [0.16000000000000003, 0.12, 0.0]]
Values are slightly off due to limitations on floating point arithmetic. Use the decimal module if the rounding error is unacceptable.
I use this combination:
def multiply(matrix, vector, axis):
if axis == 1:
val = np.repeat(array, matrix.getnnz(axis=1))
matrix.data *= val
else:
matrix = matrix.multiply(vector)
return matrix
When the axis is 1 (multiply by rows), I replicate the second approach of this solution,
and when the axis is 0 (multiply by columns) I use multiply
The in-place result (axis=1) is more efficient.