unique percentile even for same value in python

unique percentile even for same value in python - python

I am looking to obtain unique percentiles even for same value in Python
For example, the following case is giving the output as expected.
Case 1
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1.rank(pct=True)
Case 1 Output - [0.25, 0.5, 0.75, 1]
I expect the output to be same even when the input series is [2, 2, 2, 4]. However, here the output is [0.5, 0.5, 0.5, 1]. I don't mind either one of the outputs.
[0.25, 0.5, 0.75, 1]
[0.5, 0.25, 0.75, 1]
[0.25, 0.75, 0.5, 1]
Please let me know if there is a way to achieve that.

Rank has a parameter method which defaults to 'average' which gives you the results are you are seeing. Let's change that to 'first'.
s1 = pd.Series([2,2,2,4])
s1.rank(pct=True,method='first')
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

There is no simple function to do this. Although I understand what you want to do, this is not a percentile score. In fact, what you've shown here is a percentage rank, which is not the same as percentile.
To get the functionality you want, I believe that you'll have to group and compute the values yourself.

Related

Adding a column to a pandas dataframe based on other columns

Problem description
Introductory remark: For the code have a look below
Let's say we have a pandas dataframe consisting of 3 columns and 2 rows.
I'd like to add a 4th column called 'Max_LF' that will consist of an array. The value of the cell is retrieved by having a look at the column 'Max_WD'. For the first row that would be 0.35 which will than be compared to the values in the column 'WD' where 0.35 can be found at the third position. Therefore, the third value of the column 'LF' should be written into the column 'Max_LF'. If the value of 'Max_WD' occures multiple times in 'WD', then all corresponding items of 'LF' should be written into 'Max_LF'.
Failed attempt
So far I had various attemps on first retrieving the index of the item in 'Max_WD' in 'WD'. After potentially retrieving the index the idea was to then get the items of 'LF' via their index:
df4['temp_indices'] = [i for i, x in enumerate(df4['WD']) if x == df4['Max_WD']]
However, a ValueError occured:
raise ValueError('Lengths must match to compare')
ValueError: Lengths must match to compare
This is what the example dateframe looks like
df = pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41]})
The expected outcome should look like
df=pd.DataFrame(data={'LF': [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] , 'WD': [[0.28, 0.34, 0.35, 0.18], [0.42, 0.45, 0.45, 0.18], [0.31, 0.21, 0.41, 0.41]], 'Max_WD': [0.35, 0.45, 0.41], 'Max_LF': [[3] ,[2,3], [3,4]]})

You could get it by simply using lambda as follows
df['Max_LF'] = df.apply(lambda x : [i + 1 for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
output is
LF Max_WD WD Max_LF
0 [1, 2, 3] 0.35 [0.28, 0.34, 0.35, 0.18] [3]
1 [1, 2, 3] 0.45 [0.42, 0.45, 0.45, 0.18] [2, 3]
2 [1, 2, 3] 0.41 [0.31, 0.21, 0.41, 0.41] [3, 4]

Thanks guys! With your help I was able to solve my problem.
Like Prince Francis suggested I first did
df['temp'] = df.apply(lambda x : [i for i, e in enumerate(x['WD']) if e == x['Max_WD']], axis=1)
to get the indicees of the 'WD'-values in 'LF'. In a second stept I then could add the actual column 'Max_LF' by doing
df['LF_Max'] = df.apply(lambda x: [x['LF'][e] for e in (x['temp'])],axis=1)
Thanks a lot guys!

You can achieve it by applying a function over axis 1.
For this, I recommend you to first convert the WD list into a pd.Series (or a numpy.ndarray) and then compare all the values at once.
Assuming that you want a list of all the values higher than the threshold, you could use this:
>>> def get_max_wd(x):
... wd = pd.Series(x.WD)
... return list(wd[wd >= x.Max_WD])
...
>>> df.apply(get_max_wd, axis=1)
0 [0.35]
1 [0.45, 0.45]
2 [0.41, 0.41]
dtype: object
The result of the apply can then be assigned as a new column into the dataframe:
df['Max_LF'] = df.apply(get_max_wd, axis=1)
If what you are after is only the maximum value (see my comment above), you can use the max() method within the function.

Best way converting data in PANDAS DataFrame to matrix in Python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?

What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch

Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

Efficient way to categorize data into bins in python

Suppose I have a floating point dataset (x) which can assume any values between 0.0 and 1.0. I want to categorized the data into custom bins,eg,:
cat= 0 # the output category
if x > 0.8 and x<=0.9:
cat = 1
if x > 0.7 and x<=0.8:
cat=2
if x>0.6 and x<=0.7:
cat = 3
and so on... Is this the most efficient (in terms of how many lines i have to write) way to do this? I was thinking whether there is some way where i just specify the lower and upper range of the category and the category number and not have to write so many if statements.

I suggest you move the data into pandas dataframe
df['data'] = pd.DataFrame(x)
binInterval = [0, 0.6, 0.7, 0.8, 0.9]
binLabels = [0, 4, 3, 2, 1]
df['binned'] = pd.cut(df['data'], bins = binInterval, labels=binLabels)
refer documentaion here

simply:
categories = [0.6, 0.7, 0.8, 0.9]
cat = [categories[i]<x and categories[i+1]>=x for i in range(0, len(categories)-1)].index(True) + 1

Pandas plot with errorbar: style does not apply

I have Pandas (version 0.14.1) DataFrame object like this
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
It returns
y dy
0 1 0.1
1 2 0.3
2 3 0.1
3 4 0.2
4 5 0.4
where the first column is value and the second is error.
First case: I want to make a plot for y-values
df['y'].plot(style="ro-")
Second case: I want to add a vertical errorbars dy for y-values
df['y'].plot(style="ro-", yerr=df['dy'])
So, If I add yerr or xerr parameter to plot method, It ignores style.
Is it Pandas feature or bug?

As TomAugspurger pointed out, it is a known issue. However, it has an easy workaround in most cases: use fmt keyword instead of style keyword to specify shortcut style options.
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
df['y'].plot(fmt='ro-', yerr=df['dy'], grid='on')

Multiplying Rows and Columns of Python Sparse Matrix by elements in an Array

I have a numpy array such as:
array = [0.2, 0.3, 0.4]
(this vector is actually size 300k dense, I'm just illustrating with simple examples)
and a sparse symmetric matrix created using Scipy such as follows:
M = [[0, 1, 2]
[1, 0, 1]
[2, 1, 0]]
(represented as dense just to illustrate; in my real problem it's a (300k x 300k) sparse matrix)
Is it possible to multiply all rows by the elements in array and then make the same operation regarding the columns?
This would result first in :
M = [[0 * 0.2, 1 * 0.2, 2 * 0.2]
[1 * 0.3, 0 * 0.3, 1 * 0.3]
[2 * 0.4, 1 * 0.4, 0 * 0.4]]
(rows are being multiplied by the elements in array)
M = [[0, 0.2, 0.4]
[0.3, 0, 0.3]
[0.8, 0.4, 0]]
And then the columns are multiplied:
M = [[0 * 0.2, 0.2 * 0.3, 0.4 * 0.4]
[0.3 * 0.2, 0 * 0.3, 0.3 * 0.4]
[0.8 * 0.2, 0.4 * 0.3, 0 * 0.4]]
Resulting finally in:
M = [[0, 0.06, 0.16]
[0.06, 0, 0.12]
[0.16, 0.12, 0]]
I've tried applying the solution I found in this thread, but it didn't work; I multiplied the data of the M by the elements in array as it was suggested, then transposed the matrix and applied the same operation but the result wasn't correct, still coudn't understand why!
Just to point this out, the matrix I'll be running this operations are somewhat big, it has 20 million non-zero elements so efficiency is very important!
I appreciate your help!
Edit:
Bitwise solution worked very well. Here it took 1.72 s to compute this operation but that's ok to our work. Tnx!

In general you want to avoid loops and use matrix operations for speed and efficiency. In this case the solution is simple linear algebra, or more specifically matrix multiplication.
To multiply the columns of M by the array A, multiply M*diag(A). To multiply the rows of M by A, multiply diag(A)*M. To do both: diag(A)*M*diag(A), which can be accomplished by:
numpy.dot(numpy.dot(a, m), a)
diag(A) here is a matrix that is all zeros except having A on its diagonal. You can have methods to create this matrix easily (e.g. numpy.diag() and scipy.sparse.diags()).
I expect this to run very fast.

The following should work:
[[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
Example:
>>> array = [0.2, 0.3, 0.4]
>>> M = [[0, 1, 2], [1, 0, 1], [2, 1, 0]]
>>> [[x*array[i]*array[j] for j, x in enumerate(row)] for i, row in enumerate(M)]
[[0.0, 0.059999999999999998, 0.16000000000000003], [0.059999999999999998, 0.0, 0.12], [0.16000000000000003, 0.12, 0.0]]
Values are slightly off due to limitations on floating point arithmetic. Use the decimal module if the rounding error is unacceptable.

I use this combination:
def multiply(matrix, vector, axis):
if axis == 1:
val = np.repeat(array, matrix.getnnz(axis=1))
matrix.data *= val
else:
matrix = matrix.multiply(vector)
return matrix
When the axis is 1 (multiply by rows), I replicate the second approach of this solution,
and when the axis is 0 (multiply by columns) I use multiply
The in-place result (axis=1) is more efficient.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

unique percentile even for same value in python - python

Rank has a parameter method which defaults to 'average' which gives you the results are you are seeing. Let's change that to 'first'. s1 = pd.Series([2,2,2,4]) s1.rank(pct=True,method='first') Output: 0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64

Related

Adding a column to a pandas dataframe based on other columns

Best way converting data in PANDAS DataFrame to matrix in Python

Efficient way to categorize data into bins in python

Pandas plot with errorbar: style does not apply

Multiplying Rows and Columns of Python Sparse Matrix by elements in an Array

Categories

Resources