Efficient way to categorize data into bins in python

Efficient way to categorize data into bins in python - python

Suppose I have a floating point dataset (x) which can assume any values between 0.0 and 1.0. I want to categorized the data into custom bins,eg,:
cat= 0 # the output category
if x > 0.8 and x<=0.9:
cat = 1
if x > 0.7 and x<=0.8:
cat=2
if x>0.6 and x<=0.7:
cat = 3
and so on... Is this the most efficient (in terms of how many lines i have to write) way to do this? I was thinking whether there is some way where i just specify the lower and upper range of the category and the category number and not have to write so many if statements.

I suggest you move the data into pandas dataframe
df['data'] = pd.DataFrame(x)
binInterval = [0, 0.6, 0.7, 0.8, 0.9]
binLabels = [0, 4, 3, 2, 1]
df['binned'] = pd.cut(df['data'], bins = binInterval, labels=binLabels)
refer documentaion here

simply:
categories = [0.6, 0.7, 0.8, 0.9]
cat = [categories[i]<x and categories[i+1]>=x for i in range(0, len(categories)-1)].index(True) + 1

Related

Fastest way to simulate multiple given multiple probabilities in Python?

I have a list of probabilities p = [p1, p2, …, pn].
All I want is to simulate a list s = [0, 1, 0, 0, 1, …, 1], whose first element is 0 with probability p1 and 1 with probability 1 - p1, and then so on for the next elements, always matching the corresponding probabilities in the p-list.
Currently my solution is to for-loop over p, and then append to s the output from np.random.choice() called on each individual pn.
s = []
for item in p:
s.append(np.random.choice([0, 1], p=[item, 1 - item]))

You just need to draw your numbers and element-wise compare them with your p.
just decide if for 1 you want > or >=
import numpy as np
p = np.array([0.2, 0.5, 0, 1.0, 0.9, 0.3, 0.1, 0.8])
x = np.random.random(size=p.shape)
ans = (x>p).astype('int')
print(p)
print(x)
print(ans)
[0.2 0.5 0. 1. 0.9 0.3 0.1 0.8]
[0.08990063 0.51804083 0.9049705 0.0885368 0.1273564 0.18583925
0.51488052 0.23258143]
[0 1 1 0 0 0 1 0]

How to find first occurrence of a significant difference in values of a pandas dataframe?

In a Pandas DataFrame, how would I find the first occurrence of a large difference between two values at two adjacent indices?
As an example, if I have a DataFrame column A with data [1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1], I would want index holding 1.5, which would be 5. In my code below, it would give me the index holding 7.2, because 15 - 7.2 > 7 - 1.5.
idx = df['A'].diff().idxmax() - 1
How should I fix this problem, so I get the index of the first 'large difference' occurrence?

The main issue is of course how you define a "large difference". Your solution is pretty good to get the largest difference, improved only by using .diff(-1) and using absolute values as shown by Jezrael:
differences = df['A'].diff(-1).abs()
Using absolute values matters if your values are not sorted, in which case you can get negative differences.
Then, you should probably do some clustering on these values and get the smallest index of the cluster with largest values. Jezrael already showed a heuristic by using the largest quartile, however by only slightly modifying your example this doesn’t work:
df = pd.DataFrame({'A': [1, 1.05, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
differences = df['A'].diff(-1).abs()
idx = differences.index[differences >= differences.quantile(.75)][0]
print(idx, differences[idx])
This returns 1 0.1499999999999999
Here's 3 other heuristics that might work better for you:
If you have a value above which you consider a difference to be “large” (e.g. 1.5):
idx = differences.index[differences >= 1.5][0]
If you know how many large values there are, you can select those and get the smallest index (e.g. 2):
idx = differences.nlargest(2).index.min()
If you know all small values are grouped together (as are all the 0.1 in your example), you can filter what's larger than the mean (or the mean + 1 standard deviation if your “large” values are very close to the smaller ones).
idx = differences.index[differences >= differences.mean()][0]
This is because contrarily to the median, your few large differences will pull the mean up significantly.
If you really want to go for proper clustering, you can use the KMeans algorithm from scikit learn:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2).fit(differences.values[:-1].reshape(-1, 1))
clusters = pd.Series(kmeans.labels_, index=differences.index[:-1])
idx = clusters.index[clusters.eq(np.squeeze(kmeans.cluster_centers_).argmax())][0]
This classifies the data into 2 classes, and then gets the classification into a pandas Series. We then filter this series’ index by selecting only the cluster that has the highest values, and finally get the first element of this filtered index.

One idea is filter by Series.quantile with Series of differences with change order of differencies by -1 and aboslute values, last get first index:
df = pd.DataFrame({'A':[1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1]})
x = df['A'].diff(-1) .abs()
print (x)
0 0.1
1 0.1
2 0.1
3 0.1
4 0.1
5 5.5
6 0.1
7 0.1
8 7.8
9 0.1
10 NaN
Name: A, dtype: float64
idx = x.index[x >= x.quantile(.75)]
print (idx)
Int64Index([5, 7, 8], dtype='int64')
print (idx[0])
5

If you have a Numpy Array, which you can use any dataframe row as, you can use numpy.argmax.
import numpy as np
import numpy as np
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = np.diff(a)
threshold = 2 # set your threshold
max_index = np.argwhere(diff> threshold) [[5],[8]]
References:
https://numpy.org/doc/stable/reference/generated/numpy.diff.html
https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html
More Info:
pandas.diff will calculate the diff diff[i] = a[i] - a[i-1]
numpy.diff will calculate the diff diff[i] = a[i+1] - a[i],
with the exception of i=max len:
a[i] = a[i]-a[i-1]

def shift(a):
a_r = np.roll(a, 1) # right shift
a_l = np.roll(a, -1) # left shift
return np.stack([a_l, a_r], axis=1)
a = np.array([1, 1.1, 1.2, 1.3, 1.4, 1.5, 7, 7.1, 7.2, 15, 15.1])
diff = abs(shift(a) - a.reshape(-1, 1))
diff = diff[1:-1]
indices = diff.argmax(axis=0) - 2
a[indices]
array([7. , 1.5])

Best way converting data in PANDAS DataFrame to matrix in Python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?

What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch

Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

unique percentile even for same value in python

I am looking to obtain unique percentiles even for same value in Python
For example, the following case is giving the output as expected.
Case 1
import pandas as pd
s1 = pd.Series([1,2,3,4])
s1.rank(pct=True)
Case 1 Output - [0.25, 0.5, 0.75, 1]
I expect the output to be same even when the input series is [2, 2, 2, 4]. However, here the output is [0.5, 0.5, 0.5, 1]. I don't mind either one of the outputs.
[0.25, 0.5, 0.75, 1]
[0.5, 0.25, 0.75, 1]
[0.25, 0.75, 0.5, 1]
Please let me know if there is a way to achieve that.

Rank has a parameter method which defaults to 'average' which gives you the results are you are seeing. Let's change that to 'first'.
s1 = pd.Series([2,2,2,4])
s1.rank(pct=True,method='first')
Output:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64

There is no simple function to do this. Although I understand what you want to do, this is not a percentile score. In fact, what you've shown here is a percentage rank, which is not the same as percentile.
To get the functionality you want, I believe that you'll have to group and compute the values yourself.

Pandas plot with errorbar: style does not apply

I have Pandas (version 0.14.1) DataFrame object like this
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
It returns
y dy
0 1 0.1
1 2 0.3
2 3 0.1
3 4 0.2
4 5 0.4
where the first column is value and the second is error.
First case: I want to make a plot for y-values
df['y'].plot(style="ro-")
Second case: I want to add a vertical errorbars dy for y-values
df['y'].plot(style="ro-", yerr=df['dy'])
So, If I add yerr or xerr parameter to plot method, It ignores style.
Is it Pandas feature or bug?

As TomAugspurger pointed out, it is a known issue. However, it has an easy workaround in most cases: use fmt keyword instead of style keyword to specify shortcut style options.
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
df['y'].plot(fmt='ro-', yerr=df['dy'], grid='on')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient way to categorize data into bins in python - python

I suggest you move the data into pandas dataframe df['data'] = pd.DataFrame(x) binInterval = [0, 0.6, 0.7, 0.8, 0.9] binLabels = [0, 4, 3, 2, 1] df['binned'] = pd.cut(df['data'], bins = binInterval, labels=binLabels) refer documentaion here

simply: categories = [0.6, 0.7, 0.8, 0.9] cat = [categories[i]<x and categories[i+1]>=x for i in range(0, len(categories)-1)].index(True) + 1

Related

Fastest way to simulate multiple given multiple probabilities in Python?

How to find first occurrence of a significant difference in values of a pandas dataframe?

Best way converting data in PANDAS DataFrame to matrix in Python

unique percentile even for same value in python

Pandas plot with errorbar: style does not apply

Categories

Resources