Grouping data in a dataframe in python - python

I have a dataframe column
df['probability'] = [0.5, 0.6, 0.7, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
I need to group data like 0.5, 0.6, 0.7 and 1 as one category with output as 1 in a new column and 1.5 and 2 as one category with output as 2 and so on. Could anyone help me in this.

Related

Pandas Group By column to generate quantiles (.25, 0.5, .75)

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT
First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

How to print categories in pandas.cut?

Notice that when you input pandas.cut into a dataframe, you get the bins of each element, Name:, Length:, dtype:, and Categories in the output. I just want the Categories array printed for me so I can obtain just the range of the number of bins I was looking for. For example, with bins=4 inputted into a dataframe of numbers "1,2,3,4,5", I would want the output to print solely the range of the four bins, i.e. (1, 2], (2, 3], (3, 4], (4, 5].
Is there anyway I can do this? It can be anything, even if it doesn't require printing "Categories".
I guessed that you just would like to get the 'bins' from pd.cut().
If so, you can simply set retbins=True, see the doc of pd.cut
For example:
In[01]:
data = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
cats, bins = pd.cut(data.a, 4, retbins=True)
Out[01]:
cats:
0 (0.996, 2.0]
1 (0.996, 2.0]
2 (2.0, 3.0]
3 (3.0, 4.0]
4 (4.0, 5.0]
Name: a, dtype: category
Categories (4, interval[float64]): [(0.996, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0]]
bins:
array([0.996, 2. , 3. , 4. , 5. ])
Then you can reuse the bins as you pleased.
e.g.,
lst = [1, 2, 3]
category = pd.cut(lst,bins)
For anyone who has come here to see how to select a particular bin from pd.cut function - we can use the pd.Interval funtcion
df['bin'] = pd.cut(df['y'], [0.1, .2,.3,.4,.5, .6,.7,.8 ,.9])
print(df["bin"].value_counts())
Ouput
(0.2, 0.3] 697
(0.4, 0.5] 156
(0.5, 0.6] 122
(0.3, 0.4] 12
(0.6, 0.7] 8
(0.7, 0.8] 4
(0.1, 0.2] 0
(0.8, 0.9] 0
print(df.loc[df['bin'] == pd.Interval(0.7,0.8)]

How to print a value to a new array if it within a bound of previous value in that array in Python/Numpy

If I have an array:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
I would like to loop through this array starting with StartArray[0] and only keep values that are within +/- .5 of the last kept value to yield:
EndArray=[1, 1.4, 1.2, 1.5, 1.9, 2.2, 2.3]
This is what I have tried so far and the results don't make sense
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
EndArray=np.empty_like(StartArray)
EndArray[0]=StartArray[0]
for i in range(len(StartArray)-1):
if EndArray[i]+.5>StartArray[i+1]>EndArray[i]-.5:
EndArray[i+1]=StartArray[i+1]
Out:
array([ 1. , 0.22559146, 0.13015365, 5.24910493, 0.63804761,
0.6 , 1.73143364, 1.5 , 1.9 , 2.2 ,
6.82525036, 0.61641556, 6.82325036])
List is the good structure for this job:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
ref=StartArray[0]
End=[]
for x in StartArray:
if abs(x- ref)<.5:
End.append(x)
ref=x
print(np.array(End))
[ 1. 1.4 1.2 1.5 1.9 2.2 2.3]
There are multiple problems with your approach. First, you're initializing EndArray to be the same size as StartArray, but that's not what you want your desired output to be. Instead, initialize EndArray to be an empty list and append values as your loop through StartArray. Secondly, you want the output values to be within 0.5 of the last kept value, so you need to keep track of this.
Adapting your code:
StartArray=np.array([1, 2, 3, 1.4, 1.2, 0.6, 1.8, 1.5, 1.9, 2.2, 3, 4 ,2.3])
EndArray=[]
last_kept = StartArray[0]
EndArray.append(last_kept)
for i in range(len(StartArray)-1):
if np.abs(StartArray[i+1] - last_kept) < 0.5:
last_kept = StartArray[i+1]
EndArray.append(last_kept)
# convert back to numpy array
EndArray = np.array(EndArray)

Clustering of sequential data

Given the following scenario, I have a really long street. Each house on the street has some number of children. If I were to sequentially append the number of children in each house along an array, I could get some array like:
x = [1,1,1,1,2,2,2,2,1,1,1,1,3,3,3,2,1,1,1,1,2,2,2,2]
I want to locationally determine areas where the households cluster, i.e. I want to group the 2's together, the 3's together, and the 2's at the end together. Normally on 1D data I would sort, determine difference, and find clusters of 1, 2, and 3. But here, I want to keep the index of these values as a factor. So I wanto to end up identifying clusters as:
index: value
0-4 : 1
5-8: 2
9-12: 1
13-16: 3
17-20: 1
21-24: 2
I have seen mean shift used for this detection, and would like to implement this in python. I have also seen kernal density functions. Does anyone know how best to implement this in python?
Edit: To make something clear, I have simplified the problem. At each cluster of integers, the actual problem I would try to address has a gaussian distribtuion of values around that integer value. So I would have a list more like:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
A simple approach:
x = [0.8, 0.95, 1.2, 1.3, 2.2, 1.6, 1.9, 2.1, 1.1, .7, .9, .9, 3.4, 2.8, 2.9, 3.0, 1.1, 1.0, 0.9, 1.2, 2.2, 2.1, 1.7, 12.0]
cluster = []
for i, v in enumerate(x):
v = round(v)
if not cluster or cluster[-1][2] != v:
cluster.append([i, i, v])
else:
cluster[-1][1] = i
This results in a list of [start, end, value] lists:
[[ 0, 3, 1],
[ 4, 7, 2],
[ 8, 11, 1],
[12, 14, 3],
[15, 15, 2],
[16, 19, 1],
[20, 23, 2]]
Your desired output wasn't zero-based, therefore the indices look a bit different
Edit:
updated algorithm for updated version of problem

Mixed length object type in pandas dataframe

I want to use the pandas library to store mixed length objects.
Let's say for instance that I want to have a dataframe with two columns: the first one storing a float and the second one storing a list of float.
What is the best way to do this in pandas, bearing in mind that I want to be able to sort the data using the first column.
import pandas as pd
data = {
'a': [.1,.2,.3],
'b': [ [.1,.2], [.3,.4,.5,.6,.7], [.8,.9,1.] ],
}
df = pd.DataFrame(data)
print df
result:
a b
0 0.1 [0.1, 0.2]
1 0.2 [0.3, 0.4, 0.5, 0.6, 0.7]
2 0.3 [0.8, 0.9, 1.0]
reversed
print df.sort('a', ascending=False)
a b
2 0.3 [0.8, 0.9, 1.0]
1 0.2 [0.3, 0.4, 0.5, 0.6, 0.7]
0 0.1 [0.1, 0.2]

Categories