How to do rank/group in python for a column? - python

I was trying to group/rank in Python like we do in SAS with Proc Rank code. The code I tried is
Merge_Data['FrSeg'] = Merge_Data['Frequency'].rank(method='dense').astype(int)
It gives me an out put with the same numbers. I would like to group into 3.
For example, Frequency from 1-10 in rank 1, 11-20 in rank 2 and 21-above in rank 3. I have min=1 and max=68 Frequency(number orders put in- if you want to know).
Thanks for your help in advance

You might be interested in numpy and pandas packages:
import pandas as pd
import numpy as np
# dataframe to hold the list of values
df = pd.DataFrame({'FrSeg': [33, 66, 26, 5, 16, 31, 34, 10, 17, 40]})
# set the rank ranges
ranges = [0, 10, 20, 68]
# pandas.cut:
# returns indices of half-open bins to which each value of 'FrSeg' belongs.
print df.groupby(pd.cut(df['FrSeg'], ranges)).count()
output:
FrSeg
FrSeg
(1, 10] 2
(10, 20] 2
(20, 68] 6

Related

Determine if Age is between two values

I am trying to determine what the ages in a dataframe are that fall between 0 and 10. I have written the following, but it only returns 'Yes' even though not all values fall between 1 and 10:
x = df['Age']
for i in x :
if df['Age'].between(0, 10, inclusive=True).any():
print('Yes')
else:
print('No')
I am doing this with the intention of creating a new column in the dataframe that will categorize people based on whether they fall into an age group, i.e., 0-10, 11-20, etc...
Thanks for any help!
If you want to create a new column, assign to the column:
df['Child'] = df['Age'].between(0, 10, inclusive=True)
with the intention of creating a new column in the dataframe that will
categorize people based on whether they fall into an age group, i.e.,
0-10, 11-20, etc...
Then pd.cut is what you are looking for:
pd.cut(df['Age'], list(range(0, df['Age'].max() + 10, 10)))
For example:
df['Age'] = pd.Series([10, 7, 15, 24, 66, 43])
then the above gives you:
0 (0, 10]
1 (0, 10]
2 (10, 20]
3 (20, 30]
4 (60, 70]
5 (40, 50]

How to use Python Pandas to calculate the mean for skipped backward rows?

Here is the data:
data = {'col1': [12, 13, 5, 2, 12, 12, 13, 23, 32, 65, 33, 52, 63, 12, 42, 65, 24, 53, 35]}
df = pd.DataFrame(data)
I want to create a new col skipped_mean. Only the last 3 rows have a valid value for this variable. What it does is it looks back 6 rows backward, continuously for 3 times, and take the average of the three numbers
How can it be done?
You could do it with a weighted rolling mean approach:
import numpy as np
weights = np.array([1/3,0,0,0,0,0,1/3,0,0,0,0,0,1/3])
df['skipped_mean'] = df['col1'].rolling(13).apply(lambda x: np.sum(weights*x))

How can I find the final cumulative sum across numpy axis? [duplicate]

This question already has answers here:
How to calculate the sum of all columns of a 2D numpy array (efficiently)
(6 answers)
Closed 4 years ago.
I have a numpy array
np.array(data).shape
(50,50)
Now, I want to find the cumulative sums across axis=1. The problem is cumsum creates an array of cumulative sums, but I just care about the final value of every row.
This is incorrect of course:
np.cumsum(data, axis=1)[-1]
Is there a succinct way of doing this without looping through the array.
You are almost there, but as you have it now, you are selecting just the final row. What you need is to select all rows from the last column, so your indexing at the end should be: [:,-1].
Example:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a.cumsum(axis=1)[:,-1]
array([ 10, 35, 60, 85, 110])
Note, I'm leaving this up as I think it explains what was going wrong with your attempt, but admittedly, there are more effective ways of doing this in the other answers!
The final cumulative sum of every row, is in fact simply the sum of every row, or the row-wise sum, so we can implement this as:
>>> x.sum(axis=1)
array([ 10, 35, 60, 85, 110])
So here for every row, we calculate the sum of all the columns. We thus do not need to first generate the sums in between (well these will likely be stored in an accumulator in numpy), but not "emitted" in the array.
You can use numpy.ufunc.reduce if you don't need the intermediary accumulated results of any ufunc.
>>> a = np.arange(9).reshape(3,3)
>>> a
>>>
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>>
>>> np.add.reduce(a, axis=1)
>>> array([ 3, 12, 21])
However, in the case of sum, Willem's answer is clearly superior and to be preferred. Just keep in mind that in the general case, there's ufunc.reduce.

How to pandas groupby specific value in a column?

I have a dataframe with multiple columns using with added a new column for age intervals.
# Create Age Intervals
bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
df['age_intervals'] = pd.cut(df['age'],bins)
Now, I've another column named no_show that states whether a person shows up for the appointment or not using values 0 or 1. By using the below code, I'm able to groupby the data based on age_intervals.
df[['no_show','age_intervals']].groupby('age_intervals').count()
Output:
age_intervals no_show
(0, 5] 8192
(5, 10] 7017
(10, 15] 5719
(15, 20] 7379
(20, 25] 6750
But how can I group the no_show data based on its values 0 and 1. For example, in the age interval (0,5], out of 8192, 3291 are 0 and 4901 are 1 for no_show and so on.
An easy way would be to group on both columns and use size() which returns a Series:
df.groupby(['age_intervals', 'no_show']).size()
This will return a Series with divided values depending on both the age_intervals column and the no_show column.

python slicing multi-demnsional array

Could any one explain me how does the command (<---) below works in python numpy
r = np.arange(36)
r.resize(6,6)
r.reshape(36)[::7] # <---
You just have to run the commands one by one and analyse their output:
Create a list of the first [0, 35] numbers.
>>> r = np.arange(36)
>>> r
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35])
Reshape the list in-place to a 6 x 6 array:
>>> r.resize(6,6) # equivalent to r = r.reshape(6,6)
>>> r
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
Reshape the vector r to a 1Dimensional vector
>>> tmp = r.reshape(36)
tmp above is exactly the same as r in the first step
Filter every 7 element
>>> tmp[::7]
array([ 0, 7, 14, 21, 28, 35])
Slicing/Indexing is represented as i:j:k, where i = from, j = to and k = step. Thus, 5:10:2 would mean from element 5th to the 10th, give me elements every 2 steps. If i is not present, it is assumed to be from the beginning of the array. If j is not present, it is assumed to be until the end of the array. If k is not present it is assumed to have an step of 1 (all the elements in the range).
With all the above, you could rewrite your example in a single line as:
>>> np.arange(36)[::7]
Or if you already have r, which is N-Dimensional:
>>> r.ravel()[::7]
Here ravel will return a 1Dimensional view of r (preferred to reshape(36)).
If you want to know more about slicing, please refer to the numpy documentation.
At first, you are using NumPy ndarray.reshape, which reconstructs the given array to the specified shape. In your case, you are converting it to a 1-Dimension array with 36 elements.
Secondly, with the numbers between brackets, your are indexing certain values in the array. The slicing consists in 3 values per dimension, in the form of [number1:number2:number3]. If you leave the values blank (like in your case for numbers 1 and 2), you will leave them to default i.e. number1 will be 0, number2 will be -1 (the last array index) and number3 will be 3:
The first number indicates the array index where you will begin taking values.
The second number indicates the array index where you will stop taking values.
Finally, the last number indicates the number of positions that will be ignored after each index reading. In your case, you are reading every 7 indexes.
One point to add, both reshape() and resize() methods have the SAME functionality, the ONLY difference between them is how they affect the calling array object r:
r.resize() have no return. It directly change the shape of calling array object r.
r.reshape() returns a new reshaped array object. And leaves the original r unchanged.
>>> import numpy as np
>>> r = np.arange(36)
>>> r.shape
(36,)
>>> # 1. --- `reshape()` returns a new object and keep the `r` ---
>>> new = r.reshape(6,6)
>>> new.shape
(6, 6)
>>>
>>> # 2. --- resize changes `r` directly and returns `None` ---
>>> nothing = r.resize(6,6)
>>> type(nothing)
<class 'NoneType'>
>>> r.shape
(6, 6)

Categories