I have a dataframe with multiple columns using with added a new column for age intervals.
# Create Age Intervals
bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
df['age_intervals'] = pd.cut(df['age'],bins)
Now, I've another column named no_show that states whether a person shows up for the appointment or not using values 0 or 1. By using the below code, I'm able to groupby the data based on age_intervals.
df[['no_show','age_intervals']].groupby('age_intervals').count()
Output:
age_intervals no_show
(0, 5] 8192
(5, 10] 7017
(10, 15] 5719
(15, 20] 7379
(20, 25] 6750
But how can I group the no_show data based on its values 0 and 1. For example, in the age interval (0,5], out of 8192, 3291 are 0 and 4901 are 1 for no_show and so on.
An easy way would be to group on both columns and use size() which returns a Series:
df.groupby(['age_intervals', 'no_show']).size()
This will return a Series with divided values depending on both the age_intervals column and the no_show column.
Related
How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2
Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])
As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)
You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)
You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)
I am trying to determine what the ages in a dataframe are that fall between 0 and 10. I have written the following, but it only returns 'Yes' even though not all values fall between 1 and 10:
x = df['Age']
for i in x :
if df['Age'].between(0, 10, inclusive=True).any():
print('Yes')
else:
print('No')
I am doing this with the intention of creating a new column in the dataframe that will categorize people based on whether they fall into an age group, i.e., 0-10, 11-20, etc...
Thanks for any help!
If you want to create a new column, assign to the column:
df['Child'] = df['Age'].between(0, 10, inclusive=True)
with the intention of creating a new column in the dataframe that will
categorize people based on whether they fall into an age group, i.e.,
0-10, 11-20, etc...
Then pd.cut is what you are looking for:
pd.cut(df['Age'], list(range(0, df['Age'].max() + 10, 10)))
For example:
df['Age'] = pd.Series([10, 7, 15, 24, 66, 43])
then the above gives you:
0 (0, 10]
1 (0, 10]
2 (10, 20]
3 (20, 30]
4 (60, 70]
5 (40, 50]
Here is the data:
data = {'col1': [12, 13, 5, 2, 12, 12, 13, 23, 32, 65, 33, 52, 63, 12, 42, 65, 24, 53, 35]}
df = pd.DataFrame(data)
I want to create a new col skipped_mean. Only the last 3 rows have a valid value for this variable. What it does is it looks back 6 rows backward, continuously for 3 times, and take the average of the three numbers
How can it be done?
You could do it with a weighted rolling mean approach:
import numpy as np
weights = np.array([1/3,0,0,0,0,0,1/3,0,0,0,0,0,1/3])
df['skipped_mean'] = df['col1'].rolling(13).apply(lambda x: np.sum(weights*x))
I'm trying to create a np array with size (80,10) so each row has random values with range 0 to 99.
I've done that by
np.random.randint(99, size=(80, 10))
But I would like to always include both 0 and 99 as values in each row.
So two values in each row are already defined and the other 8 will be random.
How would I accomplish this? Is there a way to generate an array size (80,8) and just concatenate [0,99] to every row to make it (80,10) at the end?
As suggested in the comments by Tim, you can generate a matrix with random values not including 0 and 99. Then replace two random indices along the second axis with the values 0 and 99.
rand_arr = np.random.randint(low=1, high=98, size=(80, 10))
rand_indices = np.random.rand(80,10).argsort(axis=1)[:,:2]
np.put_along_axis(rand_arr, rand_indices, [0,99], axis=1)
The motivation for using argsort is that we want random indices along the second axis without replacement. Just generating a random integer matrix for values 0-10 with size=(80,2) will not guarantee this.
In your scenario, you could do np.argpartion with kth=2 instead of np.argsort. This should be more efficient.
I've tried a few things and this is what I came up with
def generate_matrix(low, high, shape):
x, y = shape
values = np.random.randint(low+1, high-1, size=(x, y-2))
predefined = np.tile([low, high], (x, 1))
values = np.hstack([values, predefined])
for row in values:
np.random.shuffle(row)
return values
Example usage
>>> generate_matrix(0, 99, (5, 10))
array([[94, 0, 45, 99, 18, 31, 78, 80, 32, 17],
[28, 99, 72, 3, 0, 14, 26, 37, 41, 80],
[18, 78, 71, 40, 99, 0, 85, 91, 8, 59],
[65, 99, 0, 45, 93, 94, 16, 33, 52, 53],
[22, 76, 99, 15, 27, 64, 91, 32, 0, 82]])
The way I approached it:
Generate an array of size (80, 8) in the range [1, 98] and then concatenate 0 and 99 for each row. But you probably need the 0/99 to occur at different indices for each row, so you have to shuffle them. Unfortunately, np.random.shuffle() only shuffles the rows among themselves. And if you use np.random.shuffle(arr.T).T, or random.Generator.permutation, you don't shuffle the columns independently. I haven't found a vectorised way to shuffle the rows independently other than using a Python loop.
Another way:
You can generate an array of size (80, 10) in the range [1, 98] and then substitute in random indices the values 0 and 99 for each row. Again, I couldn't find a way to generate unique indices per row (so that 0 doesn't overwrite 99 for example) without a Python loop. Since I couldn't find a way to avoid Python loops, I opted for the first way, which seemed more straightforward.
If you don't care about duplicates, create an array of zeros, replace columns 1-9 with random numbers and set column 10 to 99.
final = np.zeros(shape=(80, 10))
final[:,1:9] = np.random.randint(97, size=(80, 8))+1
final[:,9] = 99
Creating a matrix 80x10 with random values from 0 to 99 with no duplicates in the same row with 0 and 99 included in every row
import random
row99=[ n for n in range(1,99) ]
perm = [n for n in range(0,10) ]
m = []
for i in range(80):
random.shuffle(row99)
random.shuffle(perm)
r = row99[:10]
r[perm[0]] = 0
r[perm[1]] = 99
m.append(r)
print(m)
Partial output:
[
... other elements here ...
[70, 58, 0, 25, 41, 10, 90, 5, 42, 18],
[0, 57, 90, 71, 39, 65, 52, 24, 28, 77],
[55, 42, 7, 9, 32, 69, 90, 0, 64, 2],
[0, 59, 17, 35, 56, 34, 33, 37, 90, 71]]
I was trying to group/rank in Python like we do in SAS with Proc Rank code. The code I tried is
Merge_Data['FrSeg'] = Merge_Data['Frequency'].rank(method='dense').astype(int)
It gives me an out put with the same numbers. I would like to group into 3.
For example, Frequency from 1-10 in rank 1, 11-20 in rank 2 and 21-above in rank 3. I have min=1 and max=68 Frequency(number orders put in- if you want to know).
Thanks for your help in advance
You might be interested in numpy and pandas packages:
import pandas as pd
import numpy as np
# dataframe to hold the list of values
df = pd.DataFrame({'FrSeg': [33, 66, 26, 5, 16, 31, 34, 10, 17, 40]})
# set the rank ranges
ranges = [0, 10, 20, 68]
# pandas.cut:
# returns indices of half-open bins to which each value of 'FrSeg' belongs.
print df.groupby(pd.cut(df['FrSeg'], ranges)).count()
output:
FrSeg
FrSeg
(1, 10] 2
(10, 20] 2
(20, 68] 6