How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2
Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])
As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)
You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)
You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)
Related
How do I retrieve Pandas dataframe rows that its column (one) values are consecutively equal to the values of a list?
Example, given this:
import pandas as pd
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
I want to extract the dataframe rows from 30 to 50, but just the first sequence of consecutive values (just the 2 to 4 index rows).
this should do the trick:
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
ans=[]
for i,num in enumerate(df['col1']):
if num in lst:
lst.remove(num)
ans.append(i)
print(ans)
You can use a rolling comparison:
s = df['col1'][::-1].rolling(len(lst)).apply(lambda x: x.eq(lst[::-1]).all())[::-1].eq(1)
if s.any():
idx = s.idxmax()
out = df.iloc[idx:idx+len(lst)]
print(out)
else:
print('Not found')
output:
col1
2 30
3 40
4 50
Try:
lst = [30, 40, 50]
if any(lst == (found := s).to_list() for s in df["col1"].rolling(len(lst))):
print(df.loc[found.index])
Prints:
col1
2 30
3 40
4 50
I have a pandas data frame with both unordered and ordered categorical columns (as well as columns with other data types). I want to select only the ordered categorical columns.
Here's an example dataset:
import pandas as pd
import numpy.random as npr
n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
"eye_color": npr.choice(eye_colors, size=n_obs),
"age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)
Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.
I can select all categorical columns with .select_dtypes().
categories = people.select_dtypes("category")
I could use a list comprehension with the .cat.ordered property to then limit this to only ordered categories.
categories[[col for col in categories.columns if categories[col].cat.ordered]]
This is dreadfully complicated code, so it feels like there must be a better way.
What's the idiomatic way of selecting only ordered columns from a dataframe?
You can iterate directly over the dtypes and return a boolean mask to avoid having to unnecessarily copy the underlying data until you are ready to subset:
>>> categorical_ordered = [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
>>> people.loc[:, categorical_ordered].head()
age_group
0 [30, 40)
1 [20, 30)
2 [50, 60)
3 [30, 40)
4 [20, 30)
You can also use is_categorical_dtype as recommended by #richardec in the comments, or simply perform a comparison with the string representation of the dtype.
>>> from pandas.api.types import is_categorical_dtype
>>> [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [is_categorical_dtype(d) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [d == 'category' and d.ordered for d in people.dtypes]
[False, False, True]
You can also abstract away the for-loop by using .apply
>>> people.dtypes.apply(lambda d: d == 'category' and d.ordered)
eye_color False
age False
age_group True
dtype: bool
>>> people.loc[:, people.dtypes.apply(lambda d: d == 'category' and d.ordered)]
age_group
0 [20, 30)
1 [40, 50)
2 [20, 30)
3 [40, 50)
...
One option is with getattr; I'd pick a list comprehension over this though:
people.loc[:, people.apply(getattr, args=('cat',None))
.apply(getattr, args=('ordered', False))]
age_group
0 [40, 50)
1 [50, 60)
2 [30, 40)
3 [40, 50)
4 [30, 40)
5 [40, 50)
6 [40, 50)
7 [20, 30)
8 [20, 30)
9 [20, 30)
10 [40, 50)
11 [20, 30)
12 [50, 60)
13 [40, 50)
14 [40, 50)
15 [20, 30)
16 [50, 60)
17 [30, 40)
18 [50, 60)
19 [40, 50)
I have the following question which I want to solve with numpy Library.
Let's suppose that we have this 'a' array
a = np.vstack(([10, 10, 20, 20, 30, 10, 40, 50, 20] ,[10, 20, 10, 20, 30, 10, 40, 50, 20]))
As output we have
[[10 10 20 20 30 10 40 50 20]
[10 20 10 20 30 10 40 50 20]]
with the shape (2, 9)
I want to delete the elements repeated vertically in our array so that I have as result:
[[10 10 20 20 30 40 50]
[10 20 10 20 30 40 50]]
In this example I want to delete the elements ((0, 5), (1, 5)) and ((0, 8), (1, 8)). Is there any numpy function that can do the job ?
Thanks
This is easily done with:
np.unique(a, axis=1)
Following the idea of this answer, you could do the following.
np.hstack({tuple(row) for row in a.T}).T
I'm pretty new to programming and python. I was asked to find out a pair of socks from a given list of numbers.
My question was - "There is a large pile of socks that must be paired by color. Given an array of integers representing the color of each sock, determine how many pairs of socks with matching colors there are."
Sample Input
STDIN Function
----- --------
9 n = 9
10 20 20 10 10 30 50 10 20 ar = [10, 20, 20, 10, 10, 30, 50, 10, 20]
Sample Output
3
So my logic was pretty simple, iterate through the list, take a number, compare it with others. If two same numbers are found, count them as a pair and remove them from the list. Then do the same untiil none are left
# Complete the sockMerchant function below.
def sockMerchant(n, ar):
print(ar)
l=[]
result=0
for i in ar:
a=i
c=0
print("a",a)#line for checking
ar.remove(i)
l=ar
print("ar",ar)#line for checking
print("l", l)#line for checking
for j in l:
f=l.index(j)
print("index", f))#line for checking
print("j",j))#line for checking
if j == a:
c=c+1
print("c",c))#line for checking
ar.remove(j)
print("ar2",ar))#line for checking
result=c/2
print("c2",c))#line for checking
return result
n=9
ar=[10, 20, 20, 10, 10, 30, 50, 10, 20]
sockMerchant(n, ar)
Please ignore the line of code beside the comments. They are just there to see the control flow. and here is my output:
[10, 20, 20, 10, 10, 30, 50, 10, 20]
a 10
ar [20, 20, 10, 10, 30, 50, 10, 20]
l [20, 20, 10, 10, 30, 50, 10, 20]
index 0
j 20
index 0
j 20
index 2
j 10
c 1
ar2 [20, 20, 10, 30, 50, 10, 20]
index 3
j 30
index 4
j 50
index 2
j 10
c 2
ar2 [20, 20, 30, 50, 10, 20]
a 20
ar [20, 30, 50, 10, 20]
l [20, 30, 50, 10, 20]
index 0
j 20
c 1
ar2 [30, 50, 10, 20]
index 1
j 50
index 2
j 10
index 3
j 20
c 2
ar2 [30, 50, 10]
a 10
ar [30, 50]
l [30, 50]
index 0
j 30
index 1
j 50
c2 0
Python is full of wonderful utils that can be helpful. Counters from collections can be used for counting how many socks of each color you got and then you just divide by 2 to get the number of pairs.
from collections import Counter
from typing import List
def sock_merchant(socks: List[int]) -> int:
counter = Counter(ar)
return sum((count // 2 for count in counter.values())
Initializing counter with an array will give you something that looks like
Counter({10: 4, 20: 3, 30: 1, 50: 1})
which is the value from the array (i.e color of the sock) and the number of times it occurs in the array.
Like with normal dicts, counters also have a .values() methods that will give you only the values, and since we want the number of pairs, we take the sum of the values after doing integer division on each of them.
I have a dataframe with multiple columns using with added a new column for age intervals.
# Create Age Intervals
bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
df['age_intervals'] = pd.cut(df['age'],bins)
Now, I've another column named no_show that states whether a person shows up for the appointment or not using values 0 or 1. By using the below code, I'm able to groupby the data based on age_intervals.
df[['no_show','age_intervals']].groupby('age_intervals').count()
Output:
age_intervals no_show
(0, 5] 8192
(5, 10] 7017
(10, 15] 5719
(15, 20] 7379
(20, 25] 6750
But how can I group the no_show data based on its values 0 and 1. For example, in the age interval (0,5], out of 8192, 3291 are 0 and 4901 are 1 for no_show and so on.
An easy way would be to group on both columns and use size() which returns a Series:
df.groupby(['age_intervals', 'no_show']).size()
This will return a Series with divided values depending on both the age_intervals column and the no_show column.