How to select ordered categorical columns from a pandas dataframe? - python

I have a pandas data frame with both unordered and ordered categorical columns (as well as columns with other data types). I want to select only the ordered categorical columns.
Here's an example dataset:
import pandas as pd
import numpy.random as npr
n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
"eye_color": npr.choice(eye_colors, size=n_obs),
"age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)
Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.
I can select all categorical columns with .select_dtypes().
categories = people.select_dtypes("category")
I could use a list comprehension with the .cat.ordered property to then limit this to only ordered categories.
categories[[col for col in categories.columns if categories[col].cat.ordered]]
This is dreadfully complicated code, so it feels like there must be a better way.
What's the idiomatic way of selecting only ordered columns from a dataframe?

You can iterate directly over the dtypes and return a boolean mask to avoid having to unnecessarily copy the underlying data until you are ready to subset:
>>> categorical_ordered = [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
>>> people.loc[:, categorical_ordered].head()
age_group
0 [30, 40)
1 [20, 30)
2 [50, 60)
3 [30, 40)
4 [20, 30)
You can also use is_categorical_dtype as recommended by #richardec in the comments, or simply perform a comparison with the string representation of the dtype.
>>> from pandas.api.types import is_categorical_dtype
>>> [isinstance(d, pd.CategoricalDtype) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [is_categorical_dtype(d) and d.ordered for d in people.dtypes]
[False, False, True]
>>> [d == 'category' and d.ordered for d in people.dtypes]
[False, False, True]
You can also abstract away the for-loop by using .apply
>>> people.dtypes.apply(lambda d: d == 'category' and d.ordered)
eye_color False
age False
age_group True
dtype: bool
>>> people.loc[:, people.dtypes.apply(lambda d: d == 'category' and d.ordered)]
age_group
0 [20, 30)
1 [40, 50)
2 [20, 30)
3 [40, 50)
...

One option is with getattr; I'd pick a list comprehension over this though:
people.loc[:, people.apply(getattr, args=('cat',None))
.apply(getattr, args=('ordered', False))]
age_group
0 [40, 50)
1 [50, 60)
2 [30, 40)
3 [40, 50)
4 [30, 40)
5 [40, 50)
6 [40, 50)
7 [20, 30)
8 [20, 30)
9 [20, 30)
10 [40, 50)
11 [20, 30)
12 [50, 60)
13 [40, 50)
14 [40, 50)
15 [20, 30)
16 [50, 60)
17 [30, 40)
18 [50, 60)
19 [40, 50)

Related

for loop append in a list, but the input is a data frame

I have a bit python code below. Just an example to show the problem: I would like to select some lines in a data frame basing on some values. Somehow this needs to be in a for loop, and I used .append() to add each selection of rows into a final file. But the result is not the same as what I expected. I learned by reading quite some posts that we should not append as a data frame in a loop. So I don't know how I could do this now. Could somebody help please? Thanks a lot!
import pandas as pd
df = pd.DataFrame({'a': [4, 5, 6, 7], 'b': [10, 20, 30, 40], 'c': [100, 50, -30, -50]})
df['diff'] = (df['b'] - df['c']).abs()
print(df)
df1 = df[df['diff'] == 90]
df2 = df[df['diff'] == 60]
list = [df1, df2]
def try_1(list):
output = []
for item in list:
output.append(item)
return output
print(try_1(list))
output from the code
a b c diff
0 4 10 100 90
1 5 20 50 30
2 6 30 -30 60
3 7 40 -50 90
[ a b c diff
0 4 10 100 90
3 7 40 -50 90, a b c diff
2 6 30 -30 60]
but the expected output of print(try_1(list))
a b c diff
4 10 100 90
7 40 -50 90
6 30 -30 60
Also, I need to write this final one into a file. I tried .write(), and it complained not a string. How could I solve this please? Thanks!
Your code just recreates the same list you had before, you can just use pd.concat instead, to write it to a frame you have to convert it to a str first:
import pandas as pd
df = pd.DataFrame({'a': [4, 5, 6, 7], 'b': [10, 20, 30, 40], 'c': [100, 50, -30, -50]})
df['diff'] = (df['b'] - df['c']).abs()
# print(df)
df1 = df[df['diff'] == 90]
df2 = df[df['diff'] == 60]
my_list = [df1, df2]
all_frames = pd.concat(my_list)
with open("file", "w") as f:
f.write(str(all_frames))
If you need to append in a for loop and occasionaly write you could do it like this:
import pandas as pd
df = pd.DataFrame({'a': [4, 5, 6, 7], 'b': [10, 20, 30, 40], 'c': [100, 50, -30, -50]})
df['diff'] = (df['b'] - df['c']).abs()
# print(df)
df1 = df[df['diff'] == 90]
df2 = df[df['diff'] == 60]
my_list = [df1, df2]
for i in range(20):
my_list.append(df2)
if i % 5 == 0: # whenever we want to write
all_frames = pd.concat(my_list)
my_list = [all_frames]
with open("file", "w") as f:
f.write(str(all_frames))

How can I replace pd intervals with integers in python

How can I replace pd intervals with integers
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
output:
age age_bands
0 43 (40, 50]
1 76 (70, 80]
2 27 (20, 30]
3 8 (0, 10]
4 57 (50, 60]
5 32 (30, 40]
6 12 (10, 20]
7 22 (20, 30]
now I want to add another column to replace the bands with a single number (int). but I could not
for example this did not work :
df['age_code']= df['age_bands'].replace({'(40, 50]':4})
how can I get a column looks like this?
age_bands age_code
0 (40, 50] 4
1 (70, 80] 7
2 (20, 30] 2
3 (0, 10] 0
4 (50, 60] 5
5 (30, 40] 3
6 (10, 20] 1
7 (20, 30] 2
Assuming you want to the first digit from every interval, then, you can use pd.apply to achieve what you want as follows:
df["age_code"] = df["age_bands"].apply(lambda band: str(band)[1])
However, note this may not be very efficient for a large dataframe,
To convert the column values to int datatype, you can use pd.to_numeric,
df["age_code"] = pd.to_numeric(df['age_code'])
As the column contains pd.Interval objects, use its property left
df['age_code'] = df['age_bands'].apply(lambda interval: interval.left // 10)
You can do that by simply adding a second pd.cut and define labels argument.
import pandas as pd
df = pd.DataFrame()
df['age'] = [43, 76, 27, 8, 57, 32, 12, 22]
age_band = [0,10,20,30,40,50,60,70,80,90]
df['age_bands']= pd.cut(df['age'], bins=age_band, ordered=True)
#This is the part of code you need to add
age_labels = [0, 1, 2, 3, 4, 5, 6, 7, 8]
df['age_code']= pd.cut(df['age'], bins=age_band, labels=age_labels, ordered=True)
>>> print(df)
You can create a dictionary of bins and map it to the age_bands column:
bins_sorted = sorted(pd.cut(df['age'], bins=age_band, ordered=True).unique())
bins_dict = {key: idx for idx, key in enumerate(bins_sorted)}
df['age_code'] = df.age_bands.map(bins_dict).astype(int)

Retrieve Pandas dataframe rows that its column (one) values are consecutively equal to the values of a list

How do I retrieve Pandas dataframe rows that its column (one) values are consecutively equal to the values of a list?
Example, given this:
import pandas as pd
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
I want to extract the dataframe rows from 30 to 50, but just the first sequence of consecutive values (just the 2 to 4 index rows).
this should do the trick:
df = pd.DataFrame({'col1': [10, 20, 30, 40, 50, 88, 99, 30, 40, 50]})
lst = [30, 40, 50]
ans=[]
for i,num in enumerate(df['col1']):
if num in lst:
lst.remove(num)
ans.append(i)
print(ans)
You can use a rolling comparison:
s = df['col1'][::-1].rolling(len(lst)).apply(lambda x: x.eq(lst[::-1]).all())[::-1].eq(1)
if s.any():
idx = s.idxmax()
out = df.iloc[idx:idx+len(lst)]
print(out)
else:
print('Not found')
output:
col1
2 30
3 40
4 50
Try:
lst = [30, 40, 50]
if any(lst == (found := s).to_list() for s in df["col1"].rolling(len(lst))):
print(df.loc[found.index])
Prints:
col1
2 30
3 40
4 50

check if the element of a row is in another row

I have a dataframe with 2 column, 1 containing number and the other containing list, I want to check if the column with number is in the column with the list
Testing data:
preds = [[40, 50, 21], [40, 50, 25], [40, 50, 21]]
target = [40, 50, 40]
df_testing = pd.DataFrame(list(zip(preds, target)),
columns =['preds', 'target'])
so for example, in the first row, I want to check if 40 is in [40, 50, 21], for 2nd row, I want to check if 50 is in [40, 50, 25] etc.
desired result: Return a Serie of True False with the index of the row
Use in stamement:
preds = [[44, 55, 21], [40, 50, 25], [40, 50, 21]]
target = [40, 50, 40]
df_testing = pd.DataFrame(list(zip(preds, target)),
columns =['preds', 'target'])
s = df_testing.apply(lambda x: x['target'] in x['preds'], axis=1)
print (s)
0 False
1 True
2 True
dtype: bool

Python find pairs of socks from a list

I'm pretty new to programming and python. I was asked to find out a pair of socks from a given list of numbers.
My question was - "There is a large pile of socks that must be paired by color. Given an array of integers representing the color of each sock, determine how many pairs of socks with matching colors there are."
Sample Input
STDIN Function
----- --------
9 n = 9
10 20 20 10 10 30 50 10 20 ar = [10, 20, 20, 10, 10, 30, 50, 10, 20]
Sample Output
3
So my logic was pretty simple, iterate through the list, take a number, compare it with others. If two same numbers are found, count them as a pair and remove them from the list. Then do the same untiil none are left
# Complete the sockMerchant function below.
def sockMerchant(n, ar):
print(ar)
l=[]
result=0
for i in ar:
a=i
c=0
print("a",a)#line for checking
ar.remove(i)
l=ar
print("ar",ar)#line for checking
print("l", l)#line for checking
for j in l:
f=l.index(j)
print("index", f))#line for checking
print("j",j))#line for checking
if j == a:
c=c+1
print("c",c))#line for checking
ar.remove(j)
print("ar2",ar))#line for checking
result=c/2
print("c2",c))#line for checking
return result
n=9
ar=[10, 20, 20, 10, 10, 30, 50, 10, 20]
sockMerchant(n, ar)
Please ignore the line of code beside the comments. They are just there to see the control flow. and here is my output:
[10, 20, 20, 10, 10, 30, 50, 10, 20]
a 10
ar [20, 20, 10, 10, 30, 50, 10, 20]
l [20, 20, 10, 10, 30, 50, 10, 20]
index 0
j 20
index 0
j 20
index 2
j 10
c 1
ar2 [20, 20, 10, 30, 50, 10, 20]
index 3
j 30
index 4
j 50
index 2
j 10
c 2
ar2 [20, 20, 30, 50, 10, 20]
a 20
ar [20, 30, 50, 10, 20]
l [20, 30, 50, 10, 20]
index 0
j 20
c 1
ar2 [30, 50, 10, 20]
index 1
j 50
index 2
j 10
index 3
j 20
c 2
ar2 [30, 50, 10]
a 10
ar [30, 50]
l [30, 50]
index 0
j 30
index 1
j 50
c2 0
Python is full of wonderful utils that can be helpful. Counters from collections can be used for counting how many socks of each color you got and then you just divide by 2 to get the number of pairs.
from collections import Counter
from typing import List
def sock_merchant(socks: List[int]) -> int:
counter = Counter(ar)
return sum((count // 2 for count in counter.values())
Initializing counter with an array will give you something that looks like
Counter({10: 4, 20: 3, 30: 1, 50: 1})
which is the value from the array (i.e color of the sock) and the number of times it occurs in the array.
Like with normal dicts, counters also have a .values() methods that will give you only the values, and since we want the number of pairs, we take the sum of the values after doing integer division on each of them.

Categories