Select only columns that have at most N unique values - python

I want to count the number of unique values in each column and select only those columns which have less than 32 unique values.
I tried using
df.filter(nunique<32)
and
df[[ c for df.columns in df if c in c.nunique<32]]
but because nunique is a method and not function they don't work. Thought len(set() would work and tried
df.apply(lambda x : len(set(x))
but doesn't work as well. Any ideas please? thanks in advance!

nunique can be called on the entire DataFrame (you have to call it). You can then filter out columns using loc:
df.loc[:, df.nunique() < 32]
Minimal Verifiable Example
df = pd.DataFrame({'A': list('abbcde'), 'B': list('ababab')})
df
A B
0 a a
1 b b
2 b a
3 c b
4 d a
5 e b
df.nunique()
A 5
B 2
dtype: int64
df.loc[:, df.nunique() < 3]
B
0 a
1 b
2 a
3 b
4 a
5 b

If anyone wants to do it in a method chaining fashion, you can:
df.loc[:, lambda x: x.nunique() < 3]

Related

How to replace data in one pandas df by the data of another one?

Want to replace some rows of some columns in a bigger pandas df by data in a smaller pandas df. The column names are same in both.
Tried using combine_first but it only updates the null values.
For example lets say df1.shape is 100, 25 and df2.shape is 10,5
df1
A B C D E F G ...Z Y Z
1 abc 10.20 0 pd.NaT
df2
A B C D E
1 abc 15.20 1 10
Now after replacing df1 should look like:
A B C D E F G ...Z Y Z
1 abc 15.20 1 10 ...
To replace values in df1 the condition is where df1.A = df2.A and df1.B = df2.B
How can it be achieved in the most pythonic way? Any help will be appreciated.
Don't know I really understood your question does this solves your problem ?
df1 = pd.DataFrame(data={'A':[1],'B':[2],'C':[3],'D':[4]})
df2 = pd.DataFrame(data={'A':[1],'B':[2],'C':[5],'D':[6]})
new_df=pd.concat([df1,df2]).drop_duplicates(['A','B'],keep='last')
print(new_df)
output:
A B C D
0 1 2 5 6
You could play with Multiindex.
First let us create those dataframe that you are working with:
cols = pd.Index(list(ascii_uppercase))
vals = np.arange(100*len(cols)).reshape(100, len(cols))
df = pd.DataFrame(vals, columns=cols)
df1 = pd.DataFrame(vals[:10,:5], columns=cols[:5])
Then transform A and B in indices:
df = df.set_index(["A","B"])
df1 = df1.set_index(["A","B"])*1.5 # multiply just to make the other values different
df.loc[df1.index, df1.columns] = df1
df = df.reset_index()

Groupby and select the first, second, and fourth member of each group?

Related: pandas dataframe groupby and get nth row
I can use the groupby method and select the first N number of group members with:
df.groupby('columnA').head(N)
But what if I want the first, second, and fourth members of each group?
GroupBy.nth takes a list, so you could just do
df = pd.DataFrame({'A': list('aaaabbbb'), 'B': list('abcdefgh')})
df.groupby('A').nth([0, 1, 3])
B
A
a a
a b
a d
b e
b f
b h
# To get the grouper as a column, use as_index=False
df.groupby('A', as_index=False).nth([0, 1, 3])
A B
0 a a
1 a b
3 a d
4 b e
5 b f
7 b h
You can do
df.groupby('columnA').apply(lambda x : x.iloc[[has to 0,1,3],:]).reset_index(level=0,drop=True)
df1 = df.groupby('columnA').head(4)
df1.drop(df.groupby('columnA').head(4).index.values[2], axis=0)

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

set a multi index in the dataframe constructor using the data-list provided to the constructor

I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2

Count size of rolling intersection in pandas

I have a dataframe that consists of group labels ('B') and elements of each group ('A'). The group labels are ordered, and I want to know how many elements of group I show up in group i+1.
An example:
df= pd.DataFrame({ 'A': ['a','b','c','a','c','a','d'], 'B' : [1,1,1,2,2,3,3]})
A B
0 a 1
1 b 1
2 c 1
3 a 2
4 c 2
5 a 3
6 d 3
The desired output would be something like:
B
1 NaN
2 2
3 1
One way to go about this would be to compute the number of distinct elements in the union of group I and group i+1 and then subtract of the number of distinct elements in each group. I've tried:
pd.rolling_apply(grp['A'], lambda x: len(x.unique()),2)
but this produces an error:
AttributeError: 'Series' object has no attribute 'type'
How do I get this to work with rolling_apply or is there a better way to attack this problem?
An approach with using sets and shifting the result:
First grouping the dataframe and then converting column A of each group into a set:
In [86]: grp = df.groupby('B')
In [87]: s = grp.apply(lambda x : set(x['A']))
In [88]: s
Out[88]:
B
1 set([a, c, b])
2 set([a, c])
3 set([a, d])
dtype: object
To calculate the intersection between consecutive sets, make a shifted version (I replace the NaN to an empty set for the next step):
In [89]: s2 = s.shift(1).fillna(set([]))
In [90]: s2
Out[90]:
B
1 set([])
2 set([a, c, b])
3 set([a, c])
dtype: object
Combine both series and calculate the length of the intersection:
In [91]: s.combine(s2, lambda x, y: len(x.intersection(y)))
Out[91]:
B
1 0
2 2
3 1
dtype: object
Another way to do the last step (for sets & means intersection):
df = pd.concat([s, s2], axis=1)
df.apply(lambda x: len(x[0] & x[1]), axis=1)
The reason the rolling apply does not work is because 1) you provided it a GroupBy object and not a series, and 2) it only works with numerical values.

Categories