how nunique works with given table values? - python

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?

You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

Related

Find difference in two different data-frames

I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.
With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

Pandas Series with different lengths

Using pandas concat function it is possible to create a series like this:
In[230]pd.concat({'One':pd.Series(range(3)), 'Two':pd.Series(range(4))})
Out[230]:
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
Is it possible to do the same without using concat method?
My best approach was:
a = pd.Series(range(3),range(3))
b = pd.Series(range(4),range(4))
pd.Series([a,b],index=['One','Two'])
But it is not the same, it outputs:
One 0 0
1 1
2 2
dtype: int64
Two 0 0
1 1
2 2
3 3
dtype: int64
dtype: object
This should give you an idea of just how useful concat is.
a.index = pd.MultiIndex.from_tuples([('One', v) for v in a.index])
b.index = pd.MultiIndex.from_tuples([('Two', v) for v in b.index])
a.append(b)
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
The same thing is achieved by pd.concat([a, b]).
This is the work for the argument keys in case you want to get the same output using concat i.e :
pd.concat([a,b],keys=['One','Two'])
One 0 0
1 1
2 2
Two 0 0
1 1
2 2
3 3
dtype: int64
This works fine:
data = list(range(3)) + list(range(4))
index = MultiIndex(levels=[['One', 'Two'], [0, 1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 0, 1, 2, 3]])
pd.Series(data,index=index)

Including missing combinations of values in a pandas groupby aggregation

Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

Categories