Find difference in two different data-frames - python

I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.

With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1

Related

how nunique works with given table values?

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?
You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

Pandas- merging two dataframe by sum the values of columns and index

I want to merge two datasets by indexes and columns.
I want to merge entire dataset
df1 = pd.DataFrame([[1, 0, 0], [0, 2, 0], [0, 0, 3]],columns=[1, 2, 3])
df1
1 2 3
0 1 0 0
1 0 2 0
2 0 0 3
df2 = pd.DataFrame([[0, 0, 1], [0, 2, 0], [3, 0, 0]],columns=[1, 2, 3])
df2
1 2 3
0 0 0 1
1 0 2 0
2 3 0 0
I have tried this code but I got this error. I can't get why it shows the size of axis as an error.
df_sum = pd.concat([df1, df2])\
.groupby(df2.index)[df2.columns]\
.sum().reset_index()
ValueError: Grouper and axis must be same length
This was what I expected the output of df_sum
df_sum
1 2 3
0 1 0 1
1 0 4 0
2 3 0 3
You can use :df1.add(df2, fill_value=0). It will add df2 into df1 also it will replace NAN value with 0.
>>> import numpy as np
>>> import pandas as pd
>>> df2 = pd.DataFrame([(10,9),(8,4),(7,np.nan)], columns=['a','b'])
>>> df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
>>> df1.add(df2, fill_value=0)
a b
0 11 11.0
1 11 8.0
2 12 6.0

Including missing combinations of values in a pandas groupby aggregation

Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0

How to name a dataframe column filled by numpy array?

I am filling a DataFrame by transposing some numpy array :
for symbol in syms[:5]:
price_p = Share(symbol)
closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
dump = np.array(closes_p)
na_price_ar.append(dump)
print symbol
df = pd.DataFrame(na_price_ar).transpose()
df, the DataFrame is well filled however, the column name are 0,1,2...,5 I would like to rename them with the value of the element syms[:5]. I googled it and I found this:
for symbol in syms[:5]:
df.rename(columns={ ''+ str(i) + '' : symbol}, inplace=True)
i = i+1
But if I check the variabke df I still have the same column name.
Any ideas ?
Instead of using a list of arrays and transposing, you could build the DataFrame from a dict whose keys are symbols and whose values are arrays of column values:
import numpy as np
import pandas as pd
np.random.seed(2016)
syms = 'abcde'
na_price_ar = {}
for symbol in syms[:5]:
# price_p = Share(symbol)
# closes_p = [c['Close'] for c in price_p.get_historical(startdate_s, enddate_s)]
# dump = np.array(closes_p)
dump = np.random.randint(10, size=3)
na_price_ar[symbol] = dump
print(symbol)
df = pd.DataFrame(na_price_ar)
print(df)
yields
a b c d e
0 3 3 8 2 4
1 7 8 7 6 1
2 2 4 9 3 9
You can use:
na_price_ar = [['A','B','C'],[0,2,3],[1,2,4],[5,2,3],[8,2,3]]
syms = ['q','w','e','r','t','y','u']
df = pd.DataFrame(na_price_ar, index=syms[:5]).transpose()
print (df)
q w e r t
0 A 0 1 5 8
1 B 2 2 2 2
2 C 3 4 3 3
You may use as dictionary key into the .rename() method the df.columns[ number ] statement
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1], 'd': [4, 1, 3, 1], 'e': [5, 2, 6, 0]}
df = pd.DataFrame(dic)
number = 0
for symbol in syms[:5]:
df.rename( columns = { df.columns[number]: symbol}, implace = True)
number = number + 1
and the result is
i f g h i
0 4 4 5 4 5
1 1 2 7 1 2
2 3 1 9 3 6
3 1 4 1 1 0

Pandas index column by boolean

I want to keep columns that have 'n' or more values.
For example:
> df = pd.DataFrame({'a': [1,2,3], 'b': [1,None,4]})
a b
0 1 1
1 2 NaN
2 3 4
3 rows × 2 columns
> df[df.count()==3]
IndexingError: Unalignable boolean Series key provided
> df[:,df.count()==3]
TypeError: unhashable type: 'slice'
> df[[k for (k,v) in (df.count()==3).items() if v]]
a
0 1
1 2
2 3
Is that the best way to do this? It seems ridiculous.
You can use conditional list comprehension to generate the columns that exceed your threshold (e.g. 3). Then just select those columns from the data frame:
# Create sample DataFrame
df = pd.DataFrame({'a': [1, 2, 3, 4, 5],
'b': [1, None, 4, None, 2],
'c': [5, 4, 3, 2, None]})
>>> df_new = df[[col for col in df if df[col].count() > 3]]
Out[82]:
a c
0 1 5
1 2 4
2 3 3
3 4 2
4 5 NaN
Use count to produce a boolean index and use this as a mask for the columns:
In [10]:
df[df.columns[df.count() > 2]]
Out[10]:
a
0 1
1 2
2 3
if you want to keep columns that have 'n' or more values. for my example i am considering n value as 4
df = pd.DataFrame({'a': [1,2,3,4,6], 'b': [1,None,4,5,7],'c': [1,2,3,5,8]})
print df
a b c
0 1 1 1
1 2 NaN 2
2 3 4 3
3 4 5 5
4 6 7 8
print df[[i for i in xrange(0,len(df.columns)) if len(df.iloc[:,i]) - df.isnull().sum()[i] >4]]
a c
0 1 1
1 2 2
2 3 3
3 4 5
4 6 8

Categories