Pandas- merging two dataframe by sum the values of columns and index - python

I want to merge two datasets by indexes and columns.
I want to merge entire dataset
df1 = pd.DataFrame([[1, 0, 0], [0, 2, 0], [0, 0, 3]],columns=[1, 2, 3])
df1
1 2 3
0 1 0 0
1 0 2 0
2 0 0 3
df2 = pd.DataFrame([[0, 0, 1], [0, 2, 0], [3, 0, 0]],columns=[1, 2, 3])
df2
1 2 3
0 0 0 1
1 0 2 0
2 3 0 0
I have tried this code but I got this error. I can't get why it shows the size of axis as an error.
df_sum = pd.concat([df1, df2])\
.groupby(df2.index)[df2.columns]\
.sum().reset_index()
ValueError: Grouper and axis must be same length
This was what I expected the output of df_sum
df_sum
1 2 3
0 1 0 1
1 0 4 0
2 3 0 3

You can use :df1.add(df2, fill_value=0). It will add df2 into df1 also it will replace NAN value with 0.
>>> import numpy as np
>>> import pandas as pd
>>> df2 = pd.DataFrame([(10,9),(8,4),(7,np.nan)], columns=['a','b'])
>>> df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
>>> df1.add(df2, fill_value=0)
a b
0 11 11.0
1 11 8.0
2 12 6.0

Related

Select rows by column value and include previous row by another column value

Here's an example of DataFrame:
import numpy as np
import pandas as pd
df = pd.DataFrame([
[0, "file_0", 5],
[0, "file_1", 0],
[1, "file_2", 0],
[1, "file_3", 8],
[2, "file_4", 0],
[2, "file_5", 5],
[2, "file_6", 100],
[2, "file_7", 0],
[2, "file_8", 50]
], columns=["case", "filename", "num"])
I wanna select num==0 rows and their previous rows with the same case value, no matter the num value of the previous row.
Finally, we should get
case filename num
0 file_0 5
0 file_1 0
1 file_2 0
2 file_4 0
2 file_6 100
2 file_7 0
I have got that I can select the previous row by
df[(df['num']==0).shift(-1).fillna(False)]
However, this doesn't consider the case value. One solution that came to my mind is group by case first and then filter data. I have no idea how to code it ...
I figure out the answer by myself:
# create boolean masks which are true when `num` is 0 and previous `case` is the same
mask = (df.case.eq(df.case.shift())) & (df['num']==0)
# concat previous rows and num==0 rows
df_res = pd.concat([df[mask.shift(-1).fillna(False)], df[df['num']==0]]).sort_values(['case', 'filename'])
How about merging df ?
df = pd.DataFrame([
[0, "file_0", 0],
[0, "file_1", 0],
[1, "file_2", 0],
[2, "file_3", 0],
[2, "file_4", 100],
[2, "file_5", 0],
[2, "file_6", 50],
[2, "file_7", 0]
], columns=["case", "filename", "num"])
df = df.merge(df, left_on='filename', right_on='filename', how='inner')
df[(df['case_x'] == df['case_y']) & df['num_x'] == 0]
Out[219]:
case_x filename num_x case_y num_y
0 0 file_0 0 0 0
1 0 file_1 0 0 0
2 1 file_2 0 1 0
3 2 file_3 0 2 0
4 2 file_4 100 2 100
5 2 file_5 0 2 0
6 2 file_6 50 2 50
7 2 file_7 0 2 0
then you can rename columns back
df[['case_x', 'filename', 'num_x']].rename({'case_x':'case','num_x':'num'},axis=1)
Out[223]:
case filename num
0 0 file_0 0
1 0 file_1 0
2 1 file_2 0
3 2 file_3 0
4 2 file_4 100
5 2 file_5 0
6 2 file_6 50
7 2 file_7 0
Do you mean:
df.join(df.groupby('case').shift(-1)
.loc[df['num']==0]
.dropna(how='all').add_suffix('_next'),
how='inner')
Output:
case filename num filename_next num_next
0 0 file_0 0 file_1 0.0
3 2 file_3 0 file_4 100.0
5 2 file_5 0 file_6 50.0

Find difference in two different data-frames

I have two data frame df1 is 26000 rows, df2 is 25000 rows.
Im trying to find data points that are in d1 but not in d2, vice versa.
This is what I wrote (below code) but when I cross check it shows me shared data point
import pandas as pd
df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')
df_join = pd.concat([df1,df2], axis = 1).drop_duplicates(keep = FALSE)
only_df1 = df_join.loc[df_join[df2.columns.to_list()].isnull().all(axis = 1), df1.columns.to_list()]
Order doesn't matter just want to know whether that data point exist in one or the other data frame.
With two dfs:
import pandas as pd
df1 = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 1, 1, 1, 1]})
df2 = pd.DataFrame({'a': [2, 3, 4, 5, 6], 'b': [1, 1, 1, 1, 1]})
print(df1)
print(df2)
a b
0 1 1
1 2 1
2 3 1
3 4 1
4 5 1
a b
0 2 1
1 3 1
2 4 1
3 5 1
4 6 1
You could do:
df_differences = df1.merge(df2, how='outer', indicator=True)
print(df_differences)
Result:
a b _merge
0 1 1 left_only
1 2 1 both
2 3 1 both
3 4 1 both
4 5 1 both
5 6 1 right_only
And then:
only_df1 = df_differences[df_differences['_merge'].eq('left_only')].drop(columns=['_merge'])
only_df2 = df_differences[df_differences['_merge'].eq('right_only')].drop(columns=['_merge'])
print(only_df1)
print()
print(only_df2)
a b
0 1 1
a b
5 6 1

New DataFrame boolean column that checks whether or not any of certain columns equal 1

I have the following pd.DataFrame and list of columns:
col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0], 'med_b': [0, 0, 1, 1], 'med_c': [0, 1, 1, 0]})
print(df)
>>>
med_a med_b med_c
0 0 0 0
1 0 0 1
2 1 1 1
3 0 1 0
I want to make a new column (new_col) that holds either True/False (or 0/1) if any of the columns in col_list is equal to 1, for each row. So the result should become:
med_a med_b med_c new_col
0 0 0 0 0
1 0 0 1 1
2 1 1 1 1
3 0 1 0 0
I know how to select only those rows where at least one of the columns in is equal to 1, but that doesn't check only those columns in col_list, and it doesn't create a new column:
df[(df== 1).any(axis=1)]
print(df)
>>>
med_a med_b med_c
1 0 0 1
2 1 1 1
3 0 1 1
How would I achieve the desired result? Any help is appreciated.
You're so close! Just filter the df with the col_list before any on axis=1 + astype(int).
import numpy as np
import pandas as pd
col_list = ["med_a", "med_c"]
df = pd.DataFrame.from_dict({'med_a': [0, 0, 1, 0],
'med_b': [0, 0, 1, 1],
'med_c': [0, 1, 1, 0]})
df['new_col'] = df[col_list].any(axis=1).astype(int)
print(df)
Or via np.where:
df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)
med_a med_b med_c new_col
0 0 0 0 0
1 0 0 1 1
2 1 1 1 1
3 0 1 0 0
Timing information via perfplot:
np.where is faster than astype(int) up to 100,000 rows at which point they are about the same.
import numpy as np
import pandas as pd
import perfplot
np.random.seed(5)
col_list = ["med_a", "med_c"]
def gen_data(n):
return pd.DataFrame.from_dict({'med_a': np.random.choice([0, 1], size=n),
'med_b': np.random.choice([0, 1], size=n),
'med_c': np.random.choice([0, 1], size=n)})
def np_where(df):
df['new_col'] = np.where(df[col_list].any(axis=1), 1, 0)
return df
def astype_int(df):
df['new_col'] = df[col_list].any(axis=1).astype(int)
return df
if __name__ == '__main__':
out = perfplot.bench(
setup=gen_data,
kernels=[
np_where,
astype_int
],
labels=[
'np_where',
'astype_int'
],
n_range=[2 ** k for k in range(25)],
equality_check=None
)
out.save('perfplot_results.png', transparent=False)

how nunique works with given table values?

yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
output:
A B
0 1 1
1 2 1
2 3 1
yf.nunique(axis=0)
output:
A 3
B 1
yf.nunique(axis=1)
output:
0 1
1 2
2 2
could you please how axis=0 and axis=1 works? In axis=0, why A=2, B=1 are ignored? Wonder if nunique gets in index as well?
You can test number of unique values per columns or per index by DataFrame.nunique.
yf = pd.DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})
print (yf)
A B
0 1 1
1 2 1
2 3 1
print (yf.nunique(axis=0))
A 3
B 1
dtype: int64
print (yf.nunique(axis=1))
0 1
1 2
2 2
dtype: int64
It means:
A is 3, because 3 unique values in column A
0 is 1, because 1 unique values in row 0

Including missing combinations of values in a pandas groupby aggregation

Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0

Categories