Pandas dataframe addition on selecting 2 or more columns - python

When there are 2 dataframes of same columns, how to select particular columns and add dataframes ?
dataframes in pandas are as follows
a_val = {'col1': [1, 2], 'col2': [3, 4], 'col3': [7, 8]}
b_val = {'col1': [1, 5, 2], 'col2': [3, 2, 4], 'col3': [7, 17, 33]}
a = pd.DataFrame(a_val)
b = pd.DataFrame(b_val)
How to make the resultant dataframe C (see below for the expected resultant C)
for example I have A dataframe as
B dataframe as
C dataframe as

I think you need merge and then sum last column:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
.assign(col3=lambda x: x.col3 + x.col3_).drop('col3_', 1)
What is same as:
c = pd.merge(a,b, on=['col1', 'col2'], suffixes=('','_'))
c.col3 = c.col3.add(c.col3_)
c = c.drop('col3_', 1)
print (c)
col1 col2 col3
0 1 3 14
1 2 4 41

Related

How to delete the consecutive duplicate elements from list also reflect that in another list in same data frame pandas?

I have a data frame like this
A
B
C
32
[1,2,2,3,4]
[a,b,c,d,e]
35
[5,5,7,7,7,8]
[q,w,e,r,t,y]
so there are the repeating elements in the B column that need to remove the second, third and so on occurrences and similarly, the elements in the C columns also get deleted with the same positions as B
the final result will be:
A
B
C
32
[1,2,3,4]
[a,b,d,e]
35
[5,7,8]
[q,e,y]
it's in a pandas data frame.
Use DataFrame.explode by both columns, then convert df.index to column df['index'] and remove duplicates per index and B column by DataFrame.drop_duplicates and last aggregate by index with A column with list:
import pandas as pd
data = {'A': [32, 35],
'B': [[1, 2, 2, 3, 4], [5, 5, 7, 7, 7, 8]],
'C': [['a', 'b', 'c', 'd', 'e'], ['q', 'w', 'e', 'r', 't', 'y']]}
df = pd.DataFrame(data)
df1 = (df.explode(['B','C'])
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
If values in A column are always unique:
df2 = (df.explode(['B','C'])
.drop_duplicates(['A','B'])
.groupby(['A'])
.agg(list)
.reset_index())
print (df2)
EDIT: Solution for oldier pandas version:
df1 = (df[['A']].join(df[['B','C']].apply(lambda x: x.explode()))
.reset_index()
.drop_duplicates(['index','B'])
.groupby(['index', 'A'])
.agg(list)
.reset_index(level=1)
.rename_axis(None))
print (df1)
Not very pandish, but trying to optimize memory used:
import numpy as np
# For each row, compute the indices of the first occurrence of each value in column B
df['_indices_to_keep'] = df.B.map(
lambda x: np.unique(x, return_index=True)[1] if x else None,
)
# Keep only values in columns B and C corresponding to the previously found positions
for c in ('B', 'C'):
df[c] = df.apply(lambda r: [r[c][i] for i in r['_indices_to_keep']] if r[c] else None, axis=1)
# Drop the temporary column with the indices
df.drop(columns=['_indices_to_keep'], inplace=True)
The result is:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]
Short solution basing on np.unique and np.take routines:
def drop_dups(r):
vals, idx_start = np.unique(r['B'], return_index=True)
r['B'], r['C'] = vals, np.take(r['C'], idx_start)
return r
df = pd.DataFrame(data)
df[['B', 'C']] = df[['B', 'C']].apply(drop_dups, axis=1)
print(df)
The output:
A B C
0 32 [1, 2, 3, 4] [a, b, d, e]
1 35 [5, 7, 8] [q, e, y]

Join two dataframes based on two columns [duplicate]

This question already has answers here:
pandas: merge (join) two data frames on multiple columns
(6 answers)
Closed 12 months ago.
I want to join two data frame df1 and df2 on two columns. For example, in the following dataframes, I want to join them with column a, b and a1, b1 and build the third dataframe.
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df1['a'] = [ 1, 2, 3 ]
df1['b'] = [ 2, 4, 6]
df1['c'] = [ 3, 5, 9]
df2['a1'] = [ 1, 2 ]
df2['b1'] = [ 4, 4]
df2['c1'] = [ 7, 5]
The output:
You can use pd.merge() and multiple keys a, b and a1, b1 using left_on and right_on, as follows:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
df1['a'] = [1, 2, 3]
df1['b'] = [2, 4, 6]
df1['c'] = [3, 5, 9]
df2['a1'] = [1, 2]
df2['b1'] = [4, 4]
df2['c1'] = [7, 5]
df3 = pd.merge(df1, df2, left_on=['a', 'b'], right_on=['a1', 'b1'], how='inner')
print(df3) # df3 has all columns for df1 and df2
# a b c a1 b1 c1
#0 2 4 5 2 4 5
df3 = df3.drop(df2.columns, axis=1) # removed columns of df2 as they're duplicated
df3.columns = ['a2', 'b2', 'c3'] # column names are changed as you want.
print(df3)
# a2 b2 c3
#0 2 4 5
For more information about pd.merge(), please see: https://pandas.pydata.org/docs/reference/api/pandas.merge.html

How to calculate number of rows between 2 indexes of pandas dataframe

I have the following Pandas dataframe in Python:
import pandas as pd
d = {'col1': [1, 2, 3, 4, 5], 'col2': [6, 7, 8, 9, 10]}
df = pd.DataFrame(data=d)
df.index=['A', 'B', 'C', 'D', 'E']
df
which gives the following output:
col1 col2
A 1 6
B 2 7
C 3 8
D 4 9
E 5 10
I need to write a function (say the name will be getNrRows(fromIndex) ) that will take an index value as input and will return the number of rows between that given index and the last index of the dataframe.
For instance:
nrRows = getNrRows("C")
print(nrRows)
> 2
Because it takes 2 steps (rows) from the index C to the index E.
How can I write such a function in the most elegant way?
The simplest way might be
len(df[row_index:]) - 1
For your information we have built-in function get_indexer_for
len(df)-df.index.get_indexer_for(['C'])-1
Out[179]: array([2], dtype=int64)

Replace values of a dataframe with the value of another dataframe

I have two pandas dataframes
df1 = pd.DataFrame({'A': [1, 3, 5], 'B': [3, 4, 5]})
df2 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [8, 9, 10, 11, 12], 'C': ['K', 'D', 'E', 'F', 'G']})
The index of both data-frames are 'A'.
How to replace the values of df1's column 'B' with the values of df2 column 'B'?
RESULT of df1:
A B
1 8
3 10
5 12
Maybe dataframe.isin() is what you're searching:
df1['B'] = df2[df2['A'].isin(df1['A'])]['B'].values
print(df1)
Prints:
A B
0 1 8
1 3 10
2 5 12
One of possible solutions:
wrk = df1.set_index('A').B
wrk.update(df2.set_index('A').B)
df1 = wrk.reset_index()
The result is:
A B
0 1 8
1 3 10
2 5 12
Another solution, based on merge:
df1 = df1.merge(df2[['A', 'B']], how='left', on='A', suffixes=['_x', ''])\
.drop(columns=['B_x'])

Drop Columns that starts with any of a list of strings Pandas

I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1
IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]

Categories