I have two DataFrames that looks like this (Note: I am still a beginner and trying to learn joins better)
xx = pd.DataFrame(np.array([[13, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
yy = pd.DataFrame(np.array([[1, 2, 3,5], [4, 5, 6,5], [7, 8, 9,5]]),
columns=['aa', 'bb', 'cc','dd'])
I want to preform a left join so that I have a final table that looks like this
aa bb cc dd
4 5 6 6
7 8 9 5
I have come up with this so far
zz = pd.merge(yy,xx, how = 'left', left_on= ['aa','bb'], right_on=['a','b'])
But this gives me the incorrect output which is
Can you please help me with what correction I need to make in order to get the desired output?
Any help will be very much appreciated
Based on the expected output, you have to do an inner join not a left join. Also to join pandas DataFrames the columns must have common columns. So I've set the columns of xx to that in yy
>>>xx.columns=['aa','bb','cc']
>>>pd.merge(yy,xx,how='inner',on=['aa','bb','cc'])
aa bb cc dd
0 4 5 6 5
1 7 8 9 5
And this would be the output of left join of yy with xx:
>>>pd.merge(yy,xx,how='left',on=['aa','bb','cc'])
aa bb cc dd
0 1 2 3 5
1 4 5 6 5
2 7 8 9 5
You need the dataframes with equal column headers so another dataframe can be created by changing the column header before merging:
zz = pd.merge(xx.rename(columns={"a": "aa", "b": "bb","c":"cc"}),yy)
zz
Related
I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0
Hi I would like to change the column names of a part of the columns in my dataframe.
When I print just the part I want to change it to: palColAdj.iloc[:, 73:].columns.str[:-2] I see the outcome I would like to see, but when I try to change it in my original dataframe I don't see the change.
So if I write either
palColAdj.iloc[:, 73:].columns=palColAdj.iloc[:, 73:].columns.str[:-2]
or
prodColAdj.iloc[:, 39:].columns=prodColAdj.iloc[:, 39:].columns.str[:-2].to_list()
and afterwards I print
prodColAdj.head()
I still see the original column names. How can this be?
Here's a way to do it.
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['aaaa', 'bbbb', 'ccccc'])
# aaaa bbbb ccccc
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
cols = df2.columns.values
dict = {}
for col in cols:
dict[col] = col[:-2]
df.rename(dict, axis=1, inplace=True)
# aa bb ccc
# 0 1 2 3
# 1 4 5 6
# 2 7 8 9
To pick specific cols, edit this:
cols = df2.columns.values[0:2]
# array(['aaaa', 'bbbb'], dtype=object)
I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')
I have a Dataframe file in which I want to switch the order of columns in only the third row while keeping other rows the same.
Under some condition, I have to switch orders for my project, but here is an example that probably has no real meaning.
Suppose the dataset is
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df
out[1]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
I want to have the output:
A B C
0 0 5 a
1 1 6 b
2 **7 2** c
3 3 8 d
4 4 9 e
How do I do it?
I have tried:
new_order = [1, 0, 2] # specify new order of the third row
i = 2 # specify row number
df.iloc[i] = df[df.columns[new_order]].loc[i] # reorder the third row only and assign new values to df
I observed from the output of the right-hand side that the columns are reordering as I wanted:
df[df.columns[new_order]].loc[i]
Out[2]:
B 7
A 2
C c
Name: 2, dtype: object
But when assigned to df again, it did nothing. I guess it's because of the name matching.
Can someone help me? Thanks in advance!
I have a pandas DataFrame that I have grouped by a combination of three columns A, B, C.
grouped = df.groupby(["A", "B", "C"])
Several additional columns D, E, F, G are (guaranteed) identical for all elements of each group, while other columns X, Y vary within each group. (I already know which columns are fixed, and which vary.)
I would like to construct a dataframe containing one row per group, and consisting of the values for the invariant columns A, B, C, D, E, F, G only. What is the most straightforward way to do this? Since there are lots of identical values, I would prefer to specify which columns to omit, rather than the other way around.
I've come up with "aggregating" by choosing one row from each group, and then deleting the unwanted columns in a separate step:
thinned = grouped.aggregate(lambda x: x.iloc[0])
del thinned["X"], thinned["Y"]
The purpose of this is to combine the invariant values with several new summary values that I calculate, in a dataframe that has one row per (current) group.
thinned["newAA"] = grouped.apply(some_function)
thinned["newBB"] = grouped.apply(other_function)
...
But I suspect there must be a less round-about way.
You could use GroupBy.first() to select just the first record of each group. For example, this
import pandas
df = pandas.DataFrame({
'A': [1, 1, 2, 2, 3, 3],
'B': [1, 1, 1, 2, 2, 2],
'C': [2, 2, 3, 3, 1, 1]
})
print(df.groupby(['A', 'B'])['C'].first())
results in
A B
1 1 2
2 1 3
2 3
3 2 1
Name: C, dtype: int64
I think you need drop_duplicates:
df = pd.DataFrame({'A':[7,4,4],
'B':[7,4,4],
'C':[7,4,4],
'D':[7,4,4],
'E':[7,4,4],
'F':[7,4,4],
'G':[7,4,4],
'X':[1,2,8],
'Y':[5,7,0]})
print (df)
A B C D E F G X Y
0 7 7 7 7 7 7 7 1 5
1 4 4 4 4 4 4 4 2 7
2 4 4 4 4 4 4 4 8 0
#filter by subset
cols = ["A", "B", "C", "D","E","F", "G"]
df1 = df.drop_duplicates(subset=cols)[cols]
print (df1)
A B C D E F G
0 7 7 7 7 7 7 7
1 4 4 4 4 4 4 4
#remove unnecessary columns
df2 = df.drop(['X','Y'], axis=1).drop_duplicates()
print (df2)
A B C D E F G
0 7 7 7 7 7 7 7
1 4 4 4 4 4 4 4
I guess you have many option here, more or less elegant.
First of all, do you care of 'X' and 'Y'? If you don't, since you're deleting them at the end you could simply use drop_duplicates
new_df = df[['A', 'B', 'C', 'D', 'E', 'F', 'G']].drop_duplicates()
# this will keep only the unique values of the above columns