Pandas combining dataframes based on column value - python

I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?

You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')

You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0

Related

Create a new column in Pandas Dataframe based on the 'NaN' values in another column

I have a dataframe:
A B
1 NaN
2 3
4 NaN
5 NaN
6 7
I want to create a new column C containing the value from B that aren't NaN, otherwise the values from A. This would be a simple matter in Excel; is it easy in Pandas?
Yes, it's simple. Use pandas.Series.where:
df['C'] = df['A'].where(df['B'].isna(), df['B'])
Output:
>>> df
A B C
0 1 NaN 1
1 2 3.0 3
2 4 NaN 4
3 5 NaN 5
4 6 7.0 7
A bit cleaner:
df.C = df.A.where(df.B.isna(), df.B)
Alternatively:
df.C = df.B.where(df.B.notna(), df.A)

Combining partially overlapping data frames on both axes pandas [duplicate]

This question already has answers here:
Merging two data frames and keeping the extra rows from first df
(2 answers)
Closed 1 year ago.
What is the best way to combine multiple data frames that partially overlap on both axes?
I came up already with a workable solution but I'm unsure it's the best way nor that I should be doing that at all.
So I have the following frames:
df1 = pd.DataFrame([[6,6],[7,7]],columns=['a','b'])
print(df1)
a b
0 6 6
1 7 7
df2 = pd.DataFrame([[7,7],[8,8]],columns=['b','c'], index=[1,2])
print(df2)
b c
1 7 7
2 8 8
basically the only overlapping data point is b1
and I'd like to obtain the following:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
If I do a regular concat I end up either with a duplicate on the index or on the columns. Now, the workaround I found is the following:
dfc = pd.concat([df1,df2],axis=0)
dfc = dfc.groupby(dfc.index).mean()
print(dfc)
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
I wonder if there is a better way to do it and more in general if this is best practice when handling data.
I should also add that in my datasets the overlapping data is always an exact duplicate and I "should" never have different values if the indexes are the same.
Thank you!
You want to use combine_first:
df1.combine_first(df2)
output:
a b c
0 6.0 6 NaN
1 7.0 7 7.0
2 NaN 8 8.0
NB. if b1 is different in the two dataframes, this will take the value from df1

Adding particular rows and changing their positions

I have an example(left one in image description).
There are several indexes in the first column. However, the third (not just for the third one because I have data from more than a thousand repeating intervals) of the repeating characters is missing-data which is 'GG'.
Question: I want to add particular rows (like 'GG') with the value 'NaN'
I want to display its values ​​in different columns based on the characters of the repeated section(from 'II' to '//\n').
Is there any way I can do in this situation?.
Assuming your data is a dataframe with all data as columns (if not, you first need to reset_index):
(df
#.reset_index() # uncomment if first column is index
.assign(cols=df.groupby('col1').cumcount())
.pivot(index='col1', columns='cols', values='col2')
)
output:
cols 0 1 2
col1
A 0.0 4.0 7.0
B 1.0 5.0 8.0
C 2.0 9.0 NaN
D 3.0 6.0 10.0
input:
col1 col2
0 A 0
1 B 1
2 C 2
3 D 3
4 A 4
5 B 5
6 D 6
7 A 7
8 B 8
9 C 9
10 D 10

How to pivot and get the mean values of each column to rows

I'm new in python and I need your assistance with getting the result when you add your columns as value and your values in rows.
Here's an example:
columns
A B C
1 2 3
4 5 6
7 8 9
Expected result:
avg
A 4
B 5
C 6
I can do it easily in excel by placing the columns in "Values" the move the values in rows to get the average but I can't seem to do it in python.
df=pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
df
A B C
0 1 2 3
1 4 5 6
2 7 8 9
ser=df.mean() #Result is a Series
df=pd.DataFrame({'avg':ser}) #Convert this Series into DataFrame
df
avg
A 4.0
B 5.0
C 6.0
Using to_frame
df.mean().to_frame('ave')
Out[186]:
ave
A 4.0
B 5.0
C 6.0

Merge dataframes without duplicating rows in python pandas [duplicate]

This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.
Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0

Categories