Concatenating Pandas DataFrames with overlapping indexes but never overlapping values - python

I have two DataFrames of arbitrary shape of the type:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 NaN
3 A3 NaN NaN
4 A4 NaN NaN
and
A B C
2 NaN NaN C2
3 NaN B3 C3
4 NaN B4 C4
5 A5 B5 C5
6 A6 B6 C6
The two DataFrames have overlapping indexes. Where there is an overlap, for a given column, there is a non-NaN in one DataFrame, and a NaN in the other. How can I concatenate these such that I can achieve a DataFrame with all values and no NaNs:
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
My proposed solution is:
df3 = pd.concat([pd.concat([df1[col].dropna(), df2[col].dropna()]) for col in df1.columns], axis=1)
However, ideally I would not work column-by-column.

Use combine_first:
df = df1.combine_first(df2)
print(df)
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6

using df.fillna() and df.append() with dropna()
df1.fillna(df2).append(df2).dropna()
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6

Related

Pandas combine two dataframes to update values of a particular column in 1st dataframe

I have two dataframes:
df1 :
A B C
0 a0 b0 c0
1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
df2 :
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a70 b2 c20
3 a3 b9 c9
In df1, for every row, whenever Column A and Column B values are equal to values in df2, column C should be updated with value from df2.
Output:
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4
I tried the following, but it did not work.
df1.set_index(['A', 'B'])
df2.set_index(['A', 'B'])
df1.update(df2)
df1.reset_index()
df2.reset_index()
df1["C"][:4] = np.where((df1["A"][:4]==df2["A"])&(df1["B"][:4]==df2["B"]),df2["C"],df1["C"][:4])
A B C
0 a0 b0 c11
1 a1 b1 c5
2 a2 b2 c2
3 a3 b3 c3
4 a4 b4 c4

Compare result to the previous result where after grouping

I have a df like this:
A B C D E F
2 a1 a2 a3 a4 100
2 a1 b2 c3 a4 100 # note
2 b1 b2 b3 b4 100
2 c1 c2 c3 c4 100
1 a1 a2 a3 a4 120
2 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
I want to compare the results of F column where the columns B-E match based on A column like so:
A B C D E F diff
2 a1 a2 a3 a4 100 120/100
2 a1 b2 c3 a4 100 # note 150/100
2 b1 b2 b3 b4 100 130/100
2 c1 c2 c3 c4 100 110/100
1 a1 a2 a3 a4 120 80/120
1 a1 b2 c3 a4 150 # note
1 b1 b2 b3 b4 130
1 c1 c2 c3 c4 110
0 a1 a2 a3 a4 80
Since the first line has the same values in the first line where A is 1 I do 120/100.
What I've tried:
df.groupby(['B',' 'C', 'D', 'E']) - this groups the data, but I don't know how I could apply the logic of calculating the previous value of column A. Or maybe there is a simpler way of achieving it.
Use DataFrameGroupBy.shift with Series.div:
df['d'] = df.groupby(['B', 'C', 'D', "E"])['F'].shift(-1).div(df['F'])
print (df)
A B C D E F d
0 2 a1 a2 a3 a4 100 1.200000
1 2 a1 b2 c3 a4 100 1.500000
2 2 b1 b2 b3 b4 100 1.300000
3 2 c1 c2 c3 c4 100 1.100000
4 1 a1 a2 a3 a4 120 0.666667
5 2 a1 b2 c3 a4 150 NaN
6 1 b1 b2 b3 b4 130 NaN
7 1 c1 c2 c3 c4 110 NaN
8 0 a1 a2 a3 a4 80 NaN

Combine multiple columns and rows into one column

I am trying to combine multiple columns and rows into one column. The order does from left to right in row 1 then goes to row 2 and follow the sequence again. Here is an example:
Current df:
Shop A B C D E
1 A1 B1 C1 D1 E1
2 A2 B2 C2 D2 E2
3 A3 B3 C3 D3 E3
4 A4 B4 C4 D4 E4
5 A5 B5 C5 D5 E5
Desired output:
Combined
A1
B1
C1
D1
E1
A2
B2
C2
.
.
.
D5
E5
Any help would be appreciated!! Thanks in advance!!
If Shop is index name, no column use DataFrame.stack with DataFrame.reset_index and Series.to_frame:
df = df.stack().reset_index(drop=True).to_frame('Combined')
print (df)
Combined
0 A1
1 B1
2 C1
3 D1
4 E1
5 A2
6 B2
7 C2
8 D2
9 E2
10 A3
11 B3
12 C3
13 D3
14 E3
15 A4
16 B4
17 C4
18 D4
19 E4
20 A5
21 B5
22 C5
23 D5
24 E5
else if there is column Shop use DataFrame.set_index in first step:
df = df.set_index('Shop').stack().reset_index(drop=True).to_frame('Combined')
You can use transpose and then melt:
df = df.set_index("Shop") # should not be needed
df.transpose().melt()
Output:
Shop value
0 1 A1
1 1 B1
2 1 C1
3 1 D1
4 1 E1
5 2 A2
6 2 B2
7 2 C2
8 2 D2
9 2 E2
10 3 A3
...
And you can now of course drop 'shop', it you'd like to:
df.transpose().melt().drop("Shop", axis=1)
Output:
value
0 A1
1 B1
2 C1
3 D1
4 E1
5 A2
6 B2
7 C2
8 D2
9 E2
10 A3
...

how to make a sample for a grouped dataframe in pandas

my question is to make a sample for a grouped dataframe in pandas. I just grouped a dataset using pd.groupby, and the grouped dataset is like this, each bikeid has several trips:
bikeid tripid A B C
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
1 4 a4 b4 c4
5 a5 b5 c5
2 6 ..............
7 ..............
3 8 ..............
9 ..............
What I want to do is just build a sample pick up one bikeid from every 3 bikeids. Should be like:
bikeid tripid A B C
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
3 8 a8 b8 c8
9 a9 b9 c9
6 ..............
..............
9
...
However I tried to use grouped_new = grouped.sample(frac=0.3)
it comes out a sample based on each single trip but not base on bikeid.
Can anyone help me out with this? Thank you so much!!!
Assuming you have a MultiIndex then consider using df.reindex() as shown below:
# your original dataframe
df = pd.read_json('{"bikeid": {"0":0,"1":0,"2":0,"3":1,"4":1,"5":2,"6":2,"7":3,"8":3},"tripid": {"0":1,"1":2,"2":3,"3":4,"4":5,"5":6,"6":7,"7":8,"8":9},"A":{"0":"a1","1":"a2","2":"a3","3":"a4","4":"a5","5":"a6","6":"a7","7":"a8","8":"a9"},"B":{"0":"b1","1":"b2","2":"b3","3":"b4","4":"b5","5":"b6","6":"b7","7":"b8","8":"b9"},"C":{"0":"c1","1":"c2","2":"c3","3":"c4","4":"c5","5":"c6","6":"c7","7":"c8","8":"c9"}}')
df.set_index(['bikeid', 'tripid'], inplace=True)
# df looks like the following
A B C
bikeid tripid
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
1 4 a4 b4 c4
5 a5 b5 c5
2 6 a6 b6 c6
7 a7 b7 c7
3 8 a8 b8 c8
9 a9 b9 c9
# define the labels you want to get out of your indexing operation
index_labels = np.arange(0, 100, 3)
# do the indexing
df.reindex(index_labels, level='bikeid')
# returns the following
A B C
bikeid tripid
0 1 a1 b1 c1
2 a2 b2 c2
3 a3 b3 c3
3 8 a8 b8 c8
9 a9 b9 c9

Pandas append dataframe without creating new columns

I have two dataframes that look like this:
df1=
A B
1 A1 B1
2 A2 B2
3 A3 B3
df2 =
A C
4 A4 C4
5 A5 C5
I would like to append df2 to df1, like so:
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
(Note: I've edited the dataframes so that not all the columns in df1 are necessarily in df2)
Whether I use concat or append, the resulting dataframe I get would have a column called "C" with the first three rows filled with nan. I just want to keep the two original columns in df1, with the new values appended. Is there a way concatenate the dataframes without having to drop the extra column afterwards?
You can first filter columns for appending by subset:
print (df2[['A']])
A
4 A4
5 A5
print (pd.concat([df1, df2[['A']]]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
print (df1.append(df2[['A']]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 NaN
5 A5 NaN
print (df2[['A','B']])
A B
4 A4 B4
5 A5 B5
print (pd.concat([df1, df2[['A','B']]]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
Or:
print (df1.append(df2[['A','B']]))
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
EDIT by comment:
If columns in df1 and df2 have different columns, use intersection:
print (df1)
A B D
1 A1 B1 R
2 A2 B2 T
3 A3 B3 E
print (df2)
A B C
4 A4 B4 C4
5 A5 B5 C5
print (df1.columns.intersection(df2.columns))
Index(['A', 'B'], dtype='object')
print (pd.concat([df1, df2[df1.columns.intersection(df2.columns)]]))
A B D
1 A1 B1 R
2 A2 B2 T
3 A3 B3 E
4 A4 B4 NaN
5 A5 B5 NaN
Actually the solution is in an obscure corner of this page. Here's the code to use:
pd.concat([df1,df2],join_axes=[df1.columns])

Categories