Efficiently combine dataframes on 2nd level index - python

I have two dataframes looking like
import pandas as pd
df1 = pd.DataFrame([2.1,4.2,6.3,8.4,10.5], index=[2,4,6,8,10])
df1.index.name = 't'
df2 = pd.DataFrame(index=pd.MultiIndex.from_tuples([('A','a',1),('A','a',4),
('A','b',5),('A','b',6),('B','c',7),
('B','c',9),('B','d',10),('B','d',11),
], names=('big', 'small', 't')))
I am searching for an efficient way to combine them such that I get
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN
I.e. I want to get the index levels 0 and 1 of df2 as index levels 0 and 1 in df1.
Of course a loop over the dataframe would work as well, though not feasible for large dataframes.
EDIT:
It appears from comments below that I should add, the indices big and small should be inferred on t in df1 based on the ordering of t.

Assuming that you want the unknown index levels to be inferred based on the ordering of 't', we can use an other merge, sort the values and then re-create the MultiIndex using ffill logic (need a Series for this).
res = (df2.reset_index()
.merge(df1, on='t', how='outer')
.set_index(df2.index.names)
.sort_index(level='t'))
res.index = pd.MultiIndex.from_arrays(
[pd.Series(res.index.get_level_values(i)).ffill()
for i in range(res.index.nlevels)],
names=res.index.names)
print(res)
0
big small t
A a 1 NaN
2 2.1
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
8 8.4
9 NaN
d 10 10.5
11 NaN

Try extracting the level values and reindex:
df2['0'] = df1.reindex(df2.index.get_level_values('t'))[0].values
Output:
0
big small t
A a 1 NaN
4 4.2
b 5 NaN
6 6.3
B c 7 NaN
9 NaN
d 10 10.5
11 NaN
For more columns in df1, we can just merge:
(df2.reset_index()
.merge(df1, on='t', how='left')
.set_index(df2.index.names)
)

Related

Positional indexing with NA values

I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok
If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0

pandas concat two dataframes of different row size without nan values

I'm concatenating two pandas data frames, that have the same exact columns, but different number of rows. I'd like to stack the first dataframe over the second.
When I do the following, I get many NaN values in some of the columns. I've tried the fix in using this post, using .reset_index
But I'm getting NaN values still. My dataframes have the following columns:
The first one, rem_dup_pre and the second one, rem_dup_po have shape (54178, 11) (83502, 11) respectively.
I've tried this:
concat_mil = pd.concat([rem_dup_pre.reset_index(drop=True), rem_dup_po.reset_index(drop=True)], axis=0)
and I get NaN values. For example in 'Station Type', where previously there were no NaN values in either rem_dup_pre or rep_dup_po:
How can I simply concat them without NaN values?
Here's how I did it and I don't get any additional NaNs.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':['a','b','c','d',np.nan,np.nan],
'c':['x',np.nan,np.nan,np.nan,'y','z']})
df2 = pd.DataFrame(np.random.randint(0,10,(3,3)), columns = list('abc'))
print (df1)
print (df2)
df = pd.concat([df1,df2]).reset_index(drop=True)
print (df)
The output of this is:
DF1:
a b c
0 1 a x
1 2 b NaN
2 3 c NaN
3 4 d NaN
4 5 NaN y
5 6 NaN z
DF2:
a b c
0 4 8 4
1 8 4 4
2 2 8 1
DF: after concat
a b c
0 1 a x
1 2 b NaN
2 3 c NaN
3 4 d NaN
4 5 NaN y
5 6 NaN z
6 4 8 4
7 8 4 4
8 2 8 1

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

How to merge multiple dataframe columns within a common dataframe in pandas in fastest way possible?

Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.

Pandas combining dataframes based on column value

I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0

Categories