Merge on columns and rows

Merge on columns and rows - python

I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.
Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.
I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want.
I am also willing to try something that is not using pandas.
For example:
DF1 A B
Y 1 2
Z 0 1
DF2 C D
X 1 2
Z 0 1
Merged: A B C D
Y 1 2 NA NA
Z 0 1 0 1
X NA NA 1 2
And then a new dataframe has to be added:
DF3 C E
Y 0 1
W 1 1
The result should be:
A B C D E
Y 1 2 0 NA 1
Z 0 1 0 1 NA
X NA NA 1 2 NA
W NA NA 1 NA 1
But what happens is:
A B C_x C_y D E
Y 1 2 NA 1 NA 1
Z 0 1 0 0 1 NA
X NA NA 1 1 2 NA
W NA NA 1 1 NA 1

You want to use DataFrame.combine_first, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.
df1.combine_first(df2).combine_first(df3)
Sample data
import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']
df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']
df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']
Code
df1.combine_first(df2).combine_first(df3)
Output:
A B C D E
W NaN NaN 1.0 NaN 1.0
X NaN NaN 1.0 2.0 NaN
Y 1.0 2.0 0.0 NaN 1.0
Z 0.0 1.0 0.0 1.0 NaN

Related

How fill NA with respect of another column?

I want to fill NaN values in this way: put the same values as the column B if they have the same value in B
Example:
A B
nan 'ra'
9 'ra'
5 'pa'
So NaN value in column A should be 9 because they have the same values in column B.

df['A'] = df.groupby('B')['A'].ffill().bfill()
Output:
>>> df
A B
0 9.0 ra
1 9.0 ra
2 5.0 pa

Fill cell containing NaN with average of value before and after considering groupby

I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2

Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?

You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

Remove outliers from aggregated Dataframe (Python)

My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?

df2 = df2.loc[df2['products'] <= 20]

merging multiple partially overlapping dataframes compactly without extra rows and nans

I have a bunch of partially overlapping (in rows and columns) pandas DataFrames, exemplified like so:
df1 = pandas.DataFrame({'a':['1','2','3'], 'b':['a','b','c']})
df2 = pandas.DataFrame({'c':['q','w','e','r','t','y'], 'b':['a','b','c','d','e','f']})
df3 = pandas.DataFrame({'a':['4','5','6'], 'c':['r','t','y']})
...etc.
I want to merge them all together with as few NaN holes as possible.
Consecutive blind outer merges invariably give some (unfortunately useless to me) hole-and-duplicate-filled variant of:
a b c
0 1 a q
1 2 b w
2 3 c e
3 NaN d r
4 NaN e t
5 NaN f y
6 4 NaN r
7 5 NaN t
8 6 NaN y
My desired output given a, b, and c above would be this (column order doesn't matter):
a b c
0 1 a q
1 2 b w
2 3 c e
3 4 d r
4 5 e t
5 6 f y
I want the NaNs to be treated as places to insert data from the next dataframe, not obstruct it.
I'm at a loss here. Is there any way to achieve this in a general way?

I can not grantee the speed , But after sort with key , seems work for your sample data.
df.apply(lambda x : sorted(x,key=pd.isnull)).dropna(0)
Out[47]:
a b c
0 1.0 a q
1 2.0 b w
2 3.0 c e
3 4.0 d r
4 5.0 e t
5 6.0 f y

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge on columns and rows - python

Related

How fill NA with respect of another column?

Fill cell containing NaN with average of value before and after considering groupby

transform a big dataframe with many None values to smaller one with indication of non null columns

Remove outliers from aggregated Dataframe (Python)

merging multiple partially overlapping dataframes compactly without extra rows and nans

Categories

Resources