I am trying to make a large dataframe using python. I have a large amount of little dataframes with different row and column names, but there is some overlap between the row names and column names. What I was trying to do is start with one of the little dataframes and then one by one add the others.
Each of the specific row-column combinations is unique and in the end there will probably be a lot of NA.
I have tried doing this with merge from pandas, but this results in a much larger dataframe than I need with row and column names being duplicated instead of merged. If I could find a way that pandas realises that NaN is not a value and overwrites it when a new little dataframe is added, I think I would obtain the result I want.
I am also willing to try something that is not using pandas.
For example:
DF1 A B
Y 1 2
Z 0 1
DF2 C D
X 1 2
Z 0 1
Merged: A B C D
Y 1 2 NA NA
Z 0 1 0 1
X NA NA 1 2
And then a new dataframe has to be added:
DF3 C E
Y 0 1
W 1 1
The result should be:
A B C D E
Y 1 2 0 NA 1
Z 0 1 0 1 NA
X NA NA 1 2 NA
W NA NA 1 NA 1
But what happens is:
A B C_x C_y D E
Y 1 2 NA 1 NA 1
Z 0 1 0 0 1 NA
X NA NA 1 1 2 NA
W NA NA 1 1 NA 1
You want to use DataFrame.combine_first, which will align the DataFrames based on index, and will prioritize values in the left DataFrame, while using values in the right DataFrame to fill missing values.
df1.combine_first(df2).combine_first(df3)
Sample data
import pandas as pd
df1 = pd.DataFrame({'A': [1,0], 'B': [2,1]})
df1.index=['Y', 'Z']
df2 = pd.DataFrame({'C': [1,0], 'D': [2,1]})
df2.index=['X', 'Z']
df3 = pd.DataFrame({'C': [0,1], 'E': [1,1]})
df3.index=['Y', 'W']
Code
df1.combine_first(df2).combine_first(df3)
Output:
A B C D E
W NaN NaN 1.0 NaN 1.0
X NaN NaN 1.0 2.0 NaN
Y 1.0 2.0 0.0 NaN 1.0
Z 0.0 1.0 0.0 1.0 NaN
Related
I want to fill NaN values in this way: put the same values as the column B if they have the same value in B
Example:
A B
nan 'ra'
9 'ra'
5 'pa'
So NaN value in column A should be 9 because they have the same values in column B.
df['A'] = df.groupby('B')['A'].ffill().bfill()
Output:
>>> df
A B
0 9.0 ra
1 9.0 ra
2 5.0 pa
I would like to fill missing values in a pandas dataframe with the average of the cells directly before and after the missing value considering that there are different IDs.
maskedid test value
1 A 4
1 B NaN
1 C 5
2 A 5
2 B NaN
2 B 2
expected DF
maskedid test value
1 A 4
1 B 4.5
1 C 5
2 A 5
2 B 3.5
2 B 2
Try to interpolate:
df['value'] = df['value'].interpolate()
And by group:
df['value'] = df.groupby('maskedid')['value'].apply(pd.Series.interpolate)
I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0
My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?
df2 = df2.loc[df2['products'] <= 20]
I have a bunch of partially overlapping (in rows and columns) pandas DataFrames, exemplified like so:
df1 = pandas.DataFrame({'a':['1','2','3'], 'b':['a','b','c']})
df2 = pandas.DataFrame({'c':['q','w','e','r','t','y'], 'b':['a','b','c','d','e','f']})
df3 = pandas.DataFrame({'a':['4','5','6'], 'c':['r','t','y']})
...etc.
I want to merge them all together with as few NaN holes as possible.
Consecutive blind outer merges invariably give some (unfortunately useless to me) hole-and-duplicate-filled variant of:
a b c
0 1 a q
1 2 b w
2 3 c e
3 NaN d r
4 NaN e t
5 NaN f y
6 4 NaN r
7 5 NaN t
8 6 NaN y
My desired output given a, b, and c above would be this (column order doesn't matter):
a b c
0 1 a q
1 2 b w
2 3 c e
3 4 d r
4 5 e t
5 6 f y
I want the NaNs to be treated as places to insert data from the next dataframe, not obstruct it.
I'm at a loss here. Is there any way to achieve this in a general way?
I can not grantee the speed , But after sort with key , seems work for your sample data.
df.apply(lambda x : sorted(x,key=pd.isnull)).dropna(0)
Out[47]:
a b c
0 1.0 a q
1 2.0 b w
2 3.0 c e
3 4.0 d r
4 5.0 e t
5 6.0 f y