I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.
Related
For the following dataframe:
a b c
0 NaN 5.0 NaN
1 2.0 6.0 NaN
2 3.0 7.0 11.0
3 4.0 NaN 12.0
I want to remove all rows with at least one NaN from the first row until a 'full' row is found. For the example above, rows 0 & 1 contain NaN so they are dropped. Row 2 is a 'full' row so it is retained, along with all following rows.
i.e., I want to get:
a b c
2 3.0 7.0 11.0
3 4.0 NaN 12.0
How can I achieve this?
Test non missing values per all rows by DataFrame.notna by DataFrame.all and Series.cummax and filter in boolean indexing:
df = df[df.notna().all(axis=1).cummax()]
print (df)
a b c
2 3.0 7.0 11.0
3 4.0 NaN 12.0
Say I have two data frames:
Original:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 NaN
2 NaN NaN 9.0
Imputation:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
(both are the same dataframes except imputation has the NaN's filled in).
I would like to reintroduce the NaN values into the imputation df column A so it looks like this(column B, C are filled in but A keeps the NaN values):
# A B C
# 0 NaN 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN 6.0 9.0
import pandas as pd
import numpy as np
dfImputation = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
dfOrginal = pd.DataFrame({'A':[np.NaN,2,np.NaN],
'B':[4,5,np.NaN],
'C':[7,np.NaN,9]})
print(dfOrginal.fillna(dfImputation))
I do not get the result I want because it just obviously fills in all values. There is a way to introduce NaN values or a way to fill in NA for specific columns? I'm not quite sure the best approach to get the intended outcome.
You can fill in only specified columns by subsetting the frame you pass into the fillna operation:
>>> dfOrginal.fillna(dfImputation[["B", "C"]])
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
Check update
df.update(im[['B','C']])
df
Out[7]:
A B C
0 NaN 4.0 7.0
1 2.0 5.0 8.0
2 NaN 6.0 9.0
I'm having a dataframe with energy use data. In order to post-process the data I need to be sure I only go forward with reliable energy uses.
One of the steps here is making sure the values in the dataframe rows are not identical, because this indicates an error in the database (energy use for households are hardly ever identical over the years except for zero energy uses (due to renewable energy installations).
The question is as follows on a simple example df:
The dataframe can contain empty cells (np.nan).
If 2 or more row-values are identical, then keep one of the
identical values and set the rest to np.nan except if the identical values are zeros.
In the example below, row 2 and 4 are replaced with np.nan but the last row is not because the identical values are zeros.
Does anyone know how to go from the initial df to the desired df? The code works except for the condition if the identical values are zeros, these should not be changed to np.nan (see last row in df)
Initial df:
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 7.0
9.0 NaN NaN
3.0 3.0 3.0
2.0 4.0 6.0
0.0 0.0 NaN
Desired df:
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 NaN
9.0 NaN NaN
3.0 NaN NaN
2.0 4.0 6.0
0.0 0.0 NaN
Tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"y_2010": [4,2,9,3,2,0],
"y_2011": [6,7,np.nan,3,4,0],
"y_2012": [3,7,np.nan,3,6,np.nan]})
print(df)
mask = df.apply(pd.Series.duplicated, 1)
df = df.mask(mask, np.nan)
print(df)
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 NaN
9.0 NaN NaN
3.0 NaN NaN
2.0 4.0 6.0
0.0 NaN NaN -> 0 changed to NaN and I don't want that
Let us try adding 0 check
df = df.mask(df.apply(pd.Series.duplicated, 1) & df.ne(0))
y_2010 y_2011 y_2012
0 4.0 6.0 3.0
1 2.0 7.0 NaN
2 9.0 NaN NaN
3 3.0 NaN NaN
4 2.0 4.0 6.0
5 0.0 0.0 NaN
You can try:
df.apply(lambda x: x.mask(x.duplicated()&x.ne(0)), axis=1)
Output:
y_2010 y_2011 y_2012
0 4.0 6.0 3.0
1 2.0 7.0 NaN
2 9.0 NaN NaN
3 3.0 NaN NaN
4 2.0 4.0 6.0
5 0.0 0.0 NaN
I'm currently having a problem with filling the missing values of my dataframe using a different dataframe.
Data samples:
df1
A B C
b 1.0 1.0
d NaN NaN
c 2.0 2.0
a NaN NaN
f NaN NaN
df2
A B C
c 1 5
b 2 6
a 3 7
d 4 8
I've tried to follow the solution in this question but it would appear that it is only possible if the values you're looking up is present in both dataframes you're joining.
My attempt
mask = df1["B"].isnull()
df1.loc[mask, "B"] = df2[df1.loc[mask, "A"]].values
Error:
"None of [Index(['d', 'a', 'f'], dtype='object')] are in the [columns]"
Expected result:
A B C
b 1.0 1.0
d 4.0 8.0
c 2.0 2.0
a 3.0 7.0
f NaN NaN
Also, can it be used it fill two columns?
You can use combine_first here, which is exactly aimed at filling NaNs by matching with another dataframe's columns:
df1.set_index('A').combine_first(df2.set_index('A')).reset_index()
A B C
0 a 3.0 7.0
1 b 1.0 1.0
2 c 2.0 2.0
3 d 4.0 8.0
4 f NaN NaN
Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
This is my empty dataframe.
one two three four
0 1.0 4.0 2.4 6.4
1 2.0 3.0 4.4 4.1
2 3.0 2.0 7.0 1.0
3 4.0 1.0 9.0 5.0
I need to fill these values into my empty dataframe.
So lets say'TERMID' takes the value from 'one', 'TYP' the value of 'two', 'ZUW' the value from 'three' and last but not least 'SEOKommentar' takes the value from 'four'
The empty dataframe needs to get filled row by row, and the ones which are not filled should say NaN.
How can I do this in an accurate way?
IIUC, you can rename the second dataframe and then reindex the columns to the original empty dataframe columns:
Creating the empty data frame:
s = 'TERMID,NAME,TYP,NAMECHANGE,ALIASES,SUCHREGELN,ZUW,SEOTEXT1,SEOTEXT2,SEOKommentar,DBIKommentar'
df = pd.DataFrame(columns=s.split(','))
Empty DataFrame
Columns: [TERMID, NAME, TYP, NAMECHANGE, ALIASES, SUCHREGELN, ZUW, SEOTEXT1, SEOTEXT2, SEOKommentar, DBIKommentar]
Index: []
Solution (df1 is the second dataframe in your example):
d = {'one': 'TERMID', 'two': 'TYP', 'three': 'ZUW', 'four': 'SEOKommentar'}
df = df1.rename(columns=d).reindex(columns=df.columns)
TERMID NAME TYP NAMECHANGE ALIASES SUCHREGELN ZUW SEOTEXT1 \
0 1.0 NaN 4.0 NaN NaN NaN 2.4 NaN
1 2.0 NaN 3.0 NaN NaN NaN 4.4 NaN
2 3.0 NaN 2.0 NaN NaN NaN 7.0 NaN
3 4.0 NaN 1.0 NaN NaN NaN 9.0 NaN
SEOTEXT2 SEOKommentar DBIKommentar
0 NaN 6.4 NaN
1 NaN 4.1 NaN
2 NaN 1.0 NaN
3 NaN 5.0 NaN