I'm learning pandas these days. I have a rudimentary question regarding the fill_value parameter when using add() on dataframes.
Imagine I have the following data:
dframe1:
A B
NYC 0 1
LA 2 3
dframe2:
A D C
NYC 0 1 2
SF 3 4 5
LA 6 7 8
Doing dframe1.add(dframe2,fill_value=0) yields:
A B C D
LA 8.0 3.0 8.0 7.0
NYC 0.0 1.0 2.0 1.0
SF 3.0 NaN 5.0 4.0
Why do I get NaN for column B, index SF?
I was expecting that fill_value ensures no results of NaN occur by - in this case - assuming column D,C and index SF exist with value 0 for dframe1.
According to http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.add.html
Probably this is the case of:
Fill missing (NaN) values with this value. If both DataFrame locations are missing, the result will be missing.
I bet you already know the fillna for pandas:
df.fillna('', inplace=True)
Related
Say I have a huge DataFrame that only contains a handful of cells that match the filtering I perform. How can I end up with only the values that match it (and their indexes and columns) in a new dataframe without the entire other DataFrame that becomes Nan. Dropping Nan's with dropna just removes the whole column or row and filter replaces non matches with Nans.
Here's my code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((1000, 1000)))
# this one is almost filled with Nans
df[df<0.01]
If need non missing values in another format you can use DataFrame.stack:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(5, 3)))
# this one is almost filled with Nans
df1 = df[df<7]
print (df1)
0 1 2
0 0.0 NaN 3.0
1 6.0 3.0 3.0
2 NaN NaN 0.0
3 0.0 NaN NaN
4 3.0 NaN 2.0
df2 = df1.stack().rename_axis(('a','b')).reset_index(name='c')
print (df2)
a b c
0 0 0 0.0
1 0 2 3.0
2 1 0 6.0
3 1 1 3.0
4 1 2 3.0
5 2 2 0.0
6 3 0 0.0
7 4 0 3.0
8 4 2 2.0
I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0
I have a DataFrame with some NaN values in all columns (Totally 3 columns). I want to populate the NaN values in each cell with the latest valid values in other rows with the fastest approach.
As an example if column A is NaN and column B is '123', I want to find the latest value in column A when the column B is '123' and populate the NaN value with that latest value.
I know it's easy to do this with a loop but I'm thinking regarding to the performance in a DataFrame with 25 mil records.
Any thought could help.
This solution uses for loop, but it loops over values of A where it is NaN.
A = The column containing NaNs
B = The column to be referenced
import pandas as pd
import numpy as np
#Consider this dataframe
df = pd.DataFrame({'A':[1,2,3,4,np.nan,6,7,8,np.nan,10],'B':['xxxx','b','xxxx','d','xxxx','f','yyyy','h','yyyy','j']})
A B
0 1.0 xxxx
1 2.0 b
2 3.0 xxxx
3 4.0 d
4 NaN xxxx
5 6.0 f
6 7.0 yyyy
7 8.0 h
8 NaN yyyy
9 10.0 j
for i in list(df.loc[np.isnan(df.A)].index): #looping over indexes where A in NaN
#dict with the keys as B and values as A
#here the dict keys will be unique and latest entries of B, hence having latest corresponding A values
dictionary = df.iloc[:i+1].dropna().set_index('B').to_dict()['A']
df.iloc[i,0] = dictionary[df.iloc[i,1]] #using the dict to change the value of A
This is how the df looks after executing the the code
A B
0 1.0 xxxx
1 2.0 b
2 3.0 xxxx
3 4.0 d
4 3.0 xxxx
5 6.0 f
6 7.0 yyyy
7 8.0 h
8 7.0 yyyy
9 10.0 j
Notice how at index = 4, A's values gets changed to 3.0 and not 1.0
I am trying to turn multiple dataframes into a single one based on the values in the first column, but not every dataframe has the same values in the first column. Take this example:
df1:
A 4
B 6
C 8
df2:
A 7
B 4
F 3
full_df:
A 4 7
B 6 4
C 8
F 3
How do I do this using python and pandas?
You can use pandas merge with outer join
df1.merge(df2,on =['first_column'],how='outer')
You can use pd.concat, remembering to align indices:
res = pd.concat([df1.set_index(0), df2.set_index(0)], axis=1)
print(res)
1 1
A 4.0 7.0
B 6.0 4.0
C 8.0 NaN
F NaN 3.0
This question already has answers here:
Pandas left join on duplicate keys but without increasing the number of columns
(2 answers)
Closed 4 years ago.
I'd like to combine two dataframes using their similar column 'A':
>>> df1
A B
0 I 1
1 I 2
2 II 3
>>> df2
A C
0 I 4
1 II 5
2 III 6
To do so I tried using:
merged = pd.merge(df1, df2, on='A', how='outer')
Which returned:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 4
2 II 3.0 5
3 III NaN 6
However, since df2 only contained one value for A == 'I', I do not want this value to be duplicated in the merged dataframe. Instead I would like the following output:
>>> merged
A B C
0 I 1.0 4
1 I 2.0 NaN
2 II 3.0 5
3 III NaN 6
What is the best way to do this? I am new to python and still slightly confused with all the join/merge/concatenate/append operations.
Let us create a new variable g, by cumcount
df1['g']=df1.groupby('A').cumcount()
df2['g']=df2.groupby('A').cumcount()
df1.merge(df2,how='outer').drop('g',1)
Out[62]:
A B C
0 I 1.0 4.0
1 I 2.0 NaN
2 II 3.0 5.0
3 III NaN 6.0