Dataframe is not updated when columns are passed to function using apply - python

I have two dataframes like this:
A B
a 1 10
b 2 11
c 3 12
d 4 13
A B
a 11 NaN
b NaN NaN
c NaN 20
d 16 30
They have identical column names and indices. My goal is to replace the NAs in df2 by the values of df1. Currently, I do this like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': range(1, 5), 'B': range(10, 14)}, index=list('abcd'))
df2 = pd.DataFrame({'A': [11, np.nan, np.nan, 16], 'B': [np.nan, np.nan, 20, 30]}, index=list('abcd'))
def repl_na(s, d):
s[s.isnull().values] = d[s.isnull().values][s.name]
return s
df2.apply(repl_na, args=(df1, ))
which gives me the desired output:
A B
a 11 10
b 2 11
c 3 20
d 16 30
My question is now how this could be accomplished if the indices of the dataframes are different (column names are still the same, and the columns have the same length). So I would have a df2 like this(df1 is unchanged):
A B
0 11 NaN
1 NaN NaN
2 NaN 20
3 16 30
Then the above code does not work anymore since the indices of the dataframes are different. Could someone tell me how the line
s[s.isnull().values] = d[s.isnull().values][s.name]
has to be modified in order to get the same result as above?

You could temporarily change the indexes on df1 to be the same as df2and just combine_first with df2;
df2.combine_first(df1.set_index(df2.index))
A B
1 11 10
2 2 11
3 3 20
4 16 30

Related

Pandas new dataframe that has sum of columns from another

I'm struggling to figure out how to do a couple of transformation with pandas. I want a new dataframe with the sum of the values from the columns in the original. I also want to be able to merge two of these 'summed' dataframes.
Example #1: Summing the columns
Before:
A B C D
1 4 7 0
2 5 8 1
3 6 9 2
After:
A B C D
6 15 24 3
Right now I'm getting the sums of the columns I'm interested in, storing them in a dictionary, and creating a dataframe from the dictionary. I feel like there is a better way to do this with pandas that I'm not seeing.
Example #2: merging 'summed' dataframes
Before:
A B C D F
6 15 24 3 1
A B C D E
1 2 3 4 2
After:
A B C D E F
7 17 27 7 2 1
First question:
Summing the columns
Use sum then convert Series to DataFrame and transpose
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6],
'C': [7, 8, 9], 'D': [0, 1, 2]})
df1 = df1.sum().to_frame().T
print(df1)
# Output:
A B C D
0 6 15 24 3
Second question:
Merging 'summed' dataframes
Use combine
df2 = pd.DataFrame({'A': [1], 'B': [2], 'C': [3], 'D': [4], 'E': [2]})
out = df1.combine(df2, sum, fill_value=0)
print(out)
# Output:
A B C D E
0 7 17 27 7 2
First part, use DataFrame.sum() to sum the columns then convert Series to dataframe by .to_frame() and finally transpose:
df_sum = df.sum().to_frame().T
Result:
print(df_sum)
A B C D
0 6 15 24 3
Second part, use DataFrame.add() with parameter fill_value, as follows:
df_sum2 = df1.add(df2, fill_value=0)
Result:
print(df_sum2)
A B C D E F
0 7 17 27 7 2.0 1.0

Function in pandas to stack rows into columns by number of rows?

Suppose I have heterogeneous dataframe:
a b c d
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
4 13 14 15 16
And i want to stack the rows like so:
a b c d
1 1,5,8,13 2,6,10,14 3,7,11,15 4,8,12,16
Etc...
All the references for grouby etc seem to require some feature of grouping, I just want to put x rows into columns, regardless of their content. Each row has a timestamp, I am looking to group values by sample count, so i want 1 row with all the values of x sample rows as columns.
I should end up with a dataframe that has x*original number of columns and original number of rows/x
I'm sure there must be some simple method I'm missing here without a series of loop etc
If need join all values to strings use:
df1 = df.astype(str).agg(','.join).to_frame().T
print (df1)
a b c d
0 1,5,9,13 2,6,10,14 3,7,11,15 4,8,12,16
Or if need create lists use:
df2 = pd.DataFrame([[list(df[x]) for x in df]], columns=df.columns)
print (df2)
a b c d
0 [1, 5, 9, 13] [2, 6, 10, 14] [3, 7, 11, 15] [4, 8, 12, 16]
If need scalars with MultiIndex (generated fro index nad columns labels) use:
df3 = df.unstack().to_frame().T
print (df3)
a b c d
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Creating a new dataframe off of duplicate indexes

I'm working in pandas and I have a dataframe X
idx
0
1
2
3
4
I want to create a new dataframe with the following indexes from ths list. There are duplicate indexes because I want some rows to repeat.
idx = [0,0,1,2,3,2,4]
My expected output is
idx
0
0
1
2
3
2
4
I cant use
X.iloc[idx]
because of the duplicated indexes
code i tried:
d = {'idx': [0,1,3,4]}
df = pd.DataFrame(data=d)
idx = [0,0,1,2,3,2,4]
df.iloc[idx] # errors here with IndexError: indices are out-of-bounds
What you want to do is weird, but here is one way to do it.
import pandas as pd
df = pd.DataFrame({'A': [11, 21, 31],
'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'ONE', 'TWO'])
OUTPUT
A B C
ONE 11 12 13
ONE 21 22 23
TWO 31 32 33
Read: pandas: Rename columns / index names (labels) of DataFrame
your current dataframe df:-
idx
0 0
1 1
2 3
3 4
Now just use reindex() method:-
idx = [0,0,1,2,3,2,4]
df=df.reindex(idx)
Now if you print df you get:-
idx
0 0.0
0 0.0
1 1.0
2 3.0
3 4.0
2 3.0
4 NaN

How to delete the row in a dataframe panda based on the row names of another dataframe?

I want to short my data, the whole data shape is 30000x480. And I want to drop some rows based on the row names of another data frame.
Help me to solve it and get the solution for:
df1
Row a b
A 1 2
B 3 4
C 5 6
D 7 8
E 9 10
F 11 12
G 13 14
df2
Row a b
C 5 6
D 7 8
F 11 12
G 13 14
So, I want to delete the rows in df1 that doesn't exist on the df2, it's hard to delete as manually because the data is very big
For better understanding, taking the same data given. Let me put the same question in a different context for a better understanding as below:
Question : Want to delete the rows in df1 that doesn't exist on the df2
New way : you need the rows of df1 that are present in df2 (or) in a way you need the common rows of both df1 & df2, try this
>>> import pandas as pd
>>> df2 = pd.DataFrame({'Row': ['C', 'D', 'F','G'], 'a': [5, 7, 11, 13], 'b' : [6, 8, 12, 14]})
>>> df1 = pd.DataFrame({'Row' : ['A', 'B', 'C', 'D'], 'a': [1,3,5,7], 'b': [2,4,6, 8]})
>>> df1
Row a b
0 A 1 2
1 B 3 4
2 C 5 6
3 D 7 8
>>> df2
Row a b
0 C 5 6
1 D 7 8
2 F 11 12
3 G 13 14
>>> pd.merge(df1, df2, 'inner')
Row a b
0 C 5 6
1 D 7 8
>>>

How can I extract a column from dataframe and attach it to rows while keeping other columns intact

How can I extract a column from pandas dataframe attach it to rows while keeping the other columns same.
This is my example dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': np.arange(0,5),
'sample_1' : [5,6,7,8,9],
'sample_2' : [10,11,12,13,14],
'group_id' : ["A","B","C","D","E"]})
The output I'm looking for is:
df2 = pd.DataFrame({'ID': [0, 1, 2, 3, 4, 0, 1, 2, 3, 4],
'sample_1' : [5,6,7,8,9,10,11,12,13,14],
'group_id' : ["A","B","C","D","E","A","B","C","D","E"]})
I have tried to slice the dataframe and concat using pd.concat but it was giving NaN values.
My original dataset is large.
You could do this using stack: Set the index to the columns you don't want to modify, call stack, sort by the "sample" column, then reset your index:
df.set_index(['ID','group_id']).stack().sort_values(0).reset_index([0,1]).reset_index(drop=True)
ID group_id 0
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
Using pd.wide_to_long:
res = pd.wide_to_long(df, stubnames='sample_', i='ID', j='group_id')
res.index = res.index.droplevel(1)
res = res.rename(columns={'sample_': 'sample_1'}).reset_index()
print(res)
ID group_id sample_1
0 0 A 5
1 1 B 6
2 2 C 7
3 3 D 8
4 4 E 9
5 0 A 10
6 1 B 11
7 2 C 12
8 3 D 13
9 4 E 14
The function you are looking for is called melt
For example:
df2 = pd.melt(df, id_vars=['ID', 'group_id'], value_vars=['sample_1', 'sample_2'], value_name='sample_1')
df2 = df2.drop('variable', axis=1)

Categories