How can i compare a DataFrame with other DataFrame's columns? - python

I have two different DataFrame's. df is the complete one and sample is for the comparing. Here is the data i have:
sample.tail()
T1 C C1 C2 C3
0 1 5 0.0 7.0 5.0
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 4.0 6.0 6.0
1 0 0 5 5.0 4.0 6.0
2 0 1 7 5.0 5.0 4.0
3 1 1 0 7.0 5.0 5.0
4 1 1 5 0.0 7.0 5.0
I have selected some columns from sample df and trying to find values in df matches the sample
Here what i did so far but no luck:
cols = sample.columns
df = df[df[cols] == sample[cols]]
and i am getting the following error:
ValueError: Can only compare identically-labeled DataFrame objects
Can you kindly help me to findout the solution for this?
EDIT: Expected output
df.tail()
T1 T2 C C1 C2 C3
0 1 0 5 0.0 7.0 5.0
21 1 1 5 0.0 7.0 5.0
27 1 0 5 0.0 7.0 5.0
34 1 1 5 0.0 7.0 5.0
42 1 1 5 0.0 7.0 5.0
47 1 0 5 0.0 7.0 5.0
51 1 1 5 0.0 7.0 5.0
You can see that All data matches with sample dataframe except T2. This is expected output for me
Thanks

Using pd.Index.intersection, you can use
cols = sample.columns.intersection(df.columns)
df[df[cols].apply(tuple, axis=1).isin(sample[cols].apply(tuple, axis=1))]

Related

Pandas dataframe insert missing row and fill with previous row

I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
Try this
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
missingid = list(set(range(df.id.min(),df.id.max())) - set(df.id.tolist()))
for i in missingid:
df.loc[len(df)] = np.concatenate((np.array([i]),df[df.id==i-1][["A","B"]].values[0]))
df=df.sort_values("id").reset_index(drop=True)
output
id A B
0 0.0 0.0 NaN
1 1.0 1.0 NaN
2 2.0 0.0 1.0
3 3.0 0.0 1.0
4 4.0 1.0 NaN
5 5.0 0.0 NaN

find duplicate subset of columns with nan values in dataframe

I have a dataframe with 4 columns that can have np.nan
df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
I am looking for invalid rows.
invalid rows are
[1] rows with duplicate columns = [i_example, i_frame, OId] or
[2] rows with duplicate columns = [i_example, i_frame, HId].
So in the example above, all the rows are invalid beside the first three rows.
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
and
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0
1 0 21 3.0 NaN
2 0 21 3.0 0.0
These two rows are invalid because of the condition [1].
and
3 1 22 0.0 4.0
4 1 22 NaN 4.0
are invalid because of the condition [2]
and
5 2 20 0.0 4.0
6 2 20 1.0 4.0
are invalid for the same reason
I tried is_duplicated but it does not work with nan values
I am not sure if the df.duplicated() function offers to eliminate NaNs. But you can add a condition to check of the value is NaN or not and find the duplicates.
df[df.duplicated(['i_example', 'i_frame', 'OId'], keep=False) & df['OId'].notna()]
Result:
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
So, for your question, I would see if the value is not NaN and then find the duplicates using df.duplicated() and create a boolean mask. With that filter the df as valid and invalid.
dupes = (df['OId'].notna() & df.duplicated(['i_example', 'i_frame', 'OId'], keep=False)) | (df['HId'].notna() & df.duplicated(['i_example', 'i_frame', 'HId'], keep=False))
invalid_df = df[dupes]
valid_df = df[~dupes]
Result:
valid_df =
i_example i_frame OId HId
0 0 20 3.0 0.0
1 3 13 NaN 8.0
2 3 13 NaN 10.0
invalid_df =
i_example i_frame OId HId
3 0 21 3.0 NaN
4 0 21 3.0 0.0
5 1 22 0.0 4.0
6 1 22 NaN 4.0
7 2 20 0.0 4.0
8 2 20 1.0 4.0

Replacing values in dataframe by values from other rows by "target row"

I am trying to do the following: When the value of 'content' is NaN, replace its value from that of the target row. Below is my code which does that by iterating over all rows, but it is ugly and slow. I suspect there should be a more elegant/fast way to do this with mask, so I figured someone may inspire me on this:
Inputs:
import pandas as pd
d = {'content': [1, 3, None, 6, 1, 59, None], 'target': [0,1,0,2,4,5,4]}
df = pd.DataFrame(data=d)
print(df)
for index, row in df.iterrows():
if df.loc[index,'content']!=df.loc[index,'content']: # To detect NaN
df.loc[index,'content']=df.loc[df.loc[index,'target'],'content']
print(df)
outputs:
content target
0 1.0 0
1 3.0 1
2 NaN 0
3 6.0 2
4 1.0 4
5 59.0 5
6 NaN 4
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 1.0 4
Thanks in advance!
Note: Only when the content of the row is "NaN", the content should be changed to that of the target row.
Additional Question: Can I do the same thing, whenever the content is 59 or 6? Thanks a lot!
By using fillna
df.content=df.content.fillna(df.target)
df
Out[268]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 2.0 2
4 1.0 4
5 59.0 5
6 5.0 5
EDIT
df.ffill()
Out[487]:
content target
0 1.0 0
1 3.0 1
2 6.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
I guess you need this
df.content.reindex(df.target)
Out[492]:
target
0 1.0
1 3.0
0 1.0
2 6.0
4 1.0
5 59.0
5 59.0
Name: content, dtype: float64
After assign it back
df.content=df.content.reindex(df.target).values
df
Out[494]:
content target
0 1.0 0
1 3.0 1
2 1.0 0
3 6.0 2
4 1.0 4
5 59.0 5
6 59.0 5
Let me edit again
df.content.fillna(df.content.reindex(df.target).reset_index(drop=True))
Out[508]:
0 1.0
1 3.0
2 1.0
3 6.0
4 1.0
5 59.0
6 1.0
Name: content, dtype: float64

Forward fill Pandas df only if an entire line is made of Nan

I would like to forward fill a pandas df with the previous line only when the current line is entirely composed ofnan.
This means that fillna(method='ffill', limit = 1) does not work in my case because it works element wise while I would need a fillna line wise.
Is there a more elegant way to achieve this task than the following instructions?
s = df.count(axis = 1)
for d in df.index[1:]:
if s.loc[d] == 0:
i = s.index.get_loc(d)
df.iloc[i] = df.iloc[i-1]
Input
v1 v2
1 1 2
2 nan 3
3 2 4
4 nan nan
Output
v1 v2
1 1 2
2 nan 3
3 2 4
4 2 4
You can use conditions for filter rows for applying ffill:
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
dtype: bool
print (df[m])
v1 v2
1 1.0 2.0
3 2.0 4.0
4 NaN NaN
df[m] = df[m].ffill()
print (df)
v1 v2
1 1.0 2.0
2 NaN 3.0
3 2.0 4.0
4 2.0 4.0
EDIT:
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 NaN NaN
5 2.0 4.0
6 NaN 3.0
7 NaN NaN
m = df.isnull().all(axis=1) | df.notnull().all(axis=1)
print (m)
1 True
2 False
3 True
4 True
5 True
6 False
7 True
dtype: bool
long_str = 'some long helper str'
df[~m] = df[~m].fillna(long_str)
df = df.ffill().replace(long_str, np.nan)
print (df)
v1 v2
1 1.0 2.0
2 NaN 7.0
3 4.0 8.0
4 4.0 8.0
5 2.0 4.0
6 NaN 3.0
7 NaN 3.0

Transposing a subset of columns in a Pandas DataFrame while using others as grouping variable?

Let's say I have a Pandas dataframe (that is already in the dataframe format):
x = [[1,2,8,7,9],[1,3,5.6,4.5,4],[2,3,4.5,5,5]]
df = pd.DataFrame(x, columns=['id1','id2','val1','val2','val3'])
id1 id2 val1 val2 val3
1 2 8.0 7.0 9
1 3 5.6 4.5 4
2 3 4.5 5.0 5
I want val1, val2, and val2 in one column, with id1 and id2 as grouping variables. I can use this extremely convoluted code:
dfT = df.iloc[:,2::].T.reset_index(drop=True)
n_points = dfT.shape[0]
final = pd.DataFrame()
for i in range(0, df.shape[0]):
data = np.asarray([[df.ix[i,'id1']]*n_points,
[df.ix[i,'id2']]*n_points,
dfT.ix[:,i].values]).T
temp = pd.DataFrame(data, columns=['id1','id2','val'])
final = pd.concat([final, temp], axis=0)
to get my dataframe into the correct format:
id1 id2 val
0 1.0 2.0 8.0
1 1.0 2.0 7.0
2 1.0 2.0 9.0
0 1.0 3.0 5.6
1 1.0 3.0 4.5
2 1.0 3.0 4.0
0 2.0 3.0 4.5
1 2.0 3.0 5.0
2 2.0 3.0 5.0
but there must be a more efficient way of doing this, since on a large dataframe this takes way too long.
Suggestions?
You can use melt with drop column variable:
print (pd.melt(df, id_vars=['id1','id2'], value_name='val')
.drop('variable', axis=1))
id1 id2 val
0 1 2 8.0
1 1 3 5.6
2 2 3 4.5
3 1 2 7.0
4 1 3 4.5
5 2 3 5.0
6 1 2 9.0
7 1 3 4.0
8 2 3 5.0
Another solution with set_index and stack:
print (df.set_index(['id1','id2'])
.stack()
.reset_index(level=2, drop=True)
.reset_index(name='val'))
id1 id2 val
0 1 2 8.0
1 1 2 7.0
2 1 2 9.0
3 1 3 5.6
4 1 3 4.5
5 1 3 4.0
6 2 3 4.5
7 2 3 5.0
8 2 3 5.0
There's even a simpler one which can be done using lreshape(Not yet documented though):
pd.lreshape(df, {'val': ['val1', 'val2', 'val3']}).sort_values(['id1', 'id2'])

Categories