I am new to pandas and python.
I want to use dictionary to filter DataFrame
import pandas as pd
from pandas import DataFrame
df = DataFrame({'A': [1, 2, 3, 3, 3, 3], 'B': ['a', 'b', 'f', 'c', 'e', 'c'], 'D':[0,0,0,0,0,0]})
my_filter = {'A':[3], 'B':['c']}
When I call
df[df.isin(my_filter)]
I get
A B D
0 NaN NaN NaN
1 NaN NaN NaN
2 3.0 NaN NaN
3 3.0 c NaN
4 3.0 NaN NaN
5 3.0 c NaN
What I want is
A B D
3 3.0 c 0
5 3.0 c 0
I dont want to add "D" in the dictionary, I want to get rows that has proper values in A and B clumns
You can sum of True by columns and then compare with 2:
print (df.isin(my_filter).sum(1) == 2)
0 False
1 False
2 False
3 True
4 False
5 True
dtype: bool
print (df[df.isin(my_filter).sum(1) == 2])
A B D
3 3 c 0
5 3 c 0
Another solution with first filter only columns with condition A and B with all for checking both True by columns:
print (df[df[['A','B']].isin(my_filter).all(1)])
A B D
3 3 c 0
5 3 c 0
Thank you MaxU for more flexible solution:
print (df[df.isin(my_filter).sum(1) == len(my_filter.keys())])
A B D
3 3 c 0
5 3 c 0
Related
I have a DataFrame where I am looking to fill in values in a column based on their grouping. I only want to fill in the values (by propagating non-NaN values using ffill and bfill) if there is only one unique value in the column to be filled; otherwise, it should be left as is. My code below has a sample dataset where I try to do this, but I get an error.
Code:
df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
"B": ['a', 'a', np.nan, 'b', 'b', 'c', np.nan, 'd', np.nan, 'e', 'e', np.nan, 'h', 'h'],
"C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, np.nan]})
col_to_groupby = "A"
col_to_modify = "B"
group = df.groupby(col_to_groupby)
modified = group[group[col_to_modify].nunique() == 1].transform(lambda x: x.ffill().bfill())
df.update(modified)
Error:
KeyError: 'Columns not found: False, True'
Original dataset:
A B C
0 1 a 5.0
1 1 a NaN
2 2 NaN 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 NaN NaN
Desired result:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
The above is the desired result because
row index 2 is in group 2, which only has 1 unique value in column B ("b"), so it is changed.
row indices 6 and 8 are in group 3, but there are 2 unique values in column B ("c" and "d"), so they are unaltered.
row index 5 is in group 11, but has no data in column B to propagate.
row index 13 is in group 6, which only has 1 unique value in column B ("h"), so it is changed.
One option is to add a condition in groupby.apply:
df[col_to_modify] = df.groupby(col_to_groupby)[col_to_modify].apply(lambda x: x.ffill().bfill() if x.nunique()==1 else x)
Another could be to use groupby + transform(nunique) + eq to create a boolean filter for the groups with unique values; then update those rows with groupby + first (first drops NaN) using where:
g = df.groupby(col_to_groupby)[col_to_modify]
df[col_to_modify] = g.transform('first').where(g.transform('nunique').eq(1), df[col_to_modify])
Output:
A B C
0 1 a 5.0
1 1 a NaN
2 2 b 4.0
3 2 b 4.0
4 2 b NaN
5 3 c 9.0
6 3 NaN NaN
7 3 d NaN
8 3 NaN 9.0
9 4 e 8.0
10 4 e NaN
11 5 NaN 2.0
12 6 h NaN
13 6 h NaN
I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.
You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4
This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0
We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
How can I replace specific row-wise duplicate cells in selected columns without dropping rows (preferably without looping through the rows)?
Basically, I want to keep the first value and replace the remaining duplicates in a row with NAN.
For example:
df_example = pd.DataFrame({'A':['a' , 'b', 'c'], 'B':['a', 'f', 'c'],'C':[1,2,3]})
df_example.head()
Original:
A B C
0 a a 1
1 b f 2
2 c c 3
Expected output:
A B C
0 a nan 1
1 b f 2
2 c nan 3
A bit more complicated example is as follows:
Original:
A B C D
0 a 1 a 1
1 b 2 f 5
2 c 3 c 3
Expected output:
A B C D
0 a 1 nan nan
1 b 2 f 5
2 c 3 nan nan
Use DataFrame.mask with Series.duplicated per rows in DataFrame.apply:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C
0 a NaN 1
1 b f 2
2 c NaN 3
With new data:
df_example = df_example.mask(df_example.apply(lambda x: x.duplicated(), axis=1))
print (df_example)
A B C D
0 a 1 NaN NaN
1 b 2 f 5.0
2 c 3 NaN NaN
I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.
I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?
You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]
Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try
In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0
something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])