I have this df:
import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake','Sem'], 'F1': [3, 4, np.nan], 'F2': [14, 40, 7], 'F3':
[np.nan, 1, 55]}
df = pd.DataFrame(data=d)
print (df)
out>>>
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 NaN
I would like to delete all the rows that under at least 2 columns (between F1 F2 and F3) are NaN.
Like:
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
This is just an example, but I may have up to many columns (up to F100) and I may want to delete with other values instead of 2 out of 3 columns.
What is the best way to do this?
You can use the subset and thresh parameters of dropna:
df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)
Example:
import pandas as pd
import numpy as np
d = {'name': ['bob', 'jake', 'Sem', 'Mario'],
'F1': [3, 4, np.nan, np.nan],
'F2': [14, 40, 7, 42],
'F3': [np.nan, 1, 55, np.nan]}
df = pd.DataFrame(data=d)
print(df)
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 55.0
3 Mario NaN 42 NaN
df2 = df.dropna(subset=['F1', 'F2', 'F3'], thresh=2)
print(df2)
name F1 F2 F3
0 bob 3.0 14 NaN
1 jake 4.0 40 1.0
2 Sem NaN 7 55.0
Selecting the columns automatically:
cols = list(df.filter(regex=r'F\d+'))
df.dropna(subset=cols, thresh=2)
alternative without dropna
Using boolean indexing:
m = df.filter(regex=r'F\d+').isna().sum(1)
df2 = df[m.lt(2)]
values above threshold
drop if at least 2 values ≥ 4:
m = df.filter(regex=r'F\d+').gt(4).sum(1)
df2 = df[m.lt(2)]
Related
I have a dataframe (df3) with 51 columns and managed to show the most common values in each feature with a for loop.
for col in df3.columns:
print('-' * 40 + col + '-' * 40 , end=' - ')
display(df3[col].value_counts().head(10))
Now I'd like to create a new dataframe called df4 with the results from the loop. That is the 10 most frequent values from all columns of df3. How can I do that?
I get values using
df4 = df3.apply(lambda col: col.value_counts().head(10).index)
Instead of for-loop I use apply.
Because .value_counts() creates Series which uses original IDs as index so I get .index
Minimal working example - because I have less values so I use head(2)
import pandas as pd
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
df2 = df.apply(lambda col: col.value_counts().head(2).index)
print(df2)
Result
A B C
0 6 4 1
1 3 8 7
EDIT:
If you have less then 10 results in column then you can convert to list expand with list which have 10 x NaN and after then crop it to [:10]
.head(10).index.tolist() + [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN])[:10])
Minimal working example
import pandas as pd
import numpy as np
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
NAN10 = [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]
df2 = df.apply(lambda col: (col.value_counts().head(10).index.tolist() + NAN10)[:10])
print(df2)
Result
A B C
0 6.0 4.0 1.0
1 3.0 8.0 7.0
2 5.0 6.0 2.0
3 4.0 5.0 9.0
4 2.0 3.0 8.0
5 1.0 2.0 NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
You can also try to conver to Series and it may add NaN in missing places but it will skip rows which have only NaN
import pandas as pd
import numpy as np
data = {
'A': [1,2,3,3,4,5,6,6,6],
'B': [4,5,6,4,2,3,4,8,8],
'C': [7,8,9,7,1,1,1,2,2]
} # columns
df = pd.DataFrame(data)
df3 = df.apply(lambda col: pd.Series(col.value_counts().head(10).index))
print(df3)
Result
A B C
0 6 4 1.0
1 3 8 7.0
2 5 6 2.0
3 4 5 9.0
4 2 3 8.0
5 1 2 NaN
Suppose I have a dataframe with rows containing missing data, but a set of columns acting as a key:
import pandas as pd
import numpy as np
data = {"id": [1, 1, 2, 2, 3, 3, 4 ,4], "name": ["John", "John", "Paul", "Paul", "Ringo", "Ringo", "George", "George"], "height": [178, np.nan, 182, np.nan, 175, np.nan, 188, np.nan], "weight": [np.nan, np.NaN, np.nan, 72, np.nan, 68, np.nan, 70]}
df = pd.DataFrame.from_dict(data)
print(df)
id name height weight
0 1 John 178.0 NaN
1 1 John NaN NaN
2 2 Paul 182.0 NaN
3 2 Paul NaN 72.0
4 3 Ringo 175.0 NaN
5 3 Ringo NaN 68.0
6 4 George 188.0 NaN
7 4 George NaN 70.0
How would I go about "squashing" these rows with duplicate keys down to pick the non-nan value (if it exists)?
desired output:
id name height weight
0 1 John 178.0 NaN
2 2 Paul 182.0 72.0
4 3 Ringo 175.0 68.0
6 4 George 188.0 70.0
The index doesn't matter, and there is always at most one row with Non-NaN data. I think I need to use groupby(['id', 'name']), but I'm not sure where to go from there.
If there are always only one non NaNs values per groups is possible aggregate many ways:
df = df.groupby(['id', 'name'], as_index=False).first()
Or:
df = df.groupby(['id', 'name'], as_index=False).last()
Or:
df = df.groupby(['id', 'name'], as_index=False).mean()
Or:
df = df.groupby(['id', 'name'], as_index=False).sum(min_count=1)
I work in python and pandas.
Let's suppose that I have the following two dataframes df_1 and df_2 (INPUT):
# df1
A B C
0 2 8 6
1 5 2 5
2 3 4 9
3 5 1 1
# df2
A B C
0 2 7 NaN
1 5 1 NaN
2 3 3 NaN
3 5 0 NaN
I want to process it to join/merge them to get a new dataframe which looks like that (EXPECTED OUTPUT):
A B C
0 2 7 NaN
1 5 1 1
2 3 3 NaN
3 5 0 NaN
So basically it is a right-merge/join but with preserving the order of the original right dataframe.
However, if I do this:
df_2 = df_1.merge(df_2[['A', 'B']], on=['A', 'B'], how='right')
then I get this:
A B C
0 5 1 1.0
1 2 7 NaN
2 3 3 NaN
3 5 0 NaN
So I get the right rows joined/merged but the output dataframe does not have the same row-order as the original right dataframe.
How can I do the join/merge and preserve the row-order too?
The code to create the original dataframes is the following:
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)
columns = ['A', 'B', 'C']
data_2 = [[2, 5, 3, 5], [7, 1, 3, 0], [np.nan, np.nan, np.nan, np.nan]]
data_2 = np.array(data_2).T
df_2 = pd.DataFrame(data=data_2, columns=columns)
I think that by using either .join() or .update() I could get what I want but to start with I am quite surprised that .merge() does not do this very simple thing too.
I think it is bug.
Possible solution with left join:
df_2 = df_2.merge(df_1, on=['A', 'B'], how='left', suffixes=('_','')).drop('C_', axis=1)
print (df_2)
A B C
0 2.0 7.0 NaN
1 5.0 1.0 1.0
2 3.0 3.0 NaN
3 5.0 0.0 NaN
You can play with index between the both dataframe
print(df)
# A B C
# 0 5 1 1.0
# 1 2 7 NaN
# 2 3 3 NaN
# 3 5 0 NaN
df = df.set_index('B')
df = df.reindex(index=df_2['B'])
df = df.reset_index()
df = df[['A', 'B', 'C']]
print(df)
# A B C
# 0 2 7.0 NaN
# 1 5 1.0 1.0
# 2 3 3.0 NaN
# 3 5 0.0 NaN
Source
One quick way is:
df_2=df_2.set_index(['A','B'])
temp = df_1.set_index(['A','B'])
df_2.update(temp)
df_2.reset_index(inplace=True)
As I discuss above with #jezrael above and if I am not missing something, if you do not need both the columns C from the original dataframes and you need only the column C with the matching values then .update() is the quickest way since you do not have to drop the columns that you do not need.
I find that after using pd.concat() to concatenate two dataframes with same column name, then df.fillna() will not work correctly with the dict parameter specifying which value to use for each column.
I don't know why? Is something wrong with my understanding?
a1 = pd.DataFrame({'a': [1, 2, 3]})
a2 = pd.DataFrame({'a': [1, 2, 3]})
b = pd.DataFrame({'b': [np.nan, 20, 30]})
c = pd.DataFrame({'c': [40, np.nan, 60]})
x = pd.concat([a1,a2, b, c], axis=1)
print(x)
x = x.fillna({'b':10, 'c': 50})
print(x)
Initial dataframe:
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
Data is unchanged after df.fillna():
a a b c
0 1 1 NaN 40.0
1 2 2 20.0 NaN
2 3 3 30.0 60.0
As mentioned in the comments, there's a problem assigning values to a dataframe in the presence of duplicate column names.
However, you can use this workaround:
for col,val in {'b':10, 'c': 50}.items():
new_col = x[col].fillna(val)
idx = int(x.columns.get_loc(col))
x = x.drop(col,axis=1)
x.insert(loc=idx, column=col, value=new_col)
print(x)
result:
a a b c
0 1 1 10.0 40.0
1 2 2 20.0 50.0
2 3 3 30.0 60.0
I have a (example-) dataframe with 4 columns:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [np.nan, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
A B C D
0 a 42.0 NaN NaN
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
I would now like to merge/combine columns B, C, and D to a new column E like in this example:
data2 = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'E': [42, 52, 31, 2, 62, 70]}
df2 = pd.DataFrame(data2, columns = ['A', 'E'])
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
I found a quite similar question here but this adds the merged colums B, C, and D at the end of column A:
0 a
1 b
2 c
3 d
4 e
5 f
6 42
7 52
8 31
9 2
10 62
11 70
dtype: object
Thanks for help.
Option 1
Using assign and drop
In [644]: cols = ['B', 'C', 'D']
In [645]: df.assign(E=df[cols].sum(1)).drop(cols, 1)
Out[645]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 2
Using assignment and drop
In [648]: df['E'] = df[cols].sum(1)
In [649]: df = df.drop(cols, 1)
In [650]: df
Out[650]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
Option 3 Lately, I like the 3rd option.
Using groupby
In [660]: df.groupby(np.where(df.columns == 'A', 'A', 'E'), axis=1).first() #or sum max min
Out[660]:
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
In [661]: df.columns == 'A'
Out[661]: array([ True, False, False, False], dtype=bool)
In [662]: np.where(df.columns == 'A', 'A', 'E')
Out[662]:
array(['A', 'E', 'E', 'E'],
dtype='|S1')
The question as written asks for merge/combine as opposed to sum, so posting this to help folks who find this answer looking for help on coalescing with combine_first, which can be a bit tricky.
df2 = pd.concat([df["A"],
df["B"].combine_first(df["C"]).combine_first(df["D"])],
axis=1)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42.0
1 b 52.0
2 c 31.0
3 d 2.0
4 e 62.0
5 f 70.0
What's so tricky about that? in this case there's no problem - but let's say you were pulling the B, C and D values from different dataframes, in which the a,b,c,d,e,f labels were present, but not necessarily in the same order. combine_first() aligns on the index, so you'd need to tack a set_index() on to each of your df references.
df2 = pd.concat([df.set_index("A", drop=False)["A"],
df.set_index("A")["B"]\
.combine_first(df.set_index("A")["C"])\
.combine_first(df.set_index("A")["D"]).astype(int)],
axis=1).reset_index(drop=True)
df2.rename(columns={"B":"E"}, inplace=True)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Use difference for columns names without A and then get sum or max:
cols = df.columns.difference(['A'])
df['E'] = df[cols].sum(axis=1).astype(int)
# df['E'] = df[cols].max(axis=1).astype(int)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
If multiple values per rows:
data = {'A': ['a', 'b', 'c', 'd', 'e', 'f'],
'B': [42, 52, np.nan, np.nan, np.nan, np.nan],
'C': [np.nan, np.nan, 31, 2, np.nan, np.nan],
'D': [10, np.nan, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
print (df)
A B C D
0 a 42.0 NaN 10.0
1 b 52.0 NaN NaN
2 c NaN 31.0 NaN
3 d NaN 2.0 NaN
4 e NaN NaN 62.0
5 f NaN NaN 70.0
cols = df.columns.difference(['A'])
df['E'] = df[cols].apply(lambda x: ', '.join(x.dropna().astype(int).astype(str)), 1)
df = df.drop(cols, axis=1)
print (df)
A E
0 a 42, 10
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
You can also use ffill with iloc:
df['E'] = df.iloc[:, 1:].ffill(1).iloc[:, -1].astype(int)
df = df.iloc[:, [0, -1]]
print(df)
A E
0 a 42
1 b 52
2 c 31
3 d 2
4 e 62
5 f 70
Zero's third option using groupby requires a numpy import and only handles one column outside the set of columns to collapse, while jpp's answer using ffill requires you know how columns are ordered. Here's a solution that has no extra dependencies, takes an arbitrary input dataframe, and only collapses columns if all rows in those columns are single-valued:
import pandas as pd
data = [{'A':'a', 'B':42, 'messy':'z'},
{'A':'b', 'B':52, 'messy':'y'},
{'A':'c', 'C':31},
{'A':'d', 'C':2, 'messy':'w'},
{'A':'e', 'D':62, 'messy':'v'},
{'A':'f', 'D':70, 'messy':['z']}]
df = pd.DataFrame(data)
cols = ['B', 'C', 'D']
new_col = 'E'
if df[cols].apply(lambda x: len(x.notna().value_counts()) == 1, axis=1).all():
df[new_col] = df[cols].ffill(axis=1).dropna(axis=1)
df2 = df.drop(columns=cols)
print(df, '\n\n', df2)
Output:
A B messy C D
0 a 42.0 z NaN NaN
1 b 52.0 y NaN NaN
2 c NaN NaN 31.0 NaN
3 d NaN w 2.0 NaN
4 e NaN v NaN 62.0
5 f NaN [z] NaN 70.0
A messy E
0 a z 42.0
1 b y 52.0
2 c NaN 31.0
3 d w 2.0
4 e v 62.0
5 f [z] 70.0