How to drop string values from pandas dataframe column? - python

I have a df:
col1 col2
A 1
B 2
1 string
2 3
C more string
How can I drop all the rows where col2 contains a string?

You can do:
df[pd.to_numeric(df['col2'], errors='coerce').notnull()]
Output:
col1 col2
0 A 1
1 B 2
3 2 3

Try
df = df[df['col2'].apply(lambda x: type(x) != str)]
The apply function ouputs True for every row that is not a string. For strings, it yields false. Then all rows with True are selected from the data frame.

Related

filter rows failing dtype conversion

I have a pandas dataframe with lot of columns. The dtype of all columns is object because some columns have strings in value. Is there a way to filter out rows into a different dataframe where value in any column is a string and then convert the cleaned dataframe to integer dtype.
I figured out the second part but not able to achieve the first part - filtering out rows if value contains string character like 'a', 'b' etc. for eg. if df is:
df = pd.DataFrame({
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
This should become 2 dataframes
df = pd.DataFrame({
'col1':[1,2,3],
'col2':[1,2,5],
'col3':[1,2,4]
})
dfError = pd.DataFrame({
'col1':['a',0],
'col2':[3,4],
'col3':[3,'45a5']
})
I believe this to be an efficient way to do it.
import pandas as pd
df = pd.DataFrame({ # main dataframe
'col1':[1,2,'a',0,3],
'col2':[1,2,3,4,5],
'col3':[1,2,3,'45a5',4]
})
mask = df.apply(pd.to_numeric, errors='coerce').isna() # checks if couldn't be numeric
mask = mask.any(1) # check rows that couldn't be numeric
df1 = df[~mask] # could be numeric
df2 = df[mask] # couldn't be numeric
Breaking it down:
df.apply(pd.to_numeric) # converts the dataframe into numeric, but this would give us an error for the string elements (like 'a')
df.apply(pd.to_numeric, errors='coerce') # 'coerce' sets any non-valid element to NaN (converts the string elements to NaN).
>>>
col1 col2 col3
0 1.0 1 1.0
1 2.0 2 2.0
2 NaN 3 3.0
3 0.0 4 NaN
4 3.0 5 4.0
mask.isna() # Detect missing values.
>>>
col1 col2 col3
1 False False False
2 True False False
3 False False True
4 False False False
mask.any(1) # Returns whether any element is True along the rows
>>>
0 False
1 False
2 True
3 True
4 False
Don't know if there's a performant way to check this. But a dirty way (might be slow) could be:
str_cond = df.applymap(lambda x: isinstance(x, str)).any(1)
df[~str_cond]
col1 col2 col3
0 1 1 1
1 2 2 2
4 3 5 4
df[str_cond]
col1 col2 col3
2 a 3 3
3 0 4 45a5

Pandas - DF with lists - find all rows that match a string in any of the columns

I have the following dataframe:
ID col1 col2 col3
0 ['a','b'] ['d','c'] ['e','d']
1 ['s','f'] ['f','a'] ['d','aaa']
Give an input string = 'a'
I want to receive a dataframe like this:
ID col1 col2 col3
0 1 0 0
1 0 1 0
I see how to do it with a for loop but that takes forever, and there must be a method I miss
Processing lists in pandas is not vectorized supported, so performance is worse like scalars.
First idea is reshape lists columns to Series by DataFrame.stack, create scalars by Series.explode, so possible compare by a, test if match per first levels by Series.any, and last reshape back with convert boolean mask to integers:
df1 = df.set_index('ID').stack().explode().eq('a').any(level=[0,1]).unstack().astype(int)
print (df1)
col1 col2 col3
ID
0 1 0 0
1 0 1 0
Or is possible use DataFrame.applymap for elementwise testing by lambda function with in:
df1 = df.set_index('ID').applymap(lambda x: 'a' in x).astype(int)
Or create for each lists column DataFrame, so possible test by a with DataFrame.any:
f = lambda x: pd.DataFrame(x.tolist(), index=x.index).eq('a').any(axis=1)
df1 = df.set_index('ID').apply(f).astype(int)

Pandas: slice Dataframe according to values of a column

I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!
When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.

Drop Repeated Values in Column, Retaining the Row

I have a problem similar to dropping duplicates, but I need to retain the row that has the repeated value. So essentially, I need to retain the first value and then replace every repetition of it with ''.
Col1 Col2
a 1
b 1
c 1
d 2
What I need is:
Col1 Col2
a 1
b
c
d 2
Thanks.
Use duplicated with replace values to empty string - but get mixed values - numeric with strings, so some functions should failed. Better is replace to NaNs, although integers are converted to floats.
df.loc[df['Col2'].duplicated(), 'Col2'] = ''
#if want numeric column
#df.loc[df['Col2'].duplicated(), 'Col2'] = np.nan
Faster alternative:
df['Col2'] = np.where(df['Col2'].duplicated(), '', df['Col2'])
print (df)
Col1 Col2
0 a 1
1 b
2 c
3 d 2

Pandas - select rows with best values

I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.

Categories