I have a problem similar to dropping duplicates, but I need to retain the row that has the repeated value. So essentially, I need to retain the first value and then replace every repetition of it with ''.
Col1 Col2
a 1
b 1
c 1
d 2
What I need is:
Col1 Col2
a 1
b
c
d 2
Thanks.
Use duplicated with replace values to empty string - but get mixed values - numeric with strings, so some functions should failed. Better is replace to NaNs, although integers are converted to floats.
df.loc[df['Col2'].duplicated(), 'Col2'] = ''
#if want numeric column
#df.loc[df['Col2'].duplicated(), 'Col2'] = np.nan
Faster alternative:
df['Col2'] = np.where(df['Col2'].duplicated(), '', df['Col2'])
print (df)
Col1 Col2
0 a 1
1 b
2 c
3 d 2
Related
I have the following dataframe:
ID col1 col2 col3
0 ['a','b'] ['d','c'] ['e','d']
1 ['s','f'] ['f','a'] ['d','aaa']
Give an input string = 'a'
I want to receive a dataframe like this:
ID col1 col2 col3
0 1 0 0
1 0 1 0
I see how to do it with a for loop but that takes forever, and there must be a method I miss
Processing lists in pandas is not vectorized supported, so performance is worse like scalars.
First idea is reshape lists columns to Series by DataFrame.stack, create scalars by Series.explode, so possible compare by a, test if match per first levels by Series.any, and last reshape back with convert boolean mask to integers:
df1 = df.set_index('ID').stack().explode().eq('a').any(level=[0,1]).unstack().astype(int)
print (df1)
col1 col2 col3
ID
0 1 0 0
1 0 1 0
Or is possible use DataFrame.applymap for elementwise testing by lambda function with in:
df1 = df.set_index('ID').applymap(lambda x: 'a' in x).astype(int)
Or create for each lists column DataFrame, so possible test by a with DataFrame.any:
f = lambda x: pd.DataFrame(x.tolist(), index=x.index).eq('a').any(axis=1)
df1 = df.set_index('ID').apply(f).astype(int)
I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!
When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.
It is not difficult to remove a row in a dataframe according to the fact that a specific column is non numerical.
But in my case I have the opposite: I must remove the lines which correspond to a numerical entry of a specific column.
Convert column to numeric with errors='coerce' and test missing values for remove numeric values:
df= pd.DataFrame(data={'tested col': [1,'b',2,'3'],
'A': [1,2,3,4]})
print (df)
tested col A
0 1 1
1 b 2
2 2 3
3 3 4
df1 = df[pd.to_numeric(df['tested col'], errors='coerce').isna()]
print (df1)
tested col A
1 b 2
I have a df:
col1 col2
A 1
B 2
1 string
2 3
C more string
How can I drop all the rows where col2 contains a string?
You can do:
df[pd.to_numeric(df['col2'], errors='coerce').notnull()]
Output:
col1 col2
0 A 1
1 B 2
3 2 3
Try
df = df[df['col2'].apply(lambda x: type(x) != str)]
The apply function ouputs True for every row that is not a string. For strings, it yields false. Then all rows with True are selected from the data frame.
I have this dataframe
col1 col2 col3
0 2 A 1
1 1 A 100
2 3 B 12
3 4 B 2
I want to select the highest col1 value from all with A, then the one from all with B, etc, i.e. this is the desired output
col1 col2 col3
0 2 A 1
3 4 B 2
I know I need some kind of groupby('col2'), but I don't know what to use after that.
is that what you want?
In [16]: df.groupby('col2').max().reset_index()
Out[16]:
col2 col1
0 A 2
1 B 4
use groupby('col2') then use idxmax to get the index of the max values within each group. Finally, use these index values to slice the original dataframe.
df.loc[df.groupby('col2').col1.idxmax()]
Notice that the index values of the original dataframe are preserved.