Pandas: slice Dataframe according to values of a column - python

I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!

When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.

Related

How to merge rows by same value in different columns using Python (Pandas)

I have a data frame, something like this:
Id Col1 Col2 Paired_Id
1 a A
2 c B
A b 1
B d 2
I would like to merge the rows to get the output something like this. Delete the paired row after merging.
Id Col1 Col2 Paired_Id
1 a b A
2 c d B
Any hint?
So:
Merging rows (ID) with its Paired_ID entries.
Is this possible with Pandas?
Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:
group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)
out = df.groupby(group, as_index=False).first()
Output:
Id Col1 Col2 Paired_Id
0 1 a b A
1 2 c d B
Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:
A line with an entry in Col1 will never have an entry in Col2.
Corresponding lines appear in the same sequence (lines 1,2,3... then
corresponding lines 1,2,3...)
Every line has a corresponding second line later on in the dataframe
If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.
df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]
Then you can easily combine those values:
df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']
If some of my assumptions are incorrect, this will of course not produce the results you want.
There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.
Edit:
I think this would be quite a bit faster:
df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))

Error when selecting rows in pandas dataframe based on column value

I have a dataframe df which looks like this:
col1 col2 col3
A 45 4
A 3 5
B 2 5
I want to make a separate dataframe, df2, which only has the rows where col in df equals A. Hence it should look like:
col1 col2 col3
A 45 4
A 3 5
I just use df2=df1.loc[df1['col1']=='A']. However this returns the error: ValueError: Cannot index with multidimensional key. Any idea what is going wrong here?
What you tried works for me, you can try this:
df2 = df[df.col1 == 'A']
Output
col1 col2 col3
0 A 45 4
1 A 3 5
Edit
Tested on pandas version
pd.__version__
'1.2.4'
Just try:
df.query("col1=='A'")

Pandas - DF with lists - find all rows that match a string in any of the columns

I have the following dataframe:
ID col1 col2 col3
0 ['a','b'] ['d','c'] ['e','d']
1 ['s','f'] ['f','a'] ['d','aaa']
Give an input string = 'a'
I want to receive a dataframe like this:
ID col1 col2 col3
0 1 0 0
1 0 1 0
I see how to do it with a for loop but that takes forever, and there must be a method I miss
Processing lists in pandas is not vectorized supported, so performance is worse like scalars.
First idea is reshape lists columns to Series by DataFrame.stack, create scalars by Series.explode, so possible compare by a, test if match per first levels by Series.any, and last reshape back with convert boolean mask to integers:
df1 = df.set_index('ID').stack().explode().eq('a').any(level=[0,1]).unstack().astype(int)
print (df1)
col1 col2 col3
ID
0 1 0 0
1 0 1 0
Or is possible use DataFrame.applymap for elementwise testing by lambda function with in:
df1 = df.set_index('ID').applymap(lambda x: 'a' in x).astype(int)
Or create for each lists column DataFrame, so possible test by a with DataFrame.any:
f = lambda x: pd.DataFrame(x.tolist(), index=x.index).eq('a').any(axis=1)
df1 = df.set_index('ID').apply(f).astype(int)

How to drop string values from pandas dataframe column?

I have a df:
col1 col2
A 1
B 2
1 string
2 3
C more string
How can I drop all the rows where col2 contains a string?
You can do:
df[pd.to_numeric(df['col2'], errors='coerce').notnull()]
Output:
col1 col2
0 A 1
1 B 2
3 2 3
Try
df = df[df['col2'].apply(lambda x: type(x) != str)]
The apply function ouputs True for every row that is not a string. For strings, it yields false. Then all rows with True are selected from the data frame.

Drop Repeated Values in Column, Retaining the Row

I have a problem similar to dropping duplicates, but I need to retain the row that has the repeated value. So essentially, I need to retain the first value and then replace every repetition of it with ''.
Col1 Col2
a 1
b 1
c 1
d 2
What I need is:
Col1 Col2
a 1
b
c
d 2
Thanks.
Use duplicated with replace values to empty string - but get mixed values - numeric with strings, so some functions should failed. Better is replace to NaNs, although integers are converted to floats.
df.loc[df['Col2'].duplicated(), 'Col2'] = ''
#if want numeric column
#df.loc[df['Col2'].duplicated(), 'Col2'] = np.nan
Faster alternative:
df['Col2'] = np.where(df['Col2'].duplicated(), '', df['Col2'])
print (df)
Col1 Col2
0 a 1
1 b
2 c
3 d 2

Categories