Python - Remove tuple from df column if present in another df column - python

I have a function which removes a string from a df column, if the string is present in the column of another df:
df1['col'] = df1['col'][~df1['col'].isin(df2['col'])]
The problem is that I now have to use this function on a column of tuples, which the function does not work with. Is there a way to easily transform the above function to accommodate for tuples? Data:
df1: df2:
index col1 index col
0 ('carol.clair', 'mark.taylor') 0 ('james.ray', 'tom.kopeland')
1 ('james.ray', 'tom.kopeland') 1 ('john.grisham', 'huratio.kane')
2 ('andrew.french', 'jack.martin')
3 ('john.grisham', 'huratio.kane')
4 ('ellis.taylor', 'sam.johnson')
Desired output:
df1
index col1
0 ('carol.clair', 'mark.taylor')
1 ('andrew.french', 'jack.martin')
2 ('ellis.taylor', 'sam.johnson')
The function does work if the column is first converted to string, however this raises an error later on in my code (I've tried using the .astype(tuple) command to solve this after removing the tuples, however the same error arose):
ValueError: too many values to unpack (expected 2)

This will give you desired output:
df1.loc[~df1['col1'].isin(df2['col'])].reset_index(drop=True)
# col1
#0 (carol.clair, mark.taylor)
#1 (andrew.french, jack.martin)
#2 (ellis.taylor, sam.johnson)

Related

Appending New Row to Existing Dataframe using Pandas

I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448

Can I mix Python dataframe positional index and column name?

Can I get an entry in dataframe using negative/positional index and column name? For example,
import pandas as pd
df = pd.DataFrame({'a':[0,1,2], 'b':[3,4,5]})
# df is:
# a b
# 0 0 3
# 1 1 4
# 2 2 5
df.iloc[-1]['a'] # got 2
df['a'].iloc[-1] # got 2
df.loc[1,'a'] # got 1
df.loc[-1,'a'] # got KeyError: -1
No, use df.columns and get_loc, to convert column header label to index location:
df.iloc[-1, df.columns.get_loc('a')]
Output:
2
No you can't mix both within the same iloc/loc (this used to be the case with the now deprecated ix indexer)
The correct approach would be one of the double slice:
df.iloc[-1]['a']
Note that the "issue" is not specific to negative positional indices. When you do df.loc[1,'a'] you are not slicing per position, but per name. This would also fail with a KeyError if you had no 1 in the index.

Return value from row based on column value pandas

I have a data frame that has a column of lists of strings, I want to find the value of a colum in a row which is based on the value of another column
i.e
samples subject trial_num
0 ['aa','bb'] 1 1
1 ['bb','cc'] 1 2
I have ['bb','cc'] and I want to get the value from the trial_num column where this list equals the samples colum, in this case 2.
Given the search column (samples) contains a list, it makes thing a tiny bit more complicated.
In this case, the apply() function can be used to test the values, and return a boolean mask, which can be applied to the DataFrame to obtain the required value.
Example code:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num']
Output:
1 2
Name: trial_num, dtype: int64
To only return the required value (2), simply append .iloc[0] to the end of the statement, as:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num'].iloc[0]
>>> 2

Dropping row in pandas by index - have to add to index?

I have a pandas dataframe of 182 rows that comes from read_csv. The first column, sys_code, contains various alphanumeric codes. I want to drop ones that start with 'FB' (there are 14 of these). I loop through the dataframe, adding what I assume would be the index to a list, then try to drop by index using the list. But this doesn't work unless I add 18 to each index number.
Without adding 18, I get a list containing numbers from 84 - 97. When I try to drop the rows using this list for indexes, I get KeyError: '[84] not found in axis'. But when I add 18 to each number, it works fine, at least for this particular dataset. But why is this? Shouldn't i be the same as the index number?
fb = []
i = 0
df.reset_index(drop=True)
for x in df['sys_code']:
if x[:2] == 'FB':
fb.append(i+18) #works
fb.append(i) # doesn't work
i += 1
df.drop(fb, axis=0, inplace=True)
You could use Series.str.startswith. Here's an example:
df = pd.DataFrame({'col1':['some string', 'FBsomething', 'FB', 'etc']})
print(df)
col1
0 some string
1 FBsomething
2 FB
3 etc
You could remove those strings that do not start with FB using:
df[~df.col1.str.startswith('FB')]
col1
0 some string
3 etc

Creating series from a series of list using positional list memeber

I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])

Categories