How to strip a value from a delimited string - python

I have a list which i have joined using the following code:
patternCore = '|'.join(list(Broker['prime_broker_id']))
patternCore
'CITI|CS|DB|JPM|ML'
Not sure why i did it that way but I used patternCore to filter multiple strings at the same time. Please note that Broker is a dataFrame
Broker['prime_broker_id']
29 CITI
30 CS
31 DB
32 JPM
33 ML
Name: prime_broker_id, dtype: object
Now I am looking to strip one string. Say I would like to strip 'DB'. How can I do that please?
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
Thank you

Since Broker is a Pandas dataframe, you can use loc with Boolean indexing, then use pd.Series.tolist:
mask = Broker['prime_broker_id'] != 'DB'
patternCore = '|'.join(Broker.loc[mask, Broker['prime_broker_id']].tolist())
A more generic solution, which works with objects other than Pandas dataframes, is to use a list comprehension with an if condition:
patternCore = '|'.join([x for x in Broker['prime_broker_id'] if x != 'DB'])
Without returning to your input series, using the same idea you can split and re-join:
patternCore = 'CITI|CS|DB|JPM|ML'
patternCore = '|'.join([x for x in patternCore.split('|') if x != 'DB'])
You should expect the last option to be expensive as your algorithm requires reading each character in your input string.

I would like to mention some points which have not been touched upon till now.
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
The reason why it didn't work was because strip() returns a copy of the string with the leading and trailing characters removed.
NOTE:
Not the characters in the occuring somewhere in the mid.
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped
Here you have specified the argument characters as 'DB'. So had your string been something like 'CITI|CS|JPM|ML|DB', your code would have worked partially(the pipe at the end would remain).
But anyways this is not a good practice. Because it would strip something like
'DCITI|CS|JPM|MLB' to 'CITI|CS|JPM|ML' or 'CITI|CS|JPM|ML|BD' to 'CITI|CS|JPM|ML|' also.
I would like to strip 'DB'.
For this part, #jpp has already given a fine answer.

Related

Regex needed to replace "0" either preceded, followed or surrounded by "1" in string of zeros and ones

Given the string "001000100", I want to replace only the zeros surrounded by "1" by "1". The result in this case would be "001111100" In this case there's guaranteed to be only one sequence of zeros surrounded by ones.
Given the string "100" or "001" or "110" or "011", I want the original string returned.
Performance is not an issue as the string (which is currently "101"), is only expected to increase slowly over time when electricity and/or tax rates change.
I think this should be trivial but my limited regex experience and web searches have failed to come up with an answer. Any help coming up with the relevant regex pattern will be appreciated.
EDIT: since posting this question, I've received quite a bit of useful feedback. To ensure any answers address my requirements I've rethought the requirements and I think (since I'm still not 100% certain) that they can be summarized as follows:
‘string’ shall always contain at least one 1
‘string’ shall have zero or one sequence of one or more 0 surrounded by a 1
a sequence of one or more 0 surrounded by a 1 shall be replaced by the same number of 1
‘string’ that does not have at least one 0 surrounded by 1 shall be returned as-is
Another useful piece of information is that the original input is not a string but a Python list of Booleans. Therefore any solution that uses regex will have to convert the list of Booleans to a string and vice versa.
I solved my problem thanks to the essential contributions of Kelly Bundy and bobble bubble. The following Python function meets the requirements but improvements are of course welcome:
def make_contiguous(booleans): # replaces '0' surrounded by '1' into '1'
string = "".join(str(int(i)) for i in booleans) # convert list of Booleans to str to allow use of regex
string = re.sub('10*1', lambda m: '1' * len(m[0]), string) # apply the regex
string = list(string)
booleans = [int(i) for i in string] # convert the str back to Booleans
return booleans

Get dummy variables from a string column full of mess

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

Problems with function isin, problem with number, with words working normaly

I have a problem with function:
print(df.loc[df['Kriterij1'] == '63'])
and i was tried with (the same)
df[df.Kriterij1.isin(['aaa','63'])]
When I want to try filtered by numbers the output is only the head (empty cells) its work only for word 'aaa'.
Or maybe i can use anoter function?
I think you need change '63' (string) to 63 (number), if mixed numeric with strings values:
print(df.loc[df['Kriterij1'] == 63])
print(df[df.Kriterij1.isin(['aaa',63])])

Pandas Key Error When Searching For Keyword In "Cell"

I am iterating over some data in a pandas dataframe searching for specific keywords, however the resulting regex search results in a KeyError: 19.
I've tried to pull out the data in the specific cell, place it in a string object and search through that, but every time I attempt to point anything to look a the data in that column, I get a KeyError: 19.
To preface my code example, I have pulled out specific chunks of the dataframe and placed them in a list of lists. (Of these chunks, I have kept all of the columns that were in the original dataframe)
Here is an example of the iteration I am attempting:
for eachGroup in mainList:
for lineItem in eachGroup:
if re.search(r'( keyword )', lineItem[19], re.I):
dostuff
As you might have guessed, the data I am searching for keywords in is column 19 which has data formatted like this:
3/23/2019 11:32:0 3/23/2019 11:32:0 3/23/2019 14:3:0 CSG CHG H6 27 1464D Random Random Random 81
Every other attempt at searching for keywords in different columns executes fine without any errors. Why would this case alone return a KeyError?
To add some more clarity, even the following code produces the same KeyError:
for eachGroup in mainList:
for lineItem in eachGroup:
text = lineItem[19]
Here's a WTF moment...
Instead of using python's smart for looping, I decided to be more granular and loop through with a while loop. Needless to say it worked.
The below code implementation fixes the issue though why it does I have no clue:
bigCount = len(mainList)
count = 0
while count < bigCount:
while smallCount < len(mainList[count]):
if re.search(r'( keyword )', mainList[count][smallCount][19], re.I):
dostuff
Try changing re.search(r'( keyword )', lineItem[19], re.I): to re.match('(.*)keyword(.*)', lineItem[19]):. re.search will return the corresponding matching object, while re.match will return a logical value that you need in an if statement. The suffix and prefix (.*) is to ignore any other character to the left or right of the string. Hope it helps.

Remove Rows that Contain a specific Value anywhere in the row (Pandas, Python 3)

I am trying to remove all rows in a Panda dataset that contain the symbol "+" anywhere in the row. So ideally this:
Keyword
+John
Mary+Jim
David
would become
Keyword
David
I've tried doing something like this in my code but it doesn't seem to be working.
excluded = ('+')
removal2 = removal[~removal['Keyword'].isin(excluded)]
The problem is that sometimes the + is contained within a word, at the beginning of a word, or at the end. Any ideas how to help? Do I need to use an index function? Thank you!
Use the vectorised str method contains and pass the '+' identifier, negate the boolean condition by using ~:
In [29]:
df[~df.Keyword.str.contains('\+')]
Out[29]:
Keyword
2 David

Categories