Str.find() unable to search for '\n' - python

I am trying to extract text between two keywords using str.find(). But it fails to find the occurrence of '\n'
text = 'Cardiff, the 6th November 2007\n company \n'
String_to_extract = '6th November 2007'
keywords = {'date': ['Cardiff, the ' , '\n']}
Code:
text2=text[text.find(keywords['date']0])+len(keywords[0]):text.find(keywords['date'][1])]
print(text2)
str.find() is unable to search for '\n', which results in no output
PS-Want to use str.find() method only

There are several problems here:
In the keywords dictionary you use a date variable that should be string: 'date'.
In the keywords dictionary you doubly escaped \\n, while you don't do this in the text variable.
In the index calculations you use a key variable that is defined no where; this should be the 'date' key defined in the keywords dictionary.
And finally, you calculate the starting position of the first index, while it should be the ending position.
Try this:
# String to be extracted = '6th November 2007'
text = 'Cardiff, the 6th November 2007\n\n \n\n'
keywords = {'date' : ['Cardiff, the ' , '\n\n']}
a = text.find(keywords['date'][0]) + len(keywords['date'][0])
b = text.find(keywords['date'][1])
text2 = text[a:b]
print(text2)

You've incorrectly calculated first index. Try this:
text = 'Cardiff, the 6th November 2007\n\n company \n\n'
keywords = ['Cardiff, the ', '\n']
result = text[text.find(keywords[0])+len(keywords[0]):text.find(keywords[1])]
Output:
6th November 2007

To generalize the Answer. use this Code:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])] # you can replace the key with whatever you want as keys

This is a really interesting question, and goes to show how something trivial may become hard to find if used in chained manner. Let's see what's happening in your code. You say that your code can't seem to find the 1st occurrence, however, I would like to state the opposite, it definitely finds the first occurrence. In the text: 'Cardiff, the 6th November 2007\n\n \n\n' you are trying to find the first occurrence of 'Cardiff, the '. You will see that in the text, the index of the string starts from index 0, i.e. text[0]. so this code text[text.find(keywords[key][0]):text.find(keywords[key][1])] essentially becomes text[0:text.find(keywords[key][1])]. Now in Python slicing rule, 0 is inclusive and you are getting the output like Cardiff, the 6th November 2007 and thinking it did not find the first occurrence. So in order to fix it, you need to move start slicing from after 'Cardiff, the '. You can achieve this by altering the text2 assignment in this way:
text2 = text[text.find(keywords[key][0])+len(keywords[key][0]):text.find(keywords[key][1])]
There are other ways to achieve what you want, but this what you were trying to do originally.

Related

Python: Replace string in one column from list in other column

I need some help please.
I have a dataframe with multiple columns where 2 are:
Content_Clean = Column filled with Content - String
Removals: list of strings to be removed from Content_Clean Column
Problem: I am trying to replace words in Content_Clean with spaces if in Removals Column:
Example Image
Example:
Content Clean: 'Johnny and Mary went to the store'
Removals: ['Johnny','Mary']
Output: 'and went to the store'
Example Code:
for i in data_eng['Removals']:
for u in i:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].str.replace(u,' ')
This does not work as Removals columns contain lists per row.
Another Example:
data_eng['Content_Clean_II'] = data_eng['Content_Clean'].apply(lambda x: re.sub(data_eng.loc[data_eng['Content_Clean'] == x, 'Removals'].values[0], '', x))
Does not work as this code is only looking for one string.
The problem is that Removals column is a list that I want use to remove/ replace with spaces in the Content_Clean column on a per row basis.
The example image link might help
Here you go. This worked on my test data. Let me know if it works for you
def repl(row):
for word in row['Removals']:
row['Content_Clean'] = row['Content_Clean'].replace(word, '')
return row
data_eng = data_eng.apply(repl, axis=1)
You can call the str.replace(old, new) method to remove unwanted words from a string.
Here is one small example I have done.
a_string = "I do not like to eat apples and watermelons"
stripped_string = a_string.replace(" do not", "")
print(stripped_string)
This will remove "do not" from the sentence

Want to extract the alphanumeric text with certain special characters using python regex

I have a following text which I want in a desired format using python regex
text = "' PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'"
I used following code
reg = re.compile("[^\w']")
text = reg.sub(' ', text)
However it gives output as text = "'PowerPoint PresentationOctober 11th 2011 Visit to Lap Chec1Edit or delete me in â viewâ then â slide masterâ'" which is not a desired output.
My desired output should be text = '"PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in view then slide master.'"
I want to remove special characters except following []()-,.
Rather than removing the chars, you may fix them using the right encoding:
text = text.encode('windows-1252').decode('utf-8')
// => ' PowerPoint PresentationOctober 11th, 2011Visit to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'
See the Python demo
If you want to remove them later, it will become much easier, like text.replace('‘', '').replace('’', ''), or re.sub(r'[’‘]+', '', text).
I got the answer though it was simple as follows, thanks for replies.
reg = re.compile("[^\w'\,\.\(\)\[\]]")
text = reg.sub(' ', text)

Need to search a string for a "two word" pattern in python

I’m trying to search a long string of characters for a country name. The country name is sometimes more than one word, such as Costa Rica.
Here is my code:
eol = len(CountryList)
for c in range(0, eol):
country = str(CountryList[c])
countrymatch = re.search(country, fullsampledata)
if countrymatch:
...
fullsampledata is a long string with all the data in one line. I’m trying to parse out the country by cycling thru a list of valid country names. If country is only one word, such as ‘Holland’, it finds it. However, if country is two or more words, ‘Costa Rica’, it doesn’t find it. Why?
You can search for a substring in a string using the .find() function as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
fullsampledata.find("Morocco")
-1
fullsampledata.index("Costa Rica")
17
So you can make your if statement as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
country = "Costa Rica"
if fullsampledata.index(country) != -1:
# Found
pass
else:
# Not Found
pass
In [1]: long_string = 'asdfsadfCosta Ricaasdkj asdfsd asdjas USA alsj'
In [2]: 'Costa Rica' in long_string
Out[2]: True
You don't have your code properly shown and I'm a little too lazy to parse it. Hope this helps.

How to replace a word found in a string with what is in the list in Python

My string is below:
word = "Continue: Lifetime Benefits in Running, Volume 1, Issue 1, February 2018"
My list is:
italic_list = ['Continue', ': Lifetime Benefits in Running', ' February 2018']
I want to change the found word(s) in the string with what's in the list with an additional tag .
The output should be like this:
<p>
<italic>Continue: Lifetime Benefits in Running</italic>, Volume 1, Issue 1, <italic>February 2018</italic>
</p>
Here is my code:
word = "Continue: Lifetime Benefits in Running, Volume 1, Issue 1, February 2018"
italic_list = ['Continue', ': Lifetime Benefits in Running', ' February 2018']
ital = ''.join(italic_list)
if ital in word:
word = word.replace(ital, "<italic>" + ital + "</italic>")
The code will work if all items in the list are in a succeeding words. But the problem with this code is if there is a certain word not in succeeding with the previous item(s).
I hope there is a better way to solve this.
Thanks so much!
Don't join.
It merge list elements into single string. You had 3 phrases that could be independently checked. Now you have single string.
Instead what you want to do, is to check if any of phrases from italic_list are present in input string one by one.
In python you can do that with loops. Just loop over italic_list, and for each element check if element is present in input string, and if so, replace that part with one with extra elements.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Categories