How to decode strings saved in utf-8 format - python

I'm trying to decode the strings in the list below. They were all encoded in utf-8 format.
_strs=['."\n\nThe vicar\'',':--\n\nIn the', 'cathedral']
Expected output:
['.The vicar', ':--In the', 'cathedral']
My attempts
>>> for x in _str:
x.decode('string_escape')
print x
'."\n\nThe vicar\''
."
The vicar'
':--\n\nIn the'
:--
In the
'cathedral'
cathedral
>>> print [x.decode('string_escape') for x in _str]
['."\n\nThe vicar\'', ':--\n\nIn the', 'cathedral']
Both attempts failed.
Any ideas?

So you want to remove some characters from your list, it can be done using a simple regex like in the following:
import re
print [re.sub(r'[."\'\n]','',x) for x in _str]
this regex removes all the (., ", ', \n) and the result will be:
['The vicar', ':--In the', 'cathedral']
hope this helps.

Related

Python - .split(), - two arguments

I need to make a modification on a python code.
This code scrapes information from a .csv file, to finally integrate it in a new .csv file, in a different structure.
In one of the columns of the source files, I have a value (string), which is in 99% of the time formed this way: 'block1 block2 block3'.
Block2 always ends with the value 'm' 99% of the time.
example: 'R2 180m RFT'.
By browsing the source dataset, I realized that in 1% of the cases, the block2 can end with 'M'.
As I need all the values after the 'm' or 'M' value, I'm a bit stuck.
I used the .split() function, like this in my :
'Newcolumn': getattr(row_unique_ids, 'COLUMNINTHEDATASET').split ('m') [1],
By doing so, my script falls in error, because it falls on a value of this style :
R2 180M AST'.
So I would like to know how to integrate an additional argument, which would allow me to make the split work well if the script falls on 'm' or 'M'.
Thank you for your help.
One solution is to
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
s = s.lower()
s.split('m')[1]
But that will mess up your casing. If you want to preserve casing,
another solution is to do:
x = ''
s = getattr(row_unique_ids, 'COLUMNINTHEDATASET')
for c in s:
if c == 'M'
x += 'm'
x += c
x.split('m')[1]
One way to do multi-arguments split is, in general:
import re
string = "this is 3an infamous String4that I need to s?plit in an infamou.s way"
#Preserve the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])",r'\1'+"DELIMITER",string).split('DELIMITER'))
#Discard the original char
print (re.sub(r"([0-9]|[?.]|[A-Z])","DELIMITER",string).split('DELIMITER'))
Output:
['this is 3', 'an infamous S', 'tring4', 'that I', ' need to s?', 'plit in an infamou.', 's way']
['this is ', 'an infamous ', 'tring', 'that ', ' need to s', 'plit in an infamou', 's way']
In your context:
import re
string = "R2 180m RFT R2 180M RFT"
print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",string).split('M'))
#print (re.sub(r"\b([0-9]+)[mM]\b",r'\1'+"M",getattr(row_unique_ids, 'COLUMNINTHEDATASET')).split('M'))
Output:
['R2 180', ' RFT R2 180', ' RFT']
It will split on m and M if those are preceded by a number.

Stripping out \\n plus whitespace using .strip() and regex is not working

I've been attempting to strip out the \n plus the whitespace before and after the words from a string, but it is not working for some reason.
This is what I tried:
.strip(my_string)
and
re.sub('\n', '', my string)
I have tried using .strip and re in order to get it working, but it simply returns the same string.
Example input:
\\n The people who steal our cards already know all of this...\\n
\\n , \\n I\'m sure every fraud minded person in America is taking notes.\\n
\\n
Expected output would be:
The people who steal our cards already know all of this..., I\'m sure every fraud minded person in America is taking notes.
You're probably looking for something like this:
re.sub(r'\s+', r' ', x)
A usage example follows:
In [10]: x
Out[10]: 'hello \n world \n blue'
In [11]: re.sub(r'\s+', r' ', x)
Out[11]: 'hello world blue'
If you'd also like to grab the sequence of characters r'\n', then let's grab them as well:
re.sub(r'(\s|\\n)+', r' ', x)
And the output:
In [14]: x
Out[14]: 'hello \\n world \n \\n blue'
In [15]: re.sub(r'(\s|\\n)+', r' ', x)
Out[15]: 'hello world blue'

How do I replace a word in a string in python?

Let us say I have a string
c = "a string is like this and roberta a a thanks"
I want the output to be as
' string is like this and roberta thanks"
This is what I am trying
c.replace('a', ' ')
' string is like this nd robert thnks'
But this replaces each 'a' in the string
So I tried this
c.replace(' a ', ' ')
'a string is like this and roberta thanks'
But this leaves out 'a' in the starting of the string.
How do i do this?
this looks like a job for re :
import re
while re.subn('(\s+a\s+|^a\s+)',' ',txt)[1]!=0:
txt=re.subn('(\s+a\s+|^a\s+)',' ',txt)[0]
I myself figured it out.
c = "a string is like this and roberta a a thanks"
import re
re.sub('\\ba\\b', ' ', c)
' string is like this and roberta thanks'
Here you go myself! Enjoy!

Python - split sentence after words but with maximum of n characters in result

I want to display some text on a scrolling display with a width of 16 characters.
To improve readability I want to flip through the text, but not by simply splitting every 16. character, I rather want to split on every ending of a word or punctuation, before the 16 character limit exceeds..
Example:
text = 'Hello, this is an example of text shown in the scrolling display. Bla, bla, bla!'
this text shall be converted in a list of strings with 16 characters maximum
result = ['Hello, this is ', 'an example of ', 'text shown in ', 'the scrolling ', 'display. Bla, ', 'bla, bla!']
I started with the regex re.split('(\W+)', text) to get a list of every element (word, punctuation), but I fail combining them.
Can you help me, or at least give me some hints?
Thank you!
I'd look at the textwrap module:
>>> text = 'Hello, this is an example of text shown in the scrolling display. Bla, bla, bla!'
>>> from textwrap import wrap
>>> wrap(text, 16)
['Hello, this is', 'an example of', 'text shown in', 'the scrolling', 'display. Bla,', 'bla, bla!']
There are lots of options you can play with in the TextWrapper, for example:
>>> from textwrap import TextWrapper
>>> w = TextWrapper(16, break_long_words=True)
>>> w.wrap("this_is_a_really_long_word")
['this_is_a_really', '_long_word']
>>> w = TextWrapper(16, break_long_words=False)
>>> w.wrap("this_is_a_really_long_word")
['this_is_a_really_long_word']
As DSM suggested, look at textwrap. If you prefer to stick with regular expressions, the following will get you part of the way there:
In [10]: re.findall(r'.{,16}\b', text)
Out[10]:
['Hello, this is ',
'an example of ',
'text shown in ',
'the scrolling ',
'display. Bla, ',
'bla, bla',
'']
(Note the missing exclamation mark and the empty string at the end though.)
Using regex:
>>> text = 'Hello, this is an example of text shown in the scrolling display. Bla, bla, bla!'
>>> pprint(re.findall(r'.{1,16}(?:\s+|$)', text))
['Hello, this is ',
'an example of ',
'text shown in ',
'the scrolling ',
'display. Bla, ',
'bla, bla!']

python regex finding all groups of words

Here is what I have so far
text = "Hello world. It is a nice day today. Don't you think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']
The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']
Can this be done with a simple regex pattern?
import itertools as it
import re
three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
if key:
group=list(group)
for i in range(0,len(group)-1):
print(' '.join(group[i:i+2]))
# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think
It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?
map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))
May be you can rewrite the lambda for shorter (like just '+')
And BTW ' is not part of \w or \s
Something like this with additional checks for list boundaries should do:
>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>>
There are two problems with your approach:
Neither \w nor \s matches punctuation.
When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.
To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.
But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.
re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
^^^ ^
lookahead assertion

Categories