I have strings such as:
text1 = ('SOME STRING,99,1234 FIRST STREET,9998887777,ABC')
text2 = ('SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF')
text3 = ('ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI')
Desired output:
SOME STRING 99,1234 FIRST STREET,9998887777,ABC
SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF
ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI
My idea: Use regex to find occurrences of 1-5 digits, possibly preceded by a symbol, that are between two commas and not followed by a space and letters, then replace by this match without the preceding comma.
Something like:
text.replace(r'(,\d{0,5},)','.........')
If you would use regex module instead of re then possibly:
import regex
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(regex.sub(r'(?<!^.*,.*),(?=#?\d+,\d+)', ' ', str))
You might be able to use re if you sure there are no other substring following the pattern in the lookahead.
import re
str = "ANOTHER STRING,#88,4321 THIRD STREET,3332221111,GHI"
print(re.sub(r',(?=#?\d+,\d+)', ' ', str))
Easier to read alternative if SOME STRING, SOME OTHER STRING, and ANOTHER STRING never contain commas:
text1.replace(",", " ", 1)
which just replaces the first comma with a space
Simple, yet effective:
my_pattern = r"(,)(\W?\d{0,5},)"
p = re.compile(my_pattern)
p.sub(r" \2", text1) # 'SOME STRING 99,1234 FIRST STREET,9998887777,ABC'
p.sub(r" \2", text2) # 'SOME OTHER STRING,56789 SECOND STREET,6665554444,DEF'
p.sub(r" \2", text3) # 'ANOTHER STRING #88,4321 THIRD STREET,3332221111,GHI'
Secondary pattern with non-capturing group and verbose compilation:
my_pattern = r"""
(?:,) # Non-capturing group for single comma.
(\W?\d{0,5},) # Capture zero or one non-ascii characters, zero to five numbers, and a comma
"""
# re.X compiles multiline regex patterns
p = re.compile(my_pattern, flags = re.X)
# This time we will use \1 to implement the first captured group
p.sub(r" \1", text1)
p.sub(r" \1", text2)
p.sub(r" \1", text3)
Related
I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.
Example String:
str = "test sdf sfwe \n \na dssdf
I want to replace the:
\na
with
a
Where 'a' could be any character.
I tried:
str = "test \n \na"
res = re.sub('[\n.]','a',str)
But how can I store the character behind the \n and use it as replacement?
You may use this regex with a capture group:
>>> s = "test sdf sfwe \n \na dssdf"
>>> >>> print re.sub(r'\n(.)', r'\1', s)
test sdf sfwe a dssdf
Search regex r'\n(.)' will match \n followed by any character and capture following character in group #1
Replacement r'\1' is back-reference to capture group #1 which is placed back in original string.
Better to avoid str as variable name since it is a reserve keyword (function) in python.
If by any character you meant any non-space character then use this regex with use of \S (non-whitespace character) instead of .:
>>> print re.sub(r'\n(\S)', r'\1', s)
test sdf sfwe
a dssdf
Also this lookahead based approach will also work that doesn't need any capture group:
>>> print re.sub(r'\n(?=\S)', '', s)
test sdf sfwe
a dssdf
Note that [\n.] will match any one of \n or literal dot only not \n followed by any character,
Find all the matches:
matches = re.findall( r'\n\w', str )
Replace all of them:
for m in matches :
str = str.replace( m, m[1] )
That's all, folks! =)
I think that the best way for you so you don't have more spaces in your text is the following:
string = "test sdf sfwe \n \na dssdf"
import re
' '.join(re.findall('\w+',string))
'test sdf sfwe a dssdf'
My string is like below:
Str=S1('amm','string'),S2('amm_sec','string'),S3('amm_','string')
How can I Split the string so that my str_list item becomes:
Str_List[0]=S1('amm','string')
Str_List[1]=S2('amm_sec','string')
...
If I use Str.split(',') then the output is:
Str_List[0]=S1('amm'
...
you can use regex with re in python
import re
Str = "S1('amm','string'),S2('amm_sec','string'),S3('amm_','string')"
lst = re.findall("S\d\(.*?\)", Str)
this will give you:
["S1('amm','string')", "S2('amm_sec','string')", "S3('amm_','string')"]
to explain the regex a little more:
S first you match 'S'
\d next look for a digit
\( then the '(' character
.*? with any number of characters in the middle (but match as few as you can)
\) followed by the last ')' character
you can play with the regex a little more here
My first thought would be to replace ',S' with ' S' using regex and split on spaces.
import re
Str = re.sub(',S',' S',Str)
Str_list = Str.split()
I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo
For example, I have strings like this:
string s = "chapter1 in chapters"
How can I replace it with regex to this:
s = "chapter 1 in chapters"
e.g. I need only to insert whitespace between "chapter" and it's number if it exists. re.sub(r'chapter\d+', r'chapter \d+ , s) doesn't work.
You can use lookarounds:
>>> s = "chapter1 in chapters"
>>> print re.sub(r"(?<=\bchapter)(?=\d)", ' ', s)
chapter 1 in chapters
RegEx Breakup:
(?<=\bchapter) # asserts a position where preceding text is chapter
(?=d) # asserts a position where next char is a digit
You can use capture groups, Something like this -
>>> s = "chapter1 in chapters"
>>> re.sub(r'chapter(\d+)',r'chapter \1',s)
'chapter 1 in chapters'