In test.txt:
quiet confidence^_^
want:P
(:let's start
Codes:
import re
file = open('test.txt').read()
for line in file.split('\n'):
line = re.findall(r"[^\w\s$]+|[a-zA-z]+|[^\w\s$]+", line)
print " ".join(line)
Results showed:
quiet confidence^_^
want : P
(: let ' s start
I tried to separate group of special characters from string but still incorrect.
Any suggestion?
Expected results:
quiet confidence ^_^
want :P
(: let's start
as #interjay said, you must define what you consider a word and what is "special characters". Still I would use 2 separate regexes to find what a word is and what is not.
word = re.compile("[a-zA-Z\']+")
not_word = re.compile("[^a-zA-Z\']+")
for line in file.split('\n'):
matched_words = re.findall(word, line)
non_matching_words = re.findall(not_word, line)
print " ".join(matched_words)
print " ".join(non_matching_words)
Have in mind that spaces \s+ will be grouped as non words.
Related
I am trying to make a simple Discord bot to respond to some user input and having difficulty trying to parse the response for the info I need. I am trying to get their "gamertag"/username but the format is a little different sometimes.
So, my idea was to make a list of delimiter words I am looking for (different versions of the word gamertag such as Gamertag:, Gamertag -, username, etc.)
Then, look line by line for one that contains any of those delimiters.
Split the string on first matching delim, strip non alphanumeric characters
I had it kinda working for a single line, then realized some people don't put it on the first line so added line by line check and messed it up (on line 19 I just realized).. Also thought there must be a better way than this? please advise, some kinda working code at this link and copied below:
testString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
applicationString = testString
gamertagSplitList = [ "gamertag", "Gamertag","Gamertag:", "gamertag:"]
#splWord = 'Gamertag'
lineNum = 0
for line in applicationString.partition('\n'):
print(line)
if line in gamertagSplitList:
applicationString = line
break
#get first line
#applicationString = applicationString.partition('\n')[0]
res = ""
#split on word, want to split on first occurrence of list of words
for splitWord in gamertagSplitList:
if splitWord in applicationString:
res = applicationString.split(splitWord)
break
splitString = res[1]
#res = test_string.split(spl_word, 1)
#splitString = res[1]
#get rid of non alphaNum characters
finalString = "" #define string for ouput
for character in splitString:
if(character.isalnum()):
# if character is alphanumeric concat to finalString
finalString = finalString + character
print(finalString)
Don't know if this will work with all your different inputs, but you can tweak it to get what you want :
import re
gamertagSplitList = ["gamertag", "Gamertag", "Gamertag:", "gamertag:"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
for line in applicationString.split('\n'):
line = line.replace(' ', '')
for tag in gamertagSplitList:
if tag in line:
gamer_tag = line.replace(tag, '', 1)
break
print(re.sub(r'\W+', '', gamer_tag))
Output :
testGamertag
You can do it without any loops with a single regex:
import re
gamertagSplitList = ["gamertag", "Gamertag"]
applicationString = """Application
Gamertag : testGamertag
Discord - testDiscord
Age - 25"""
print(re.search(r'(' + '|'.join(gamertagSplitList) + ')\s*[:-]?\s*(\w+)\s*', applicationString)[2])
If all values in gamertagSplitList differ just by casing, you can simplify that even further:
print(re.search(r'gamertag\s*[:-]?\s*(\w+)\s*', applicationString, re.IGNORECASE)[1])
Let's take a closer look at this regex:
gamertag will match a string 'gamertag'
\s* will match any (including none) whitespace characters (space, newline, tab, etc.)
[:-]? will match either none or a single character which is either : or -
(\w+) will match 1 or more alphanumeric characters. Parenthesis here denote a group -- specific substring that we can extract later from the match.
By using re.IGNORECASE we make matching case insensitive, so that separator GaMeRtAg will also be recognised by this pattern.
The indexing part [1] means that we're interested in a first group in our pattern (remember the parenthesis). A group with index 0 is always a full match, and groups from index 1 upwards represent substrings that match subexpressions in parenthesis (ordered by their ( appearance in the regex).
I want to replace every line in a textfile with " " which starts with "meshname = " and ends with any letter/number and underscore combination. I used regex's in CS but I never really understood the different notations in Python. Can you help me with that?
Is this the right regex for my problem and how would i transform that into a Python regex?
m.e.s.h.n.a.m.e.' '.=.' '.{{_}*,{0,...,9}*,{a,...,z}*,{A,...,Z}*}*
x.y = Concatenation of x and y
' ' = whitespace
{x} = set containing x
x* = x.x.x. ... .x or empty word
What would the script look like in order to replace every string/line in a file containing meshname = ... with the Python regex? Something like this?
fin = open("test.txt", 'r')
data = fin.read()
data = data.replace("^meshname = [[a-z]*[A-Z]*[0-9]*[_]*]+", "")
fin.close()
fin = open("test.txt", 'w')
fin.write(data)
fin.close()
or is this completely wrong? I've tried to get it working with this approach, but somehow it never matched the right string: How to input a regex in string.replace?
Following the current code logic, you can use
data = re.sub(r'^meshname = .*\w$', ' ', data, flags=re.M)
The re.sub will replace with a space any line that matches
^ - line start (note the flags=re.M argument that makes sure the multiline mode is on)
meshname - a meshname word
= - a = string
.* - any zero or more chars other than line break chars as many as possible
\w - a letter/digit/_
$ - line end.
I have a pattern compiled as
pattern_strings = ['\xc2d', '\xa0', '\xe7', '\xc3\ufffdd', '\xc2\xa0', '\xc3\xa7', '\xa0\xa0', '\xc2', '\xe9']
join_pattern = '|'.join(pattern_strings)
pattern = re.compile(join_pattern)
and then I find pattern in file as
def find_pattern(path):
with open(path, 'r') as f:
for line in f:
print line
found = pattern.search(line)
if found:
print dir(found)
logging.info('found - ' + found)
and my input as path file is
\xc2d
d\xa0
\xe7
\xc3\ufffdd
\xc3\ufffdd
\xc2\xa0
\xc3\xa7
\xa0\xa0
'619d813\xa03697'
When I run this program, nothing happens.
I it not able to catch these patterns, what is am I doing wrong here?
Desired output
- each line because each line has one or the other matching pattern
Update
After changing the regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
It is still the same, no output
UPDATE
after making regex to
pattern_strings = ['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
join_pattern = '[' + '|'.join(pattern_strings) + ']'
pattern = re.compile(join_pattern)
Things started to work, but partially, the patterns still not caught are for line
\xc2\xa0
\xc3\xa7
\xa0\xa0
for which my pattern string is ['\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0']
escape the \ in the search patterns
either with r"\xa0" or as "\\xa0"
do this ....
['\\xc2d', '\\xa0', '\\xe7', '\\xc3\\ufffdd', '\\xc2\\xa0', '\\xc3\\xa7', '\\xa0\\xa0', '\\xc2', '\\xe9']
like everyones been saying to do except the one guy you listened too...
Does your file actually contain \xc2d --- that is, five characters: a backslash followed by c, then 2, then d? If so, your regex won't match it. Each of your regexes will match one or two characters with certain character codes. If you want to match the string \xc2d your regex needs to be \\xc2d.
I need help on some regex problem with chinese characters in python.
"拉柏多公园" is the correct form of the word, but in a text i found "拉柏 多公 园", what regex should i use to replace the characters.
import re
name = "拉柏多公园"
line = "whatever whatever it is then there comes a 拉柏 多公 园 sort of thing"
line2 = "whatever whatever it is then there comes another拉柏 多公 园 sort of thing"
line3 = "whatever whatever it is then there comes yet another 拉柏 多公 园sort of thing"
line4 = "whatever whatever it is then there comes a拉柏 多公 园sort of thing"
firstchar = "拉"
lastchar = "园"
i need to replace the strings in the lines so that the output line will look like this
line = "whatever whatever it is then there comes a 拉柏多公园 sort of thing"
line2 = "whatever whatever it is then there comes another 拉柏多公园 sort of thing"
line3 = "whatever whatever it is then there comes yet another 拉柏多公园 sort of thing"
line4 = "whatever whatever it is then there comes a 拉柏多公园 sort of thing"
i tried these to but the regex is badly structured:
reline = line.replace (r"firstchar*lastchar", name) #
reline2 = reline.replace (" ", " ")
print reline2
can someone help to correct my regex?
Thanks
(I assume you're using python 3, since you're using unicode characters in regular strings. For python 2, add u before each string literal.)
Python 3
import re
name = "拉柏多公园"
# the string of Chinese characters, with any number of spaces interspersed.
# The regex will match any surrounding spaces.
regex = r"\s*拉\s*柏\s*多\s*公\s*园\s*"
So you can replace each string with
reline = re.sub(regex, ' ' + name + ' ', line)
Python 2
# -*- coding: utf-8 -*-
import re
name = u"拉柏多公园"
# the string of Chinese characters, with any number of spaces interspersed.
# The regex will match any surrounding spaces.
regex = ur"\s*拉\s*柏\s*多\s*公\s*园\s*"
So you can replace each string with
reline = re.sub(regex, u' ' + name + u' ', line)
Discussion
The result will be surrounded by spaces. More generally, if you want it to work at the start or end of the line, or before commas or periods, you'll have to replace ' ' + name + ' ' with something more sophisticated.
Edit: fixed. Of course, you have to use the re library function.
I need a way to remove all whitespace from a string, except when that whitespace is between quotes.
result = re.sub('".*?"', "", content)
This will match anything between quotes, but now it needs to ignore that match and add matches for whitespace..
I don't think you're going to be able to do that with a single regex. One way to do it is to split the string on quotes, apply the whitespace-stripping regex to every other item of the resulting list, and then re-join the list.
import re
def stripwhite(text):
lst = text.split('"')
for i, item in enumerate(lst):
if not i % 2:
lst[i] = re.sub("\s+", "", item)
return '"'.join(lst)
print stripwhite('This is a string with some "text in quotes."')
Here is a one-liner version, based on #kindall's idea - yet it does not use regex at all! First split on ", then split() every other item and re-join them, that takes care of whitespaces:
stripWS = lambda txt:'"'.join( it if i%2 else ''.join(it.split())
for i,it in enumerate(txt.split('"')) )
Usage example:
>>> stripWS('This is a string with some "text in quotes."')
'Thisisastringwithsome"text in quotes."'
You can use shlex.split for a quotation-aware split, and join the result using " ".join. E.g.
print " ".join(shlex.split('Hello "world this is" a test'))
Oli, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Here's the small regex:
"[^"]*"|(\s+)
The left side of the alternation matches complete "quoted strings". We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.
Here is working code (and an online demo):
import re
subject = 'Remove Spaces Here "But Not Here" Thank You'
regex = re.compile(r'"[^"]*"|(\s+)')
def myreplacement(m):
if m.group(1):
return ""
else:
return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
Here little longish version with check for quote without pair. Only deals with one style of start and end string (adaptable for example for example start,end='()')
start, end = '"', '"'
for test in ('Hello "world this is" atest',
'This is a string with some " text inside in quotes."',
'This is without quote.',
'This is sentence with bad "quote'):
result = ''
while start in test :
clean, _, test = test.partition(start)
clean = clean.replace(' ','') + start
inside, tag, test = test.partition(end)
if not tag:
raise SyntaxError, 'Missing end quote %s' % end
else:
clean += inside + tag # inside not removing of white space
result += clean
result += test.replace(' ','')
print result