I am trying to remove emoticons from a piece of text, I looked at this regex from another question and it doesn't remove any emoticons. Can you let me know what I am doing wrong, or if there are better regex's for removing emojis from a string.
import re
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
u'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line=myre.sub('', line)
Something like this?
import re
myre = re.compile('('
'\ud83c[\udf00-\udfff]|'
'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
'[\u2600-\u26FF\u2700-\u27BF])+'.decode('unicode_escape'),
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line = myre.sub('', line.decode('utf-8'))
print(line)
Related
I have a simulationoutput with many lines, parts of it look like this:
</GraphicData>
</Connection>
<Connection>
<Name>ES1</Name>
<Type>Port</Type>
<From>Windfarm.Out</From>
<To>BR1.In</To>
<GraphicData>
<Icon>
<Points>
</GraphicData>
</Connection>
<Connection>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
<GraphicData>
<Icon>
<Points>
The word between Name and /Name varies from output to output. These names (here: ES1 and S2) are stored in a textfile (keywords.txt).
What I need is a Regex that gets the keywords from the list (keywords.txt). searches for matches in (Simulationoutput.txt) until /To> and writes these matches into another textfile (finaloutput.txt).
Here is what I've done so far
with open("keywords.txt", 'r') as f:
keywords = ast.literal_eval(f.read())
pattern = '|'.join(keywords)
results = []
with open('Simulationoutput.txt', 'r') as f:
for line in f:
matches = re.findall(pattern,line)
if matches:
results.append((line, len(matches)))
results = sorted(results, key=lambda x: x[1], reverse=True)
with open('finaloutput.txt', 'w') as f:
for line, num_matches in results:
f.write('{} {}\n'.format(num_matches, line))
The finaloutput.txt looks like this now:
<Name>ES1</Name>
<Name>S2</Name>
So the code almost does what I want but the output should look like this
<Name>ES1</Name>
<Type>Port</Type>
<From>Hydro.Out</From>
<To>BR1.In</To>
<Name>S2</Name>
<Type>Port</Type>
<From>BR1.Out</From>
<To>C1.In</To>
Thanks in advance.
Although I strongly advise you to use xml.etree.ElementTree to do this, here's how you could do it using regex:
import re
keywords = ["ES1", "S2"]
pattern = "|".join([re.escape(key) for key in keywords])
pattern = fr"<Name>(?:{pattern}).*?<\/To>"
with open("Simulationoutput.txt", "r") as f:
matches = re.findall(pattern, f.read(), flags=re.DOTALL)
with open("finaloutput.txt", "w") as f:
f.write("\n\n".join(matches).replace("\n ", "\n"))
The regex used is the following:
<Name>(?:ES1|S2).*?<\/To>
<Name>: Matches `.
(?:): Non-capturing group.
ES1|S2: Matches either ES1 or S2.
.*?: Matches any character, between zero and unlimited times, as few as possible (lazy). Note that . does not match newlines by default, only because the re.DOTALL flag is set.
<\/To>: Matches </To>.
I need to apply the following patterns
regex pattern 1 =>
search1: ^this, replace1: these
regex pattern 2 =>
search2: tests$, replace2: \t tests
regex patterns 3
.....
the following code only executes one search-replace operation.
how to combine multiple search operations? I might need to apply perhaps 10-20 patterns
thank you
import re
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
pattern1= re.sub('test\.', 'tests',line)
fout.write(pattern2)
Idk what did you mean by putting pattern1 and pattern2, here's my solution with what I understood :
Simply put then in a list and iterate through ?
import re
patterns = [('^this', 'these'), ('tests$', 'tests')]
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
for regex, replace in patterns:
line = re.sub(regex, replace, line)
fout.write(line)
You can put your regex patterns and replacement in one dictionary and use for loop on it.
check below example:
import re
patterns = {"pattern1":"replacetxt"}
fin = open("data.txt", "r")
fout = open("data2.txt", "w")
for line in fin:
for patt, replace in patterns.items():
line = re.sub(patt, replace, line)
fout.write(line)
I'm trying to find any occurunce of "fiction" preceeded or followed by anything, except for "non-"
I tried :
.*[^(n-)]fiction.*
but it's not working as I want it to.
Can anyone help me out?
Check if this works for you:
.*(?<!non\-)fiction.*
You should avoid patterns starting with .*: they cause too many backtracking steps and slow down the code execution.
In Python, you may always get lines either by reading a file line by line, or by splitting a line with splitlines() and then get the necessary lines by testing them against a pattern without .*s.
Reading a file line by line:
final_output = []
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if "fiction" in line and "non-fiction" not in line:
final_output.append(line.strip())
Or, getting the lines even with non-fiction if there is fiction with no non- in front using a bit modified #jlesuffleur's regex:
import re
final_output = []
rx = re.compile(r'\b(?<!non-)fiction\b')
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if rx.search(line):
final_output.append(line.strip())
Getting lines from a multiline string (with both approaches mentioned above):
import re
text = "Your input string line 1\nLine 2 with fiction\nLine 3 with non-fiction\nLine 4 with fiction and non-fiction"
rx = re.compile(r'\b(?<!non-)fiction\b')
# Approach with regex returning any line containing fiction with no non- prefix:
final_output = [line.strip() for line in text.splitlines() if rx.search(line)]
# => ['Line 2 with fiction']
# Non-regex approach that does not return lines that may contain non-fiction (if they contain fiction with no non- prefix):
final_output = [line.strip() for line in text.splitlines() if "fiction" in line and "non-fiction" not in line]
# => ['Line 2 with fiction', 'Line 4 with fiction and non-fiction']
See a Python demo.
What about a negative lookbehind?
s = 'fiction non-fiction'
res = re.findall("(?<!non-)fiction", s)
res
This regular expression is suppose to remove emoticons but when i try it on my sample text, it does not work. It was working previously..not sure what I am missing. Thanks
Here is a sample text: pastebin.com/uYUNk9R1
Place in notepad document to test, Python 2.7 .
import re
myre = re.compile('('
'\ud83c[\udf00-\udfff]|'
'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
'[\u2600-\u26FF\u2700-\u27BF])+'.decode('unicode_escape'),
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line = myre.sub('', line)
out.write(line)
You need to convert your input data to Unicode
line = myre.sub('', line.decode('utf-8'))
i got an inputfile which contains a javascript code which contains many five-figure ids. I want to have these ids in a list like:
53231,53891,72829 etc
This is my actual python file:
import re
fobj = open("input.txt", "r")
text = fobj.read()
output = re.findall(r'[0-9][0-9][0-9][0-9][0-9]' ,text)
outp = open("output.txt", "w")
How can i get these ids in the output file like i want it?
Thanks
import re
# Use "with" so the file will automatically be closed
with open("input.txt", "r") as fobj:
text = fobj.read()
# Use word boundary anchors (\b) so only five-digit numbers are matched.
# Otherwise, 123456 would also be matched (and the match result would be 12345)!
output = re.findall(r'\b\d{5}\b', text)
# Join the matches together
out_str = ",".join(output)
# Write them to a file, again using "with" so the file will be closed.
with open("output.txt", "w") as outp:
outp.write(out_str)