Python Regular Expression to Remove Emoticons Not working - python

This regular expression is suppose to remove emoticons but when i try it on my sample text, it does not work. It was working previously..not sure what I am missing. Thanks
Here is a sample text: pastebin.com/uYUNk9R1
Place in notepad document to test, Python 2.7 .
import re
myre = re.compile('('
'\ud83c[\udf00-\udfff]|'
'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
'[\u2600-\u26FF\u2700-\u27BF])+'.decode('unicode_escape'),
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line = myre.sub('', line)
out.write(line)

You need to convert your input data to Unicode
line = myre.sub('', line.decode('utf-8'))

Related

Cant extract substring from the string using regex in python

I want to extract the substring "login attempt [b'admin'/b'admin']" from the string:
2021-05-06T00:00:15.921179Z [HoneyPotSSHTransport,1127,5.188.87.53] login attempt [b'admin'/b'admin'] succeeded.
But python returns the whole string. My code is:
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
if re.findall(r'login\sattempt\s\[[a-zA-z0-9]\'[a-zA-z0-9]+\'/[a-zA-z0-9]+\'[a-zA-z0-9]+\'\]', line):
print(line)
outF.write(line)
outF.write("\n")
outF.close()
Thanks in advance. This is the LINK which contains the data from which I want to extract.
Your code states: if re.findall returns something, print the whole line. But you should print the return from re.findall and write that as a string.
Or use re.search if you expect a single match.
Note that [A-z] matches more than [A-Za-z].
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
res = re.search(r"login\sattempt\s\[[a-zA-Z0-9]'[a-zA-Z0-9]+'/[a-zA-Z0-9]+'[a-zA-Z0-9]+']", line)
if res:
outF.write(res.group())
outF.write("\n")
outF.close()
Usernames.txt now contains:
login attempt [b'admin'/b'admin']

Extract chunks of text from document and write them to new text file

I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well.
What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible.
Code:
array = []
f = open('text.txt','r') as infile
w = open(r'temp2.txt', 'w') as outfile
for line in f:
data = f.read()
x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL)
array.append(x)
outfile.write(x)
return array
What the text may look like
( CAR: *random info*
*random info* - could be many lines of this
)
Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)).
You can use the following regular expression: (CAR[\s\S]*?(?=\)))
See explanation...
Here you can visualize your regular expression...
We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms.
Then we just have to remove the newline characters from the resulting matches and write them to a file.
with open("text.txt", 'r') as f:
matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL)
with open("output.txt", 'w') as f:
for match in matches:
f.write(" ".join(match.split('\n')))
f.write('\n')
The output file looks like this:
CAR: *random info* *random info* - could be many lines of this
EDIT:
updated code to put newline between matches in output file

Python 2.7 Search Line if match pattern and replace string

How can I read the file and find all lines match pattern start with \d+\s. And the replace the write space to , . Some of lines are contain English character. But some of line are Chinese. I guest the write space in chinese encoding is different with english?
Example (text.txt)
asdfasdf
1 abcd
2 asdfajklsd
3 asdfasdf
4 ...
asdfasdf
66 ...
aasdfasdf
99 ...
100 中文
101 中文
102 asdfga
103 中文
My Test Code:
with open('text.txt', 'r') as t:
with open('newtext.txt', 'w') as nt:
content = t.readlines()
for line in content:
okline = re.compile('^[\d+]\s')
if okline:
ntext = re.sub('\s', ',', okline)
nt.write(ntext)
With single re.subn() function:
with open('text.txt', 'r') as text, open('newtext.txt', 'w') as new_text:
lines = text.read().splitlines()
for l in lines:
rpl = re.subn(r'^(\d+)\s+', '\\1,', l)
if rpl[1]:
new_text.write(rpl[0] + '\n')
The main advantage of this is that re.subn will return a tuple (new_string, number_of_subs_made) where number_of_subs_made is the crucial value pointing to the substitution made upon the needed matched line
You could do this:
# Reading lines from input file
with open('text.txt', 'r') as t:
content = t.readlines()
# Opening file for writing
with open('newtext.txt', 'w') as nt:
# For each line
for line in content:
# We search for regular expression
if re.search('^\d+\s', line):
# If we found pattern inside line only then can continue
# and substitute white spaces with commas and write to output file
ntext = re.sub('\s', ',', line)
nt.write(ntext)
There were multiple problems with your code, for starters \d is character class, basically \d is same as [0-9] so you don't need to put it inside square brackets. You can see regex demo here. Also you were checking if compile object is True, since compile operation succeeds compile object will always be True.
Also, you should avoid nested with statements, more Pythonic way is to open files using with, read it, and then close it.
Compact code
import re
with open('esempio.txt', 'r') as original, open('newtext2.txt', 'w') as newtext:
for l in original.read().split('\n'):
if re.search("^\d+\s",l):
newtext.write(re.sub('\s', ',', l)+'\n')

Regular Expression to remove emoticons in Python

I am trying to remove emoticons from a piece of text, I looked at this regex from another question and it doesn't remove any emoticons. Can you let me know what I am doing wrong, or if there are better regex's for removing emojis from a string.
import re
myre = re.compile(u'('
u'\ud83c[\udf00-\udfff]|'
u'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
u'[\u2600-\u26FF\u2700-\u27BF])+',
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line=myre.sub('', line)
Something like this?
import re
myre = re.compile('('
'\ud83c[\udf00-\udfff]|'
'\ud83d[\udc00-\ude4f\ude80-\udeff]|'
'[\u2600-\u26FF\u2700-\u27BF])+'.decode('unicode_escape'),
re.UNICODE)
def clean(inputFile,outputFile):
with open(inputFile, 'r') as original,open(outputFile, 'w+') as out:
for line in original:
line = myre.sub('', line.decode('utf-8'))
print(line)

Replace part of a matched string in python

I have the following matched strings:
punctacros="Tasla"_TONTA
punctacros="Tasla"_SONTA
punctacros="Tasla"_JONTA
punctacros="Tasla"_BONTA
I want to replace only a part (before the underscore) of the matched strings, and the rest of it should remain the same in each original string.
The result should look like this:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Edit:
This should work:
from re import sub
with open("/path/to/file") as myfile:
lines = []
for line in myfile:
line = sub('punctacros="Tasla"(_.*)', r'TROGA\1', line)
lines.append(line)
with open("/path/to/file", "w") as myfile:
myfile.writelines(lines)
Result:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Note however, if your file is exactly like the sample given, you can replace the re.sub line with this:
line = "TROGA_"+line.split("_", 1)[1]
eliminating the need of Regex altogether. I didn't do this though because you seem to want a Regex solution.
mystring.replace('punctacross="Tasla"', 'TROGA_')
where mystring is string with those four lines. It will return string with replaced values.
If you want to replace everything before the first underscore, try this:
#! /usr/bin/python3
data = ['punctacros="Tasla"_TONTA',
'punctacros="Tasla"_SONTA',
'punctacros="Tasla"_JONTA',
'punctacros="Tasla"_BONTA',
'somethingelse!="Tucku"_CONTA']
for s in data:
print('TROGA' + s[s.find('_'):])

Categories