Find TM superscript in python 2 using regex - python

My text file includes "SSS™" as one of its words and I am trying to find it using regular expression. My problem is with finding ™ superscript. My code is:
import re
path='G:\python_code\A.txt'
f_general=open(path, 'r')
special=re.findall(r'\U2122',f_general.read())
print(special)
but it doesn't print anything. How can I fix it?

It may have to do with the encoding of your file. Try this:
import re
path = "g:\python_code\A.txt"
f_general=open(path, "r", encoding="UTF-16")
data = f_general.read()
special=re.findall(chr(8482), data)
print(special)
print(chr(8482))
Note I'm using the decimal value for Trade mark. This is the site I use:
https://www.ascii.cl/htmlcodes.htm
So, open the file you have in notepad. Do a save as and choose encoding unicode and this should all work. Working with extended ascii can be a hassle. I am using Python 3.6 but I think this should still work in 2.x
Note when it prints out the chr(8482) in your command line it will probably just be a T, at least that is what I get in windows.
update
Try this for Python 2 and this should capture the word before trademark:
import re
with open("g:\python_code\A.txt", "rb") as f:
data = f.read().decode("UTF-16")
regex = re.compile("\S+" + chr(8482))
match = re.search(regex, data)
if match:
print (match.group(0))

Related

Python read text file and split on control character

I'm working with output text files from Hadoop and Hive where the files have fields delimited by control-A. I'm then using Python to read the file line-by-line, but the string split() function is not splitting correctly even when I specify the delimiter.
Here is some sample data that is typical of what I get from Hadoop. Note that ^A is actually a control character.
field1^Afield2^Afield3^Afield4
field5^Afield6^Afield7^Afield8
You can see that the Linux command-line tool cut using the control code as a delimiter actually works. It is outputting the third field:
bash> cat test.txt | cut -d $'\001' -f 3
field3
field7
I then wrote a Python function that reads the file line-by-line using the standard Python idiom:
import re
def read_file(filename):
''' Read file line-by-line and split. '''
with open(filename, "r") as myfile:
for line in myfile:
tokens = line.split('\u0001')
#tokens = line.split('\^A')
#tokens = re.split('\^A', line)
print 'len(tokens): %d, tokens[0]: %s\n' % (len(tokens), tokens[0])
However, when I run the function, the string is not split correctly. There should be four tokens in each line.
>>> read_file('test2.txt')
len(tokens): 1, tokens[0]: field1field2field3field4
len(tokens): 1, tokens[0]: field5field6field7field8
As you can see in my Python function, I tried three different approaches to splitting the string. None of them worked.
tokens = line.split('\u0001')
tokens = line.split('\^A')
tokens = re.split('\^A', line)
Thanks for any help.
Related questions (none had a working solution for me):
delimiting carat A in python
re.split not working on ^A
Assuming that control-A is character "\x01" (ASCII code 1):
>>> line="field1\x01field2\x01field3\x01field4"
>>> line.split("\x01")
['field1', 'field2', 'field3', 'field4']
If you want to use the "\u0001" notation, you need the 'u' prefix (Python 2):
>>> line.split(u"\u0001")
[u'field1', u'field2', u'field3', u'field4']

"ValueError: embedded null character" when using open()

I am taking python at my college and I am stuck with my current assignment. We are supposed to take 2 files and compare them. I am simply trying to open the files so I can use them but I keep getting the error "ValueError: embedded null character"
file1 = input("Enter the name of the first file: ")
file1_open = open(file1)
file1_content = file1_open.read()
What does this error mean?
It seems that you have problems with characters "\" and "/". If you use them in input - try to change one to another...
Default encoding of files for Python 3.5 is 'utf-8'.
Default encoding of files for Windows tends to be something else.
If you intend to open two text files, you may try this:
import locale
locale.getdefaultlocale()
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=locale.getdefaultlocale()[1])
file1_content = file1_open.read()
There should be some automatic detection in the standard library.
Otherwise you may create your own:
def guess_encoding(csv_file):
"""guess the encoding of the given file"""
import io
import locale
with io.open(csv_file, "rb") as f:
data = f.read(5)
if data.startswith(b"\xEF\xBB\xBF"): # UTF-8 with a "BOM"
return "utf-8-sig"
elif data.startswith(b"\xFF\xFE") or data.startswith(b"\xFE\xFF"):
return "utf-16"
else: # in Windows, guessing utf-8 doesn't work, so we have to try
try:
with io.open(csv_file, encoding="utf-8") as f:
preview = f.read(222222)
return "utf-8"
except:
return locale.getdefaultlocale()[1]
and then
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=guess_encoding(file1))
file1_content = file1_open.read()
Try putting r (raw format).
r'D:\python_projects\templates\0.html'
On Windows while specifying the full path of the file name, we should use double backward slash as the seperator and not single backward slash.
For instance, C:\\FileName.txt instead of C:\FileName.txt
I got this error when copying a file to a folder that starts with a number. If you write the folder path with the double \ sign before the number, the problem will be solved.
The first slash of the file path name throws the error.
Need Raw, r
Raw string
FileHandle = open(r'..', encoding='utf8')
FilePath='C://FileName.txt'
FilePath=r'C:/FileName.txt'
The problem is due to bytes data that needs to be decoded.
When you insert a variable into the interpreter, it displays it's repr attribute whereas print() takes the str (which are the same in this scenario) and ignores all unprintable characters such as: \x00, \x01 and replaces them with something else.
A solution is to "decode" file1_content (ignore bytes):
file1_content = ''.join(x for x in file1_content if x.isprintable())
I was also getting the same error with the following code:
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr(data)
It was happening because I was passing the bytestring i.e. data in writestr() method without specifying the name of file i.e. Report.zip where it should be saved.
So I changed my code and it worked.
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr('Report.zip', data)
If you are trying to open a file then you should use the path generated by os, like so:
import os
os.path.join("path","to","the","file")

search for string and print line containing that string in a very large file

Trying to search for a string (email address) and print the line it is found in within a 1.66 gig .dump file(ashley madison). If I change print (line) to print ('true'), i get true returned, so i know it is reading the file, but when I try to print the line, python crashes with no error. Please help. python 3.4 on windows vista (rather than using a database and importing, I'm using this as a learning exercize for python)
import os
with open('aminno_member_email.dump', 'r', errors = 'ignore')as searchfile:
for line in searchfile:
if 'email#address.com' in line:
#print ('true')
print (line)
As I suspected, each line of that file is very long (to the tune of nearly a million characters, as you found). Most consoles are not set up to handle that sort of thing, so writing that line to a text file is your best bet. You can then open the file in a text editor or word processor and use its search function to locate areas of interest.
To display your search string with some characters of surrounding text, you can use a regular expression.
import re
...
# replace this:
'''
if 'email#address.com' in line:
#print ('true')
print (line)
'''
# with this:
print(*re.findall(r'(.{0,10}email#address\.com.{0,10})', line), sep='\n')
That will print each match with up to 10 characters before and after the search string, separated by a newline.
Example:
>>> print(*re.findall(r'(.{0,10}str.{0,10})', 'hello this is a string with text and it is very strong stuff'), sep='\n')
this is a string with t
t is very strong stuff
Open the file as stream instead and read from the stream instead of loading the entire file to RAM. Use io from the Python standard library.
with io.open('aminno_member_email.dump', 'r') as file:
...

Is there any function like iconv in Python?

I have some CSV files need to convert from shift-jis to utf-8.
Here is my code in PHP, which is successful transcode to readable text.
$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;
My problem is how to do same thing in Python.
I don't know PHP, but does this work :
mystring.decode('shift-jis').encode('utf-8') ?
Also I assume the CSV content is from a file. There are a few options for opening a file in python.
with open(myfile, 'rb') as fin
would be the first and you would get data as it is
with open(myfile, 'r') as fin
would be the default file opening
Also I tried on my computed with a shift-js text and the following code worked :
with open("shift.txt" , "rb") as fin :
text = fin.read()
text.decode('shift-jis').encode('utf-8')
result was the following in UTF-8 (without any errors)
' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'
Ok I validate my solution :)
The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6"
It gives the correct result.
You can try yourself at this URL
for when pythons built-in encodings are insufficient there's an iconv at PyPi.
pip install iconv
unfortunately the documentation is nonexistant.
There's also iconv_codecs
pip install iconv_codecs
eg:
>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')
It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.
According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.
If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .

Replace part of string using python regular expression

I have the following lines (many, many):
...
gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567
..
What I'd like to do is to find line 'particular' (whatever number is after ':')
and replace this number with '111222333'. How can I do that using python regular expressions ?
for line in input:
key, val = line.split(':')
if key == 'particular':
val = '111222333'
I'm not sure regex would be of any value in this specific case. My guess is they'd be slower. That said, it can be done. Here's one way:
for line in input:
re.sub('^particular : .*', 'particular : 111222333')
There are subtleties involved in this, and this is almost certainly not what you'd want in production code. You need to check all of the re module constants to make sure the regex is acting the way you expect, etc. You might be surprised at the flexibility you find in dealing with problems like this in Python if you try not to use re (of course, this isn't to say re isn't useful) ;-)
Sure you need a regular expression?
other_number = '111222333'
some_text, some_number = line.split(': ')
new_line = ': '.join(some_text, other_number)
#!/usr/bin/env python
import re
text = '''gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567'''
print(re.sub('[0-9]+', '111222333', text))
input = """gfnfgnfgnf: 5656756734
arvervfdsa: 1343453563
particular: 4685685685
erveveersd: 3453454545
verveversf: 7896789567"""
entries = re.split("\n+", input)
for entry in entries:
if entry.startswith("particular"):
entry = re.sub(r'[0-9]+', r'111222333', entry)
or with sed:
sed -e 's/^particular: [0-9].*$/particular: 111222333/g' file
An important point here is that if you have a lot of lines, you want to process them one by one. That is, instead of reading all the lines in replacing them, and writing them out again, you should read in a line at a time and write out a line at a time. (This would be inefficient if you were actually reading a line at a time from the disk; however, Python's IO is competent and will buffer the file for you.)
with open(...) as infile, open(...) as outfile:
for line in infile:
if line.startswith("particular"):
outfile.write("particular: 111222333")
else:
outfile.write(line)
This will be speed- and memory-efficient.
Your sed example forces me to say neat!
python -c "import re, sys; print ''.join(re.sub(r'^(particular:) \d+', r'\1 111222333', l) for l in open(sys.argv[1]))" file

Categories