python regex unicode - extracting data from an utf-8 file

python regex unicode - extracting data from an utf-8 file - python

I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.

This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克

I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']

Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!

Related

Finding an exact word in list

I am learning Python and am struggling with fining an exact word in each string in a list of strings.
Apologies if this is an already asked question for this situation.
This is what my code looks like so far:
with open('text.txt') as f:
lines = f.readlines()
lines = [line.rstrip('\n') for line in open('text.txt')]
keyword = input("Enter a keyword: ")
matching = [x for x in lines if keyword.lower() in x.lower()]
match_count = len(matching)
print('\nNumber of matches: ', match_count, '\n')
print(*matching, sep='\n')
Right now, matching will return all strings containing the word, not strings contating the exact word. For example, if I enter in 'local' as the keyword, strings with 'locally' and 'localized' in addition to 'local' will be returned when I only want just instances of 'local' returned.
I have tried:
match_test = re.compile(r"\b" + keyword+ r"\b")
match_test = ('\b' + keyword + '\b')
match_test = re.compile('?:^|\s|$){0}'.format(keyword))
matching = [x for x in lines if keyword.lower() == x.lower()]
matching = [x for x in lines if keyword.lower() == x.lower().strip()]
And none of them shave worked, so I'm a bit stuck.
How do I take the keyword entered from the user, and then return all strings in a list that contain that exact keyword?
Thanks

in means contained in, 'abc' in 'abcd' is True. For exact match use ==
matching = [x for x in lines if keyword.lower() == x.lower()]
You might need to remove spaces\new lines as well
matching = [x for x in lines if keyword.lower().strip() == x.lower().strip()]
Edit:
To find a line containing the keyword you can use loops
matches = []
for line in lines:
for string in line.split(' '):
if string.lower().strip() == keyword.lower().strip():
matches.append(line)

This method avoids having to read the whole file into memory. It also deals with cases like "LocaL" or "LOCAL" assuming you want to capture all such variants. There is a bit of performance overhead on making the temp string each time the line is read, however:
import re
reader(filename, target):
#this regexp matches a word at the front, end or in the middle of a line stripped
#of all punctuation and other non-alpha, non-whitespace characters:
regexp = re.compile(r'(^| )' + target.lower() + r'($| )')
with open(filename) as fin:
matching = []
#read lines one at at time:
for line in fin:
line = line.rstrip('\n')
#generates a line of lowercase and whitespace to test against
temp = ''.join([x.lower() for x in line if x.isalpha() or x == ' '])
print(temp)
if regexp.search(temp):
matching.append(line) #store unaltered line
return matching
Given the following tests:
locally local! localized
locally locale nonlocal localized
the magic word is Local.
Localized or nonlocal or LOCAL
This is returned:
['locally local! localized',
'the magic word is Local.',
'Localized or nonlocal or LOCAL']

Please find my solution which should match only local among following mentioned text in text file . I used search regular expression to find the instance which has only 'local' in string and other strings containing local will not be searched for .
Variables which were provided in text file :
local
localized
locally
local
local diwakar
local
local##!
Code to find only instances of 'local' in text file :
import os
import sys
import time
import re
with open('C:/path_to_file.txt') as f:
for line in f:
a = re.search(r'local\W$', line)
if a:
print(line)
Output
local
local
local
Let me know if this is what you were looking for

Your first test seems to be on the right track
Using input:
import re
lines = [
'local student',
'i live locally',
'keyboard localization',
'what if local was in middle',
'end with local',
]
keyword = 'local'
Try this:
pattern = re.compile(r'.*\b{}\b'.format(keyword.lower()))
matching = [x for x in lines if pattern.match(x.lower())]
print(matching)
Output:
['local student', 'what if local was in middle', 'end with local']
pattern.match will return the first instance of the regex matching or None. Using this as your if condition will filter for strings that match the whole keyword in some place. This works because \b matches the begining/ending of words. The .* works to capture any characters that may occur at the start of the line before your keyword shows up.
For more info about using Python's re, checkout the docs here: https://docs.python.org/3.8/library/re.html

You can try
pattern = re.compile(r"\b{}\b".format(keyword))
match_test = pattern.search(line)
like shown in
Python - Concat two raw strings with an user name

Parse ~4k files for a string (sophisticated conditions)

Problem description
There is a set of ~4000 python files with the following struture:
#ScriptInfo(number=3254,
attibute=some_value,
title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(title)
The goal
The goal is to get the value of the title from the ScriptInfo decorator (in this case it is "crawler for my website"), but there are a couple of problems:
1) There is no rule for naming a variable that contains the title. That's why it can be title_name, my_title, etc. See example:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
2) The #ScriptInfo decorator may have more than two arguments so getting its contents from between the parentheses in order to get the second parameter's value is not an option
My (very naive) solution
But the piece of code that stays unchanged is the scenario_name = entity.get_script_by_title(my_title) line. Taking this into account, I've come up with the solution:
import re
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("python_file.py") as file:
for line in file:
if re.match(regexp, line):
title_variable = re.match(title_variable_re, line).group(1)
title_re = title_variable + r"\s?=\s\"(.*)\"?"
with open("python_file.py") as file:
for line in file:
if re.match(title_re, line):
title_value = re.match(regexp, line).group(1)
print title_value
This snippet of code does the following:
1) Traverses (see the first with open) the script file and gets the variable with title value because it is up to a programmer to choose its name
2) Traverses the script file again (see the second with open) and gets the title's value
The question for the stackoverflow family
Is there a better and more efficient way to get the title's (my_title's, title_name's, etc) value than traversing the script file two times?

If you open the file only once and save all lines into fileContent, add break where appropriate, and reuse the matches to access the captured groups, you obtain something like this (with parentheses after print for 3.x, without for 2.7):
import re
title_value = None
title_variable_re = r"scenario_name\s?=\s?entity\.get_script_by_title\((.*)\)"
with open("scenarioName.txt") as file:
fileContent = list(file.read().split('\n'))
title_variable = None
for line in fileContent:
m1 = re.match(title_variable_re, line)
if m1:
title_variable = m1.group(1)
break
title_re = r'\s*' + title_variable + r'\s*=\s*"([^"]*)"[,)]?\s*'
for line in fileContent:
m2 = re.match(title_re, line)
if m2:
title_value = m2.group(1)
break
print(title_value)
Here an unsorted list of changes in the regular expressions:
Allow space before the title_variable, that's what the r'\s*' + is for
Allow space around =
Allow comma or closing round paren in the end of the line in title_re, that's what the [,)]? is for
Allow some space in the end of the line
When tested on the following file as input:
#ScriptInfo(number=3254,
attibute=some_value,
my_title="crawler for my website",
some_other_key=some_value)
scenario_name = entity.get_script_by_title(my_title)
it produces the following output:
crawler for my website

Python 3 - How to remove line/paragraph breaks

from docx import Document
alphaDic = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','!','?','.','~',',','(',')','$','-',':',';',"'",'/']
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
removeSpaces = " ".join(translation.split())
removeLineBreaks = removeSpaces.replace('\n','')
doc.paragraphs[docIndex].text = removeLineBreaks
docIndex +=1
I am attempting to remove line breaks from the document, but it doesn't work.
I am still getting
Hello
There
Rather than
Hello
There

I think what you want to do is get rid of an empty paragraph. The following function could help, it deletes a certain paragraph that you don't want:
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Code by: Scanny*
In your code, you could check if translation is equal to '' or not, and if it is then call the delete_paragraph function, so your code would be like:
while docIndex < len(doc.paragraphs):
firstSen = doc.paragraphs[docIndex].text
rep_dic = {ord(k):None for k in alphaDic + [x.upper() for x in alphaDic]}
translation = (firstSen.translate(rep_dic))
if translation != '':
doc.paragraphs[docIndex].text = translation
else:
delete_paragraph(doc.paragraphs[docIndex])
docIndex -=1 # go one step back in the loop because of the deleted index
docIndex +=1
*Reference- feature: Paragraph.delete()

The package comes with an example program that extracts the text.
That said, I think your problem springs from the fact that you are trying to operate on paragraphs. But the separation between paragraphs is where the newlines are happening. So even if you replace a program with the empty string (''), there will still be a newline added to the end of it.
You should either take the approach of the example program, and do your own formatting, or you should make sure that you delete any spurious "empty" paragraphs that might be between the "full" ones you have ("Hello", "", "There") -> ("Hello", "There").

Since readlines could read any type of text files, you can open the file rewrite the lines you want and ignore the lines you dont want to use.
"""example"""
file = open("file name", "w")
for line in file.readlines():
if (line != ''):
file.write(line)

Python to print string from substring from list

I am a newbie to python.Consider I have a list ['python','java','ruby']
I have a textfile as:
jrubyk
knwdjavawe
weqkpythonqwe
1ruby.e
Expected output:
ruby
java
python
ruby
I need to print the strings in list hidden inside as substring.
Is there a way to obtain that?

I tend to use regular expressions when I want to strip certain substrings from larger strings. Here is an inelegant but readable way to do this.
import re
python_matcher = re.compile('python')
java_matcher = re.compile('java')
ruby_matcher = re.compile('ruby')
hidden_text_list = open('hidden.txt', 'r').readlines()
for line in hidden_text_list:
python_matched = python_matcher.search(line)
java_matched = java_matcher.search(line)
ruby_matched = ruby_matcher.search(line)
if python_matched:
print python_matched.group()
elif java_matched:
print java_matched.group()
elif ruby_matched:
print ruby_matched.group()

The brute force approach is:
hidden_strings = ['python','java','ruby']
with open('path/to/textfile/as/in/example.txt') as infile:
for line in infile:
for hidden_string in hidden_strings:
if hidden_string in line:
print(hidden_string)

Faster operation reading file

I have to process a 15MB txt file (nucleic acid sequence) and find all the different substrings (size 5). For instance:
ABCDEF
would return 2, as we have both ABCDE and BCDEF, but
AAAAAA
would return 1. My code:
control_var = 0
f=open("input.txt","r")
list_of_substrings=[]
while(f.read(5)!=""):
f.seek(control_var)
aux = f.read(5)
if(aux not in list_of_substrings):
list_of_substrings.append(aux)
control_var += 1
f.close()
print len(list_of_substrings)
Would another approach be faster (instead of comparing the strings direct from the file)?

Depending on what your definition of a legal substring is, here is a possible solution:
import re
regex = re.compile(r'(?=(\w{5}))')
with open('input.txt', 'r') as fh:
input = fh.read()
print len(set(re.findall(regex, input)))
Of course, you may replace \w with whatever you see fit to qualify as a legal character in your substring. [A-Za-z0-9], for example will match all alphanumeric characters.
Here is an execution example:
>>> import re
>>> input = "ABCDEF GABCDEF"
>>> set(re.findall(regex, input))
set(['GABCD', 'ABCDE', 'BCDEF'])
EDIT: Following your comment above, that all character in the file are valid, excluding the last one (which is \n), it seems that there is no real need for regular expressions here and the iteration approach is much faster. You can benchmark it yourself with this code (note that I slightly modified the functions to reflect your update regarding the definition of a valid substring):
import timeit
import re
FILE_NAME = r'input.txt'
def re_approach():
return len(set(re.findall(r'(?=(.{5}))', input[:-1])))
def iter_approach():
return len(set([input[i:i+5] for i in xrange(len(input[:-6]))]))
with open(FILE_NAME, 'r') as fh:
input = fh.read()
# verify that the output of both approaches is identicle
assert set(re.findall(r'(?=(.{5}))', input[:-1])) == set([input[i:i+5] for i in xrange(len(input[:-6]))])
print timeit.repeat(stmt = re_approach, number = 500)
print timeit.repeat(stmt = iter_approach, number = 500)

15MB doesn't sound like a lot. Something like this probably would work fine:
import Counter, re
contents = open('input.txt', 'r').read()
counter = Counter.Counter(re.findall('.{5}', contents))
print len(counter)
Update
I think user590028 gave a great solution, but here is another option:
contents = open('input.txt', 'r').read()
print set(contents[start:start+5] for start in range(0, len(contents) - 4))
# Or using a dictionary
# dict([(contents[start:start+5],True) for start in range(0, len(contents) - 4)]).keys()

You could use a dictionary, where each key is a substring. It will take care of duplicates, and you can just count the keys at the end.
So: read through the file once, storing each substring in the dictionary, which will handle finding duplicate substrings & counting the distinct ones.

Reading all at once is more i/o efficient, and using a dict() is going to be faster than testing for existence in a list. Something like:
fives = {}
buf = open('input.txt').read()
for x in xrange(len(buf) - 4):
key = buf[x:x+5]
fives[key] = 1
for keys in fives.keys():
print keys

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python regex unicode - extracting data from an utf-8 file - python

Related

Finding an exact word in list

Parse ~4k files for a string (sophisticated conditions)

Python 3 - How to remove line/paragraph breaks

Python to print string from substring from list

Faster operation reading file

Categories

Resources