Seeking advice on how to mine items from multiple text files to build a dictionary.
This text file: https://pastebin.com/Npcp3HCM
Was manually transformed into this required data structure: https://drive.google.com/file/d/0B2AJ7rliSQubV0J2Z0d0eXF3bW8/view
There are thousands of such text files and they may have different section headings as shown in these examples:
https://pastebin.com/wWSPGaLX
https://pastebin.com/9Up4RWHu
I started off by reading the files
from glob import glob
txtPth = '../tr-txt/*.txt'
txtFiles = glob(txtPth)
with open(txtFiles[0],'r') as tf:
allLines = [line.rstrip() for line in tf]
sectionHeading = ['Corporate Participants',
'Conference Call Participiants',
'Presentation',
'Questions and Answers']
for lineNum, line in enumerate(allLines):
if line in sectionHeading:
print(lineNum,allLines[lineNum])
My idea was to look for the line numbers where section headings existed and try to extract the content in between those line numbers, then strip out separators like dashes. That didn't work and I got stuck in trying to create a dictionary of this kind so that I can later run various natural language processing algorithms on quarried items.
{file-name1:{
{date-time:[string]},
{corporate-name:[string]},
{corporate-participants:[name1,name2,name3]},
{call-participants:[name4,name5]},
{section-headings:{
{heading1:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]}],
{heading2:[
{name1:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name3:[speechOrderNum, text-content]},
{name2:[speechOrderNum, text-content]},
{name1:[speechOrderNum, text-content]},
{name4:[speechOrderNum, text-content]}],
{heading3:[text-content]},
{heading4:[text-content]}
}
}
}
The challenge is that different files may have different headings and number of headings. But there will always be a section called "Presentation" and very likely to have "Question and Answer" section. These section headings are always separated by a string of equal-to signs. And content of different speaker is always separated by string of dashes. The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning of the document with an asterisks before their name and their tile is always on the next line.
Any suggestion on how to parse the text files is appreciated. The ideal help would be to provide guidance on how to produce such a dictionary (or other suitable data structure) for each file that can then be written to a database.
Thanks
--EDIT--
One of the files looks like this: https://pastebin.com/MSvmHb2e
In which the "Question & Answer" section is mislabeled as "Presentation" and there is no other "Question & Answer" section.
And final sample text: https://pastebin.com/jr9WfpV8
The comments in the code should explain everything. Let me know if anything is under specified, and needs more comments.
In short I leverage regex to find the '=' delimiter lines to subdivide the entire text into subsections, then handle each type of sections separately for clarity sake ( so you can tell how I am handling each case).
Side note: I am using the word 'attendee' and 'author' interchangeably.
EDIT: Updated the code to sort based on the '[x]' pattern found right next to the attendee/author in the presentation/QA section. Also changed the pretty print part since pprint does not handle OrderedDict very well.
To strip any additional whitespace including \n anywhere in the string, simply do str.strip(). if you specifically need to strip only \n, then just do str.strip('\n').
I have modified the code to strip any whitespace in the talks.
import json
import re
from collections import OrderedDict
from pprint import pprint
# Subdivides a collection of lines based on the delimiting regular expression.
# >>> example_string =' =============================
# asdfasdfasdf
# sdfasdfdfsdfsdf
# =============================
# asdfsdfasdfasd
# =============================
# >>> subdivide(example_string, "^=+")
# >>> ['asdfasdfasdf\nsdfasdfdfsdfsdf\n', 'asdfsdfasdfasd\n']
def subdivide(lines, regex):
equ_pattern = re.compile(regex, re.MULTILINE)
sections = equ_pattern.split(lines)
sections = [section.strip('\n') for section in sections]
return sections
# for processing sections with dashes in them, returns the heading of the section along with
# a dictionary where each key is the subsection's header, and each value is the text in the subsection.
def process_dashed_sections(section):
subsections = subdivide(section, "^-+")
heading = subsections[0] # header of the section.
d = {key: value for key, value in zip(subsections[1::2], subsections[2::2])}
index_pattern = re.compile("\[(.+)\]", re.MULTILINE)
# sort the dictionary by first capturing the pattern '[x]' and extracting 'x' number.
# Then this is passed as a compare function to 'sorted' to sort based on 'x'.
def cmp(d):
mat = index_pattern.findall(d[0])
if mat:
print(mat[0])
return int(mat[0])
# There are issues when dealing with subsections containing '-'s but not containing '[x]' pattern.
# This is just to deal with that small issue.
else:
return 0
o_d = OrderedDict(sorted(d.items(), key=cmp))
return heading, o_d
# this is to rename the keys of 'd' dictionary to the proper names present in the attendees.
# it searches for the best match for the key in the 'attendees' list, and replaces the corresponding key.
# >>> d = {'mr. man ceo of company [1]' : ' This is talk a' ,
# ... 'ms. woman ceo of company [2]' : ' This is talk b'}
# >>> l = ['mr. man', 'ms. woman']
# >>> new_d = assign_attendee(d, l)
# new_d = {'mr. man': 'This is talk a', 'ms. woman': 'This is talk b'}
def assign_attendee(d, attendees):
new_d = OrderedDict()
for key, value in d.items():
a = [a for a in attendees if a in key]
if len(a) == 1:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[a[0]] = value.strip()
elif len(a) == 0:
# to strip out any additional whitespace anywhere in the text including '\n'.
new_d[key] = value.strip()
return new_d
if __name__ == '__main__':
with open('input.txt', 'r') as input:
lines = input.read()
# regex pattern for matching headers of each section
header_pattern = re.compile("^.*[^\n]", re.MULTILINE)
# regex pattern for matching the sections that contains
# the list of attendee's (those that start with asterisks )
ppl_pattern = re.compile("^(\s+\*)(.+)(\s.*)", re.MULTILINE)
# regex pattern for matching sections with subsections in them.
dash_pattern = re.compile("^-+", re.MULTILINE)
ppl_d = dict()
talks_d = dict()
# Step1. Divide the the entire document into sections using the '=' divider
sections = subdivide(lines, "^=+")
header = []
print(sections)
# Step2. Handle each section like a switch case
for section in sections:
# Handle headers
if len(section.split('\n')) == 1: # likely to match only a header (assuming )
header = header_pattern.match(section).string
# Handle attendees/authors
elif ppl_pattern.match(section):
ppls = ppl_pattern.findall(section)
d = {key.strip(): value.strip() for (_, key, value) in ppls}
ppl_d.update(d)
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: ppl_d})
# Handle subsections
elif dash_pattern.findall(section):
heading, d = process_dashed_sections(section)
talks_d.update({heading: d})
# Else its just some random text.
else:
# assuming that if the previous section was detected as a header, then this section will relate
# to that header
if header:
talks_d.update({header: section})
#pprint(talks_d)
# To assign the talks material to the appropriate attendee/author. Still works if no match found.
for key, value in talks_d.items():
talks_d[key] = assign_attendee(value, ppl_d.keys())
# ordered dict does not pretty print using 'pprint'. So a small hack to make use of json output to pretty print.
print(json.dumps(talks_d, indent=4))
Could you please confirm that whether you only require "Presentation" and "Question and Answer" sections?
Also, regarding the output is it ok to dump CSV format similar to what you have "manually transformed".
Updated solution to work for every sample file you provided.
Output is from Cell "D:H" as per "Parsed-transcript" file shared.
#state = ["other", "head", "present", "qa", "speaker", "data"]
# codes : 0, 1, 2, 3, 4, 5
def writecell(out, data):
out.write(data)
out.write(",")
def readfile(fname, outname):
initstate = 0
f = open(fname, "r")
out = open(outname, "w")
head = ""
head_written = 0
quotes = 0
had_speaker = 0
for line in f:
line = line.strip()
if not line: continue
if initstate in [0,5] and not any([s for s in line if "=" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 1
elif initstate in [0,5] and not any([s for s in line if "-" != s]):
if initstate == 5:
out.write('"')
quotes = 0
out.write("\n")
initstate = 4
elif initstate == 1 and line == "Presentation":
initstate = 2
head = "Presentation"
head_written = 0
elif initstate == 1 and line == "Questions and Answers":
initstate = 3
head = "Questions and Answers"
head_written = 0
elif initstate == 1 and not any([s for s in line if "=" != s]):
initstate = 0
elif initstate in [2, 3] and not any([s for s in line if ("=" != s and "-" != s)]):
initstate = 4
elif initstate == 4 and '[' in line and ']' in line:
comma = line.find(',')
speech_st = line.find('[')
speech_end = line.find(']')
if speech_st == -1:
initstate = 0
continue
if comma == -1:
firm = ""
speaker = line[:speech_st].strip()
else:
speaker = line[:comma].strip()
firm = line[comma+1:speech_st].strip()
head_written = 1
if head_written:
writecell(out, head)
head_written = 0
order = line[speech_st+1:speech_end]
writecell(out, speaker)
writecell(out, firm)
writecell(out, order)
had_speaker = 1
elif initstate == 4 and not any([s for s in line if ("=" != s and "-" != s)]):
if had_speaker:
initstate = 5
out.write('"')
quotes = 1
had_speaker = 0
elif initstate == 5:
line = line.replace('"', '""')
out.write(line)
elif initstate == 0:
continue
else:
continue
f.close()
if quotes:
out.write('"')
out.close()
readfile("Sample1.txt", "out1.csv")
readfile("Sample2.txt", "out2.csv")
readfile("Sample3.txt", "out3.csv")
Details
in this solution there is a state machine which works as follows:
1. detects whether heading is present, if yes, write it
2. detects speakers after heading is written
3. writes notes for that speaker
4. switches to next speaker and so on...
You can later process the csv files as you want.
You can also populate the data in any format you want once basic processing is done.
Edit:
Please replace the function "writecell"
def writecell(out, data):
data = data.replace('"', '""')
out.write('"')
out.write(data)
out.write('"')
out.write(",")
Related
I have a number series contained in a string, and I want to remove everything but the number series. But the double quotes are giving me errors. Here are examples of the strings and a sample command that I have used. All I want is 127.60-02-15, 127.60-02-16, etc.
<span id="lblTaxMapNum">127.60-02-15</span>
<span id="lblTaxMapNum">127.60-02-16</span>
I have tried all sorts of methods (e.g., triple double quotes, single quotes, quotes with backslashes, etc.). Here is one inelegant way that still isn't working because it's still leaving ">:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
Here is what I am working with (more specific code). I'm retrieving the data from an CSV and just trying to clean it up.
text = open("outputA.csv", "r")
text = ''.join([i for i in text])
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
outputB = open("outputB.csv", "w")
outputB.writelines(text)
outputB.close()
If you add a > in the second replace it is still not elegant but it works:
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\">", "")
text = text.replace("</span>", "")
Alternatively, you could use a regex:
import re
text = "<span id=\"lblTaxMapNum\">127.60-02-16</span>"
pattern = r".*>(\d*.\d*-\d*-\d*)\D*" # the pattern in the brackets matches the number
match = re.search(pattern, text) # this searches for the pattern in the text
print(match.group(1)) # this prints out only the number
You can use beatifulsoup.
from bs4 import BeautifulSoup
strings = ['<span id="lblTaxMapNum">127.60-02-15</span>', '<span id="lblTaxMapNum">127.60-02-16</span>']
# Use BeautifulSoup to extract the text from the <span> tags
for string in strings:
soup = BeautifulSoup(string, 'html.parser')
number_series = soup.span.text
print(number_series)
output:
127.60-02-15
127.60-02-16
it's a little bit long , hope my documents are readable
with open(r'c:\users\GH\desktop\test.csv' , 'r') as f:
text = f.read().strip()
stRange = '<' # we will gonna remove the dump txt from our file by using (range
index) method
endRange = '>' # which means removing all extra literals between <>
text = list(text)
# casting our data to a list to be able to modify our data by reffering to its
components by index number
i = 0
length = len(text)
# we're gonna manipulate our text while we are iterating upon it
# so we have to declare a variable to be able to change it while iterating
while i < length:
if text[i] == stRange:
stRange = text.index(text[i])
elif text[i] != endRange and text[i] != stRange:
i += 1
continue
elif text[i] == endRange:
endRange = text.index(text[i]) # an integer to be used as rangeIndex
i = 0
del text[stRange : endRange + 1] # deleting the extra unwanted
characters
length = len(text) # getting the new length of our data
stRange = '<' # and again , assigning the specific characters to their
variables
endRange = '>'
i += 1
else:
result = str()
for l in text:
result += l
else:
with open(path , 'w') as f:
f.write(result)
with open(path , 'r') as f:
print('the result ==>')
print(f.read())
I’m stuck with the following problem. I’m trying to split sentences of multiple text files while keeping track of which paragraph it’s from and its position in the paragraph. What I mean with position is if it’s for example the first, second, or last sentence. In addition, I want to omit duplicates while keeping track of how many they are in the same paragraph and on the same position. All text files use one format and have the same number of paragraphs. What I’m aiming for is this:
SENTENCE
PARAGRAPH
POSITION
OCCURENCES
Sentence_A
1
1
1
Sentence_B
1
2
4
Sentence_C
2
1
1
I managed to count all occurrences using the following code.
from pathlib import Path
import pandas as pd
import re
my_dir_path = "data/text_files/"
sentences = []
for file in Path(my_dir_path).iterdir():
with open(file, 'r', encoding='windows-1252') as file_open:
string = file_open.read()
string = re.sub(r"\n+", '\n', string)
string = re.sub(r'(\d+)\.(\d+)', r"\1:\2", string)
string = string.replace('\n', ' ')
string = string.replace('. ', '.')
string = string.split('.')
for elem in string:
if 'Verwachting vandaag en morgen:' in elem: # omit specific sentence
continue
else:
sentences.append(elem)
df = pd.DataFrame(sentences, columns=['SENTENCES'])
df1 = df['SENTENCES'].value_counts().rename('SENTENCES').reset_index(name='OCCURENCES')
Adding the conditional part about the paragraph and its position is where I got stuck. I’m a novice at Python, any pointers are appreciated :]
Example text: https://pastebin.com/y30hzWCB. It always has 2 newlines between paragraphs. Some sentences are cut off in-between with a newline.
Thanks in advance!
from pathlib import Path
import pandas as pd
import re
from collections import defaultdict
my_dir_path = "data/text_files/"
data = {"SENTENCE": [], "PARAGRAPH": [], "POSITION": [], "OCCURENCES": []}
for file in Path(my_dir_path).iterdir():
parano = 1
pos = 0
already = False
occ = defaultdict(int)
with open(file, 'r', encoding='windows-1252') as file_open:
string = file_open.read()
string = re.sub(r"\n+", '\n', string)
string = re.sub(r'(\d+)\.(\d+)', r"\1:\2", string)
string = string.replace('\n', ' ')
string = string.replace('. ', '.')
string = string.split('.')
for elem in string:
if elem == "\n" and already:
parano += 1
already = True
pos = 0
else:
already = False
pos += 1
if 'Verwachting vandaag en morgen:' in elem: # omit specific sentence
continue
elif elem not in data["SENTENCES"]:
data["SENTENCE"].append(elem)
data["PARAGRAPH"].append(parano)
data["POSITION"].append(pos)
occ[elem] = 1
else:
occ[elem] += 1
for i in data["SENTENCES"]:
data["OCCURENCES"].append(occ[i])
df = pd.DataFrame(data)
What I did:
created a dictionary data which would directly be used to create the required dataframe
kept count of paragraph number parano by checking if current line is a newline
kept count of position by simply increasing pos each time a valid sentence occured
I have a string that holds data. And I want everything in between ({ and })
"({Simple Data})"
Should return "Simple Data"
Or regex:
s = '({Simple Data})'
print(re.search('\({([^})]+)', s).group(1))
Output:
'Simple Data'
You could try the following:
^\({(.*)}\)$
Group 1 will contain Simple Data.
See an example on regexr.
If the brackets are always positioned at the beginning and the end of the string, then you can do this:
l = "({Simple Data})"
print(l[2:-2])
Which resulst in:
"Simple Data"
In Python you can access single characters via the [] operator. With this you can access the sequence of characters starting with the third one (index = 2) up to the second-to-last (index = -2, second-to-last is not included in the sequence).
You could try this regex (?s)\(\{(.*?)\}\)
which simply captures the contents between the delimiters.
Beware though, this doesn't account for nesting.
If nesting is a concern, the best you can to with standard Python re engine
is to get the inner nest only, using this regex:
\(\{((?:(?!\(\{|\}\).)*)\}\)
Hereby I designed a tokenizer aimming at nesting data. OP should check out here.
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('DATA', r'[ \t]*[\w]+[\w \t]*'),
('SKIP', r'[ \t\f\v]+'),
('NEWLINE', r'\n|\r\n'),
('BOUND_L', r'\(\{'),
('BOUND_R', r'\}\)'),
('MISMATCH', r'.'),
]
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
lines = code.splitlines()
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
else:
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
statements = '''
({Simple Data})
({
Parent Data Prefix
({Nested Data (Yes)})
Parent Data Suffix
})
'''
queue = collections.deque()
for token in tokenize(statements):
if token.typ == 'DATA' or token.typ == 'MISMATCH':
queue.append(token.value)
elif token.typ == 'BOUND_L' or token.typ == 'BOUND_R':
print(''.join(queue))
queue.clear()
Output of this code should be:
Simple Data
Parent Data Prefix
Nested Data (Yes)
Parent Data Suffix
I have a large list of sub-strings that I want to search through and find if two particular sub-strings can be found in a row. The logic is intended to look for the first sequence, then if it is found, look at the second sub-string and return all the matches (based off the first 15 characters of the 16 character sequence). If the first sequence can not be found, it just looks for the second sequence only, and finally, if that can not be found, defaults to zero. The matches are then appended to a list, which is processed further. The current code used is as follows:
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
joined_A_Search_1 = ''.join(map(str,A_vein_1))
print 'search 1', joined_A_Search_1
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
joined_A_Search_2 = ''.join(map(str,A_vein_2))
match_A = [] #empty list to append closest match to
#Match search algorithm
for i,text in enumerate(data):
if joined_A_Search_1 == text:
if joined_A_Search_2 == data[i+1][:-1]:
print 'logic stream 1'
match_A.append(data[i+1][-1])
if joined_A_Search_1 != text:
if joined_A_Search_2 == text[:-1]:
print 'logic stream 2'
#print 'match', text[:-1]
match_A.append(text[-1])
print ' A matches', match_A
try:
filter_A = max(set(match_A), key=match_A.count)
except:
filter_A = 0
print 'no match A'
filter_A = int(filter_A)
print '0utput', filter_A
The issue is that I get a return of both logic stream 1 and logic stream 2, when I actually want it to be a strict one or the other, in this case only logic stream 1. An example of the output looks like this:
search 1 0100101010001000
search 2 100101010001000
logic stream 1
logic stream 2
logic stream 1
logic stream 2
logic stream 2
(Note: The list has been shortened, and the data inputs have been substituted in directly, as well as the print outs for the purposes of this post and error tracking)
Input :
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
code :
av1_str = "".join(map(str,A_vein_1))
av2_str = "".join(map(str,A_vein_2))
y=[av1_str,av2_str]
print [(y,dataA.index(x)) for x in dataA for y in dataB if y in x]
Output :
[('0100101010001000', 0), ('100101010001000', 0), ('100101010001000', 1)]
Your code confuses me. But I think I understand your issue:
#!/usr/env/env python
dataA = ['0100101010001000',
'1001010100010001',
'0010101000100010',
'0101010001000110',
'1010100010001110',
'0101000100011100',
'1010001000111010',
'0100010001110100',
'1000100011101000',
'0001000111010000']
A_vein_1 = [0,1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
A_vein_2 = [1,0,0,1,0,1,0,1,0,0,0,1,0,0,0]
av1_str = "".join(map(str,A_vein_1))
av2_str = "".join(map(str,A_vein_2))
for i, d in enumerate(dataA):
if av1_str in d:
print av1_str, 'found in line', i
elif av2_str in d:
print av2_str, 'found in line', i
This gives me :
jcg#jcg:~/code/python/stack_overflow$ python find_str.py
0100101010001000 found in line 0
100101010001000 found in line 0
100101010001000 found in line 1
After edit to elif:
jcg#jcg:~/code/python/stack_overflow$ python find_str.py
0100101010001000 found in line 0
100101010001000 found in line 1
I am facing difficulties for extracting data from an UTF-8 file that contains chinese characters.
The file is actually the CEDICT (chinese-english dictionary) and looks like this :
賓 宾 [bin1] /visitor/guest/object (in grammar)/
賓主 宾主 [bin1 zhu3] /host and guest/
賓利 宾利 [Bin1 li4] /Bentley/
賓士 宾士 [Bin1 shi4] /Taiwan equivalent of 奔馳|奔驰[Ben1 chi2]/
賓夕法尼亞 宾夕法尼亚 [Bin1 xi1 fa3 ni2 ya4] /Pennsylvania/
賓夕法尼亞大學 宾夕法尼亚大学 [Bin1 xi1 fa3 ni2 ya4 Da4 xue2] /University of Pennsylvania/
賓夕法尼亞州 宾夕法尼亚州 [Bin1 xi1 fa3 ni2 ya4 zhou1] /Pennsylvania/
Until now, I manage to get the first two fields using split() but I can't find out how I should proceed to extract the two other fields (let's say for the second line "bin1 zhu3" and "host and guest". I have been trying to use regex but it doesn't work for a reason I ignore.
#!/bin/python
#coding=utf-8
import re
class REMatcher(object):
def __init__(self, matchstring):
self.matchstring = matchstring
def match(self,regexp):
self.rematch = re.match(regexp, self.matchstring)
return bool(self.rematch)
def group(self,i):
return self.rematch.group(i)
def look(character):
myFile = open("/home/quentin/cedict_ts.u8","r")
for line in myFile:
line = line.rstrip()
elements = line.split(" ")
try:
if line != "" and elements[1] == character:
myFile.close()
return line
except:
myFile.close()
break
myFile.close()
return "Aucun résultat :("
translation = look("賓主") # translation contains one line of the file
elements = translation.split()
traditionnal = elements[0]
simplified = elements[1]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
m = REMatcher(translation)
tr = ""
if m.match(r"\[(\w+)\]"):
tr = m.group(1)
print "Pronouciation:" + tr
Any help appreciated.
This builds a dictionary to look up translations by either simplified or traditional characters and works in both Python 2.7 and 3.3:
# coding: utf8
import re
import codecs
# Process the whole file decoding from UTF-8 to Unicode
with codecs.open('cedict_ts.u8',encoding='utf8') as datafile:
D = {}
for line in datafile:
# Skip comment lines
if line.startswith('#'):
continue
trad,simp,pinyin,trans = re.match(r'(.*?) (.*?) \[(.*?)\] /(.*)/',line).groups()
D[trad] = (simp,pinyin,trans)
D[simp] = (trad,pinyin,trans)
Output (Python 3.3):
>>> D['马克']
('馬克', 'Ma3 ke4', 'Mark (name)')
>>> D['一路顺风']
('一路順風', 'yi1 lu4 shun4 feng1', 'to have a pleasant journey (idiom)')
>>> D['馬克']
('马克', 'Ma3 ke4', 'Mark (name)')
Output (Python 2.7, you have to print strings to see non-ASCII characters):
>>> D[u'马克']
(u'\u99ac\u514b', u'Ma3 ke4', u'Mark (name)')
>>> print D[u'马克'][0]
馬克
I would continue to use splits instead of regular expressions, with the maximum split number given. It depends on how consistent the format of the input file is.
elements = translation.split(' ',2)
traditionnal = elements[0]
simplified = elements[1]
rest = elements[2]
print "Traditionnal:" + traditionnal
print "Simplified:" + simplified
elems = rest.split(']')
tr = elems[0].strip('[')
print "Pronouciation:" + tr
Output:
Traditionnal:賓主
Simplified:宾主
Pronouciation:bin1 zhu3
EDIT: To split the last field into a list, split on the /:
translations = elems[1].strip().strip('/').split('/')
#strip the spaces, then the first and last slash,
#then split on the slashes
Output (for the first line of input):
['visitor', 'guest', 'object (in grammar)']
Heh, I've done this exact same thing before. Basically you just need to use regex with groupings. Unfortunately, I don't know python regex super well (I did the same thing using C#), but you should really do something like this:
matcher = "(\b\w+\b) (\b\w+\b) \[(\.*?)\] /(.*?)/"
basically you match the entire line using one expression, but then you use ( ) to separate each item into a regex-group. Then you just need to read the groups and voila!