I would like to display a list containing information from some text files using Python.
What I want to display :
[host_name, hardware_ethernet_value, fixed_address_value]
An example of a file (random examples):
host name-random-text.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 1.3.0.155;
}
host name-another-again.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 3.5.0.115;
}
Someone helped me to write a code for that but it doesn't work anymore, though I know where the problem is coming from.
So the code is as follows :
#!/usr/bin/env python
#import modules
import pprint
import re
#open a file
filedesc = open("DATA/fixed.10.3", "r")
#using regex expressions to get the different informations
SCAN = {
'host' : r"^host (\S+) {",
'hardware' : r"hardware ethernet (\S+);",
'fixed-adress' : r"fixed adress (\S+);"
}
item = []
for key in SCAN:
#print(key)
regex = re.compile(SCAN[key])
#print(regex)
for line in filedesc:
#print(line)
match = regex.search(line)
#print(match)
#match = re.search(regex, line)
#if match is not None:
#print(match.group(1))
if match is not None:
#print(match.group(1))
if match.group(1) == key:
print(line)
item += [match.group(2)]
break
#print the final dictionnaries
pp=print(item)
#make sure to close the file after using it with file.close()
What should be expected :
match.group(1) = host
match.group(2) = name-random-text.etc
But what I have is match.group(1) = name-random-text.etc so match.group(2) = nothing here. This is why the condition match.group(1) == key never works, because match.group(1) never takes the values ['host', 'hardware ethernet', 'fixed-address'].
Your reg exp matches only 1 group.
If you want match 2 groups and group 1 should be equal SCAN's key, you need change SCAN like this:
SCAN = {
'host' : r"^(host) (\S+) {",
'hardware' : r"(hardware) ethernet (\S+);",
'fixed address' : r"(fixed address) (\S+);"
}
Very simple, but working decision:
#!/usr/bin/env python
#import modules
import pprint
import re
#open a file
file_lines = """
host name-random-text.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 1.3.0.155;
}
host name-another-again.etc {
hardware ethernet 00:02:99:aa:xx:yc;
fixed-address 3.5.0.115;
}
""".split('\n')
SCAN = {
'host': r"^host (\S+) {",
'hardware': r"hardware ethernet (\S+);",
'fixed_address': r"fixed-address (\S+);",
'end_item': r"^\s*}$\s*",
}
# Compile only once, if not want repeat re.compile above
for key, expr in SCAN.iteritems():
SCAN[key] = re.compile(expr)
items = []
item = {}
for line in file_lines:
for key, expr in SCAN.iteritems():
m = expr.search(line)
if not m:
continue
if key == 'end_item':
items.append(item)
print "Current item", [item.get(x) for x in ['host', 'hardware', 'fixed_address']]
item = {}
else:
item[key] = m.group(1)
print "Full list of items"
pprint.pprint(items)
Output:
Current item ['name-random-text.etc', '00:02:99:aa:xx:yc', '1.3.0.155']
Current item ['name-another-again.etc', '00:02:99:aa:xx:yc', '3.5.0.115']
Full list of items
[{'fixed_address': '1.3.0.155',
'hardware': '00:02:99:aa:xx:yc',
'host': 'name-random-text.etc'},
{'fixed_address': '3.5.0.115',
'hardware': '00:02:99:aa:xx:yc',
'host': 'name-another-again.etc'}]
Related
I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]
I have a set of data that I would like to extract from a txt file and stored in a specific format. The data is is currently in a txt file like so:
set firewall family inet filter INBOUND term TEST from source-address 1.1.1.1/32
set firewall family inet filter INBOUND term TEST from destination-prefix-list test-list
set firewall family inet filter INBOUND term TEST from protocol udp
set firewall family inet filter INBOUND term TEST from destination-port 53
set firewall family inet filter INBOUND term TEST then accept
set firewall family inet filter PROD term LAN from source-address 4.4.4.4/32
set firewall family inet filter PROD term LAN from source-address 5.5.5.5/32
set firewall family inet filter PROD term LAN from protocol tcp
set firewall family inet filter PROD term LAN from destination-port 443
set firewall family inet filter PROD term LAN then deny
I would like the data to be structured to where each rule has their respective options placed into dictionary and appended to a list. For example:
Expected Output
[{'Filter': 'INBOUND', 'Term': 'TEST', 'SourceIP': '1.1.1.1/32', 'DestinationList': 'test-list', 'Protocol': 'udp', 'DestinationPort': '53', 'Action': 'accept},
{'Filter': 'PROD', 'Term': 'LAN', 'SourceIP': ['4.4.4.4/32','5.5.5.5/32'], 'Protocol': 'tcp', 'DestinationPort': '443', 'Action': 'deny'}]
As you can see there may be instances where a certain trait does not exist for a rule. I would also have to add multiple IP addresses as a value. I am currently using Regex to match the items in the txt file. My thought was to iterate through each line in the file, find any matches and add them as a key-value pair to a dictionary.
Once I get an "accept" or "deny", that should signal the end of the rule and I will append the dictionary to the list, clear the dictionary and start the process with the next rule. However this does not seem to be working as intended. My Regex seems fine but I cant seem to figure out the logic when processing each line, adding multiple values to a value list, and adding values to the dictionary. Here is my code below
import re
data_file = "sample_data.txt"
##### REGEX PATTERNS #####
filter_re = r'(?<=filter\s)(.*)(?=\sterm.)'
term_re = r'(?<=term\s)(.*)(?=\sfrom|\sthen)'
protocol_re = r'(?<=protocol\s)(.*)'
dest_port_re = r'(?<=destination-port\s)(.*)'
source_port_re = r'(?<=from\ssource-port\s)(.*)'
prefix_source_re = r'(?<=from\ssource-prefix-list\s)(.*)'
prefix_dest_re = r'(?<=from\sdestination-prefix-list\s)(.*)'
source_addr_re = r'(?<=source-address\s)(.*)'
dest_addr_re = r'(?<=destination-address\s)(.*)'
action_re = r'(?<=then\s)(deny|accept)'
pattern_list = [filter_re, term_re, source_addr_re, prefix_source_re, source_port_re, dest_addr_re, prefix_dest_re, dest_port_re, protocol_re, action_re]
pattern_headers = ["Filter", "Term", "Source_Address", "Source_Prefix_List", "Source_Port", "Destination_Address," "Destination_Prefix_List", "Destination_Port", "Protocol", "Action"]
final_list = []
def open_file(file):
rule_dict = {}
with open(file, 'r') as f:
line = f.readline()
while line:
line = f.readline().strip()
for header, pattern in zip(pattern_headers,pattern_list):
match = re.findall(pattern, line)
if len(match) != 0:
if header != 'accept' or header != 'deny':
rule_dict[header] = match[0]
else:
rule_dict[header] = match[0]
final.append(rule_dict)
rule_dict = {}
print(rule_dict)
print(final_list)
The final list is empty and the rule_dict only contains the final rule from the text file not the both of the rulesets. Any guidance would be greatly appreciated.
There are few little mistakes in your code:
in your while loop f.readline() needs to be at the end, otherwise you already begin in line 2 (readline called twice before doing anything)
final_list has to be defined in your function and also used
correctly then (instead of only "final"
if header != 'accept' or header != 'deny':: here needs to be an and. One of them is always True, so the else part never gets executed.
you need to check the match for accept|deny, not the header
for example in Source_IP you want to have a list with all IP's you find. The way you do it, the value would always be updated and only the last found IP will be in your final_list
def open_file(file):
final_list = []
rule_dict = {}
with open(file) as f:
line = f.readline()
while line:
line = line.strip()
for header, pattern in zip(pattern_headers, pattern_list):
match = re.findall(pattern, line)
if len(match) != 0:
if (match[0] != "accept") and (match[0] != "deny"):
rule_dict.setdefault(header, set()).add(match[0])
else:
rule_dict.setdefault(header, set()).add(match[0])
#adjust values of dict to list (if multiple values) or just a value (instead of set) before appending to list
final_list.append({k:(list(v) if len(v)>1 else v.pop()) for k,v in rule_dict.items()})
rule_dict = {}
line = f.readline()
print(f"{rule_dict=}")
print(f"{final_list=}")
open_file(data_file)
Output:
rule_dict={}
final_list=[
{
'Filter': 'INBOUND',
'Term': 'TEST',
'Source_Address': '1.1.1.1/32',
'Destination_Prefix_List': 'test-list',
'Protocol': 'udp', 'Destination_Port': '53',
'Action': 'accept'
},
{
'Filter': 'PROD',
'Term': 'LAN',
'Source_Address': ['5.5.5.5/32', '4.4.4.4/32'],
'Protocol': 'tcp',
'Destination_Port': '443',
'Action': 'deny'
}
]
There are few things that i have change in your code:
When "accept" and "deny" found in action then append final_dict in final_list and empty final_dict
allow to add more than one SourceIP- for that create list in value of SourceIP when more than SourceIP get
import re
data_file = "/home/hiraltalsaniya/Documents/Hiral/test"
filter_re = r'(?<=filter\s)(.*)(?=\sterm.)'
term_re = r'(?<=term\s)(.*)(?=\sfrom|\sthen)'
protocol_re = r'(?<=protocol\s)(.*)'
dest_port_re = r'(?<=destination-port\s)(.*)'
source_port_re = r'(?<=from\ssource-port\s)(.*)'
prefix_source_re = r'(?<=from\ssource-prefix-list\s)(.*)'
prefix_dest_re = r'(?<=from\sdestination-prefix-list\s)(.*)'
source_addr_re = r'(?<=source-address\s)(.*)'
dest_addr_re = r'(?<=destination-address\s)(.*)'
action_re = r'(?<=then\s)(deny|accept)'
pattern_list = [filter_re, term_re, source_addr_re, prefix_source_re, source_port_re, dest_addr_re, prefix_dest_re,
dest_port_re, protocol_re, action_re]
pattern_headers = ["Filter", "Term", "SourceIP", "Source_Prefix_List", "Source_Port", "Destination_Address",
"DestinationList", "Destination_Port", "Protocol", "Action"]
def open_file(file):
final_dict: dict = dict()
final_list: list = list()
with open(file) as f:
for line in f:
for header, pattern in zip(pattern_headers, pattern_list):
match = re.search(pattern, line)
if match:
# check if accept or deny it means the end of the rule then empty dictionary
if str(match.group()) == "accept" or match.group() == "deny":
final_list.append(final_dict)
final_dict: dict = dict()
# if more than one SourceIP then create list of SourceIP
elif header == "SourceIP" and header in final_dict.keys():
final_dict[header] = [final_dict[header]]
final_dict.setdefault(header, final_dict[header]).append(match.group())
else:
final_dict[header] = match.group()
print("final_list=", final_list)
open_file(data_file)
Output:
final_list= [{'Filter': 'INBOUND',
'Term': 'TEST',
'SourceIP': '1.1.1.1/32',
'DestinationList': 'test-list',
'Protocol': 'udp',
'Destination_Port': '53'
},
{'Filter': 'PROD',
'Term': 'LAN',
'SourceIP': ['4.4.4.4/32', '5.5.5.5/32'],
'Protocol': 'tcp',
'Destination_Port': '443'
}]
I have a file containing a text like this:
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
Does anyone know, how could I extract variables like below:
appList=["application1","application2"]
ServerOfapp1=["127.0.0.1:8082","127.0.0.1:8083","127.0.0.1:8084"]
ServerOfapp2=["127.0.0.1:8092","127.0.0.1:8093","127.0.0.1:8094"]
.
.
.
and so on
If the lines you want always start with upstream and server this should work:
app_dic = {}
with open('file.txt','r') as f:
for line in f:
if line.startswith('upstream'):
app_i = line.split()[1]
server_of_app_i = []
for line in f:
if not line.startswith('server'):
break
server_of_app_i.append(line.split()[1][:-1])
app_dic[app_i] = server_of_app_i
app_dic should then be a dictionary of lists:
{'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'],
'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']}
EDIT
If the input file does not contain any newline character, as long as the file is not too large you could write it to a list and iterate over it:
app_dic = {}
with open('file.txt','r') as f:
txt_iter = iter(f.read().split()) #iterator of list
for word in txt_iter:
if word == 'upstream':
app_i = next(txt_iter)
server_of_app_i=[]
for word in txt_iter:
if word == 'server':
server_of_app_i.append(next(txt_iter)[:-1])
elif word == '}':
break
app_dic[app_i] = server_of_app_i
This is more ugly as one has to search for the closing curly bracket to break. If it gets any more complicated, regex should be used.
If you are able to use the newer regex module by Matthew Barnett, you can use the following solution, see an additional demo on regex101.com:
import regex as re
rx = re.compile(r"""
(?:(?P<application>application\d)\s{\n| # "application" + digit + { + newline
(?!\A)\G\n) # assert that the next match starts here
server\s # match "server"
(?P<server>[\d.:]+); # followed by digits, . and :
""", re.VERBOSE)
string = """
loadbalancer {
upstream application1 {
server 127.0.0.1:8082;
server 127.0.0.1:8083;
server 127.0.0.1:8084;
}
upstream application2 {
server 127.0.0.1:8092;
server 127.0.0.1:8093;
server 127.0.0.1:8094;
}
}
"""
result = {}
for match in rx.finditer(string):
if match.group('application'):
current = match.group('application')
result[current] = list()
if current:
result[current].append(match.group('server'))
print result
# {'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094'], 'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084']}
This makes use of the \G modifier, named capture groups and some programming logic.
This is the basic method:
# each of your objects here
objText = "xyz xcyz 244.233.233.2:123"
listOfAll = re.findall(r"/\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):[0-9]{1,5}/g", objText)
for eachMatch in listOfAll:
print "Here's one!" % eachMatch
Obviously that's a bit rough around the edges, but it will perform a full-scale regex search of whatever string it's given. Probably a better solution would be to pass it the objects themselves, but for now I'm not sure what you would have as raw input. I'll try to improve on the regex, though.
I believe this as well can be solved with re:
>>> import re
>>> from collections import defaultdict
>>>
>>> APP = r'\b(?P<APP>application\d+)\b'
>>> IP = r'server\s+(?P<IP>[\d\.:]+);'
>>>
>>> pat = re.compile('|'.join([APP, IP]))
>>>
>>>
>>> scan = pat.scanner(s)
>>> d = defaultdict(list)
>>>
>>> for m in iter(scan.search, None):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
Or similarly with re.finditer method and without pat.scanner:
>>> for m in re.finditer(pat, s):
group = m.lastgroup
if group == 'APP':
keygroup = m.group(group)
continue
else:
d[keygroup].append(m.group(group))
>>> d
defaultdict(<class 'list'>, {'application1': ['127.0.0.1:8082', '127.0.0.1:8083', '127.0.0.1:8084'], 'application2': ['127.0.0.1:8092', '127.0.0.1:8093', '127.0.0.1:8094']})
There are two named groups in my pattern: myFlag and id, I want to add one more myFlag immediately before group id.
Here is my current code:
# i'm using Python 3.4.2
import re
import os
contents = b'''
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
'''
pattern = rb'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+(?P<id>\(UINT\)0 *,)'
res = re.search(pattern, contents, re.DOTALL)
if None != res:
print(res.groups()) # the output is (b'xdlg', b'(UINT)0,')
# 'replPattern' becomes b'(?P<myFlag>[a-zA-Z0-9_]+)::(?P=myFlag).+:.+((?P=myFlag)\\(UINT\\)0 *,)'
replPattern = pattern.replace(b'?P<id>', b'(?P=myFlag)', re.DOTALL)
print(replPattern)
contents = re.sub(pattern, replPattern, contents)
print(contents)
The expected results should be:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
but now the result this the same with the original:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg((UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
The issue appears to be the pattern syntax — particularly the end:
0 *,)
That makes no sense really... fixing it seems to solve most of the issues, although I would recommend ditching DOTALL and going with MULTILINE instead:
p = re.compile(ur'([a-zA-Z0-9_]+)::\1(.*\n\W+:.*)(\(UINT\)0,.*)', re.MULTILINE)
sub = u"\\1::\\1\\2\\1\\3"
result = re.sub(p, sub, s)
print(result)
Result:
xdlg::xdlg(x_app* pApp, CWnd* pParent)
: customized_dlg(xdlg(UINT)0, pParent, pApp)
, m_pReaderApp(pApp)
, m_info(pApp)
{
}
https://regex101.com/r/hG3lV7/1
I have modified a python babelizer to help me to translate english to chinese.
## {{{ http://code.activestate.com/recipes/64937/ (r4)
# babelizer.py - API for simple access to babelfish.altavista.com.
# Requires python 2.0 or better.
#
# See it in use at http://babel.MrFeinberg.com/
"""API for simple access to babelfish.altavista.com.
Summary:
import babelizer
print ' '.join(babelizer.available_languages)
print babelizer.translate( 'How much is that doggie in the window?',
'English', 'French' )
def babel_callback(phrase):
print phrase
sys.stdout.flush()
babelizer.babelize( 'I love a reigning knight.',
'English', 'German',
callback = babel_callback )
available_languages
A list of languages available for use with babelfish.
translate( phrase, from_lang, to_lang )
Uses babelfish to translate phrase from from_lang to to_lang.
babelize(phrase, from_lang, through_lang, limit = 12, callback = None)
Uses babelfish to translate back and forth between from_lang and
through_lang until either no more changes occur in translation or
limit iterations have been reached, whichever comes first. Takes
an optional callback function which should receive a single
parameter, being the next translation. Without the callback
returns a list of successive translations.
It's only guaranteed to work if 'english' is one of the two languages
given to either of the translation methods.
Both translation methods throw exceptions which are all subclasses of
BabelizerError. They include
LanguageNotAvailableError
Thrown on an attempt to use an unknown language.
BabelfishChangedError
Thrown when babelfish.altavista.com changes some detail of their
layout, and babelizer can no longer parse the results or submit
the correct form (a not infrequent occurance).
BabelizerIOError
Thrown for various networking and IO errors.
Version: $Id: babelizer.py,v 1.4 2001/06/04 21:25:09 Administrator Exp $
Author: Jonathan Feinberg <jdf#pobox.com>
"""
import re, string, urllib
import httplib, urllib
import sys
"""
Various patterns I have encountered in looking for the babelfish result.
We try each of them in turn, based on the relative number of times I've
seen each of these patterns. $1.00 to anyone who can provide a heuristic
for knowing which one to use. This includes AltaVista employees.
"""
__where = [ re.compile(r'name=\"q\">([^<]*)'),
re.compile(r'td bgcolor=white>([^<]*)'),
re.compile(r'<\/strong><br>([^<]*)')
]
# <div id="result"><div style="padding:0.6em;">??</div></div>
__where = [ re.compile(r'<div id=\"result\"><div style=\"padding\:0\.6em\;\">(.*)<\/div><\/div>', re.U) ]
__languages = { 'english' : 'en',
'french' : 'fr',
'spanish' : 'es',
'german' : 'de',
'italian' : 'it',
'portugese' : 'pt',
'chinese' : 'zh'
}
"""
All of the available language names.
"""
available_languages = [ x.title() for x in __languages.keys() ]
"""
Calling translate() or babelize() can raise a BabelizerError
"""
class BabelizerError(Exception):
pass
class LanguageNotAvailableError(BabelizerError):
pass
class BabelfishChangedError(BabelizerError):
pass
class BabelizerIOError(BabelizerError):
pass
def saveHTML(txt):
f = open('page.html', 'wb')
f.write(txt)
f.close()
def clean(text):
return ' '.join(string.replace(text.strip(), "\n", ' ').split())
def translate(phrase, from_lang, to_lang):
phrase = clean(phrase)
try:
from_code = __languages[from_lang.lower()]
to_code = __languages[to_lang.lower()]
except KeyError, lang:
raise LanguageNotAvailableError(lang)
html = ""
try:
params = urllib.urlencode({'ei':'UTF-8', 'doit':'done', 'fr':'bf-res', 'intl':'1' , 'tt':'urltext', 'trtext':phrase, 'lp' : from_code + '_' + to_code , 'btnTrTxt':'Translate'})
headers = {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain"}
conn = httplib.HTTPConnection("babelfish.yahoo.com")
conn.request("POST", "http://babelfish.yahoo.com/translate_txt", params, headers)
response = conn.getresponse()
html = response.read()
saveHTML(html)
conn.close()
#response = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', params)
except IOError, what:
raise BabelizerIOError("Couldn't talk to server: %s" % what)
#print html
for regex in __where:
match = regex.search(html)
if match:
break
if not match:
raise BabelfishChangedError("Can't recognize translated string.")
return match.group(1)
#return clean(match.group(1))
def babelize(phrase, from_language, through_language, limit = 12, callback = None):
phrase = clean(phrase)
seen = { phrase: 1 }
if callback:
callback(phrase)
else:
results = [ phrase ]
flip = { from_language: through_language, through_language: from_language }
next = from_language
for i in range(limit):
phrase = translate(phrase, next, flip[next])
if seen.has_key(phrase): break
seen[phrase] = 1
if callback:
callback(phrase)
else:
results.append(phrase)
next = flip[next]
if not callback: return results
if __name__ == '__main__':
import sys
def printer(x):
print x
sys.stdout.flush();
babelize("I won't take that sort of treatment from you, or from your doggie!",
'english', 'french', callback = printer)
## end of http://code.activestate.com/recipes/64937/ }}}
and the test code is
import babelizer
print ' '.join(babelizer.available_languages)
result = babelizer.translate( 'How much is that dog in the window?', 'English', 'chinese' )
f = open('result.txt', 'wb')
f.write(result)
f.close()
print result
The result is to be expected inside a div block . I modded the script to save the html response . What I found is that all utf8 characters are turned to nul . Do I need take special care in treating the utf8 response ?
I think you need to use:
import codecs
codecs.open
instead of plain open, in your:
saveHTML
method, to handle utf-8 docs. See the Python Unicode Howto for a complete explanation.