trying to automate translation on babelfish with python

trying to automate translation on babelfish with python - python

I have modified a python babelizer to help me to translate english to chinese.
## {{{ http://code.activestate.com/recipes/64937/ (r4)
# babelizer.py - API for simple access to babelfish.altavista.com.
# Requires python 2.0 or better.
#
# See it in use at http://babel.MrFeinberg.com/
"""API for simple access to babelfish.altavista.com.
Summary:
import babelizer
print ' '.join(babelizer.available_languages)
print babelizer.translate( 'How much is that doggie in the window?',
'English', 'French' )
def babel_callback(phrase):
print phrase
sys.stdout.flush()
babelizer.babelize( 'I love a reigning knight.',
'English', 'German',
callback = babel_callback )
available_languages
A list of languages available for use with babelfish.
translate( phrase, from_lang, to_lang )
Uses babelfish to translate phrase from from_lang to to_lang.
babelize(phrase, from_lang, through_lang, limit = 12, callback = None)
Uses babelfish to translate back and forth between from_lang and
through_lang until either no more changes occur in translation or
limit iterations have been reached, whichever comes first. Takes
an optional callback function which should receive a single
parameter, being the next translation. Without the callback
returns a list of successive translations.
It's only guaranteed to work if 'english' is one of the two languages
given to either of the translation methods.
Both translation methods throw exceptions which are all subclasses of
BabelizerError. They include
LanguageNotAvailableError
Thrown on an attempt to use an unknown language.
BabelfishChangedError
Thrown when babelfish.altavista.com changes some detail of their
layout, and babelizer can no longer parse the results or submit
the correct form (a not infrequent occurance).
BabelizerIOError
Thrown for various networking and IO errors.
Version: $Id: babelizer.py,v 1.4 2001/06/04 21:25:09 Administrator Exp $
Author: Jonathan Feinberg <jdf#pobox.com>
"""
import re, string, urllib
import httplib, urllib
import sys
"""
Various patterns I have encountered in looking for the babelfish result.
We try each of them in turn, based on the relative number of times I've
seen each of these patterns. $1.00 to anyone who can provide a heuristic
for knowing which one to use. This includes AltaVista employees.
"""
__where = [ re.compile(r'name=\"q\">([^<]*)'),
re.compile(r'td bgcolor=white>([^<]*)'),
re.compile(r'<\/strong><br>([^<]*)')
]
# <div id="result"><div style="padding:0.6em;">??</div></div>
__where = [ re.compile(r'<div id=\"result\"><div style=\"padding\:0\.6em\;\">(.*)<\/div><\/div>', re.U) ]
__languages = { 'english' : 'en',
'french' : 'fr',
'spanish' : 'es',
'german' : 'de',
'italian' : 'it',
'portugese' : 'pt',
'chinese' : 'zh'
}
"""
All of the available language names.
"""
available_languages = [ x.title() for x in __languages.keys() ]
"""
Calling translate() or babelize() can raise a BabelizerError
"""
class BabelizerError(Exception):
pass
class LanguageNotAvailableError(BabelizerError):
pass
class BabelfishChangedError(BabelizerError):
pass
class BabelizerIOError(BabelizerError):
pass
def saveHTML(txt):
f = open('page.html', 'wb')
f.write(txt)
f.close()
def clean(text):
return ' '.join(string.replace(text.strip(), "\n", ' ').split())
def translate(phrase, from_lang, to_lang):
phrase = clean(phrase)
try:
from_code = __languages[from_lang.lower()]
to_code = __languages[to_lang.lower()]
except KeyError, lang:
raise LanguageNotAvailableError(lang)
html = ""
try:
params = urllib.urlencode({'ei':'UTF-8', 'doit':'done', 'fr':'bf-res', 'intl':'1' , 'tt':'urltext', 'trtext':phrase, 'lp' : from_code + '_' + to_code , 'btnTrTxt':'Translate'})
headers = {"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain"}
conn = httplib.HTTPConnection("babelfish.yahoo.com")
conn.request("POST", "http://babelfish.yahoo.com/translate_txt", params, headers)
response = conn.getresponse()
html = response.read()
saveHTML(html)
conn.close()
#response = urllib.urlopen('http://babelfish.yahoo.com/translate_txt', params)
except IOError, what:
raise BabelizerIOError("Couldn't talk to server: %s" % what)
#print html
for regex in __where:
match = regex.search(html)
if match:
break
if not match:
raise BabelfishChangedError("Can't recognize translated string.")
return match.group(1)
#return clean(match.group(1))
def babelize(phrase, from_language, through_language, limit = 12, callback = None):
phrase = clean(phrase)
seen = { phrase: 1 }
if callback:
callback(phrase)
else:
results = [ phrase ]
flip = { from_language: through_language, through_language: from_language }
next = from_language
for i in range(limit):
phrase = translate(phrase, next, flip[next])
if seen.has_key(phrase): break
seen[phrase] = 1
if callback:
callback(phrase)
else:
results.append(phrase)
next = flip[next]
if not callback: return results
if __name__ == '__main__':
import sys
def printer(x):
print x
sys.stdout.flush();
babelize("I won't take that sort of treatment from you, or from your doggie!",
'english', 'french', callback = printer)
## end of http://code.activestate.com/recipes/64937/ }}}
and the test code is
import babelizer
print ' '.join(babelizer.available_languages)
result = babelizer.translate( 'How much is that dog in the window?', 'English', 'chinese' )
f = open('result.txt', 'wb')
f.write(result)
f.close()
print result
The result is to be expected inside a div block . I modded the script to save the html response . What I found is that all utf8 characters are turned to nul . Do I need take special care in treating the utf8 response ?

I think you need to use:
import codecs
codecs.open
instead of plain open, in your:
saveHTML
method, to handle utf-8 docs. See the Python Unicode Howto for a complete explanation.

Related

Translation of a PHP Script to Python3 (Django)

I am attempting to convert from scratch the following PHP script into Python for my Django Project:
Note that it is my understanding that this script should handle values sent from a form, sign the data with the Secret_Key, encrypt the data in SHA256 and encode it in Base64
<?php
define ('HMAC_SHA256', 'sha256');
define ('SECRET_KEY', '<REPLACE WITH SECRET KEY>');
function sign ($params) {
return signData(buildDataToSign($params), SECRET_KEY);
}
function signData($data, $secretKey) {
return base64_encode(hash_hmac('sha256', $data, $secretKey, true));
}
function buildDataToSign($params) {
$signedFieldNames = explode(",",$params["signed_field_names"]);
foreach ($signedFieldNames as $field) {
$dataToSign[] = $field . "=" . $params[$field];
}
return commaSeparate($dataToSign);
}
function commaSeparate ($dataToSign) {
return implode(",",$dataToSign);
}
?>
Here is what I have done so far :
def sawb_confirmation(request):
if request.method == "POST":
form = SecureAcceptance(request.POST)
if form.is_valid():
access_key = 'afc10315b6aaxxxxxcfc912xx812b94c'
profile_id = 'E25C4XXX-4622-47E9-9941-1003B7910B3B'
transaction_uuid = str(uuid.uuid4())
signed_field_names = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
signed_date_time = datetime.datetime.now()
signed_date_time = str(signed_date_time.strftime("20%y-%m-%dT%H:%M:%SZ"))
locale = 'en'
transaction_type = str(form.cleaned_data["transaction_type"])
reference_number = str(form.cleaned_data["reference_number"])
amount = str(form.cleaned_data["amount"])
currency = str(form.cleaned_data["currency"])
# Transform the String into a List
signed_field_names = [x.strip() for x in signed_field_names.split(',')]
# Get Values for each of the fields in the form
values = [access_key, profile_id, transaction_uuid,signed_field_names,'',signed_date_time,locale,transaction_type,reference_number,amount,currency]
# Insert the signedfieldnames in their place in the list (MUST BE KEPT)
values[3] = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
# Merge the two lists as one
DataToSign = list(map('='.join, zip(signed_field_names, values)))
# Hash Sha-256
API_SECRET = 'bb588d4f96ac491ebd43cceb18xx149b79291f874f1a41fcbf5bc078bb6c8793af2df5ad4b174f80bd5f24a4e4eec6fdabdxxxxxc6c1410db40252deea613e0b976748539294438694ba08xx4ba831d3d850349cacfa445f9706aa57be7f8e61aab0be2288054dbe88ec6200ccd7c72888bcc0aa373f42059ec248d3c86b0f45'
message = '{} {}'.format(DataToSign, API_SECRET)
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).hexdigest().upper()
base64string = base64.b64encode( bytes(signature, "utf-8") )
When printing the variables as they come, I obtain the following :
VALUES : ['afc10315b6aa3b2a8cfc91253812b94c', 'E25C4FE4-4622-47E9-9941-1003B7910B3B', '0b59b0ae-bd25-4421-a231-bb83dcfc91fa', 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', '', '2021-03-06T22:07:30Z', 'en', 'authorization', '1615068450109', '100', 'USD']
DATATOSIGN : ['access_key=afc10315b6aa3b2a8cfc91253812b94c', 'profile_id=E25C4FE4-4622-47E9-9941-1003B7910B3B', 'transaction_uuid=0b59b0ae-bd25-4421-a231-bb83dcfc91fa', 'signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', 'unsigned_field_names=', 'signed_date_time=2021-03-06T22:07:30Z', 'locale=en', 'transaction_type=authorization', 'reference_number=1615068450109', 'amount=100', 'currency=USD']
SIGNATURE : 953C786EB9884CEC13C24118B00125BDCFE23AFF8AB02E7BEF29A83156C55C16
BASE64STRING : b'OTUzQzc4NkVCOTg4NENFQzEzQzI0MTE4QjAwMTI1QkRDRkUyM0FGRjhBQjAyRTdCRUYyOUE4MzE1NkM1NUMxNg=='
I think I am getting pretty close from the final result I would like to achieve since I would then simply have to post the Base64String to a specific URL.
However, I am unsure of a couple of things which may seem a bit off :
Is my "translation" of the PHP code into Python correct? Am I meant to merge my lists with a result in "DATATOSIGN"? I am not proficient in PHP so I might have misunderstood how to present the data.
The signature in Base64 should be 44 chars AT ALL TIME like "WrXOhTzhBjYMZROwiCug2My3jiZHOqATimcz5EBA07M=" when using the PHP Sample Code but mine way exceeds this limitation.
If you need any additional information, please do not hesitate to ask.
Hope you can give me pointers !

To approach this problem, it might be good to get an idea of what your final PHP result would be for given parameters.
Here are the parameters I'm using for this with your given PHP code:
$params = [
'access_key' => 'afc10315b6aaxxxxxcfc912xx812b94c',
'profile_id' => 'E25C4XXX-4622-47E9-9941-1003B7910B3B',
'transaction_uuid' => '12345',
'signed_field_names' => 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency',
'unsigned_field_names' => '',
'signed_date_time' => '2021-03-06 16:14:00',
'locale' => 'en',
'transaction_type' => 'credit',
'reference_number' => '12345',
'amount' => '50',
'currency' => 'usd'
];
When I run the original PHP code with these parameters, and these lines for outputting the code:
<?php
echo "build data to sign:\n";
print_r(buildDataToSign($params));
echo "\n";
echo "sign data:\n";
echo signData(buildDataToSign($params), 'secret');
?>
I get the following output:
build data to sign:
access_key=afc10315b6aaxxxxxcfc912xx812b94c,profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B,transaction_uuid=12345,signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency,unsigned_field_names=,signed_date_time=2021-03-06 16:14:00,locale=en,transaction_type=credit,reference_number=12345,amount=50,currency=usd
sign data:
6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=
So with your new Python version of this PHP code, you'll probably want to have a similar sign data value of 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8= at the end with these parameters!
Because your Python example does not seem to get the same result as-is, after adding a return base64string to the end of your Python def, I get the following output:
sign_data:
b'NThFMjU4QTQyRjU2MkVDRDgzM0RCOEIwM0VDODczQTExNjc3MUNDMEM2OURGMDFGMjdFQkU3MEMzMDAyNjA3RQ=='
In order to match the PHP version of your code, I wanted to try to find out what was going on between the PHP and Python approaches in regard to the hmac and base64 parts.
When I broke down your PHP code example into steps relating to the hmac value and then later the base64 value, here is what I found (using a data message of 'hello' and a key of 'secret' to keep it simple):
Example PHP Code:
<?php
$hash_value = hash_hmac('sha256', 'hello', 'secret', true);
$base64_value = base64_encode($hash_value);
echo "hash value:\n";
echo $hash_value;
echo "\n";
echo "base64 value:\n";
echo $base64_value;
echo "\n";
?>
Example PHP Code Output:
;▒▒▒▒▒C▒|
base64 value:
iKqz7ejTrflNJquQ07r9SiCDBww7zOnAFO4EpEOEfAs=
That looks like some crazy binary-type data! Then, I wanted to try to make sure that the base64 value could be reproduced in Python. To do that, I used a simple approach in Python using the same values as earlier.
Example Python Code:
import base64
import hashlib
import hmac
# Based on your Python code example
hash_value = hmac.new(bytes('secret' , 'latin-1'), msg = bytes('hello', 'latin-1'), digestmod = hashlib.sha256).hexdigest().upper()
base64_value = base64.b64encode(bytes(hash_value, 'utf-8'))
print("hash value:")
print(hash_value)
print("base64 value:")
print(base64_value)
Example Python Code Output:
hash value:
88AAB3EDE8D3ADF94D26AB90D3BAFD4A2083070C3BCCE9C014EE04A443847C0B
base64 value:
b'ODhBQUIzRURFOEQzQURGOTREMjZBQjkwRDNCQUZENEEyMDgzMDcwQzNCQ0NFOUMwMTRFRTA0QTQ0Mzg0N0MwQg=='
So, like you found out earlier, something is causing this base64 value result on the Python side to be longer than the PHP version.
After looking into things more (especially seeing the strange data result in the PHP test code above), I found out that the hash_hmac() function in PHP has the option to return a result in binary form (with the true value at the end of the hash_hmac() in your PHP code example). On the Python side, it looks like you decided to use hmac.hexdigest() which I think I've used before in the past when I wanted a string-like value. For this case, however, I think you might want to get the value back as a binary value. To do this, it looks like you'll want to use hmac.digest() instead.
Modified Example Python Code:
import base64
import hashlib
import hmac
# Based on your Python code example
hash_value = hmac.new(bytes('secret' , 'latin-1'), msg = bytes('hello', 'latin-1'), digestmod = hashlib.sha256).digest()
base64_value = base64.b64encode(bytes(hash_value))
print("hash value:")
print(hash_value)
print("base64 value:")
print(base64_value)
Modified Example Python Code Output:
hash value:
b'\x88\xaa\xb3\xed\xe8\xd3\xad\xf9M&\xab\x90\xd3\xba\xfdJ \x83\x07\x0c;\xcc\xe9\xc0\x14\xee\x04\xa4C\x84|\x0b'
base64 value:
b'iKqz7ejTrflNJquQ07r9SiCDBww7zOnAFO4EpEOEfAs='
Now, the final base64 results appear to match between the example PHP and Python code.
In order for me to better understand what was different between the PHP and Python code, I ended up writing a simple translation of your PHP code into Python (and partly based on your Python code as well).
Here is what the related Python code looks like on my side (with example params):
import base64
import hmac
params = {
'access_key': 'afc10315b6aaxxxxxcfc912xx812b94c',
'profile_id': 'E25C4XXX-4622-47E9-9941-1003B7910B3B',
'transaction_uuid': "12345",
'signed_field_names': 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency',
'unsigned_field_names': '',
'signed_date_time': "2021-03-06 16:14:00",
'locale': 'en',
'transaction_type': "credit",
'reference_number': "12345",
'amount': "50",
'currency': "usd"
}
SECRET_KEY = 'secret'
def sign(params):
return sign_data(build_data_to_sign(params), SECRET_KEY)
def sign_data(data, secret_key):
return base64.b64encode(bytes(hmac.new(bytes(secret_key, 'latin-1'), msg=bytes(data, 'latin-1'), digestmod='sha256').digest()))
def build_data_to_sign(params):
data_to_sign = []
signed_field_names = params['signed_field_names'].split(',')
for field in signed_field_names:
data_to_sign.append(field + "=" + params[field])
return comma_separate(data_to_sign)
def comma_separate(data_to_sign):
return ','.join(data_to_sign)
When I use my code translation to check your Python code, I checked the values for the variables signed_field_names and DataToSign in your code, and I got the following results:
signed_field_names:
['access_key', 'profile_id', 'transaction_uuid', 'signed_field_names', 'unsigned_field_names', 'signed_date_time', 'locale', 'transaction_type', 'reference_number', 'amount', 'currency']
DataToSign:
['access_key=afc10315b6aaxxxxxcfc912xx812b94c', 'profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B', 'transaction_uuid=12345', 'signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency', 'unsigned_field_names=', 'signed_date_time=2021-03-06 16:14:00', 'locale=en', 'transaction_type=credit', 'reference_number=12345', 'amount=50', 'currency=usd']
When I check the values with my code translation attempt, I get these values:
signed_field_names:
['access_key', 'profile_id', 'transaction_uuid', 'signed_field_names', 'unsigned_field_names', 'signed_date_time', 'locale', 'transaction_type', 'reference_number', 'amount', 'currency']
DataToSign:
access_key=afc10315b6aaxxxxxcfc912xx812b94c,profile_id=E25C4XXX-4622-47E9-9941-1003B7910B3B,transaction_uuid=12345,signed_field_names=access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency,unsigned_field_names=,signed_date_time=2021-03-06 16:14:00,locale=en,transaction_type=credit,reference_number=12345,amount=50,currency=usd
So it looks like your DataToSign = list(map('='.join, zip(signed_field_names, values))) line is specifying a list whereas my code attempt is specifying a string based on your original PHP example.
Because of this, I think you'll want to turn the result back into a string like this (though the variable name could also be written differently if you so choose):
DataToSignString = ','.join(DataToSign)
To save time in this long post, I also found that your message variable was different than my translation of your PHP code. To work around this, I made the message variable in your Python code set to the previously mentioned DataToSignString:
# Commenting out previous message line for now
# message = '{} {}'.format(DataToSignString, API_SECRET)
message = DataToSignString
Also, the following changes seem to be needed for your Python example:
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).digest()
base64string = base64.b64encode(bytes(signature))
This way, you have a binary version of the hmac object. Also, it looks like the utf-8 part might not be needed for now in the base64encode part.
Finally, I added a return to return the calculated base64string while also converting it to a string before base64string is returned:
return str(base64string, 'utf-8')
When put together, here is what the modified code from your Python example looks like:
import base64
import datetime
import hashlib
import hmac
import pprint
import uuid
def sign():
access_key = 'afc10315b6aaxxxxxcfc912xx812b94c'
profile_id = 'E25C4XXX-4622-47E9-9941-1003B7910B3B'
transaction_uuid = "12345"
signed_field_names = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
signed_date_time = "2021-03-06 16:14:00"
locale = 'en'
transaction_type = "credit"
reference_number = "12345"
amount = "50"
currency = "usd"
# Transform the String into a List
signed_field_names = [x.strip() for x in signed_field_names.split(',')]
# Get Values for each of the fields in the form
values = [access_key, profile_id, transaction_uuid,signed_field_names,'',signed_date_time,locale,transaction_type,reference_number,amount,currency]
# Insert the signedfieldnames in their place in the list (MUST BE KEPT)
values[3] = 'access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency'
# Merge the two lists as one
DataToSign = list(map('='.join, zip(signed_field_names, values)))
DataToSignString = ','.join(DataToSign)
# Hash Sha-256
API_SECRET = 'secret'
message = DataToSignString
signature = hmac.new(bytes(API_SECRET , 'latin-1'), msg = bytes(message , 'latin-1'), digestmod = hashlib.sha256).digest()
base64string = base64.b64encode(bytes(signature))
return str(base64string, 'utf-8')
result = sign()
print("sign_data:")
print(result)
The output for this code (with the given parameters) is:
sign_data:
6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=
The value part of this output should be the same as the PHP output from earlier in this post. The earlier value was 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8= and the latest output showed a result of 6V0iIqu3smGmadPK4KvRuHm1nNkuIVLBPbLg7VkA7M8=.

#summea You are god sent ! Thanks a ton ! I cannot believe how much of an effort you made, I am baffled !
If anyone is attempted to Implement Secure Acceptance / CyberSource and see this message, note that you would not be able to pass value="{{ signed_field_names }} in your front end as it is since the data looks like ['access_key','profile_id'].
You would need to either tweak it before sending it (which somehow gives out a Signature Mismatched on CYBS end) or simply hardcode the input field in payment_confirmation like so : value="access_key,profile_id,transaction_uuid,signed_field_names,unsigned_field_names,signed_date_time,locale,transaction_type,reference_number,amount,currency"/>

How to get readable unicode string from single bibtex entry field in python script

Suppose you have a .bib file containing bibtex-formatted entries. I want to extract the "title" field from an entry, and then format it to a readable unicode string.
For example, if the entry was:
#article{mypaper,
author = {myself},
title = {A very nice {title} with annoying {symbols} like {\^{a}}}
}
what I want to extract is the string:
A very nice title with annoying symbols like â
I am currently trying to use the pybtex package, but I cannot figure out how to do it. The command-line utility pybtex-format does a good job in converting full .bib files, but I need to do this inside a script and for single title entries.

Figured it out:
def load_bib(filename):
from pybtex.database.input.bibtex import Parser
parser = Parser()
DB = parser.parse_file(filename)
return DB
def get_title(entry):
from pybtex.plugin import find_plugin
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
sentence = style.format_title(entry, 'title')
data = {'entry': entry,
'style': style,
'bib_data': None}
T = sentence.f(sentence.children, data)
title = T.render(backend)
return title
DB = load_bib("bibliography.bib")
print ( get_title(DB.entries["entry_label"]) )
where entry_label must match the label you use in latex to cite the bibliography entry.

Building upon the answer by Daniele, I wrote this function that lets one render fields without having to use a file.
from io import StringIO
from pybtex.database.input.bibtex import Parser
from pybtex.plugin import find_plugin
def render_fields(author="", title=""):
"""The arguments are in bibtex format. For example, they may contain
things like \'{i}. The output is a dictionary with these fields
rendered in plain text.
If you run tests by defining a string in Python, use r'''string''' to
avoid issues with escape characters.
"""
parser = Parser()
istr = r'''
#article{foo,
Author = {''' + author + r'''},
Title = {''' + title + '''},
}
'''
bib_data = parser.parse_stream(StringIO(istr))
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
entry = bib_data.entries["foo"]
data = {'entry': entry, 'style': style, 'bib_data': None}
sentence = style.format_author_or_editor(entry)
T = sentence.f(sentence.children, data)
rendered_author = T.render(backend)[0:-1] # exclude period
sentence = style.format_title(entry, 'title')
T = sentence.f(sentence.children, data)
rendered_title = T.render(backend)[0:-1] # exclude period
return {'title': rendered_title, 'author': rendered_author}

Printing dictionary from inside a list puts one character on each line

Yes, yet another. I can't figure out what the issue is. I'm trying to iterate over a list that is a subsection of JSON output from an API call.
This is the section of JSON that I'm working with:
[
{
"created_at": "2017-02-22 17:20:29 UTC",
"description": "",
"id": 1,
"label": "FOO",
"name": "FOO",
"title": "FOO",
"updated_at": "2018-12-04 16:37:09 UTC"
}
]
The code that I'm running that retrieves this and displays it:
#!/usr/bin/python
import json
import sys
try:
import requests
except ImportError:
print "Please install the python-requests module."
sys.exit(-1)
SAT_API = 'https://satellite6.example.com/api/v2/'
USERNAME = "admin"
PASSWORD = "password"
SSL_VERIFY = False # Ignore SSL for now
def get_json(url):
# Performs a GET using the passed URL location
r = requests.get(url, auth=(USERNAME, PASSWORD), verify=SSL_VERIFY)
return r.json()
def get_results(url):
jsn = get_json(url)
if jsn.get('error'):
print "Error: " + jsn['error']['message']
else:
if jsn.get('results'):
return jsn['results']
elif 'results' not in jsn:
return jsn
else:
print "No results found"
return None
def display_all_results(url):
results = get_results(url)
if results:
return json.dumps(results, indent=4, sort_keys=True)
def main():
orgs = display_all_results(KATELLO_API + "organizations/")
for org in orgs:
print org
if __name__ == "__main__":
main()
I appear to be missing a concept because when I print org I get each character per line such as
[
{
"
c
r
e
a
t
e
d
_
a
t
"
It does this through to the final ]
I've also tried to print org['name'] which throws the TypeError: list indices must be integers, not str Python error. This makes me think that org is being seen as a list rather than a dictionary which I thought it would be due to the [{...}] format.
What concept am I missing?
EDIT: An explanation for why I'm not getting this: I'm working with a script in the Red Hat Satellite API Guide which I'm using to base another script on. I'm basically learning as I go.

display_all_results is returning a string since you are doing json.dumps in json.dumps(results, indent=4, sort_keys=True), which converts the dictionary to a string (you are getting that dictionary from r.json() in get_json function)
You then end up iterating over the characters of that string in main, and you see one character per line
Instead just return results from display_all_results and the code will work as intended
def display_all_results(url):
#results is already a dictionary, just return it
results = get_results(url)
if results:
return results

Orgs is a result of json.dump which produces a string. So instead of this code:
for org in orgs:
print(org)
replace it with simply:
#for org in orgs:
print(orgs)

Python to ruby conversion

Hi guys I have a python script that posts some data to google and gets back response. The script is below
net, cid, lac = 24005, 40242, 62211
import urllib
a = '000E00000000000000000000000000001B0000000000000000000000030000'
b = hex(cid)[2:].zfill(8) + hex(lac)[2:].zfill(8)
c = hex(divmod(net,100)[1])[2:].zfill(8) + hex(divmod(net,100)[0])[2:].zfill(8)
string = (a + b + c + 'FFFFFFFF00000000').decode('hex')
try:
data = urllib.urlopen('http://www.google.com/glm/mmap',string)
r = data.read().encode('hex')
print r
except:
print 'connect error'
I want to get the same response with a ruby script. I am not able to form the request properly and I always get the badimplementation error or http 501 error. Could you tell me where the mistake is at? (The ruby script is attached below).
require 'net/http'
def fact(mnc,mcc,cid,lac)
a = '000E00000000000000000000000000001B0000000000000000000000030000'
b = cid.to_s(16).rjust(8,'0') + lac.to_s(16).rjust(8,'0')
c = mnc.to_s(16).rjust(8,'0') + mcc.to_s(16).rjust(8,'0')
string = [a + b + c + 'FFFFFFFF00000000'].pack('H*')
url = URI.parse('http://www.google.com/glm/mmap')
resp = Net::HTTP.post_form(url,string)
print resp
end
puts fact(5,240,40242,62211)

From the documentation:
Posts HTML form data to the specified URI object. The form data must be provided as a Hash mapping from String to String.
You have to pass the parameters, if I understood that correctly, on the form:
{"param1" => "value1", "param2"=>"value2"}
I just didn't understand what are the names of the parameters you are passing on your request.
Here are some usage examples for the method Net::HTTP::post_form, also from the official doc:
Ex 1:
uri = URI('http://www.example.com/search.cgi')
res = Net::HTTP.post_form(uri, 'q' => 'ruby', 'max' => '50')
puts res.body
Ex2:
uri = URI('http://www.example.com/search.cgi')
res = Net::HTTP.post_form(uri, 'q' => ['ruby', 'perl'], 'max' => '50')
puts res.body
Link to the examples
Hope it helps
edit: function that accepts a String as a parameter to the post request: Net::HTTP::request_post

Creating a Blog Summary in Python?

Is there any good library (or regex magic) which can convert a blog entry into a blog summary? I'd like the summary to display the first four sentences, first paragraph, or first X number of characters... not really sure what would be the best. Ideally, I would like it to keep html formatting tags such as <a>, <b>, <u> and <i>, but it could remove all other html tags, javascript and css.
More specifically, as input I'd give an html string representing an entire blog post. As output, I'd like an html string which contains the first few sentences, paragraph, or X number of characters. With all potentially unsafe html tags removed. In Python please.

If you're looking at the HTML you'll need to parse it. In addition to aforementioned BeautifulSoup, lxml.html has some nice HTML handling tools.
However if it's a blog you may find it even easier to work with RSS/Atom feeds. Feedparser is fantastic and would make it easy. You'd gain compatibility and durability (because RSS is more defined things will change less) but if the feed doesn't include what you need it won't help you.

I ended up using the gdata library and rolling my own blog summarizer, which uses the gdata library to fetch a Blogspot blog on Google App Engine (wouldn't be hard to port it to other platforms). The code is below. To use it, first set the constant blog_id_constant and then call get_blog_info to return a dictionary with the blog summaries.
I would not trust the code to create summaries of any random blog out there on the internet because it may not remove all unsafe html from the blog feed. However, for a simple blog that you write yourself, the code below should work.
Please feel free to copy but if you see any bugs or would like to make improvements, add them in the comments. (Sorry for the semicolons).
import sys
import os
import logging
import time
import urllib
from HTMLParser import HTMLParser
from django.core.cache import cache
# Import the Blogger API
sys.path.insert(0, 'gdata.zip')
from gdata import service
Months = ["Jan.", "Feb.", "Mar.", "Apr.", "May", "June", "July", "Aug.", "Sept.", "Oct.", "Nov.", "Dec."];
blog_id_constant = -1 # YOUR BLOG ID HERE
blog_pages_at_once = 5
# -----------------------------------------------------------------------------
# Blogger
class BlogHTMLSummarizer(HTMLParser):
'''
An HTML parser which only grabs X number of words and removes
all tags except for certain safe ones.
'''
def __init__(self, max_words = 80):
self.max_words = max_words
self.allowed_tags = ["a", "b", "u", "i", "br", "div", "p", "img", "li", "ul", "ol"]
if self.max_words < 80:
# If it's really short, don't include layout tags
self.allowed_tags = ["a", "b", "u", "i"]
self.reset()
self.out_html = ""
self.num_words = 0
self.no_more_data = False
self.no_more_tags = False
self.tag_stack = []
def handle_starttag(self, tag, attrs):
if not self.no_more_data and tag in self.allowed_tags:
val = "<%s %s>"%(tag,
" ".join("%s='%s'"%(a,b) for (a,b) in attrs))
self.tag_stack.append(tag)
self.out_html += val
def handle_data(self, data):
if self.no_more_data:
return
data = data.split(" ")
if self.num_words + len(data) >= self.max_words:
data = data[:self.max_words-self.num_words]
data.append("...")
self.no_more_data = True
self.out_html += " ".join(data)
self.num_words += len(data)
def handle_endtag(self, tag):
if self.no_more_data and not self.tag_stack:
self.no_more_tags = True
if not self.no_more_tags and self.tag_stack and tag == self.tag_stack[-1]:
if not self.tag_stack:
logging.warning("mixed up blogger tags")
else:
self.out_html += "</%s>"%tag
self.tag_stack.pop()
def get_blog_info(short_summary = False, page = 1, year = "", month = "", day = "", post = None):
'''
Returns summaries of several recent blog posts to be displayed on the front page
page: which page of blog posts to get. Starts at 1.
'''
blogger_service = service.GDataService()
blogger_service.source = 'exampleCo-exampleApp-1.0'
blogger_service.service = 'blogger'
blogger_service.account_type = 'GOOGLE'
blogger_service.server = 'www.blogger.com'
blog_dict = {}
# Do the common stuff first
query = service.Query()
query.feed = '/feeds/' + blog_id_constant + '/posts/default'
query.order_by = "published"
blog_dict['entries'] = []
def get_common_entry_data(entry, summarize_len = None):
'''
Convert an entry to a dictionary object.
'''
content = entry.content.text
if summarize_len != None:
parser = BlogHTMLSummarizer(summarize_len)
parser.feed(entry.content.text)
content = parser.out_html
pubstr = time.strptime(entry.published.text[:-10], '%Y-%m-%dT%H:%M:%S')
safe_title = entry.title.text.replace(" ","_")
for c in ":,.<>!##$%^&*()+-=?/'[]{}\\\"":
# remove nasty characters
safe_title = safe_title.replace(c, "")
link = "%d/%d/%d/%s/"%(pubstr.tm_year, pubstr.tm_mon, pubstr.tm_mday,
urllib.quote_plus(safe_title))
return {
'title':entry.title.text,
'alllinks':[x.href for x in entry.link] + [link], #including blogger links
'link':link,
'content':content,
'day':pubstr.tm_mday,
'month':Months[pubstr.tm_mon-1],
'summary': True if summarize_len != None else False,
}
def get_blogger_feed(query):
feed = cache.get(query.ToUri())
if not feed:
logging.info("GET Blogger Page: " + query.ToUri())
try:
feed = blogger_service.Get(query.ToUri())
except DownloadError:
logging.error("Cant download blog, rate limited? %s"%str(query.ToUri()))
return None
except Exception, e:
web_exception('get_blogger_feed', e)
return None
cache.set(query.ToUri(), feed, 3600)
return feed
def _in_one(a, allBs):
# Return true if a is in one of allBs
for b in allBs:
if a in b:
return True
return False
def _get_int(i):
try:
return int(i)
except ValueError:
return None
(year, month, day) = (_get_int(year), _get_int(month), _get_int(day))
if not short_summary and year and month and day:
# Get one more than we need so we can see if we have more
query.published_min = "%d-%02d-%02dT00:00:00-08:00"%(year, month, day)
query.published_max = "%d-%02d-%02dT23:59:59-08:00"%(year, month, day)
feed = get_blogger_feed(query)
if not feed:
return {}
blog_dict['detail_view'] = True
blog_dict['entries'] = map(lambda e: get_common_entry_data(e, None), feed.entry)
elif not short_summary and year and month and not day:
# Get one more than we need so we can see if we have more
query.published_min = "%d-%02d-%02dT00:00:00-08:00"%(year, month, 1)
query.published_max = "%d-%02d-%02dT23:59:59-08:00"%(year, month, 31)
feed = get_blogger_feed(query)
if not feed:
return {}
blog_dict['detail_view'] = True
blog_dict['entries'] = map(lambda e: get_common_entry_data(e, None), feed.entry)
if post:
blog_dict['entries'] = filter(lambda f: _in_one(post, f['alllinks']), blog_dict['entries'])
elif short_summary:
# Get a summary of all posts
query.max_results = str(3)
query.start_index = str(1)
feed = get_blogger_feed(query)
if not feed:
return {}
feed.entry = feed.entry[:3]
blog_dict['entries'] = map(lambda e: get_common_entry_data(e, 18), feed.entry)
else:
# Get a summary of all posts
try:
page = int(page)
except ValueError:
page = 1
# Get one more than we need so we can see if we have more
query.max_results = str(blog_pages_at_once + 1)
query.start_index = str((page - 1)* blog_pages_at_once + 1)
logging.info("GET Blogger Page: " + query.ToUri())
feed = blogger_service.Get(query.ToUri())
has_older = len(feed.entry) > blog_pages_at_once
feed.entry = feed.entry[:blog_pages_at_once]
if page > 1:
blog_dict['newer_page'] = str(page-1)
if has_older:
blog_dict['older_page'] = str(page+1)
blog_dict['entries'] = map(lambda e: get_common_entry_data(e, 80), feed.entry)
return blog_dict

You will have to parse the html. A nice lib for doing that is BeautifulSoup. It will allow to remove specific tags and extract values (text between tags). The text can than be relatively easily cut down to four sentences, though I'd go for a fixed number of characters, as the sentence length might vary a lot.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

trying to automate translation on babelfish with python - python

I think you need to use: import codecs codecs.open instead of plain open, in your: saveHTML method, to handle utf-8 docs. See the Python Unicode Howto for a complete explanation.

Related

Translation of a PHP Script to Python3 (Django)

How to get readable unicode string from single bibtex entry field in python script

Printing dictionary from inside a list puts one character on each line

Python to ruby conversion

Creating a Blog Summary in Python?

Categories

Resources