How to read multiline .properties file in python - python

I'm trying to read a java multiline i18n properties file. Having lines like:
messages.welcome=Hello\
World!
messages.bye=bye
Using this code:
import configobj
properties = configobj.ConfigObj(propertyFileName)
But with multilines values it fails.
Any suggestions?

According to the ConfigObj documentation, configobj requires you to surround multiline values in triple quotes:
Values that contain line breaks
(multi-line values) can be surrounded
by triple quotes. These can also be
used if a value contains both types of
quotes. List members cannot be
surrounded by triple quotes:
If modifying the properties file is out of the question, I suggest using configparser:
In config parsers, values can span
multiple lines as long as they are
indented more than the key that holds
them. By default parsers also let
empty lines to be parts of values.
Here's a quick proof of concept:
#!/usr/bin/env python
# coding: utf-8
from __future__ import print_function
try:
import ConfigParser as configparser
except ImportError:
import configparser
try:
import StringIO
except ImportError:
import io.StringIO as StringIO
test_ini = """
[some_section]
messages.welcome=Hello\
World
messages.bye=bye
"""
config = configparser.ConfigParser()
config.readfp(StringIO.StringIO(test_ini))
print(config.items('some_section'))
Output:
[('messages.welcome', 'Hello World'),
('messages.bye', 'bye')]

Thanks for the answers, this is what I finally did:
Add the section to the fist line of the properties file
Remove empty lines
Parse with configparser
Remove first line (section added in first step)
This is a extract of the code:
#!/usr/bin/python
...
# Add the section
subprocess.Popen(['/bin/bash','-c','sed -i \'1i [default]\' '+srcDirectory+"/*.properties"], stdout=subprocess.PIPE)
# Remove empty lines
subprocess.Popen(['/bin/bash','-c','sed -i \'s/^$/#/g' '+srcDirectory+"/*.properties"], stdout=subprocess.PIPE)
# Get all i18n files
files=glob.glob(srcDirectory+"/"+baseFileName+"_*.properties")
config = ConfigParser.ConfigParser()
for propFile in files:
...
config.read(propertyFileName)
value=config.get('default',"someproperty")
...
# Remove section
subprocess.Popen(['/bin/bash','-c','sed -i \'1d\' '+srcDirectory+"/*.properties"], stdout=subprocess.PIPE)
I still have troubles with those multilines that doesn't start with an empty space. I just fixed them manually, but a sed could do the trick.

Format your properties file like this:
messages.welcome="""Hello
World!"""
messages.bye=bye

Give a try to ConfigParser

I don't understand anything in the Java broth, but a regex would help you, I hope:
import re
ch = '''messages.welcome=Hello
World!
messages.bye=bye'''
regx = re.compile('^(messages\.[^= \t]+)[ \t]*=[ \t]*(.+?)(?=^messages\.|\Z)',re.MULTILINE|re.DOTALL)
print regx.findall(ch)
result
[('messages.welcome', 'Hello\n World! \n'), ('messages.bye', 'bye')]

Related

How am I suppposed to handle the BOM while text processing using sys.stdin in Python 3?

Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.

ConfigParser breaking when section header itself has an ]

I am using the ConfigParser module like this:
from ConfigParser import ConfigParser
c = ConfigParser()
c.read("pymzq.ini")
However, the sections gets botched up like this:
>>> c.sections()
['pyzmq:platform.architecture()[0']
for the pymzq.ini file which has ] in tht title to mean something:
[pyzmq:platform.architecture()[0] == '64bit']
url = ${pkgserver:fullurl}/pyzmq/pyzmq-2.2.0-py2.7-linux-x86_64.egg
Looks like ConfigParser uses a regex that only parses section lines up to the first closing bracket, so this is as expected.
You should be able to subclass ConfigParser/RawConfigParser and change that regexp to something that better suits your case, such as ^\[(?P<header>.+)\]$, maybe.
Thanks for the pointer #AKX, I went with:
class MyConfigParser(ConfigParser):
_SECT_TMPL = r"""
\[ # [
(?P<header>[^$]+) # Till the end of line
\] # ]
"""
SECTCRE = re.compile(_SECT_TMPL, re.VERBOSE)
Please let me know if you have any better versions. Source code of the original ConfigParser.

search string in text file and capture following n characters

I'm using subprocess.Popen and stdout to write the output of a curl from Backendless (BaaS).
What's written to the output file is a long single line of data, separated by commas. Here's a small portion of it.
{"APIEndpoint":"asdfaasdfa","created":1429550024000,"updated":null,"objectId":"EE51537D-A9AC-721C-FF33-F4B258931E00"...}
The value I need from this output file is the 37-character string following "objectID":". I've read many similar questions but haven't been able to find a solution to this specific one. I've tried something like:
objectid = 37
searchfile open('backendless.txt', 'r')
for line in searchfile:
if "\"objectId\":\"" in line:
print(right[:objectidd])
which returns nothing. Please correct me if I'm using line incorrectly. I'm very new to this. Also, is there a way to achieve the same result without saving it to the text file first and instead performing the curl with PIPE and communicate?
I'm using Python 3.4. Thank you.
EDIT/SOLUTION:
from subprocess import *
baascurl = Popen(['curl', '-H', appid, '-H', secretkey, '-H', apptype, '-H', contenttype, '-X', 'GET', '-v', 'https://api.backendless.com/v1/data/Latency/last'], stdout=PIPE).communicate()[0]
objidbytes = baascurl.decode(encoding='utf-8')
objid = json.loads(objidbytes)["objectId"]
Your data looks pretty much like JSON, so maybe you can use Python's json module:
import json
objid = json.loads(datastring)["objectId"]
If you want to stay on the text level, the right tool for this job are regular expressions. Look into Python's re module.
import re
m = re.search(r'"objectId":"([^"]+)"', datastring, re.IGNORECASE)
if m:
objid = m.group(1)

Replace all regex matches in a file

Consider a basic regex like a(.+?)a. How can one replace all occurences of that regex in a file with the content of the first group?
Use can use the re module to use regular expressions in python and the fileinput module to simply replace text in files in-place
Example:
import fileinput
import re
fn = "test.txt" # your filename
r = re.compile('a(.+?)a')
for line in fileinput.input(fn, inplace=True):
match = r.match(line)
print match.group() if match else line.replace('\n', '')
Before:
hello this
aShouldBeAMatch!!!!! and this should be gone
you know
After:
hello this
aShouldBeAMa
you know
Note: this works because the argument inplace=True causes input file to be moved to a backup file and standard output is directed to the input file, as documented under Optional in-place filtering.
You can use Notepad++ with Version >= 6.0. Since then it does support PCRE Regex.
You can then use your regex a(.+?)a and replace with $1
sed
Are you limited to using Python tools? Because sed works very well.
$ sed -i <filename> "s/a(.+?)a/\1/g"
Vim
In a Vim window, give the following search-and-replace ex command:
:%s/\va(.+?)a/\1/g
Note that many regex characters are escaped in Vim- \v sets "very magic" mode, which removes the need for escaping. The same command with "magic" (the default) is :%s/a\(.\+\?)a/\1/g
Python
If you're looking to do this in Python, BigYellowCactus' answer is excellent (use the re module for regex, and fileinput to modify the file).

Python - ConfigParser throwing comments

Based on ConfigParser module how can I filter out and throw every comments from an ini file?
import ConfigParser
config = ConfigParser.ConfigParser()
config.read("sample.cfg")
for section in config.sections():
print section
for option in config.options(section):
print option, "=", config.get(section, option)
eg. in the ini file below the above basic script prints out the further comments lines as well like:
something = 128 ; comment line1
; further comments
; one more line comment
What I need is having only the section names and pure key-value pairs inside them without any comments. Does ConfigParser handles this somehow or should I use regexp...or? Cheers
according to docs lines starting with ; or # will be ignored. it doesn't seem like your format satisfies that requirement. can you by any chance change format of your input file?
edit: since you cannot modify your input files, I'd suggest pre-parsing them with something along the lines:
tmp_fname = 'config.tmp'
with open(config_file) as old_file:
with open(tmp_fname, 'w') as tmp_file:
tmp_file.writelines(i.replace(';', '\n;') for i in old_lines.readlines())
# then use tmp_fname with ConfigParser
obviously if semi-colon is present in options you'll have to be more creative.
Best way is to write a commentless file subclass:
class CommentlessFile(file):
def readline(self):
line = super(CommentlessFile, self).readline()
if line:
line = line.split(';', 1)[0].strip()
return line + '\n'
else:
return ''
You could use it then with configparser (your code):
import ConfigParser
config = ConfigParser.ConfigParser()
config.readfp(CommentlessFile("sample.cfg"))
for section in config.sections():
print section
for option in config.options(section):
print option, "=", config.get(section, option)
It seems your comments are not on lines that start with the comment leader. It should work if the comment leader is the first character on the line.
As the doc said: "(For backwards compatibility, only ; starts an inline comment, while # does not.)" So use ";" and not "#" for inline comments. It is working well for me.
Python 3 comes with a build-in solution: The class configparser.RawConfigParser has constructor argument inline_comment_prefixes. Example:
class MyConfigParser(configparser.RawConfigParser):
def __init__(self):
configparser.RawConfigParser.__init__(self, inline_comment_prefixes=('#', ';'))

Categories