I have some large content in a text file like this:
1. name="user1” age="21”
2. name="user2” age="25”
....
If we notice I have this ( ” ) special type of quote here at end of each word.
I just want to replace that quote ( ” ) with normal quote (")
Code:
import codecs
f = codecs.open('myfile.txt',encoding='utf-8')
for line in f:
print "str text : ",line
a = repr(line)
print "repr text : ",a
x = a.replace(u'\u201d', '"')
print "new text : ",x
Output:
str text : 1. name="user1” age="21”
repr text : u'1. name="user1\u201d age="21\u201d\n'
new text : u'1. name="user1\u201d age="21\u201d\n'
but its not working. What I am missing here?
Update :
I just tried this:
import codecs
f = codecs.open('one.txt')
for line in f:
print "str text : ",line
y= line.replace("\xe2\x80\x9d", '"')
print "ynew text : ",y
and it is working now.
Still I want to know what was wrong with x = a.replace(u'\u201d', '"')
a is the repr of the line, which does not contain the char ”, but contains the string \,u,2,0,1,d.
So changing a = repr(line) to a = line will fix the problem.
Related
I am using the following code to find the location the start index of some strings as well as a temperature all of which are read from a text file.
The array searchString, contains what I'm looking for. It does locate the index of the first character of each string. The issue is that unless I put the backslash in front of the string: +25°C, finditer gives an error.
(Alternately, if I remove the + sign, it works - but I need to look for the specific +25). My question is am I correctly escaping the + sign, since the line: print('Looking for: ' + headerName + ' in the file: ' + filename )
displays : Looking for: +25°C in the file: 123.txt (with the slash showing in front of of the +)
Am I just 'getting away with this', or is this escaping as it should?
thanks
import re
path = 'C:\mypath\\'
searchString =["Power","Cal", "test", "Frequency", "Max", "\+25°C"]
filename = '123.txt' # file name to check for text
def search_str(file_path):
with open(file_path, 'r') as file:
content = file.read()
for headerName in searchString:
print('Looking for: ' + headerName + ' in the file: ' + filename )
match =re.finditer(headerName, content)
sub_indices=[]
for temp in match:
index = temp.start()
sub_indices.append(index)
print(sub_indices ,'\n')
You should use the re.escape() function to escape your string pattern. It will escape all the special characters in given string, for example:
>>> print(re.escape('+25°C'))
\+25°C
>>> print(re.escape('my_pattern with specials+&$#('))
my_pattern\ with\ specials\+\&\$#\(
So replace your searchString with literal strings and try it with:
def search_str(file_path):
with open(file_path, 'r') as file:
content = file.read()
for headerName in searchString:
print('Looking for: ' + headerName + ' in the file: ' + filename )
match =re.finditer(re.escape(headerName), content)
sub_indices=[]
for temp in match:
index = temp.start()
sub_indices.append(index)
print(sub_indices ,'\n')
I have a text file to convert to YAML format. Here are some notes to describe the problem a little better:
The sections within the file have a different number of subheadings to each other.
The values of the subheadings can be any data type (e.g. string, bool, int, double, datetime).
The file is approximately 2,000 lines long.
An example of the format is below:
file_content = '''
Section section_1
section_1_subheading1 = text
section_1_subheading2 = bool
end
Section section_2
section_2_subheading3 = int
section_2_subheading4 = double
section_2_subheading5 = bool
section_2_subheading6 = text
section_2_subheading7 = datetime
end
Section section_3
section_3_subheading8 = numeric
section_3_subheading9 = int
end
'''
I have tried to convert the text to YAML format by:
Replacing the equal signs with colons using regex.
Replacing Section section_name with section_name :.
Removing end between each section.
However, I am having difficulty with #2 and #3. This is the text-to-YAML function I have created so far:
import yaml
import re
def convert_txt_to_yaml(file_content):
"""Converts a text file to a YAML file"""
# Replace "=" with ":"
file_content2 = file_content.replace("=", ":")
# Split the lines
lines = file_content2.splitlines()
# Define section headings to find and replace
section_names = "Section "
section_headings = r"(?<=Section )(.*)$"
section_colons = r"\1 : "
end_names = "end"
# Convert to YAML format, line-by-line
for line in lines:
add_colon = re.sub(section_headings, section_colons, line) # Add colon to end of section name
remove_section_word = re.sub(section_names, "", add_colon) # Remove "Section " in section header
line = re.sub(end_names, "", remove_section_word) # Remove "end" between sections
# Join lines back together
converted_file = "\n".join(lines)
return converted_file
I believe the problem is within the for loop - I can't manage to figure out why the section headers and endings aren't changing. It prints perfectly if I test it, but the lines themselves aren't saving.
The output format I am looking for is the following:
file_content = '''
section_1 :
section_1_subheading1 : text
section_1_subheading2 : bool
section_2 :
section_2_subheading3 : int
section_2_subheading4 : double
section_2_subheading5 : bool
section_2_subheading6 : text
section_2_subheading7 : datetime
section_3 :
section_3_subheading8 : numeric
section_3_subheading9 : int
'''
I would rather convert it to dict and then format it as yaml using the yaml package in python as below:
import yaml
def convert_txt_to_yaml(file_content):
"""Converts a text file to a YAML file"""
config_dict = {}
# Split the lines
lines = file_content.splitlines()
section_title=None
for line in lines:
if line=='\n':
continue
elif re.match('.*end$', line):
#End of section
section_title=None
elif re.match('.*Section\s+.*', line):
#Start of Section
match_obj = re.match(".*Section\s+(.*)", line)
section_title=match_obj.groups()[0]
config_dict[section_title] = {}
elif section_title and re.match(".*{}_.*\s+=.*".format(section_title), line):
match_obj = re.match(".*{}_(.*)\s+=(.*)".format(section_title), line)
config_dict[section_title][match_obj.groups()[0]] = match_obj.groups()[1]
return yaml.dump(config_dict )
I have the below code in one of my configuration files:
appPackage_name = sqlncli
appPackage_version = 11.3.6538.0
The left side is the key and the right side is value.
Now i want to be able to replace the value part with something else given a key in Python.
import re
Filepath = r"C:\Users\bhatsubh\Desktop\Everything\Codes\Python\OO_CONF.conf"
key = "appPackage_name"
value = "Subhayan"
searchstr = re.escape(key) + " = [\da-zA-Z]+"
replacestr = re.escape(key) + " = " + re.escape(value)
filedata = ""
with open(Filepath,'r') as File:
filedata = File.read()
File.close()
print ("Before change:",filedata)
re.sub(searchstr,replacestr,filedata)
print ("After change:",filedata)
I assume there is something wrong with the regex i am using. But i am not able to figure out what . Can someone please help me ?
Use the following fix:
import re
#Filepath = r"C:\Users\bhatsubh\Desktop\Everything\Codes\Python\OO_CONF.conf"
key = "appPackage_name"
value = "Subhayan"
#searchstr = re.escape(key) + " = [\da-zA-Z]+"
#replacestr = re.escape(key) + " = " + re.escape(value)
searchstr = r"({} *= *)[\da-zA-Z.]+".format(re.escape(key))
replacestr = r"\1{}".format(value)
filedata = "appPackage_name = sqlncli"
#with open(Filepath,'r') as File:
# filedata = File.read()
#File.close()
print ("Before change:",filedata)
filedata = re.sub(searchstr,replacestr,filedata)
print ("After change:",filedata)
See the Python demo
There are several issues: you should not escape the replacement pattern, only the literal user-defined values in the regex pattern. You can use a capturing group (a pair of unescaped (...)) and a backreference (here, \1 since the group is only one in the pattern) to restore the part of the matched string you need to keep rather than build that replacement string dynamically. As the version value contains dots, you should add a . to the character class, [\da-zA-Z.]. You also need to assign new value after replacing, so as to actually modify it.
I have some input data from ASCII files which uses double quote to encapsulate string as well as still use double quote inside those strings, for example:
"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'21.844"" "No Shift"
Notice the double quote used in the coordinate.
So I have been using:
valList = shlex.split(line)
But shlex get's confused with the double quote used as the second in the coordinate.
I've been doing a find and replace on '\"\"' to '\\\"\"'. This of course turns an empty strings to \"" as well so I do a find and replace on (this time with spaces) ' \\\"\" ' to ' \"\"" '. Not exactly the most efficient way of doing it!
Any suggestions on handling this double quote in the coordinate?
I would do it this way:
I would treat this line of text as a csv file. Then according to RFC 4180 :
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Then all you would need to do is to add another " to your coordinates. So it would look like this "S 0556'21.844"""(NOTE extra quote) Then you can use a standartcsv` module to break it apart and extract necessary information.
>>> from StringIO import StringIO
>>> import csv
>>>
>>> test = '''"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'21.844""" "No Shift"'''
>>> test_obj = StringIO(test)
>>> reader = csv.reader(test_obj, delimiter=' ', quotechar='"', quoting=csv.QUOTE_ALL)
>>> for i in reader:
... print i
...
The output would be :
['Reliable', 'Africa', '567.87', 'Bob', '', '', '', 'S 05`56\'21.844"', 'No Shift']
I'm not good with regexes, but this non-regex suggestion might help ...
INPUT = ('"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'
"'"
'21.844"" "No Shift"')
def main(input):
output = input
surrounding_quote_symbol = '<!>'
if input.startswith('"'):
output = '%s%s' % (surrounding_quote_symbol, output[1:])
if input.endswith('"'):
output = '%s%s' % (output[:-1], surrounding_quote_symbol)
output = output.replace('" ', '%s ' % surrounding_quote_symbol)
output = output.replace(' "', ' %s' % surrounding_quote_symbol)
print "Stage 1:", output
output = output.replace('"', '\"')
output = output.replace(surrounding_quote_symbol, '"')
return output
if __name__ == "__main__":
output = main(INPUT)
print "End results:", output
I've got a script to do search and replace. it's based on a script here.
It was modified to accept file as input but it does not seem to recognize regex well.
The script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
import re
import glob
_replacements = {
'[B]': '**',
'[/B]': '**',
'[I]': '//',
'[/I]': '//',
}
def _do_replace(match):
return _replacements.get(match.group(0))
def replace_tags(text, _re=re.compile('|'.join((r) for r in _replacements))):
return _re.sub(_do_replace, text)
def getfilecont(FN):
if not glob.glob(FN): return -1 # No such file
text = open(FN, 'rt').read()
text = replace_tags(text, re.compile('|'.join(re.escape(r) for r in _replacements)))
return replace_tags(text)
scriptName = os.path.basename(sys.argv[0])
if sys.argv[1:]:
srcfile = glob.glob(sys.argv[1])[0]
else:
print """%s: Error you must specify file, to convert forum tages to wiki tags!
Type %s FILENAME """ % (scriptName, scriptName)
exit(1)
dstfile = os.path.join('.' , os.path.basename(srcfile)+'_wiki.txt')
converted = getfilecont(srcfile)
try:
open(dstfile, 'wt+').write(converted)
print 'Done.'
except:
print 'Error saving file %s' % dstfile
print converted
#print replace_tags("This is an [[example]] sentence. It is [[{{awesome}}]].")
What I want is to replace
'[B]': '**',
'[/B]': '**',
with only one line like this as in regex
\[B\](.*?)\[\/B\] : **\1**
That very would be helpful with BBcode tags like this:
[FONT=Arial]Hello, how are you?[/FONT]
Then I can use something like this
\[FONT=(.*?)\](.*?)\[\/FONT\] : ''\2''
But I can not seem to be able to do that with this script. There are another ways to do regex search and replace in the original source of this script but it works for one tag at a time using re.sub. Other advantage of this script that I can add as many line as I want so I can update it later.
For starters, you're escaping the patterns on this line:
text = replace_tags(text, re.compile('|'.join(re.escape(r) for r in _replacements)))
re.escape takes a string and escapes it in such a way that if the new string were used as a regex it would match exactly the input string.
Removing the re.escape won't entirely solve your problem, however, ans you find the replacement by just doing a lookup of the matched text in your dict on this line:
return _replacements.get(match.group(0))
To fix this you could make each pattern into its own capture group:
text = replace_tags(text, re.compile('|'.join('(%s)' % r for r in _replacements)))
You'll also need to know which pattern goes with which substitution. Something like this might work:
_replacements_dict = {
'[B]': '**',
'[/B]': '**',
'[I]': '//',
'[/I]': '//',
}
_replacements, _subs = zip(*_replacements_dict.items())
def _do_replace(match):
for i, group in m.groups():
if group:
return _subs[i]
Note that this changes _replacements into a list of the patterns, and creates a parallel array _subs for the actual replacements. (I would have named them regexes and replacements, but didn't want to have to re-edit every occurrence of "_replacements").
Someone has done it here.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys, os
import re
import glob
_replacements_dict = {
'\[B\]': '**',
'\[\/B\]': '**',
'\[I\]': '//',
'\[\/I\]': '//',
'\[IMG\]' : '{{',
'\[\/IMG\]' : '}}',
'\[URL=(.*?)\]\s*(.*?)\s*\[\/URL\]' : r'[[\1|\2]]',
'\[URL\]\s*(.*?)\s*\[\/URL\]' : r'[[\1]]',
'\[FONT=(.*?)\]' : '',
'\[color=(.*?)\]' : '',
'\[SIZE=(.*?)\]' : '',
'\[CENTER]' : '',
'\[\/CENTER]' : '',
'\[\/FONT\]' : '',
'\[\/color\]' : '',
'\[\/size\]' : '',
}
_replacements, _subs = zip(*_replacements_dict.items())
def replace_tags(text):
for i, _s in enumerate(_replacements):
tag_re = re.compile(r''+_s, re.I)
text, n = tag_re.subn(r''+_subs[i], text)
return text
def getfilecont(FN):
if not glob.glob(FN): return -1 # No such file
text = open(FN, 'rt').read()
return replace_tags(text)
scriptName = os.path.basename(sys.argv[0])
if sys.argv[1:]:
srcfile = glob.glob(sys.argv[1])[0]
else:
print """%s: Error you must specify file, to convert forum tages to wiki tags!
Type %s FILENAME """ % (scriptName, scriptName)
exit(1)
dstfile = os.path.join('.' , os.path.basename(srcfile)+'_wiki.txt')
converted = getfilecont(srcfile)
try:
open(dstfile, 'wt+').write(converted)
print 'Done.'
except:
print 'Error saving file %s' % dstfile
#print converted
#print replace_tags("This is an [[example]] sentence. It is [[{{awesome}}]].")
http://pastie.org/1447448