Replace decimals within a special character in file - python

I am currently trying to read in file and replace all the decimals that are only between the thorn character in it such that:
ie.
þ219.91þ
þ122.1919þ
þ467.426þ
þ104.351þ
þ104.0443þ
will become
þ219þ
þ122þ
þ467þ
þ104þ
þ104þ
The gist of something I'm trying to replicate that works in Notepad++ (regex replacing - below) and trying to replicate it in python (code below which is not working). Any suggestions?
In Notepad++:
Find: (\xFE\d+)\.\d+(\xFE)
Replace: $1$2
Python:
for line in file:
line = re.sub("(\xFE\d+)\.\d+(\xFE)", "\xFE\d+\xFE", line)

I don't think it would be necessary to have \xFE and this might simply work:
import re
regex = r"(þ\d+)\.\d+(þ)"
test_str = ("þ219.91þ\n"
"þ122.1919þ\n"
"þ467.426þ\n"
"þ104.351þ\n"
"þ104.0443þ")
subst = "\\1\\2"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)

You're not replacing decimals: you're truncating the values. Will the mathematical treatment do for you? This assumes that all lines are of the format you show.
for line in file:
_, val, _ = line.split('þ') # null string, value, null string
line = 'þ' + str(int(val))+ 'þ'
Note that you could reduce this a little with a single line in the loop:
line = 'þ' + str(int(line.split('þ')[1]))+ 'þ'

You could use a one-liner such as:
f = ["þ219.91þ", "þ122.1919þ", "þ467.426þ", "þ104.351þ", "þ104.0443þ"]
print(["þ{}þ".format(int(float(l.strip("þ")))) for l in f])
Result:
['þ219þ', 'þ122þ', 'þ467þ', 'þ104þ', 'þ104þ']

Related

Find, decode and replace all base64 values in text file

I have a SQL dump file that contains text with html links like:
<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>
I'd like to find, decode and replace the base64 part of the text in each of these links.
I've been trying to use Python w/ regular expressions and base64 to do the job. However, my regex skills are not up to the task.
I need to select any string that starts with
'getattachement.php?data='
and ends with
'"'
I then need to decode the part between 'data=' and '&quot' using base64.b64decode()
results should look something like:
<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a>
I think the solution will look something like:
import re
import base64
with open('phpkb_articles.sql') as f:
for line in f:
re.sub(some_regex_expression_here, some_function_here_to_decode_base64)
Any ideas?
EDIT: Answer for anyone who's interested.
import re
import base64
import sys
def decode_base64(s):
"""
Method to decode base64 into ascii
"""
# fix escaped equal signs in some base64 strings
base64_string = re.sub('%3D', '=', s.group(1))
decodedString = base64.b64decode(base64_string)
# substitute '|' for '/'
decodedString = re.sub('\|', '/', decodedString)
# escape the spaces in file names
decodedString = re.sub(' ', '%20', decodedString)
# print 'assets/' + decodedString + '&quot' # Print for debug
return 'assets/' + decodedString + '&quot'
count = 0
pattern = r'getattachment.php\?data=([^&]+?)&quot'
# Open the file and read line by line
with open('phpkb_articles.sql') as f:
for line in f:
try:
# globally substitute in new file path
edited_line = re.sub(pattern, decode_base64, line)
# output the edited line to standard out
sys.stdout.write(edited_line)
except TypeError:
# output unedited line if decoding fails to prevent corruption
sys.stdout.write(line)
# print line
count += 1
you already have it, you just need the small pieces:
pattern: r'data=([^&]+?)&quot' will match anything after data= and before &quot
>>> pat = r'data=([^&]+?)&quot'
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>'
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1)
>>> decodeString
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='
you can then use str.replace() method as well as base64.b64decode() method to finish the rest. I dont want to just write your code for you but this should give you a good idea of where to go.

Python - How to make sure that a line being read from a file contain only a given string and nothing else

In order to make sure I start and stop reading a text file exactly where I want to, I am providing 'start1'<->'end1', 'start2'<->'end2' as tags in between the text file and providing that to my python script. In my script I read it as:
start_end = ['start1','end1']
line_num = []
with open(file_path) as fp1:
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
line_num.append(num)
fp1.close()
print '\nLine number: ', line_num
fp2 = open(file_path)
for k, line2 in enumerate(fp2):
for x in range(line_num[0], line_num[1] - 1):
if k == x:
header.append(line2)
fp2.close()
This works well until I reach start10 <-> end10 and further. Eg. it checks if I have "start2" in the line and also reads the text that has "start21" and similarly for end tag as well. so providing "start1, end1" as input also reads "start10, end10". If I replace the line:
if i in line:
with
if i == line:
it throws an error.
How can I make sure that the script reads the line that contains ONLY "start1" and not "start10"?
import re
prog = re.compile('start1$')
if prog.match(line):
print line
That should return None if there is no match and return a regex match object if the line matches the compiled regex. The '$' at the end of the regex says that's the end of the line, so 'start1' works but 'start10' doesn't.
or another way..
def test(line):
import re
prog = re.compile('start1$')
return prog.match(line) != None
> test('start1')
True
> test('start10')
False
Since your markers are always at the end of the line, change:
start_end = ['start1','end1']
to:
start_end = ['start1\n','end1\n']
You probably want to look into regular expressions. The Python re library has some good regex tools. It would let you define a string to compare your line to and it has the ability to check for start and end of lines.
If you can control the input file, consider adding an underscore (or any non-number character) to the end of each tag.
'start1_'<->'end1_'
'start10_'<->'end10_'
The regular expression solution presented in other answers is more elegant, but requires using regular expressions.
You can do this with find():
for num, line in enumerate(fp1, 1):
for i in start_end:
if i in line:
# make sure the next char isn't '0'
if line[line.find(i)+len(i)] != '0':
line_num.append(num)

How to remove newlines and indents from a string in Python?

In my Python script, I have a SQL statement that goes on forever like so:
query = """
SELECT * FROM
many_many
tables
WHERE
this = that,
a_bunch_of = other_conditions
"""
What's the best way to get this to read like a single line? I tried this:
def formattedQuery(query):
lines = query.split('\n')
for line in lines:
line = line.lstrip()
line = line.rstrip()
return ' '.join(lines)
and it did remove newlines but not spaces from the indents. Please help!
You could do this:
query = " ".join(query.split())
but it will not work very well if your SQL queries contain strings with spaces or tabs (for example select * from users where name = 'Jura X'). This is a problem of other solutions which use string.replace or regular expressions. So your approach is not too bad, but your code needs to be fixed.
What is actually wrong with your function - you return the original, the return values of lsplit and rsplit are abandoned. You could fix it like this:
def formattedQuery(query):
lines = query.split('\n')
r = []
for line in lines:
line = line.lstrip()
line = line.rstrip()
r.append(line)
return ' '.join(r)
Another way of doing it:
def formattedQuery(q): return " ".join([s.strip() for s in q.splitlines()])
Another one line:
>>> import re
>>> re.sub(r'\s', ' ', query)
'SELECT * FROM many_many tables WHERE this = that, a_bunch_of = other_conditions'
This replaces all white spaces characters in the string query by a single ' ' white space.
string.translate can remove characters (just provide None for the second argument so it doesn't also convert characters):
import string
string.translate(query, None, "\n\t")

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

convert string into int()

I have a dataset that looks like this:
0 _ _ 23.0186E-03
10 _ _51.283E-03
20 _ _125.573E-03
where the numbers are lined up line by line (the underscores represent spaces).
The numbers in the right hand column are currently part of the line's string. I am trying to convert the numbers on the right into numerical values (0.0230186 etc). I can convert them with int() once they are in a simple numerical form, but I need to change the "E"s to get there. If you know how to change it for any value of E such as E-01, E-22 it would be very helpful.
Currently my code looks like so:
fin = open( 'stringtest1.txt', "r" )
fout = open("stringtest2.txt", "w")
while 1:
x=fin.readline()
a=x[5:-1]
##conversion code should go here
if not x:
break
fin.close()
fout.close()
I would suggest the following for the conversion:
float(x.split()[-1])
str.split() will split on white space when no arguments are provided, and float() will convert the string into a number, for example:
>>> '20 125.573E-03'.split()
['20', '125.573E-03']
>>> float('20 125.573E-03'.split()[-1])
0.12557299999999999
You should use context handlers, and file handles are iterable:
with open('test1.txt') as fhi, open('test2.txt', 'w') as fho:
for line in fhi:
f = float(line.split()[-1])
fho.write(str(f))
If I understand what you want to do correctly, there's no need to do anything with the E's: in python float('23.0186E-03') returns 0.0230186, which I think is what you want.
All you need is:
fout = open("stringtest2.txt", "w")
for line in open('stringtest1.txt', "r"):
x = float(line.strip().split()[1])
fout.write("%f\n"%x)
fout.close()
Using %f in the output string will make sure the output will be in decimal notation (no E's). If you just use str(x), you may get E's in the output depending on the original value, so the correct conversion method depends on which output you want:
>>> str(float('23.0186E-06'))
'2.30186e-05'
>>> "%f"%float('23.0186E-06')
'0.000023'
>>> "%.10f"%float('23.0186E-06')
'0.0000230186'
You can add any number to %f to specify the precision. For more about string formatting with %, see http://rgruet.free.fr/PQR26/PQR2.6.html#stringMethods (scroll down to the "String formatting with the % operator" section).
float("20 _ _125.573E-03".split()[-1].strip("_"))

Categories