Modifying file contents using regular expressions in Python

Modifying file contents using regular expressions in Python - python

I've been trying to remove the numberings from the following lines using a Python script.
jokes.txt:
It’s hard to explain puns to kleptomaniacs because they always take things literally.
I used to think the brain was the most important organ. Then I thought, look
what’s telling me that.
When I run this Python script:
import re
with open('jokes.txt', 'r+') as original_file:
modfile = original_file.read()
modfile = re.sub("\d+\. ", "", modfile)
original_file.write(modfile)
The numbers are still there and it gets appended like this:
It’s hard to explain puns to kleptomaniacs because they always take things literally.
I used to think the brain was the most important organ. Then I thought, look what’s telling me that.1. It’s hard to explain puns to
kleptomaniacs because they always take things literally.਍ഀ਍ഀ2. I used to think the brain was the most important organ. Then I thought, look what’s telling me that.
I guess the regular expression re.sub("\d+\. ", "", modfile)finds all the digits from 0-9 and replaces it with an empty string.
As a novice, I'm not sure where I messed up. I'd like to know why this happens and how to fix it.

You've opened the file for reading and writing, but after you've read the file in you just start writing without specifying where to write to. That causes it to start writing where you left off reading - at the end of the file.
Other than closing the file and re-opening it just for writing, here's a way to write to the file:
import re
with open('jokes.txt', 'r+') as original_file:
modfile = original_file.read()
modfile = re.sub("\d+\. ", "", modfile)
original_file.seek(0) # Return to start of file
original_file.truncate() # Clear out the old contents
original_file.write(modfile)
I don't know why the numbers were still there in the part that you appended, as this worked just fine for me. You might want to add a caret (^) to the start of your regex (resulting in "^\d+\. "). Carets match the start of a line, making it so that if one of your jokes happens to use something like 1. in the joke itself the number at the beginning will be removed but not the number inside the joke.

Related

Python: how to get rid of non-ascii characters being read from a file

I am processing, with python, a long list of data that looks like this
The digraphs are probably due to encoding problems. (I am not sure whether these characters will be preserved in this site)
29/07/2016 04:00:12 0.125143
Now, when I read such file into a script using something like open and readlines, there is an error, reading
SyntaxError: EOL while scanning string literal
I know (or may look up usage of) replace and regex functions, but I cannot do them in my script. The biggest problem is that anywhere I include or read such strange character, error occurs, pointing on the very line it is read. So I cannot do anything to them.

Are you reading a file? If so, try to extract values using regexps, not to remove extra characters:
re.search(r'^([\d/: ]{19})', line).group(1)
re.search(r'([\d.]{7})', line).group(1)

I find that the re.findall works. (I am sorry I do not have time to test all other methods, since the significance of this job has vanished, and I even forget this question itself.)
def extract_numbers(str_i):
pat="(\d+)/(\d+)/(\d+)\D*(\d+):(\d+):(\d+)\D*(\d+)\.(\d+)"
match_h = re.findall(pat, str_i)
return match_h[0]
# ....
# `f` is the handle of the file in question
lines =f.readlines()
for l in lines:
ls_f =extract_numbers(l)
# process them....

Print on a single line not working Python 3

I'm very new to programming and am working on some code to extract data from a bunch of text files. I've been able to do this however the data is not useful to me in Excel. Therefore, I would like to print it all on a single line and separate it by a special character, which I can then delimit in Excel.
Here is my code:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
for line in inF:
for a in data:
string=a
if string in line:
print (line,end='*') #print on same line
inF.close()
So basically what I'm doing is finding if a keyword is on that line and then printing that line if it is.
Even though I have print(,end='*'), I don't get the print on a single line. It outputs:
Find me
*find you
*find us
Where is the problem? (I'm using Python 3.5.1)

Your immediate problem is that you're not removing the newline characters from your lines before printing them. The usual way to do this is with strip(), eg:
print(line.strip(), end='*')
You'll also print multiple copies of the line if more than one of your special phrases appear in the line. To avoid that, add a break statement after your print, or (better, but a more advanced construct that might not make sense until you're used to generator expressions) use if any(keyword in line for keyword in data):
You also don't need to explicitly close the input file - the point of the with open(...) as ...: context manager is that it closes the file when exiting it.
And I would avoid using string as a variable name - it doesn't tell anyone anything about what the variable is used for, and it can cause confusion if you end up using the built-in string module for anything. It's not as bad as shadowing a built-in constructor like list, but it's worth avoiding. Especially since it does nothing for you here, you can just use if a in line: here if you don't want to use the any() version above.
In addition to all that, if your data is not extremely large (and I hope it's not if you're trying to fit it all on one line) you'll get tidier code and avoid the trailing delimiter by using the .join() method on strings, eg something like:
import os
data=['Find me','find you', 'find us']
with open('C:\\Users\\Documents\\File.txt', 'r') as inF:
print "*".join(line.strip() for line in inF if any(keyword in line for keyword in data))

Parsing a file in python

Caveat emptor: I can spell p-y-t-h-o-n and that's pretty much all there is to my knowledge. I tried to take some online classes but after about 20 lectures learning not much, I gave up long time ago. So, what I am going to ask is very simple but I need help:
I have a file with the following structure:
object_name_here:
object_owner:
- me#my.email.com
- user#another.email.com
object_id: some_string_here
identification: some_other_string_here
And this block repeats itself hundreds of times in the same file.
Other than object_name_here being unique and required, all other lines may or may not be present, email addresses can be from none to 10+ different email addresses.
what I want to do is to export this information into a flat file, likes of /etc/passwd, with a twist
for instance, I want the block above to yield a line like this:
object_name_here:object_owner=me#my_email.com,user#another.email.com:objectid=some_string_here:identification=some_other_string_here
again, the number of fields or length of the content fields are not fixed by any means. I am sure this is pretty easy task to accomplish with python but how, I don't know. I don't even know where to start from.
Final Edit: Okay, I am able to write a shell script (bash, ksh etc.) to parse the information, but, when I asked this question originally, I was under the impression that, python had a simpler way of handling uniform or semi-uniform data structures as this one. My understanding was proven to be not very accurate. Sorry for wasting your time.

As jaypb points out, regular expressions are a good idea here. If you're interested in some python 101, I'll give you some simple code to get you started on your own solution.
The following code is a quick and dirty way to lump every six lines of a file into one line of a new file:
# open some files to read and write
oldfile = open("oldfilename","r")
newfile = open("newfilename","w")
# initiate variables and iterate over the input file
count = 0
outputLine = ""
for line in oldfile:
# we're going to append lines in the file to the variable outputLine
# file.readline() will return one line of a file as a string
# str.strip() will remove whitespace at the beginning and end of a string
outputLine = outputLine + oldfile.readline().strip()
# you know your interesting stuff is six lines long, so
# reset the output string and write it to file every six lines
if count%6 == 0:
newfile.write(outputLine + "\n")
outputLine = ""
# increment the counter
count = count + 1
# clean up
oldfile.close()
newfile.close()
This isn't exactly what you want to do but it gets you close. For instance, if you want to get rid of " - " from the beginning of the email addresses and replace it with "=", instead of just appending to outputLine you'd do something like
if some condition:
outputLine = outputLine + '=' + oldfile.readline()[3:]
that last bit is a python slice, [3:] means "give me everything after the third element," and it works for things like strings or lists.
That'll get you started. Use google and the python docs (for instance, googling "python strip" takes you to the built-in types page for python 2.7.10) to understand every line above, then change things around to get what you need.

Since you are replacing text substrings with different text substrings, this is a pretty natural place to use regular expressions.
Python, fortunately, has an excellent regular expressions library called re.
You will probably want to heavily utilize
re.sub(pattern, repl, string)
Look at the documentation here:
https://docs.python.org/3/library/re.html
Update: Here's an example of how to use the regular expression library:
#!/usr/bin/env python
import re
body = None
with open("sample.txt") as f:
body = f.read()
# Replace emails followed by other emails
body = re.sub(" * - ([a-zA-Z.#]*)\n * -", r"\1,", body)
# Replace declarations of object properties
body = re.sub(" +([a-zA-Z_]*): *[\n]*", r"\1=", body)
# Strip newlines
body = re.sub(":?\n", ":", body)
print (body)
Example output:
$ python example.py
object_name_here:object_owner=me#my.email.com, user#another.email.com:object_id=some_string_here:identification=some_other_string_here

Python search and replace in a file

I was trying to make a script to allow me to automate clean ups in the linux kernel a little bit. The first thing on my agenda was to remove braces({}) on if statements(c-styled) that wasnt necessary for single statement blocks. Now the code I tried with my little knowledge of regex in python I got to a working state, such as:
if (!buf || !buf_len) {
TRACE_RET(chip, STATUS_FAIL);
}
and the script turn it into:
if (!buf || !buf_len)
TRACE_RET(chip, STATUS_FAIL);
Thats what I want but when I try it on real source files it seems like it randomly selects a if statement and take its deleted it beginning brace and it has multiple statement blocks and it remove the ending brace far down the program usually on a else satement or a long if statement.
So can someone please help me with make the script only touch an if statement if it has a single block statement and correctly delete it corresponding beginning and ending brace.
The correct script looks like:
from sys import argv
import os
import sys
import re
get_filename = argv[1]
target = open(get_filename)
rename = get_filename + '.tmp'
temp = open(rename, 'w')
def if_statement():
look=target.read()
pattern=r'''if (\([^.)]*\)) (\{)(\n)([^>]+)(\})'''
replacement=r'''if \1 \3\4'''
pattern_obj = re.compile(pattern, re.MULTILINE)
outtext = re.sub(pattern_obj, replacement, look)
temp.write(outtext)
temp.close()
target.close()
if_statement()
Thanks in advance

In theory, this would mostly work:
re.sub(r'(if\s*\([^{]+\)\s*){([^;]*;)\s*}', r'\1\2', yourstring)
Note that this will fail on nested single-statement blocks and on semicolons inside string or character literals.
In general, trying to parse C code with regex is a bad idea, and you really shouldn't get rid of those braces anyway. It's good practice to have them and they're not hurting anything.

Python Printing from python32

I can't get Python to print a word doc. What I am trying to do is to open the Word document, print it and close it. I can open Word and the Word document:
import win32com.client
msword = win32com.client.Dispatch("Word.Application")
msword.Documents.Open("X:\Backoffice\Adam\checklist.docx")
msword.visible= True
I have tried next to print
msword.activedocument.printout("X:\Backoffice\Adam\checklist.docx")
I get the error of "print out not valid".
Could someone shed some light on this how I can print this file from Python. I think it might be as simple as changing the word "printout". Thanks, I'm new to Python.

msword.ActiveDocument gives you the current active document. The PrintOut method prints that document: it doesn't take a document filename as a parameter.
From http://msdn.microsoft.com/en-us/library/aa220363(v=office.11).aspx:
expression.PrintOut(Background, Append, Range, OutputFileName, From, To, Item,
Copies, Pages, PageType, PrintToFile, Collate, FileName, ActivePrinterMacGX,
ManualDuplexPrint, PrintZoomColumn, PrintZoomRow, PrintZoomPaperWidth,
PrintZoomPaperHeight)
Specifically Word is trying to use your filename as a boolean Background which may be set True to print in the background.
Edit:
Case matters and the error is a bit bizarre. msword.ActiveDocument.Printout() should print it. msword.ActiveDocument.printout() throws an error complaining that 'PrintOut' is not a property.
I think what happens internally is that Python tries to compensate when you don't match the case on properties but it doesn't get it quite right for methods. Or something like that anyway. ActiveDocument and activedocument are interchangeable but PrintOut and printout aren't.

You probably have to escape the backslash character \ with \\:
msword.Documents.Open("X:\\Backoffice\\Adam\\checklist.docx")
EDIT: Explanation
The backslash is usually used to declare special characters. For example \n is the special character for a new-line. If you want a literal \ you have to escape it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.