Python search and replace in a file - python

I was trying to make a script to allow me to automate clean ups in the linux kernel a little bit. The first thing on my agenda was to remove braces({}) on if statements(c-styled) that wasnt necessary for single statement blocks. Now the code I tried with my little knowledge of regex in python I got to a working state, such as:
if (!buf || !buf_len) {
TRACE_RET(chip, STATUS_FAIL);
}
and the script turn it into:
if (!buf || !buf_len)
TRACE_RET(chip, STATUS_FAIL);
Thats what I want but when I try it on real source files it seems like it randomly selects a if statement and take its deleted it beginning brace and it has multiple statement blocks and it remove the ending brace far down the program usually on a else satement or a long if statement.
So can someone please help me with make the script only touch an if statement if it has a single block statement and correctly delete it corresponding beginning and ending brace.
The correct script looks like:
from sys import argv
import os
import sys
import re
get_filename = argv[1]
target = open(get_filename)
rename = get_filename + '.tmp'
temp = open(rename, 'w')
def if_statement():
look=target.read()
pattern=r'''if (\([^.)]*\)) (\{)(\n)([^>]+)(\})'''
replacement=r'''if \1 \3\4'''
pattern_obj = re.compile(pattern, re.MULTILINE)
outtext = re.sub(pattern_obj, replacement, look)
temp.write(outtext)
temp.close()
target.close()
if_statement()
Thanks in advance

In theory, this would mostly work:
re.sub(r'(if\s*\([^{]+\)\s*){([^;]*;)\s*}', r'\1\2', yourstring)
Note that this will fail on nested single-statement blocks and on semicolons inside string or character literals.
In general, trying to parse C code with regex is a bad idea, and you really shouldn't get rid of those braces anyway. It's good practice to have them and they're not hurting anything.

Related

Python 3.6 Identifying a string and if X in Y

Newb programmer here working on my first project. I've searched this site and the python documentation, and either I'm not seeing the answer, or I'm not using the right terminology. I've read the regex and if sections, specifically, and followed links around to other parts that seemed relevant.
import re
keyphrase = '##' + '' + '##'
print(keyphrase) #output is ####
j = input('> ') ###whatever##
if keyphrase in j:
print('yay')
else:
print('you still haven\'t figured it out...')
k = j.replace('#', '')
print(k) #whatever
This is for a little reddit bot project. I want the bot to be called like ##whatever## and then be able to do things with the word(s) in between the ##'s. I've set up the above code to test if Python was reading it but I keep getting my "you still haven't figured it out..." quip.
I tried adding the REGEX \W in the middle of keyphrase, to no avail. Also weird combinations of \$\$ and quotes
So, my question, is how do I put a placeholder in keyphrase for user input?
For instance, if a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.
You could use the following regex r'##(.*?)##' to capture everything inside of the key phrase you've chosen.
Sample Output:
>>> import re
>>> f = lambda s: re.match(r'##(.*?)##', s).group(1)
>>> f("##whatever##")
whatever
>>> f = lambda s: re.findall(r'##(.*?)##', s)
>>> f("a ##comment## does something like ##this## ##I can grab## everything between the # symbols as separate inputs/calls.")
['comment', 'this', 'I can grab']
How does it work? (1) We state the string constant head and tail for the capture group 1 between the brackets (). Great, almost there! (2) We then match any character .*? with greedy search enforced so that we capture the whole string.
Suggested Readings:
Introduction to Regex in Python - Jee Gikera
Something like this should work:
import re
keyphrase_regex = re.compile(r'##(.*)##')
user_input = input('> ')
keyphrase_match = keyphrase_regex.search(user_input)
# `search` returns `None` if regex didn't match anywhere in the string
keyphrase_content = keyphrase_match.group(1) if keyphrase_match else None
if keyphrase_content:
keyphrase_content = keyphrase_match.group(1)
print('yay! You submitted "', keyphrase_content, '" to the bot!')
else:
# Bonus tip: Use double quotes to make a string containing apostrophe
# without using a backslash escape
print("you still haven't figured it out...")
# Use `keyphrase_content` for whatever down here
Regular expressions are kind of hard to wrap your head around, because they work differently than most programming constructs. It's a language to describe patterns.
Regex One is a fantastic beginners guide.
Regex101 is an online sandbox that allows you to type a regular expression and some sample strings, then see what matches (and why) in real time
The regex ##(.*)## basically means "search through the string until you find two '#' signs. Right after those, start capturing zero-or-more of any character. If you find another '#', stop capturing characters. If that '#' is followed by another one, stop looking at the string, return successfully, and hold onto the entire match (from first '#' to last '#'). Also, hold onto the captured characters in case the programmer asks you for just them.
EDIT: Props to #ospahiu for bringing up the ? lazy quantifier. A final solution, combining our approaches, would look like this:
# whatever_bot.py
import re
# Technically, Python >2.5 will compile and cache regexes automatically.
# For tiny projects, it shouldn't make a difference. I think it's better style, though.
# "Explicit is better than implicit"
keyphrase_regex = re.compile(r'##(.*?)##')
def parse_keyphrases(input):
return keyphrase_regex.find_all(input)
Lambdas are cool. I prefer them for one-off things, but the code above is something I'd rather put in a module. Personal preference.
You could even make the regex substitutable, using the '##' one by default
# whatever_bot.py
import re
keyphrase_double_at_sign = re.compile(r'##(.*?)##')
def parse_keyphrases(input, keyphrase_regex=keyphrase_double_at_sign):
return keyphrase_regex.find_all(input)
You could even go bonkers and write a function that generates a keyphrase regex from an arbitrary "tag" pattern! I'll leave that as an exercise for the reader ;) Just remember: Several characters have special regex meanings, like '*' and '?', so if you want to match that literal character, you'd need to escape them (e.g. '\?').
If you want to grab the content between the "#", then try this:
j = input("> ")
"".join(j.split("#"))
You're not getting any of the info between the #'s in your example because you're effectively looking for '####' in whatever input you give it. Unless you happen to put 4 #'s in a row, that RE will never match.
What you want to do instead is something like
re.match('##\W+##', j)
which will look for 2 leading ##s, then any number greater than 1 alphanumeric characters (\W+), then 2 trailing ##s. From there, your strip code looks fine and you should be able to grab it.

How to remove whitespaces from multiline string pass it to another program and then add back whitespaces?

Summary
I need to detect indentation level of the first line in multiline string passed to a script. Store it. Remove this indent from other lines. Pass the multiline string with removed indent level to another program (that I've figured how to do) add back indent to all lines in multiline string and print it to stdout (that I also know how to do).
To be specific I have a problem with vim and Python formatter YAPF.
The way yapf works is that if python file is incorrect formatting would result in error.
So imagine this
def f():
# imagine some very very long lines here that we want to reformat
If I would select this imagined lines in vim and then press gq (I've set formatprg=yapf) vim would substitute this lines with a traceback of yapf which is no good of course. But If I would select the whole function it would do the job perfectly.
You can test this with
echo ' fooo = 1' | yapf
This would result in IndentationError
While echo 'fooo = 1' | yapf would work
So what I think is a very nice workaround is to remove indentation store the indent level of the first line, pass string without indentation to yapf somehow and then add indent to the result. The problem with this is I'd like this to be a one liner or close to that so that it could be stored directly in my vimrc. So python isn't a good match for that because I would need at least to import re package etc.
So I thought about perl.
The only problem is that I don't know perl much.
So for now my experiment looks like this
$a = " foo = 1\n bar = '1'";
my ($indent, $text) = $a =~ m/^(\s+)(.*)$/m;
$command = "echo " . $text;
$out = `$command`;
print "$out\n";
print "$text\n";
I will be glad for any help. Maybe there is more easy way to do this, I don't know.
Since you seem to be familiar with Python already I would recommend using its textwrap module, which contains dedent and (in version 3.3 and later) indent functions that can do most of the job for you:
import re
from textwrap import dedent, indent
whitespace = re.compile('\s+')
test_string = ''' while True:
pass'''
leading_whitespace = whitespace.match(test_string)
dedented_text = dedent(test_string)
# Do whatever you want with dedented_text
indented_text = indent(dedented_text, leading_whitespace.group(0))

Modifying file contents using regular expressions in Python

I've been trying to remove the numberings from the following lines using a Python script.
jokes.txt:
It’s hard to explain puns to kleptomaniacs because they always take things literally.
I used to think the brain was the most important organ. Then I thought, look
what’s telling me that.
When I run this Python script:
import re
with open('jokes.txt', 'r+') as original_file:
modfile = original_file.read()
modfile = re.sub("\d+\. ", "", modfile)
original_file.write(modfile)
The numbers are still there and it gets appended like this:
It’s hard to explain puns to kleptomaniacs because they always take things literally.
I used to think the brain was the most important organ. Then I thought, look what’s telling me that.1. It’s hard to explain puns to
kleptomaniacs because they always take things literally.਍ഀ਍ഀ2. I used to think the brain was the most important organ. Then I thought, look what’s telling me that.
I guess the regular expression re.sub("\d+\. ", "", modfile)finds all the digits from 0-9 and replaces it with an empty string.
As a novice, I'm not sure where I messed up. I'd like to know why this happens and how to fix it.
You've opened the file for reading and writing, but after you've read the file in you just start writing without specifying where to write to. That causes it to start writing where you left off reading - at the end of the file.
Other than closing the file and re-opening it just for writing, here's a way to write to the file:
import re
with open('jokes.txt', 'r+') as original_file:
modfile = original_file.read()
modfile = re.sub("\d+\. ", "", modfile)
original_file.seek(0) # Return to start of file
original_file.truncate() # Clear out the old contents
original_file.write(modfile)
I don't know why the numbers were still there in the part that you appended, as this worked just fine for me. You might want to add a caret (^) to the start of your regex (resulting in "^\d+\. "). Carets match the start of a line, making it so that if one of your jokes happens to use something like 1. in the joke itself the number at the beginning will be removed but not the number inside the joke.

Reading and correctly understanding/interpreting control characters from a file (python)

I'm a python beginner and just ran into a simple problem: I have a list of names (designators) and then a very simple code that reads lines in a csv file and prints the csv lines that has a name in the first column (row[0]) in common with my "designator list". So:
import csv
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.csv','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
if row[0] in DesignatorList:
print row
My csv files is only a list of names, like this:
AAX-435
AAX-961
HHX-58
HHX-9387
I would like to be able to use wildcards like * and ., example: let's say that I put this on my csv file:
AAX*
H.X-9387
*58
I need my code to be able to interpret those wild cards/control characters, printing the following:
every line that starts with "AAX";
every line that starts with "H", then any following character, then finally ends with "X-9387";
every line that ends with "58".
Thank you!
EDIT: For future reference (in case somebody runs into the same problem), this is how I solved my problem following Roman advice:
import csv
import re
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.txt','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
designator_col0 = row[0]
designator_col0_re = re.compile("^" + ".*".join(re.escape(i) for i in designator_col0.split("*")) + "$")
for d in DesignatorList:
if designator_col0_re.match(d):
print d
Try the re module.
You may need to prepare regular expression (regex) for use by replacing '*' with '.*' and adding ^ (beginning of a string) and $ (end of string) to the beginning and the end of the regular expression. In addition, you may need to escape everything else by re.escape function (that is, function escape from module re).
In case you do not have any other "control characters" (as you call them), splitting the string by "*" and joining by ".*" after applying escape.
For example,
import re
def make_rule(rule): # where rule for example "H*X-9387"
return re.compile("^" + ".*".join(re.escape(i) for i in rule.split("*")) + "$")
Then you can match (I guess, your rule is row):
...
rule_re = make_rule(row)
for d in DesignatorList:
if rule_re.match(d):
print row # or maybe print d
(I have understood, that rules are coming from CSV file while designators are from a list. It's easy to do it the other way around).
The examples above are examples. You still need to adapt them into your program.
Python's string object does have a startswith and an endswith method, which you could use here if you only had a small number of rules. The most general way to go with this, since you seem to have fairly simple patterns, is regular expressions. That way you can encode those rules as patterns.
import re
rules = ['^AAX.*$', # starts with AAX
'^H.*X-9387$', # starts with H, ends with X-9387
'^.*58$'] # ends with 58
for line in reader:
if any(re.match(rule, line) for rule in rules):
print line

A Python script I've written for correcting table names of the SQL dumps from Windows. Any comments?

as a newbie in Python I've thought about writing a quick and dirty script for correcting the table anme caps of a MySQL dump file (by phpMyAdmin).
The idea is since the correct capitalization of the table names are in the comments, I'm going to use it.
e.g.:
-- --------------------------------------------------------
--
-- Table structure for table `Address`
--
The reason I'm asking here is that I don't have a mentor on Python programming and I was hoping you guys could steer me to the right direction. It feels like there's a lot of stuff I'm doing wrong (maybe it's not pythonic) I'd really appreciate your help, thanks in advance!
Here's what I've written (and it works):
#!/usr/bin/env python
import re
filename = 'dump.sql'
def get_text_blocks(filename):
text_blocks = []
text_block = ''
separator = '-- -+'
for line in open(filename, 'r'):
text_block += line
if re.match(separator, line):
if text_block:
text_blocks.append(text_block)
text_block = ''
return text_blocks
def fix_text_blocks(text_blocks):
f = open(filename + '-fixed', 'w')
for block in text_blocks:
table_pattern = re.compile(r'Table structure for table `(.+)`')
correct_table_name = table_pattern.search(block)
if correct_table_name:
replacement = 'CREATE TABLE IF NOT EXISTS `' + correct_table_name.groups(0)[0] + '`'
block = re.sub(r'CREATE TABLE IF NOT EXISTS `(.+)`', replacement, block)
f.write(block)
if __name__ == '__main__':
fix_text_blocks(get_text_blocks(filename))
Looks fairly good, so the following are relatively minor:
get_text_blocks basically splits the entire text by the separator, correct? If so, I think this can be done with a single regex with a re.MULTILINE flag. Something like r'(.*?)\n-- -+' (warning: untested).
If you don't want to use a single regex but prefer to parse the file in a loop, you can ditch the regex for str.straswith. You should also not concatenate strings the way you do with text_block, since every concatenation creates a new string. You can use either the StringIO class, or have a list of lines, and then join them with '\n'.join.
The nested 'if' can be dropped: use the 'and' operator instead.
In any case, working with files (and other objects which have a 'finally' logic) is now done with the 'with [object] as [name]:' clause. Look it up, it's nifty.
If you don't do that - always close your files when you finish working with them, preferably in a 'finally' clause.
I prefer opening files with the 'b' flag as well. Prevents '\r\n' magic in Windows.
In fix_text_blocks, the pattern should be compiled outside the for loop.

Categories