Sed REMOVE / REPLACE double parentheses - python

I have a python file, which has lots of double parentheses like this, which i would like to replace with single parenthesis.
Sometimes the print goes on for 2 lines or more.
print(('>> # some text some text some text and '
+ 'some more text'))
print(('>> # some text some text some text and '
+ 'some more text'))
print(('>> # some text some text some text and '
+ 'some more text'))
print(('>> # some text some text some text and '
+ 'some more text'))
print((something))
print((something))
print((something))
print((something))
print((something))
print((something))
I have tried a lot different ways to approach this. I think the easiest would be with sed. I have something like this:
grep -rl 'print((' test.txt | xargs sed -i "N;s/print((\(.*\)))/print(\1)/g"
The output looks like this:
print('>> # some text some text some text and '
+ 'some more text')
print('>> # some text some text some text and '
+ 'some more text')
print(('>> # some text some text some text and '
+ 'some more text'))
print(('>> # some text some text some text and '
+ 'some more text'))
print(something)
print(something)
print(something)
print(something)
print(something)
print(something)
Now with some lines it works but with some it doesn't, i think it is because of the N; but i need this in case it is multiple lines long..
What could i do to improve this pattern?

to avoid issues due to input file names, use grep -rlZ 'regex' | xargs -0 <command ...>
if content within parenthesis doesn't have )), then you can use this
grep -rlZ 'print((' | xargs -0 perl -i -0777 -pe 's/print\(\((.*?)\)\)/print($1)/sg'
-0777 to slurp entire file content as a single string, so this solution not fit for large files that cannot fit memory requirements
.*? is non-greedy matching
s modifier allows to match \n as well for .
When using -i option, you can specify a backup suffix (ex: -i.bkp) or prefix (ex: -i'bkp.*') or even a backup directory (ex: -i'bkp_dir/*') - these help in preserving the original copy for further use

Related

Printing Single Quote inside the string

I want to output
XYZ's "ABC"
I tried the following 3 statements in Python IDLE.
1st and 2nd statement output a \ before '.
3rd statement with print function doesn't output \ before '.
Being new to Python, I wanted to understand why \ is output before ' in the 1st and 2nd statements.
>>> "XYZ\'s \"ABC\""
'XYZ\'s "ABC"'
>>> "XYZ's \"ABC\""
'XYZ\'s "ABC"'
>>> print("XYZ\'s \"ABC\"")
XYZ's "ABC"
Here are my observations when you call repr() on a string: (It's the same in IDLE, REPL, etc)
If you print a string(a normal string without single or double quote) with repr() it adds a single quote around it. (note: when you hit enter on REPL the repr() gets called not __str__ which is called by print function.)
If the word has either ' or " : First, there is no backslash in the output. The output is gonna be surrounded by " if the word has ' and ' if the word has ".
If the word has both ' and ": The output is gonna be surrounded by single quote. The ' is gonna get escaped with backslash but the " is not escaped.
Examples:
def print_it(s):
print(repr(s))
print("-----------------------------------")
print_it('Soroush')
print_it("Soroush")
print_it('Soroush"s book')
print_it("Soroush's book")
print_it('Soroush"s book and Soroush\' pen')
print_it("Soroush's book and Soroush\" pen")
output:
'Soroush'
-----------------------------------
'Soroush'
-----------------------------------
'Soroush"s book'
-----------------------------------
"Soroush's book"
-----------------------------------
'Soroush"s book and Soroush\' pen'
-----------------------------------
'Soroush\'s book and Soroush" pen'
-----------------------------------
So with that being said, the only way to get your desired output is by calling str() on a string.
I know Soroush"s book is grammatically incorrect in English. I just want to put it inside an expression.
Not sure what you want it to print.
Do you want it to output XYZ\'s \"ABC\" or XYZ's "ABC"?
The \ escapes next special character like quotes, so if you want to print a \ the code needs to have two \\.
string = "Im \\"
print(string)
Output: Im \
If you want to print quotes you need single quotes:
string = 'theres a "lot of "" in" my "" script'
print(string)
Output: theres a "lot of "" in" my "" script
Single quotes makes you able to have double quotes inside the string.

Regex replace string which is before or after two different string

I have this string (html):
html = 'x<sub>i</sub> - y<sub>i)<sub>2</sub>'
I would like to convert this html string to latex in a robust way. Let me explain:
<sub>SOMETHING</sub> -> converted to _{SOMETHING}
I already know how to do that:
latex = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)
Sometimes the first part <sub> or its closing tag is missing, like in the example string. In that case, the output should still be correct.
So how I was thinking of doing it is: After running 1, I take the string after <sub> and anything before </sub> with _{SOMETHING}
text = re.sub(r'<sub>(.*?)</sub>',r'_{\1} ', html)
print(text)
# if missing part:
text = re.sub(r'<sub>(.*?)',r'_{\1} ', text)
print(text)
latex = re.sub(r'(.*?)</sub>',r'_{\1} ', text)
… but I get:
x_{i} - y_{i)<sub>2}
x_{i} - y_{i)_{} 2}
x_{i} - y_{i)_{} 2}
What I would like to get:
x_{i} - y_{i})_{2}
Assuming you have texts that are segmented into different parts, the corresponding <sub> / </sub> tags may reside in the adjoining segments, so it should suffice to just replace them one by one separately, and you do not need to make any guess work.
Just use
text = text.replace('<sub>', '_{').replace('</sub>', '}')
to replace each <sub> with _{ and </sub> with } in any context.
You need to use greedy regexes (i.e. without ?) for the unmatched tags, otherwise you'll always get zero-width matches.
>>> text = '1<sub>2'
>>> re.sub(r'<sub>(.*)', r'_{\1} ', text)
'1_{2} '
BTW while figuring this out, I noticed you can put the second two regexes together like this:
re.sub(r'<sub>(.*)|(.*)</sub>', r'_{\1\2} ', text)

How to delete alphanumeric words out of a Unicode file

I need to use a dictionary database, but most of it is some alphanumeric useless stuff, and the interesting fields are either non alphanumeric (such as chinese characters) or inside some brackets. I searched a lot, learned about a lot of tools like sed, awk, grep, ect I even thought about creating a Python script to sort it out, but I never managed to find of a solution.
A line of the database looks like this:
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
I need it to be like this :
助 ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}
Ho can I do this using any of the tools mentioned above?
Here is a Python solution if you would still like one:
import re
alpha_brack = re.compile(r"([a-zA-Z0-9.\-]+)|({.*?})")
my_string = """
助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367
DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4
Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"""
match = alpha_brack.findall(my_string)
new_string = my_string
for g0, _ in match: # only care about first group!
new_string = new_string.replace(g0,'',1) # replace only first occurence!
final = re.sub(r'\s{2,}',' ', new_string) # finally, clean up whitespace
print(final)
My results:
'助ジョ たすける たすかる すける すけ {help} {rescue} {assist}'
Personally, given your example line, I'd sed out all alphanumeric characters that start and end with a space:
sed -i 's/ [a-zA-Z0-9 .-]+ / /g' should be close to what you need. You may have to add more special characters if the text you're wiping out contains other things. This is an in-place substitution for a single space (essentially deleting).
No linux box handy to verify this one... it may require a little massaging.
Also worth mentioning, this will not work if the brackets can contain two spaces: {test results found} as it'll blow away the results
Using perl:
perl -ne '
m/(.*?)({.*)/; # Split based on '{'
my $a=$1; my $b=$2;
$a =~ s/[[:alnum:]-.]//g; #Remove alphabets, numbers, '.', '-' (add more characters as you need.)
$a =~ s/ +/ /g; # Compress spaces.
print "$a $b\n"; #Print 2 parts and a newline
' dbfile.txt
Explanation in the inline comments.
Similar logic with sed:
sed '
h; #Save line in hold space.
s/{.*//; # Remove 2nd part
s/[a-zA-Z0-9.-]//g; # Remove all alphabets, numbers, . & -
s/ */ /g; # Compress spaces
x; #Save updated 1st part in hold space, take back the complete line in pattern space
s/[^{]*{/{/; #Remove first part
x; #Swap hold & pattern space again.
G; # Append 2nd part to first part separated by newline
s/\n//; # Remove newline.
' dbfile.txt
Using shell script (Bash):
#!/bin/bash
string="助 L1782 DN1921 K407 O431 DO346 MN2313 MP2.0376 E314 IN623 DA633 DS248 DF367 DH330 DT284 DC248 DJ826 DG211 DM1800 P1-5-2 I2g5.1 Q7412.7 DR3945 Yzhu4 Wjo ジョ たす.ける たす.かる す.ける すけ {help} {rescue} {assist}"
echo "" > tmpfield
for field in $string
do
if [ "${field:0:1}" != "{" ];then #
echo $field|sed "s/[a-zA-Z0-9 .-]/ /g" >> tmpfield
else
echo $field >> tmpfield
fi
done
#convert rows to one column
cat tmpfield | awk 'NF'|awk 'BEGIN { ORS = " " } { print }'
My output:
nampt#nampt-desktop:/mnt$ bash 1.bash
助 ジョ たす ける たす かる す ける すけ {help} {rescue} {assist}

pyparse: How to handle "{ foo bar \n }" formatted stream?

I'm hoping someone can point out a method to get pyparse to handle the following stream of data:
"text { \n line1 line1\n line2 line2\n \n }"
where the information between the braces is just a blob of strings for further parsing later. The best I've been able to accomplish is to use skipTo with a failOn attribute.
line = SkipTo(LineEnd(), failOn=(LineStart()+LineEnd())|'}') + LineEnd().suppress()
nxos_clause = "with" + output_file + "{" + OneOrMore(line.setDebug()) + "}"
Debug shows
Match {SkipTo:(LineEnd) Suppress:(LineEnd)} at loc 76(4,1)
Exception raised:Found expression {{LineStart LineEnd} | "}"} (at char 94), (line:4, col:19)
(1, 'failed parse:', 'Expected "}" (at char 77), (line:4, col:2)')
The output I am looking for would be
"{", "line1 line1", "line2 line2", "}"
I know this is dead simple to do manually. I am looking to build a more complex grammar once I get the simple stuff working...
If newlines are significant, you'll need to remove them from the pyparsing set of default whitespace characters.
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(' ')
To suppress empty lines, define an expression that matches empty lines, and match and suppress them, before testing for the expression that matches lines that may have content:
test = "text { \n line1 line1\n line2 line2\n \n }"
NL = LineEnd().suppress()
LBRACE,RBRACE = map(Literal, "{}")
emptyLine = Suppress(Empty() + NL)
line = SkipTo(NL) + NL
nxos_clause = "text" + LBRACE + OneOrMore(~RBRACE + (emptyLine | line)) + RBRACE
Also, note that we had to lookahead inside the OneOrMore so as not to process a closing right brace as a valid non-empty line.
Now parse the whole input line:
print nxos_clause.parseString(test)
Gives:
['text', '{', 'line1 line1', 'line2 line2', '}']

Multi-Line Statements problems

Iv'e read various guides on Multi-Line Statements but cannot find a guide that has comments, variables, text and text that requires splitting over multiple lines.
I'm struggling to split the below code:
ex = 25
cmd = 'raspistill -o ' + filename + ' -t 1000 -ex ' + ex
onto a multi line with comments, like this:
cmd = 'raspistill -o ' + filename + \ # explain line 1
' -t 1000' \ # explain line 2
'-ex ' + ex # explain line 3
Is this the best way to split code over multiple lines?
You can use parentheses instead of backslashes to do line continuations:
a = ( "aaa" + # foo
"bbb" + # bar
"ccc" # baz
)
Basically when you have an expression in any kind of brackets, python will not end statements at the end of line, but will first wait until it finds the corresponding closing bracket.
I find it more readable and idiomatic than the backslashes.
I'm not sure what language you're using, but this statement is probably not being parsed the way you think:
cmd = 'raspistill -o ' + filename + \ # explain line 1
' -t 1000' \ # explain line 2
'-ex ' + ex # explain line 3
In Python, you'd get the error:
SyntaxError: unexpected character after line continuation character
The problem is that the line continuation character (backslash \) isn't escaping the newline, it's only escaping the space after it. That's because the newline doesn't follow the backslash. It doesn't come until much later, after your comment. So you still have 3 separate lines here.
Get rid of the extra comments and put them in the front, for example:
# explain lines 1, 2, and 3
#
cmd = 'raspistill -o ' + filename + \
' -t 1000' \
'-ex ' + ex

Categories