Regex and grep exception matching

Regex and grep exception matching - python

I tested my regex for matching exceptions in a log file :
http://gskinner.com/RegExr/
Regex is :
.+Exception[^\n]+(\s+at.++)+
And it works for couple of cases I pasted here, but not when I'm using it with grep :
grep '.+Exception[^\n]+(\s+at.++)+' server.log
Does grep needs some extra flags to make it work wit regex ?
Update:
It doesn't have to be regex, I'm looking for anything that will print exceptions.

Not all versions of grep understand the same syntax.
Your pattern contains a + for 1 or more repeats, which means it is in egrep territory.
But it also has \s for white space, which most versions of grep are ignorant of.
Finally, you have ++ to mean a possessive match of the preceding atom, which only fairly sophisticated regex engines understand. You might try a non-possessive match.
However, you don’t need a leading .+, so you can jump right to the string you want. Also, I don’t see why you would use [^\n] since that’s what . normally means, and because you’re operating in line mode already anyways.
If you have grep -P, you might try that. I’m using a simpler but equivalent version of your pattern; you aren’t using an option to grep that gives only the exact match, so I assume you want the whole record:
$ grep -P 'Exception.+\sat' server.log
But if that doesn’t work, you can always bring out the big guns:
$ perl -ne 'print if /Exception.+\sat/' server.log
And if you want just the exact match, you could use
$ perl -nle 'print $& if /Exception.*\bat\b.*/' server.log
That should give you enough variations to play with.
I don’t understand why people use web-based “regex” builders when they can just do the same on the command line with existing tools, since that way they can be absolutely certain the patterns they devise will work with those tools.

You need to pass it the -e <regex> option and if you want to use the extended regex -E -e <regex> . Take a look at the man: man grep

It looks like you're trying to find lines that look something like:
... Exception foobar at line 7 ...
So first, to use regular expressions, you have to use -e with grep, or you can just run egrep.
Next, you don't really have to specify the .+ at the start of the expression. It's usually best to minimize what you're searching for. If it's imperative that there is at least one character before "Exception", then just use ..
Also, \s is a perl-ish way of asking for a space. grep uses POSIX regex, so the equivalent is [[:space:]].
So, I would use:
grep -e 'Exception.*[[:space:]]at'
This would get what you want with the least amount of muss and fuss.

Related

Sed command in python

My input is as
Type combinational function (A B)
Want output to be
Type combinational
function (A B)
I used code and its working
sed 's/\([^ ]* [^ ]*\) \(function.*\)/\1\n\2/' Input_file
When I use this code inside python script using os.system and subprocess its giving me error.
How can I execute this sed inside python script. Or how can I write python code for above sed code.
Python code used
cmd='''
sed 's/\([^ ]* [^ ]*\) \(function.*\)/\1\n\2/' Input_file
'''
subprocess.check_output(cmd, shell=True)
Error is
sed: -e expression #1, char 34: unterminated `s' command

The \n in the string is being substituted by Python into a literal newline. As suggested by #bereal in a comment, you can avoid that by using r'''...''' instead of '''...''' around the script; but a much better solution is to avoid doing in sed what Python already does very well all by itself.
with open('Input_file') as inputfile:
lines = inputfile.read()
lines = lines.replace(' function', '\nfunction')
This is slightly less strict than your current sed script, in that it doesn't require exactly two space-separated tokens before the function marker. If you want to be strict, try re.sub() instead.
import re
# ...
lines = re.sub(r'^(\S+\s+\S+)\s+(function)', r'\1\n\2', lines, re.M)
(Tangentially, you also want to avoid the unnecessary shell=True; perhaps see Actual meaning of 'shell=True' in subprocess)

Although the solutions 1 and 2 are the shortest valid way to get your code running (on Unix), i'd like to add some remarks:
a. os.system() has some issues related to it, and should be replaced by subprocess.call("your command line", shell=False). Regardless of using os.system or subprocess.call, shell=True implies a security risk.
b. Since sed (and awk) are tools that rely heavily on regular expressions it is recommended, when building python for maintainability, to use native python code. In this case use the re, regular expression module, which has a regexp optimized implementation.

Why is the syntax invalid in this Python one-liner using a for-loop?

I'm working on a Bash script. In it, there are a couple occasions where I need to parse some JSON. My usual approach for that is as follows:
MY_JSON=$(some command that prints JSON to stdout)
RESULT=$(python -c "import json,sys;data=json.load(sys.stdin); python code here that prints out the value(s) I need")
This tends to work well. However, yesterday I ran into a problem. I had the following code:
MY_JSON=$(command that returns JSON containing an array of IDs)
IDS=$(echo "${MY_JSON}" | python -c "import json,sys;data=json.load(sys.stdin); for a in data['array']: print(a['id'])")
When I ran that code, I got "Syntax Error" with the caret pointing at the f in for.
In my googling, everything I found indicated that when you get a syntax error on the very first character of a statement, it usually means that you screwed something up in the previous statement. However, if I remove the for loop entirely, I get no syntax error. So, obviously, the problem is with the loop.
What did I do wrong? How can the syntax error be the first character of valid keyword?
I ended up finding the answer, which I'll post below to help others who are trying to build Python one-liners involving a for loop -- but I'm hoping someone can chime in with a better answer, perhaps using comprehensions (which I don't fully understand) or something else instead of the for loop so that I can actually accomplish this in a single line. Using a language other than Python would also be acceptable, so long as it's something typically available on a Linux host.
To be clear, I'd be looking for solutions using true JSON parsing, not some approximation using your favorite string manipulation tool (sed, awk, etc) that would be fragile with respect to things like whether the JSON is pretty-printed.

Statements in Python's grammar are divided into two groups, simple statements and compound statements:
stmt: simple_stmt | compound_stmt
Only a simple statement can contain ;, and a simple statement is limited to so-called small statements:
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
Small statements do not include for loops.
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
A for loop is, rather, a compound statement:
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt

This turns out to have been caused by Python's use of semantic whitespace. Python veterans probably knew that immediately, but I only dabble, so this was confusing. I'll explain.
After extensive searching, I landed on the idea that perhaps indentation was the problem. Other people were getting syntax errors that were caused by indentation issues.
Python uses indentation instead of visible characters to determine the boundaries of a semantic block of code (like the body of a loop). In a C-like language (which is where I'm most comfortable -- Java, C#, etc), we use curly braces for this:
for (var i in myArray) {
i.doSomething();
printSomething(i);
}
None of the whitespace is important, so it's easy to turn it into a one-liner (although that's not a common practice in C-style languages:
for (var i in myArray) { i.doSomething(); printSomething(i); }
So, the next thing I tried was taking more care with the spaces after my semicolons. My for loop was "indented" one space, whereas the import and json.load lines had no leading spaces. So I took out that space (I'm leaving off some of the surrounding code for brevity):
python -c "import json,sys;data=json.load(sys.stdin);for a in data['array']: print(a['id'])"
This didn't help. My next thought was that perhaps the semicolons are insufficient and that I needed actual line breaks. I iterated a bit and landed on this, which works:
MY_JSON=$(command that returns JSON containing an array of IDs)
IDS=$(echo "${MY_JSON}" | python -c "import json,sys;data=json.load(sys.stdin);
for a in data['array']:
print(a['server_id'])")
Both the for statement and the loop body need to be on actual separate lines, with the body indented more deeply than for. Note that this chunk of bash script falls within the body of an if, so it's all indented. In my experimentation, I was unable to find an arrangement that worked where the for loop itself began anywhere but the very beginning of the line.
The end result isn't pretty, but it works.

Using a language other than Python would also be acceptable, so long as it's something typically available on a Linux host.
Give JQ a try. It's a full fledged query language with compact syntax, perfect for command-line JSON parsing and manipulation.
IDS=$(jq -r '.array[].id' <<< "$MY_JSON")
Or, safer if the IDs could contain whitespace or other special characters:
readarray -t IDS < <(jq -r '.array[].id' <<< "$MY_JSON")
If they could contain newlines, this super paranoid version delimits items with \0:
readarray -d $'\0' -t IDS < <(jq -r0 '.array[].id' <<< "$MY_JSON")
Depending on what you're doing with IDS you may be able to do that in JQ as well.

Fastest way to replace space with an unused character and add space in between all characters

What is a fast way to:
Replace space with an unused unicode character.
Add spaces in between all characters
I've tried:
$ python3 -c "print (open('test.txt').read().replace(' ', u'\uE000').replace('', ' '))" > test.spaced.txt
But when I tried it on a 6GB textfile with 90 Million lines, it's really slow.
Simply reading the file after opening it takes really long:
$ time python3 -c "print (open('test.txt').read())"
Assume that my machine has more than enough RAM to handle the inflated file,
Is there a way to do it with sed / awk / bash tools?
Is there a faster way to do the replacement and addition faster in Python?

I believe, using the tools specially designed for text processing is faster than invoking a script written in a general-purpose interpreted language such as Python.
SED doesn't support Unicode escape sequences, but it is possible to pass the actual characters using command substitution:
sed -i -e "s/ /$(printf '\uE000')/g; s/\(.\)/ \1 /g" file
Perl is my favorite, because it is very flexible. It is also much better for text processing than Python:
The Perl languages borrow features from other programming languages
including C, shell script (sh), AWK, and sed... They provide
powerful text processing facilities without the arbitrary data-length
limits of many contemporary Unix commandline tools,... facilitating
easy manipulation of text files.
(from Wikipedia)
Example:
perl -CSDL -p -i -e 's/ /\x{E000}/g ; s/(.)/ \1 /g' file
Note, the -CSDL option enables UTF-8 for the output.
There is also an AWKward way of doing this using GNU AWK version 4.1.0 or newer:
gawk -i inplace '{
a = gsub(/ /, "\xee\x80\x80");
a = gensub(/(.)/, " \\1 ", "g");
print a; }' file
But I wouldn't recommend for obvious reasons.
I doubt that anyone would claim that a specific tool, or algorithm is the fastest one, as there are plenty of factors that may affect the performance, - hardware, the way the tools are compiled, tool versions, the kernel version, etc. Perhaps, the best way to find the right tool, or algorithm is to benchmark. I don't think it necessary to mention the time command.

Parsing date from file name in regexp

I've got a file that can be named something like wh-201310301615.tar.gz but it will always have the -201310301615.tar.gz part. I want to find if that string is in the file name and get only the numbers (thus - and .tar.gz mus be present). Currently I use next pattern to find it:
-\d+\.tar\.gz
but I'm pretty sure there's a better way to do it and to get only the numbers (currently I have to trim the string). Any suggestions?
EDIT:I'm using python thus it's engine.

Try this pattern.
(?<=-)(\d+)(?=\.tar\.gz)
see DEMO

I'm not entirely sure what regex engine you are using, but assuming I've understood your question, this should work in any that support lookarounds.
(?![^-]+-)\d+(?=\.tar\.gz)

You can do it with a find and a small script.
unix> ls
wh-201310301615.tar.gz
wh-201310301616.tar.gz
unix> find . -name "wh-*.tar.gz" -exec find_it {} \;
201310301615
201310301616
unix> cat find_it
#!/bin/sh
echo $1 | cut -c 6-17

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres.
There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them.
We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are very high(almost 150 files per minute with 20K file size each).
Since the file columns numbers are not fixed, I need to pre-process the file before I pass it to the postgres procedure. The pre-processing is simply to add extra commas in the csv for columns, which are not there in the file.
There are two options for me to preprocess the file - use python or use Sed.
My first question is, what would be the fastest way of pre-process the file?
Second question is, If I use sed how would I insert a comma after say 4th, 5th comma fields?
e.g. if file has entries like
1,23,56,we,89,2009-12-06
and I need to edit the file with final output like:
1,23,56,we,,89,,2009-12-06

Are you aware of the fact that COPY FROM lets you specify which columns (as well as in which order they) are to be imported?
COPY tablename ( column1, column2, ... ) FROM ...
Specifying directly, at the Postgres level, which columns to import and in what order, will typically be the fastest and most efficient import method.
This having been said, there is a much simpler (and portable) way of using sed (than what has been presented in other posts) to replace an n th occurrence, e.g. replace the 4th and 5th occurrences of a comma with double commas:
echo '1,23,56,we,89,2009-12-06' | sed -e 's/,/,,/5;s/,/,,/4'
produces:
1,23,56,we,,89,,2009-12-06
Notice that I replaced the rightmost fields (#5) first.
I see that you have also tagged your question as perl-related, although you make no explicit reference to perl in the body of the question; here would be one possible implementation which gives you the flexibility of also reordering or otherwise processing fields:
echo '1,23,56,we,89,2009-12-06' |
perl -F/,/ -nae 'print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]"'
also produces:
1,23,56,we,,89,,2009-12-06
Very similarly with awk, for the record:
echo '1,23,56,we,89,2009-12-06' |
awk -F, '{print $1","$2","$3","$4",,"$5",,"$6}'
I will leave Python to someone else. :)
Small note on the Perl example: I am using the -a and -F options to autosplit so I have a shorter command string; however, this leaves the newline embedded in the last field ($F[5]) which is fine as long as that field doesn't have to be reordered somewhere else. Should that situation arise, slightly more typing would be needed in order to zap the newline via chomp, then split by hand and finally print our own newline character \n (the awk example above does not have this problem):
perl -ne 'chomp;#F=split/,/;print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]\n"'
EDIT (an idea inspired by Vivin):
COMMAS_TO_DOUBLE="1 4 5"
echo '1,23,56,we,89,2009-12-06' |
sed -e `for f in $COMMAS_TO_DOUBLE ; do echo "s/,/,,/$f" ; done |
sort -t/ -k4,4nr | paste -s -d ';'`
1,,23,56,we,,89,,2009-12-06
Sorry, couldn't resist it. :)

To answer your first question, sed would have less overhead, but might be painful. awk would be a little better (it's more powerful). Perl or Python have more overhead, but would be easier to work with (regarding Perl, that's maybe a little subjective ;). Personally, I'd use Perl).
As far as the second question, I think the problem might be a little more complex. For example, don't you need to examine the string to figure out what fields are actually missing? Or is it guaranteed that it will always be the 4th and 5th? If it's the first case case, it would be way easier to do this in Python or Perl rather than in sed. Otherwise:
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),/\1,\2,\3,\4,,\5,,/'
or (easier on the eyes):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]\+,\)\{3\}\)\([^,]\+\),\([^,]\+\),/\1,\3,,\4,,/'
This will add a comma after the 5th and 4th columns assuming there are no other commas in the text.
Or you can use two seds for something that's a little less ugly (only slightly, though):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]*,\)\{4\}\)/\1,/' | sed -e 's/\(\([^,]*,\)\{6\}\)/\1,/'

#OP, you are processing a csv file, which have distinct fields and delimiters. Use a tool that can split on delimiters and give you fields to work with easily. sed is not one of them, although it can be done, as some of the answers suggested, but you will get sed regex that is hard to read when it gets complicated. Use tools like awk/Python/Perl where they work with fields and delimiters easily, best of all, modules that specifically tailored to processing csv is available. For your example, a simple Python approach (without the use of csv module which ideally you should try to use it)
for line in open("file"):
line=line.rstrip() #strip new lines
sline=line.split(",")
if len(sline) < 8: # you want exact 8 fields
sline.insert(4,"")
sline.insert(6,"")
line=','.join(sline)
print line
output
$ more file
1,23,56,we,89,2009-12-06
$ ./python.py
1,23,56,we,,89,,2009-12-06

sed 's/^([^,]*,){4}/&,/' <original.csv >output.csv
Will add a comma after the 4th comma separated field (by matching 4 repetitions of <anything>, and then adding a comma after that). Note that there is a catch; make sure none of these values are quoted strings with commas in them.
You could chain multiple replacements via pipes if necessary, or modify the regex to add in any needed commas at the same time (though that gets more complex; you'd need to use subgroup captures in your replacement text).

Don't know regarding speed, but here is sed expr that should do the job:
sed -i 's/\(\([^,]*,\)\{4\}\)/\1,/' file_name
Just replace 4 by desured number of columns

Depending on your requirements, consider using ETL software for this and future tasks. Tools like Pentaho and Talend offer you a great deal of flexibility and you don't have to write a single line of code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex and grep exception matching - python

You need to pass it the -e <regex> option and if you want to use the extended regex -E -e <regex> . Take a look at the man: man grep

Related

Sed command in python

Why is the syntax invalid in this Python one-liner using a for-loop?

Fastest way to replace space with an unused character and add space in between all characters

Parsing date from file name in regexp

Sed script to edit csv file Or Python

Categories

Resources