Parsing date from file name in regexp - python

I've got a file that can be named something like wh-201310301615.tar.gz but it will always have the -201310301615.tar.gz part. I want to find if that string is in the file name and get only the numbers (thus - and .tar.gz mus be present). Currently I use next pattern to find it:
-\d+\.tar\.gz
but I'm pretty sure there's a better way to do it and to get only the numbers (currently I have to trim the string). Any suggestions?
EDIT:I'm using python thus it's engine.

Try this pattern.
(?<=-)(\d+)(?=\.tar\.gz)
see DEMO

I'm not entirely sure what regex engine you are using, but assuming I've understood your question, this should work in any that support lookarounds.
(?![^-]+-)\d+(?=\.tar\.gz)

You can do it with a find and a small script.
unix> ls
wh-201310301615.tar.gz
wh-201310301616.tar.gz
unix> find . -name "wh-*.tar.gz" -exec find_it {} \;
201310301615
201310301616
unix> cat find_it
#!/bin/sh
echo $1 | cut -c 6-17

Related

Python, os.system and subprocess.call both don't work for variable string command

I'm trying to issue a command to another application using python, but it seems to only acknowledge part of the command. Here are the lines of code in question:
command = 'potreeconverter {} -q NICE -p {} –o {}\{}\{}\{}'.format(path,folder,pathup[0],cid,wpid,folder)
print (command)
os.system(command)
I'm fairly new to Python so forgive me if that's a weird way to construct a string for a directory name containing many variables. However, the print function always returns the exact command that I intended, and it will run as intended if I simply copy and paste it into the command prompt manually.
potreeconverter C:\Users\thomas\source\test.las -q NICE -p test –o C:\Users\thomas\source\55555\55555\test
The command is accepted by the application, but it ignores the -o parameter, which specifies an output directory for the application. It does the same thing if I use subprocess.call. No other part of the command is ever ignored.
I read that this issue can be solved by having python write the command to a batch file, and then sending the batch file through. I would really rather not do that because it would be pretty inefficient. Is there another way that anyone knows of to avoid this?
Also, I'm unsure what this means but I thought it was odd and perhaps significant. When this problem occurs, and only when this problem occurs, the default output directory that the program chooses instead of the one I specified will use forward slashes instead of backslashes.
you need to escape your backslashes.
a backslash is a special character, and that is why you are able to encode special characters like tabs (\t), newlines (\n) and a bunch more.
so you could just replace this:
command = 'potreeconverter {} -q NICE -p {} –o {}\{}\{}\{}'.format(path,folder,pathup[0],cid,wpid,folder)
with:
command = 'potreeconverter {} -q NICE -p {} –o {}\\{}\\{}\\{}'.format(path,folder,pathup[0],cid,wpid,folder)
you could also use python's raw string notation, which I, personally consider nicer, and easier to maintain:
command = r'potreeconverter {} -q NICE -p {} –o {}\{}\{}\{}'.format(path,folder,pathup[0],cid,wpid,folder)
what this does is it simply tells python not to count any character as special (except for format strings, but that kind of doesn't count).
now, as to what you said about it being a strange way to make paths, there is a better way, which is os.path.join. this takes any arguments, and joins them as a path. for example it would do:
>>> os.path.join('C:\\tuna', 'fish', 'directory')
'C:\\tuna\\fish\\directory'
>>>
there are 3 major advantages here: it will choose between / or \ depending on the system (\ on windows\DOS, / on Unix/Linux, etc...), it can take any amount of arguments, and is more readable. in your case, you could do:
import os.path
base = 'potreeconverter {} -q NICE -p {} –o'
path = os.path.join(pathup[0], cid, wpid, folder)
command = ' '.join((base, path))
your code would work too, but this is the recommended way for working with paths.
if you have any questions, please ask. have a good day!
Just to make sure I don't leave everyone hanging, I figured out the solution. It's very strange. All I did was call the parameters in a different order, and now it works every time. They work in any order if I input the command manually, but if I issue the command from python it seems it will only work if I write the output parameter first. Not sure if this is a problem with python or with the application I'm writing the command to.

git diff --color=never escape sequences in output [duplicate]

I'm using python's sh to script git commands. For example, I do things like
import sh
git = sh.git.bake(_cwd='/some/dir/')
project_hash = git('rev-parse', 'HEAD').stdout.strip()
project_branch = git('rev-parse', '--abbrev-ref', 'HEAD').stdout.strip()
project_date = git('log', '-1', '--pretty=format:%ci').stdout.strip()
and then I write the project_hash, project_branch and project_date into a database, etc.
The trouble is git sometimes adds shell escape sequences to its output. For example,
print(repr(project_hash))
print(repr(project_branch))
print(repr(project_date))
leads to
'e55595222076bd90b29e184b6ff6ad66ec8c3a03'
'master'
'\x1b[?1h\x1b=\r2012-03-26 01:07:40 -0500\x1b[m\r\n\r\x1b[K\x1b[?1l\x1b>'
The first two strings are not a problem, but the last one, the date, has escape sequences.
Is there any way I can get rid of these, e.g. asking git not to output any escape sequences?
I have tried the "--no-color" option with the git log command. That did not help.
I would also be happy to strip them out in python itself, but I don't know how. I tried s.encode('ascii') where s is the date string. That did not make a difference.
Print stdout in Python without shell escape sequences addresses the same issue. The recommendation there is to use python's subprocess rather than sh. E.g., I could do
project_date = subprocess.check_output(["git", "log", "-1", "--pretty=format:%ci"], cwd='/some/dir/')
and
print(repr(project_date))
gives
'2012-03-26 01:07:40 -0500'
That is what I want, of course. However, if it is possible I would prefer to stick with sh, and so would like to know if I can avoid the escape sequences using sh.
Any suggestions?
Those are not color sequences, those look like terminal initialization sequences. Specifically:
ESC [ ? 1 h ESC =
is the sequence to turn on function-key-mode and
ESC [ ? 1 l ESC >
is the sequence to turn it off again. This suggests that git log is running things through your pager. I'm not quite sure why; normally git suppresses use of the pager when the output is a pipe (as it is with subprocess.Popen() at least, and I would think with sh, although I have not used the sh module).
(Pause to consult documentation...)
Aha! Per sh module docs, by default, the output of an sh-module-run command goes through a pseudo-tty. This is fooling git into running your pager.
As a slightly dirty work-around, you can run git --no-pager log ... to suppress the use of the pager, even when running with sh. Or, you can try the _tty_out=False argument (again, I have not used the sh module, you will have to experiment a bit). Amusingly, one of the examples at the bottom of the sh module documentation is git!
It seems like sh does the right thing. In python 2.7, this:
import sh
git = sh.git.bake(_cwd='/tmp/gittest/')
project_hash = git('rev-parse', 'HEAD')
project_branch = git('rev-parse', '--abbrev-ref', 'HEAD')
project_date = git('log', '-1', '--pretty=format:%ci')
print(repr(project_hash).strip())
print(repr(project_branch).strip())
print(repr(project_date).strip())
gives me:
500ddad67203badced9a67170b42228ffa269f53
master
2013-11-22 00:05:59 +1100
If you really want to strip out escapes, use the decoder tools provided by python (Process escape sequences in a string in Python)

Embedding executable python script in python string

If you are a bash/posix sh wizard you will know the $(command substition) feature in bash, which could even be inserted in a string. For example,
$ echo "I can count: $(seq 1 10 | tr -d '\n')"
I can count: 12345678910
You can imagine all wild things to do with this, especially to make a dynamically formed string. No need to do if..else block outside the string; just embed the code inside! I am s spoiled by this feature. So here's the question: in python, can we do something similar? Is there one person already devising a module to accomplish this task?
(Just a side comment: admittedly having this kind of feature is powerful but also opening yourself to a security risk. The program can be vulnerable to code injection. So think thoroughly before doing this especially with a foreign string coming from outside the code.)
You can use eval() and all of it's potential risks...
Some good links here and here
See the built-in eval() function.
Are you looking for an fstring:
Instead of starting the string with '
We start the string with f'
And whenever we want to embed any script we just put inside these: {}

Regex and grep exception matching

I tested my regex for matching exceptions in a log file :
http://gskinner.com/RegExr/
Regex is :
.+Exception[^\n]+(\s+at.++)+
And it works for couple of cases I pasted here, but not when I'm using it with grep :
grep '.+Exception[^\n]+(\s+at.++)+' server.log
Does grep needs some extra flags to make it work wit regex ?
Update:
It doesn't have to be regex, I'm looking for anything that will print exceptions.
Not all versions of grep understand the same syntax.
Your pattern contains a + for 1 or more repeats, which means it is in egrep territory.
But it also has \s for white space, which most versions of grep are ignorant of.
Finally, you have ++ to mean a possessive match of the preceding atom, which only fairly sophisticated regex engines understand. You might try a non-possessive match.
However, you don’t need a leading .+, so you can jump right to the string you want. Also, I don’t see why you would use [^\n] since that’s what . normally means, and because you’re operating in line mode already anyways.
If you have grep -P, you might try that. I’m using a simpler but equivalent version of your pattern; you aren’t using an option to grep that gives only the exact match, so I assume you want the whole record:
$ grep -P 'Exception.+\sat' server.log
But if that doesn’t work, you can always bring out the big guns:
$ perl -ne 'print if /Exception.+\sat/' server.log
And if you want just the exact match, you could use
$ perl -nle 'print $& if /Exception.*\bat\b.*/' server.log
That should give you enough variations to play with.
I don’t understand why people use web-based “regex” builders when they can just do the same on the command line with existing tools, since that way they can be absolutely certain the patterns they devise will work with those tools.
You need to pass it the -e <regex> option and if you want to use the extended regex -E -e <regex> . Take a look at the man: man grep
It looks like you're trying to find lines that look something like:
... Exception foobar at line 7 ...
So first, to use regular expressions, you have to use -e with grep, or you can just run egrep.
Next, you don't really have to specify the .+ at the start of the expression. It's usually best to minimize what you're searching for. If it's imperative that there is at least one character before "Exception", then just use ..
Also, \s is a perl-ish way of asking for a space. grep uses POSIX regex, so the equivalent is [[:space:]].
So, I would use:
grep -e 'Exception.*[[:space:]]at'
This would get what you want with the least amount of muss and fuss.

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres.
There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them.
We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are very high(almost 150 files per minute with 20K file size each).
Since the file columns numbers are not fixed, I need to pre-process the file before I pass it to the postgres procedure. The pre-processing is simply to add extra commas in the csv for columns, which are not there in the file.
There are two options for me to preprocess the file - use python or use Sed.
My first question is, what would be the fastest way of pre-process the file?
Second question is, If I use sed how would I insert a comma after say 4th, 5th comma fields?
e.g. if file has entries like
1,23,56,we,89,2009-12-06
and I need to edit the file with final output like:
1,23,56,we,,89,,2009-12-06
Are you aware of the fact that COPY FROM lets you specify which columns (as well as in which order they) are to be imported?
COPY tablename ( column1, column2, ... ) FROM ...
Specifying directly, at the Postgres level, which columns to import and in what order, will typically be the fastest and most efficient import method.
This having been said, there is a much simpler (and portable) way of using sed (than what has been presented in other posts) to replace an n th occurrence, e.g. replace the 4th and 5th occurrences of a comma with double commas:
echo '1,23,56,we,89,2009-12-06' | sed -e 's/,/,,/5;s/,/,,/4'
produces:
1,23,56,we,,89,,2009-12-06
Notice that I replaced the rightmost fields (#5) first.
I see that you have also tagged your question as perl-related, although you make no explicit reference to perl in the body of the question; here would be one possible implementation which gives you the flexibility of also reordering or otherwise processing fields:
echo '1,23,56,we,89,2009-12-06' |
perl -F/,/ -nae 'print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]"'
also produces:
1,23,56,we,,89,,2009-12-06
Very similarly with awk, for the record:
echo '1,23,56,we,89,2009-12-06' |
awk -F, '{print $1","$2","$3","$4",,"$5",,"$6}'
I will leave Python to someone else. :)
Small note on the Perl example: I am using the -a and -F options to autosplit so I have a shorter command string; however, this leaves the newline embedded in the last field ($F[5]) which is fine as long as that field doesn't have to be reordered somewhere else. Should that situation arise, slightly more typing would be needed in order to zap the newline via chomp, then split by hand and finally print our own newline character \n (the awk example above does not have this problem):
perl -ne 'chomp;#F=split/,/;print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]\n"'
EDIT (an idea inspired by Vivin):
COMMAS_TO_DOUBLE="1 4 5"
echo '1,23,56,we,89,2009-12-06' |
sed -e `for f in $COMMAS_TO_DOUBLE ; do echo "s/,/,,/$f" ; done |
sort -t/ -k4,4nr | paste -s -d ';'`
1,,23,56,we,,89,,2009-12-06
Sorry, couldn't resist it. :)
To answer your first question, sed would have less overhead, but might be painful. awk would be a little better (it's more powerful). Perl or Python have more overhead, but would be easier to work with (regarding Perl, that's maybe a little subjective ;). Personally, I'd use Perl).
As far as the second question, I think the problem might be a little more complex. For example, don't you need to examine the string to figure out what fields are actually missing? Or is it guaranteed that it will always be the 4th and 5th? If it's the first case case, it would be way easier to do this in Python or Perl rather than in sed. Otherwise:
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),/\1,\2,\3,\4,,\5,,/'
or (easier on the eyes):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]\+,\)\{3\}\)\([^,]\+\),\([^,]\+\),/\1,\3,,\4,,/'
This will add a comma after the 5th and 4th columns assuming there are no other commas in the text.
Or you can use two seds for something that's a little less ugly (only slightly, though):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]*,\)\{4\}\)/\1,/' | sed -e 's/\(\([^,]*,\)\{6\}\)/\1,/'
#OP, you are processing a csv file, which have distinct fields and delimiters. Use a tool that can split on delimiters and give you fields to work with easily. sed is not one of them, although it can be done, as some of the answers suggested, but you will get sed regex that is hard to read when it gets complicated. Use tools like awk/Python/Perl where they work with fields and delimiters easily, best of all, modules that specifically tailored to processing csv is available. For your example, a simple Python approach (without the use of csv module which ideally you should try to use it)
for line in open("file"):
line=line.rstrip() #strip new lines
sline=line.split(",")
if len(sline) < 8: # you want exact 8 fields
sline.insert(4,"")
sline.insert(6,"")
line=','.join(sline)
print line
output
$ more file
1,23,56,we,89,2009-12-06
$ ./python.py
1,23,56,we,,89,,2009-12-06
sed 's/^([^,]*,){4}/&,/' <original.csv >output.csv
Will add a comma after the 4th comma separated field (by matching 4 repetitions of <anything>, and then adding a comma after that). Note that there is a catch; make sure none of these values are quoted strings with commas in them.
You could chain multiple replacements via pipes if necessary, or modify the regex to add in any needed commas at the same time (though that gets more complex; you'd need to use subgroup captures in your replacement text).
Don't know regarding speed, but here is sed expr that should do the job:
sed -i 's/\(\([^,]*,\)\{4\}\)/\1,/' file_name
Just replace 4 by desured number of columns
Depending on your requirements, consider using ETL software for this and future tasks. Tools like Pentaho and Talend offer you a great deal of flexibility and you don't have to write a single line of code.

Categories