Regex Expression to get everything between double quotes - python

I'm trying to get a regex to work for a string of multiline text. Need this to work for python.
Example text:
description : "4.10 TCP Wrappers - not installed"
info : "If some of the services running in /etc/inetd.conf are
required, then it is recommended that TCP Wrappers are installed and configured to limit access to any active TCP and UDP services.
TCP Wrappers allow the administrator to control who has access to various inetd network services via source IP address controls. TCP Wrappers also provide logging information via syslog about both successful and unsuccessful connections.
TCP Wrappers are generally triggered via /etc/inetd.conf, but other options exist for \"wrappering\" non-inetd based software.
The configuration of TCP Wrappers to suit a particular environment is outside the scope of this benchmark; however the following links will provide the necessary documentation to plan an appropriate implementation:
ftp://ftp.porcupine.org/pub/security/index.html
The website contains source code for both IPv4 and IPv6 versions."
expect : "^[\\s]*[A-Za-z0-9]+:[\\s]+[^A][^L][^L]"
required : YES
I have come up with this,
[(a-zA-Z_ \t#)]*[:][ ]*\"[^\"]*.*\"
But the problem is that it stops at the second \" the rest of the line is not selected.
My objective is to get the entire string starting from info till the end of the double quotes, relating to the info line.
This same regex should also work for the 'expect' line, starting from expect ending at the double quotes relating to the expect string.
Once I get the entire string I will split it on the first ":" because I want to store these strings into a DB with the "description", "info", "expect" as columns then the strings as values in those columns.
Appreciate the help!

One alternative is to use thelexer provided in the shlex module:
>>> s = """tester : "this is a long string
that
is multiline, contains \\" double qoutes \\" and .
this line is finished\""""
>>> shlex.split(s[s.find('"'):])[0]
'this is a long string\nthat\nis multiline, contains " double qoutes " and .\nthis line is finished'
It will also remove the backslases from the double quotes inside the string.
The code finds the first double quote in the string and only looks at the string starting from there. It then uses shlex.split() to tokenize the remainder of the string, and takes the first token from the returned list.

Update 1: I got this to work:
[(a-zA-Z_ \t#)]*[:][ ]*\"([^\"]|(?<=\\\\)[\"])*\"
Update 2: If you cannot modify the file to add escaped quotes where necessary for the expression above, then as long as the lines such as
group : "#GROUP#" || "test"
exist only as single lines, then I think this will grab those along with the longer quoted values:
[(a-zA-Z_ \t#)]*[:][ ]*(?:\"([^\"]|(?<=\\\\)[\"])*\"|.*)(?=(?:\r\n|$))
Try that, and if it works, I'll update again to explain it.

Related

How do I use regex correctly to correctly identify a command from a client to server in an internet relay chat program in python

So I was having a bit of trouble wording my question, but essentially I am working on a client application making commands to a IRC chat server that provide certain functionality. It was suggested we use regex to do the parsing of such commands. The first command that needs to be completed when a client is accepted by the server is the USER command which in general will look something like this:
"USER guest 0 * :Ronnie Reagan"
The parts are the USER, followed by the username which is 1 word and I believe can contain numbers, the mode which is a numeric value from 0-9 that indicates your current mode in the chat, the star is just unused extra stuff but it has to be there, and the last part is a colon with no space before the real name. Just as a note the manual doesn't say the real name has to be two separate names, just that it can contain spaces, so it can be any combination of letters and spaces even though its kind of weird.
This is what I came up with based on what I read about regex but have had some issues testing it.
"USER\s[a-zA-Z0-9]\s\d\s*\s:[a-zA-z\s]"
Here is the simple program I was using to test it based on some light tutorials I looked through
import re
userPattern = re.compile("USER\s[a-zA-Z0-9]\s\d\s*\s:[a-zA-z\s]")
while True:
regexTest = input()
isMatch = userPattern.match(regexTest)
if bool(isMatch) == True:
print("valid request")
else:
print("invalid request")
No matter the case I always get an invalid request and I've tried it in a few other ways too. I can't tell if its because something is wrong with my regex or my method of testing it.
There are some issues in your regex:
[a-zA-Z0-9] represents a single character, you want a plus sign at the end of it so that it matches 1 or more characters: [a-zA-Z0-9]+. Same thing about [a-zA-z\s].
* in regex is a special symbol, you need to escape it if you want to match an asterisk: \*
So this is the fixed version of your regex that should work:
USER\s[a-zA-Z0-9]+\s\d\s\*\s:[a-zA-z\s]+
But I think it could be simplified:
If you don't care about what goes after a colon then you can just use .+ there
Instead of [a-zA-Z0-9] you can just use \w (word matcher)
So I think this would work as well:
USER\s\w+\s\d\s\*\s:.+

Escape a console string containing a path with "\r" (python)

I need to send the following commands to a busybox device via a serial port:
SBC1000 > setenv serverip '192.168.128.100'
SBC1000 > setenv fsfile '1k\root.jffs2-128k'
SBC1000 > saveenv
I can escape the single quotes of the first line without a problem using a backslash:
cmd = 'setenv serverip \'192.168.128.100\''
I've tried various combinations of backslashes for the second line, but couldn't get the 1k\root part to escape properly. I believe it is being interpreted as a return. I tried double and triple escape with no success.
I finally stumbled upon using
cmd = 'setenv fsfile \'1k\\\u0072oot.jffs2-128k\''
to include the \r ( not a return ) for my string.
Is there a more readable way to include this \r ( not a return ) pattern in my string?
The solution was to use double-quotes " " as suggested by John Szakmeister.
I discovered that the command string was being passed to a function inside a private class based on pexpect-serial.
My guess is that my string was being evaluated by pexpect in a greedy way. By using a distinct delimiter, the problem was overcome.

How to select an entire entity around a regex without splitting the string first?

My project (unrelated to this question, just context) is a ML classifier, I'm trying to improve it and have found that when I stripped URLS from the text given to it, some of the URLS have been broken by spaces. For example:
https:// twitter.com/username/sta tus/ID
After I remove links that are not broken, I am left with thinks like www website com. I removed those with the following regular expression in Python:
tweet = re.sub('(www|http).*?(org |net |edu |com |be |tt |me |ms )','',tweet);
I've put a space after every one of them because this happens after the regular strip and text processing (so only working with parts of a URL separated by spaces) and theoretically we should only pick up the remainders of a broken link... not something like
http website strangeTLD .... communication
It's not perfect but it works, however I just thought that I might try to preemptively remove URLS from twitter only, since I know that the spaces that break the regular URL strip will always be in the same places, hoping this improves my classifier accuracy? This will get rid of the string of characters that occurs after a link... specifically pictures, which is a lot of my data.
Specifically, is there a way to select the entity surrounding/after:
pic.twitter.com/
or, in reference to the example I gave earlier, select the entity after the username broken by the space in status (I'm just guessing at this regex)...
http.*?twitter.com/*?/sta tus/
Thank you in advance! And for the record, I was given this dataset to work with; I am not sure why the URLs are almost all broken by spaces.
Yes, what you are talking about is called Positive Lookbehind and works using (?<=...), where the ellipsis should be replaced by what you want to skip.
E.g. if you want to select whatever comes after username in https://twitter.com/username/status/ID, just use
(?<=https:\/\/twitter\.com\/username\/).*
and you will get status/ID, like you can see with this live demo.
In this case I had to escape slashes / using backslashes, as required by Regex specifications; I also used the Kleene star operator, i.e. the asterisk, to match any occurrence of . (any character), just like you did.
What a positive lookbehind combination does is specifying some mandatory text before the current position of your cursor; in other words, it puts the cursor after the expression you feed it (if the said text exists).
Of course this is not enough in your case, since username won't be a fixed string but a variable one. This might be an additional requirement, since lookbehinds do not work with variable lengths.
So you can just skip www.twitter.com/
(?<=https:\/\/twitter\.com\/).*
And then, via Python, create a substring
currentText = "username/status/ID"
result = currentText.split("/",1)[1] # returns status/ID
Test it in this demo (click "Execute"); a simple explanation of how this works is in the answer to this question (in short, you just split the string at the first slash character).
As a sidenote, blanks/spaces aren't allowed in URLs and if necessary are usually encoded as %20 or + (see e.g. this answer). In other words, every URL you got can be safely stripped of spaces before processing, so... why didn't they do it?

Escaping values for vim.command, vim.eval in Python vim plugins

I'm writing a python plugin for vim and it's looking like the only way to call a specific command is with the vim.command function. However, just substituting values into the function seems like a bad idea. How would I escape values so that I can pass untrusted data as an argument into a vim function? As a simple example, let's say I want to echo out untrusted input (I know I could just use print, but this is just an example). I would do something like:
value = get_data_from_untrusted_source()
vim.command("echo %s" % value)
However, if that untrusted data has a | in it, the command is ended and a new one is executed which is bad. Even if I use quotes, we end up with sql injection like attacks where an attacker can just put an apostrophe in their response to end the string. Then if we double quote, it could be possible to put a backslash somewhere to end the quote. For example if we just double quotes we would go from \' to \'' which escapes the first quote.
Basically what I'm asking is if there's a safe way to call vim functions from a python plugin and would appreciate any help.

python pexpect sendcontrol key characters

I am working with pythons pexpect module to automate tasks, I need help in figuring out key characters to use with sendcontrol. how could one send the controlkey ENTER ? and for future reference how can we find the key characters?
here is the code i am working on.
#!/usr/bin/env python
import pexpect
id = pexpect.spawn ('ftp 192.168.3.140')
id.expect_exact('Name')
id.sendline ('anonymous')
id.expect_exact ('Password')
*# Not sure how to send the enter control key
id.sendcontrol ('???')*
id.expect_exact ('ftp')
id.sendline ('dir')
id.expect_exact ('ftp')
lines = id.before.split ('\n')
for line in lines :
print line
pexpect has no sendcontrol() method. In your example you appear to be trying to send an empty line. To do that, use:
id.sendline('')
If you need to send real control characters then you can send() a string that contains the appropriate character value. For instance, to send a control-C you would:
id.send('\003')
or:
id.send(chr(3))
Responses to comment #2:
Sorry, I typo'ed the module name -- now fixed. More importantly, I was looking at old documentation on noah.org instead of the latest documentation at SourceForge. The newer documentation does show a sendcontrol() method. It takes an argument that is either a letter (for instance, sendcontrol('c') sends a control-C) or one of a variety of punctuation characters representing the control characters that don't correspond to letters. But really sendcontrol() is just a convenient wrapper around the send() method, which is what sendcontrol() calls after after it has calculated the actual value that you want to send. You can read the source for yourself at line 973 of this file.
I don't understand why id.sendline('') does not work, especially given that it apparently works for sending the user name to the spawned ftp program. If you want to try using sendcontrol() instead then that would be either:
id.sendcontrol('j')
to send a Linefeed character (which is control-j, or decimal 10) or:
id.sendcontrol('m')
to send a Carriage Return (which is control-m, or decimal 13).
If those don't work then please explain exactly what does happen, and how that differs from what you wanted or expected to happen.
If you're just looking to "press enter", you can send a newline:
id.send("\n")
As for other characters that you might want to use sendcontrol() with, I found this useful: https://condor.depaul.edu/sjost/lsp121/documents/ascii-npr.htm
For instance, I was interested in Ctrl+v. Looking it up in the table shows this line:
control character
python & java
decimal
description
^v
\x16
22
synchronous idle
So if I want to send that character, I can do any of these:
id.send('\x16')
id.send(chr(22))
id.sendcontrol('v')
sendcontrol() just looks up the correct character to send and then sends it like any other text
For keys not listed in that table, you can run this script: https://github.com/pexpect/pexpect/blob/master/tests/getch.py (ctrl space to exit)
For instance, ran that script and pressed F4 and it said:
27<STOP>
79<STOP>
83<STOP>
So then to press F4 via pexpect:
id.send(chr(27) + chr(79) + chr(83))

Categories