Regular Expression Question

Regular Expression Question - python

I'm trying to use regular expression to extract the comments in the heading of a file.
For example, the source code may look like:
//This is an example file.
//Please help me.
#include "test.h"
int main() //main function
{
...
}
What I want to extract from the code are the first two lines, i.e.
//This is an example file.
//Please help me.
Any idea?

Why use regex?
>>> f = file('/tmp/source')
>>> for line in f.readlines():
... if not line.startswith('//'):
... break
... print line
...

>>> code="""//This is an example file.
... //Please help me.
...
... #include "test.h"
... int main() //main function
... {
... ...
... }
... """
>>>
>>> import re
>>> re.findall("^\s*//.*",code,re.MULTILINE)
['//This is an example file.', '//Please help me.']
>>>
If you only need to match continuous comment lines at the top, you could use following.
>>> re.search("^((?:\s*//.*\n)+)",code).group().strip().split("\n")
['//This is an example file.', '//Please help me.']
>>>

this doesn't just get the first 2 comment lines, but mulitline and // comments at the back as well. Its not what you required though.
data=open("file").read()
for c in data.split("*/"):
# multiline
if "/*" in c:
print ''.join(c.split("/*")[1:])
if "//" in c:
for item in c.split("\n"):
if "//" in c:
print ''.join(item.split("//")[1:])

to extend the context into below considerations
spaces in front of //...
empty lines between each //... line
import re
code = """//This is an example file.
a
// Please help me.
// ha
#include "test.h"
int main() //main function
{
...
}"""
for s in re.finditer(r"^(\s*)(//.*)",code,re.MULTILINE):
print(s.group(2))
>>>
//This is an example file.
// Please help me.
// ha

Related

Python regex pattern in order to find if a code line is finishing with a space or tab character

Sorry for putting such a low level question but I really tried to look for the answer before coming here...
Basically I have a script which is searching inside .py files and reads line by line there code -> the object of the script is to find if a line is finishing with a space or a tab as in the below example
i = 5
z = 25
Basically afte r the i variable we should have a \s and after z variable a \t . ( i hope the code format will not erase it)
def custom_checks(file, rule):
"""
#param file: file: file in-which you search for a specific character
#param rule: the specific character you search for
#return: dict obj with the form { line number : character }
"""
rule=re.escape(rule)
logging.info(f" File {os.path.abspath(file)} checked for {repr(rule)} inside it ")
result_dict = {}
file = fileinput.input([file])
for idx, line in enumerate(file):
if re.search(rule, line):
result_dict[idx + 1] = str(rule)
file.close()
if not len(result_dict):
logging.info("Zero non-compliance found based on the rule:2 consecutive empty rows")
else:
logging.warning(f'Found the next errors:{result_dict}')
After that if i will check the logging output i will see this:
checked for '\+s\\s\$' inside it i dont know why the \ are double
Also basically i get all the regex from a config.json which is this one:
{
"ends with tab":"+\\t$",
"ends with space":"+s\\s$"
}
Could some one help me please in this direction-> I basically know that I may do in other ways such as reverse the line [::-1] get the first character and see if its \s etc but i really wanna do it with regex.
Thanks!

Try:
rules = {
'ends with tab': re.compile(r'\t$'),
'ends with space': re.compile(r' $'),
}
Note: while getting lines from iterating the file will leave newline ('\n') at the end of each string, $ in a regex matches the position before the first newline in the string. Thus, if using regex, you don't need to explicitly strip newlines.
if rule.search(line):
...
Personally, however, I would use line.rstrip() != line.rstrip('\n') to flag trailing spaces of any kind in one shot.
If you want to directly check for specific characters at the end of the line, you then need to strip any newline, and you need to check if the line isn't empty. For example:
char = '\t'
s = line.strip('\n')
if s and s[-1] == char:
...
Addendum 1: read rules from JSON config
# here from a string, but could be in a file, of course
json_config = """
{
"ends with tab": "\\t$",
"ends with space": " $"
}
"""
rules = {k: re.compile(v) for k, v in json.loads(json_config).items()}
Addendum 2: comments
The following shows how to comment out a rule, as well as a rule to detect comments in the file to process. Since JSON doesn't support comments, we can consider yaml instead:
yaml_config = """
ends with space: ' $'
ends with tab: \\t$
is comment: ^\\s*#
# ignore: 'foo'
"""
import yaml
rules = {k: re.compile(v) for k, v in yaml.safe_load(yaml_config).items()}
Note: 'is comment' is easy. A hypothetical 'has comment' is much harder to define -- why? I'll leave that as an exercise for the reader ;-)
Note 2: in a file, the yaml config would be without double backslash, e.g.:
cat > config.yml << EOF
ends with space: ' $'
ends with tab: \t$
is comment: ^\s*#
# ignore: 'foo'
EOF
Additional thought
You may want to give autopep8 a try.
Example:
cat > foo.py << EOF
# this is a comment
text = """
# xyz
bar
"""
def foo():
# to be continued
pass
def bar():
pass
EOF
Note: to reveal the extra spaces:
cat foo.py | perl -pe 's/$/|/'
# this is a comment |
|
text = """|
# xyz |
bar |
"""|
def foo(): |
# to be continued |
pass |
|
def bar():|
pass |
|
|
|
There are several PEP8 issues with the above (extra spaces at end of lines, only 1 line between the functions, etc.). Autopep8 fixes them all (but correctly leaves the text variable unchanged):
autopep8 foo.py | perl -pe 's/$/|/'
# this is a comment|
|
text = """|
# xyz |
bar |
"""|
|
|
def foo():|
# to be continued|
pass|
|
|
def bar():|
pass|

How to read csv with multiple quoted delimiters in single field?

I'd like to be able to split a string which contains the delimiter quoted multiple times. Is there an argument to handle this type of string with the csv module? Or is there another way to process it?
text = '"a,b"-"c,d","a,b"-"c,d"'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output: ['"a,b"-"c,d"', '"a,b"-"c,d"']
Actual output: ['"a', 'b"-"c', 'd"', '"a', 'b"-"c', 'd"']
EDIT:
The example above is simplified, but apparently too simplified as some comments provided solutions for the simplified version but not for the full version. Below is the actual data I want to process.
import csv
text = '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output
[
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
]
Actual output
[
'"3-Amino-1',
'2',
'4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-CD-0','"3-Amino-1',
'2', '4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-LS-0'
]

I'll only answer the first part of your question: there is no way to do this with the built-in csv module.
Looking at the CPython source code, quotechar option is only processed at the start of a field:
case START_FIELD:
/* expecting field */
...
else if (c == dialect->quotechar &&
dialect->quoting != QUOTE_NONE) {
/* start quoted field */
self->state = IN_QUOTED_FIELD;
}
...
break;
Inside a field, there is no such check:
case IN_FIELD:
/* in unquoted field */
if (c == '\n' || c == '\r' || c == '\0') {
/* end of line - return [fields] */
if (parse_save_field(self) < 0)
return -1;
self->state = (c == '\0' ? START_RECORD : EAT_CRNL);
}
else if (c == dialect->escapechar) {
/* possible escaped character */
self->state = ESCAPED_CHAR;
}
else if (c == dialect->delimiter) {
/* save field - wait for new field */
if (parse_save_field(self) < 0)
return -1;
self->state = START_FIELD;
}
else {
/* normal character - save in field */
if (parse_add_char(self, module_state, c) < 0)
return -1;
}
break;
There is a check for quotechar while the parser is in the IN_QUOTED_FIELD state; however, upon encountering a quote, it goes back to the IN_FIELD state indicating we're inside an unquoted field. So this is possible:
>>> import csv
>>> import io
>>> print(next(csv.reader(io.StringIO('"a,b"cd,e'))))
['a,bcd', 'e']
But once the parser has reached the end of the initial quoted section, it will consider any subsequent quotes as part of the data. I don't know if this behaviour is to conform with any (written or unwritten) CSV specification, or if it's just a bug.

The data is in a non-standard format and so any solution would need to be tested on the full dataset. A possible workaround could be to first replace ," characters with ;" and then simply split it on the ;. This could be done without using CSV or RE:
tests = [
'"a,b"-"c,d","a,b"-"c,d"',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0',
]
for test in tests:
row = test.replace(',"' , ';"').split(';')
print(len(row), row)
Giving:
2 ['"a,b"-"c,d"', '"a,b"-"c,d"']
2 ['"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0', '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'

If the structure is always the same with the comma sandwiched between an integer and the '"', you can use a regular expression:
import re
re.split('(?<=[0-9]),(?=")', text)

Splitting C code to statements using python

Is there any way to split a string (a complete C file) to C statements using python?
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
/*comment*/
printf("equal\n");
}
return(0);
}
If this is read to a string is there any way to split it into a list of strings like this:
list=['#include <stdio.h>', '#include<math.h>', 'int main(void){','if(final==(final_t)){', 'foo(final);', '/*comment*/', 'printf("equal\n);', '}', 'return(0);', '}']

Without being extremely complex, a C language program is composed of lexical tokens that form declarations and statements according to a syntax. And your splitting need some more explainations: according to the C language standard, if (cond) statement1 [else statement2]; is is statement. Simply both statement1 and statement2 can be blocks, so statements can be nested. In your requirements, you seem to concat the opening brace of a eventual block to the conditional, and leave the closing brace alone. And you say nothing about declarations or preprocessor language
So IMHO, your specifications are still incomplete...
Anyway, it is already far too complex for a simple lexical analyzer. So you should first write the complete grammar that you want to process, ideally in Backus-Naur Form, and declare the terminating tokens. Once you have that, it is easy to use lex + yaxx PYL to build a parser from that grammar.
It is probably not the expected answer, but C language parsers are far from trivial, except you want only accept a small subset of the language.

You should perform the following steps to reach the result:
Get your code as separate lines.
Cut leading and trailing spaces.
Skip empty lines.
If your code if given as a string you can use:
lines = content.split('\n')
If as a file:
with open('file.c') as f:
lines = f.readlines()
To cut extra spaces:
lines = list(map(str.strip, lines))
To skip empty lines:
lines = list(filter(lambda x: x, lines))
So the full code may look like this:
content = """
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
printf("equal\n");
}
return(0);
}
"""
lines = content.split('\n')
lines = list(map(str.strip, lines))
lines = list(filter(lambda x: x, lines))
print(lines)

code_list = []
with open("<your-code-file>", 'r') as code_file:
for line in code_file:
if "{" in line:
code_list[-1] = code_list[-1] + line.strip()
else:
code_list.append(line.strip())
print(code_list)
output:
['#include <stdio.h>', '#include <math.h>', '', 'int main (void){\n', 'if(final==(final_t)) {\n', 'foo(final);', 'printf("equal\\n");', '}', 'return(0);', '}']

How to replace letters with numbers and re-convert at anytime (Caesar cipher)?

I've been coding this for almost 2 days now but cant get it. I've coded two different bits trying to find it.
Code #1
So this one will list the letters but wont change it to the numbers (a->1, b->2, ect)
import re
text = input('Write Something- ')
word = '{}'.format(text)
for letter in word:
print(letter)
#lists down
Outcome-
Write something- test
t
e
s
t
Then I have this code that changes the letters into numbers, but I haven't been able to convert it back into letters.
Code #2
u = input('Write Something')
a = ord(u[-1])
print(a)
#converts to number and prints ^^
enter code here
print('')
print(????)
#need to convert from numbers back to letters.
Outcome:
Write Something- test
116
How can I send a text through (test) and make it convert it to either set numbers (a->1, b->2) or random numbers, save it to a .txt file and be able to go back and read it at any time?

What youre trying to achieve here is called "caesar encryption".
You for example say normally you would have: A=1, a=2, B=3, B=4, etc...
then you would have a "key" which "shifts" the letters. Lets say the key is "3", so you would shift all letters 3 numbers up and you would end up with: A=4, a=5, B=6, b=7, etc...
This is of course only ONE way of doing a caesar encryption. This is the most basic example. You could also say your key is "G", which would give you:
A=G, a=g, B=H, b=h, etc.. or
A=G, a=H, B=I, b=J, etc...
Hope you understand what im talking about. Again, this is only one very simple example way.
Now, for your program/script you need to define this key. And if the key should be variable, you need to save it somewhere (write it down). Put your words in a string, and check and convert each letter and write it into a new string.
You then could say (pseudo code!):
var key = READKEYFROMFILE;
string old = READKEYFROMFILE_OR_JUST_A_NORMAL_STRING_:)
string new = "";
for (int i=0, i<old.length, i++){
get the string at i;
compare with your "key";
shift it;
write it in new;
}
Hope i could help you.
edit:
You could also use a dictionary (like the other answer says), but this is a very static (but easy) way.
Also, maybe watch some guides/tutorials on programming. You dont seem to be that experienced. And also, google "Caesar encryption" to understand this topic better (its very interesting).
edit2:
Ok, so basically:
You have a variable, called "key" in this variable, you store your key (you understood what i wrote above with the key and stuff?)
You then have a string variable, called "old". And another one called "new".
In old, you write your string that you want to convert.
New will be empty for now.
You then do a "for loop", which goes as long as the ".length" of your "old" string. (that means if your sentence has 15 letters, the loop will go through itself 15 times and always count the little "i" variable (from the for loop) up).
You then need to try and get the letter from "old" (and save it for short in another vairable, for example char temp = "" ).
After this, you need to compare your current letter and decide how to shift it.
If thats done, just add your converted letter to the "new" string.
Here is some more precise pseudo code (its not python code, i dont know python well), btw char stands for "character" (letter):
var key = g;
string old = "teststring";
string new = "";
char oldchar = "";
char newchar = "";
for (int i=0; i<old.length; i++){
oldchar = old.charAt[i];
newchar = oldchar //shift here!!!
new.addChar(newchar);
}
Hope i could help you ;)
edit3:
maybe also take a look at this:
https://inventwithpython.com/chapter14.html
Caesar Cipher Function in Python
https://www.youtube.com/watch?v=WXIHuQU6Vrs

Just use dictionary:
letters = {'a': 1, 'b': 2, ... }
And in the loop:
for letter in word:
print(letters[letter])

To convert to symbol codes and back to characters:
text = input('Write Something')
for t in text:
d = ord(t)
n = chr(d)
print(t,d,n)
To write into file:
f = open("a.txt", "w")
f.write("someline\n")
f.close()
To read lines from file:
f = open("a.txt", "r")
lines = f.readlines()
for line in lines:
print(line, end='') # all lines have newline character at the end
f.close()
Please see documentation for Python 3: https://docs.python.org/3/

Here are a couple of examples. My method involves mapping the character to the string representation of an integer padded with zeros so it's 3 characters long using str.zfill.
Eg 0 -> '000', 42 -> '042', 125 -> '125'
This makes it much easier to convert a string of numbers back to characters since it will be in lots of 3
Examples
from string import printable
#'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
from random import sample
# Option 1
char_to_num_dict = {key : str(val).zfill(3) for key, val in zip(printable, sample(range(1000), len(printable))) }
# Option 2
char_to_num_dict = {key : str(val).zfill(3) for key, val in zip(printable, range(len(printable))) }
# Reverse mapping - applies to both options
num_to_char_dict = {char_to_num_dict[key] : key for key in char_to_num_dict }
Here are two sets of dictionaries to map a character to a number. The first option uses random numbers eg 'a' = '042', 'b' = '756', 'c' = '000' the problem with this is you can use it one time, close the program and then the next time the mapping will most definitely not match. If you want to use random values then you will need to save the dictionary to a file so you can open to get the key.
The second option creates a dictionary mapping a character to a number and maintains order. So it will follow the sequence eg 'a' = '010', 'b' = '011', 'c' = '012' everytime.
Now I've explained the mapping choices here are the function to convert between
def text_to_num(s):
return ''.join( char_to_num_dict.get(char, '') for char in s )
def num_to_text(s):
slices = [ s[ i : i + 3 ] for i in range(0, len(s), 3) ]
return ''.join( num_to_char_dict.get(char, '') for char in slices )
Example of use ( with option 2 dictionary )
>>> text_to_num('Hello World!')
'043014021021024094058024027021013062'
>>> num_to_text('043014021021024094058024027021013062')
'Hello World!'
And finally if you don't want to use a dictionary then you can use ord and chr still keeping with padding out the number with zeros method
def text_to_num2(s):
return ''.join( str(ord(char)).zfill(3) for char in s )
def num_to_text2(s):
slices = [ s[ i : i + 3] for i in range(0, len(s), 3) ]
return ''.join( chr(int(val)) for val in slices )
Example of use
>>> text_to_num2('Hello World!')
'072101108108111032087111114108100033'
>>> num_to_text2('072101108108111032087111114108100033')
'Hello World!'

Bash script to select a single Python function from a file

For a git alias problem, I'd like to be able to select a single Python function from a file, by name. eg:
...
def notyet():
wait for it
def ok_start(x):
stuff
stuff
def dontgettrickednow():
keep going
#stuff
more stuff
def ok_stop_now():
In algorithmic terms, the following would be close enough:
Start filtering when you find a line that matches /^(\s*)def $1[^a-zA-Z0-9]/
Keep matching until you find a line that is not ^\s*# or ^/\1\s] (that is, either a possibly-indented comment, or an indent longer than the previous one)
(I don't really care if decorators before the following function are picked up. The result is for human reading.)
I was trying to do this with Awk (which I barely know) but it's a bit harder than I thought. For starters, I'd need a way of storing the length of the indent before the original def.

One way using awk. Code is well commented, so I hope it's easy to understand.
Content of infile:
...
def notyet():
wait for it
def ok_start(x):
stuff
stuff
def dontgettrickednow():
keep going
#stuff
more stuff
def ok_stop_now():
Content of script.awk:
BEGIN {
## 'f' variable is the function to search, set a regexp with it.
f_regex = "^" f "[^a-zA-Z0-9]"
## When set, print line. Otherwise omit line.
## It is set when found the function searched.
## It is unset when found any character different from '#' with less
## spaces before it.
in_func = 0
}
## Found function.
$1 == "def" && $2 ~ f_regex {
## Get position of first 'd' in the line.
i = index( $0, "d" )
## Sanity check. Never should success because the condition was
## checked before.
if ( i == 0 ) {
next
}
## Get characters until matched index before, check that all of
## them are spaces, and get its length.
indent = substr( $0, 0, i - 1 )
if ( indent ~ /^[[:space:]]*$/ ) {
num_spaces = length( indent )
}
## Set variable, print line and read next one.
in_func = 1
print
next
}
## When we are inside the function, line doesn't begin with '#' and
## it's not a blank line (only spaces).
in_func == 1 && $1 ~ /^[^#]/ && $0 ~ /[^[:space:]]/ {
## Get how many characters there are until first non-space. The result
## is the position of first non-blank, so substract one to get the number
## of spaces.
spaces = match( $0, /[^[:space:]]/ )
spaces -= 1
## If current indent is less or equal that the indent of function definition, then
## end of function found, so end processing.
if ( spaces <= num_spaces ) {
in_func = 0
}
}
## Self-explanatory.
in_func == 1 {
print
}
Run it like:
awk -f script.awk -v f="ok_start" infile
With following output:
def ok_start(x):
stuff
stuff
def dontgettrickednow():
keep going
#stuff
more stuff

Why not just let python do it? I think the inspection module can print out the source of a function, so you could just import the module, select the function and inspect it. Hang on. Banging away at a solution for you...
OK. It turns out the inspect.getsource function doesn't work for stuff defined interactively:
>>> def test(f):
... print 'arg:', f
...
>>> test(1)
arg: 1
>>> inspect.getsource(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\inspect.py", line 699, in getsource
lines, lnum = getsourcelines(object)
File "C:\Python27\lib\inspect.py", line 688, in getsourcelines
lines, lnum = findsource(object)
File "C:\Python27\lib\inspect.py", line 529, in findsource
raise IOError('source code not available')
IOError: source code not available
>>>
But for your use case, it will work: For modules that are saved to disk. Take for instance my test.py file:
def test(f):
print 'arg:', f
def other(f):
print 'other:', f
And compare with this interactive session:
>>> import inspect
>>> import test
>>> inspect.getsource(test.test)
"def test(f):\n print 'arg:', f\n"
>>> inspect.getsource(test.other)
"def other(f):\n print 'other:', f\n"
>>>
So... You need to write a simple python script that accepts the name of a python source file and a function/object name as arguments. It should then import the module and inspect the function and print that to STDOUT.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular Expression Question - python

Why use regex? >>> f = file('/tmp/source') >>> for line in f.readlines(): ... if not line.startswith('//'): ... break ... print line ...

Related

Python regex pattern in order to find if a code line is finishing with a space or tab character

How to read csv with multiple quoted delimiters in single field?

Splitting C code to statements using python

How to replace letters with numbers and re-convert at anytime (Caesar cipher)?

Bash script to select a single Python function from a file

Categories

Resources