Splitting C code to statements using python - python

Is there any way to split a string (a complete C file) to C statements using python?
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
/*comment*/
printf("equal\n");
}
return(0);
}
If this is read to a string is there any way to split it into a list of strings like this:
list=['#include <stdio.h>', '#include<math.h>', 'int main(void){','if(final==(final_t)){', 'foo(final);', '/*comment*/', 'printf("equal\n);', '}', 'return(0);', '}']

Without being extremely complex, a C language program is composed of lexical tokens that form declarations and statements according to a syntax. And your splitting need some more explainations: according to the C language standard, if (cond) statement1 [else statement2]; is is statement. Simply both statement1 and statement2 can be blocks, so statements can be nested. In your requirements, you seem to concat the opening brace of a eventual block to the conditional, and leave the closing brace alone. And you say nothing about declarations or preprocessor language
So IMHO, your specifications are still incomplete...
Anyway, it is already far too complex for a simple lexical analyzer. So you should first write the complete grammar that you want to process, ideally in Backus-Naur Form, and declare the terminating tokens. Once you have that, it is easy to use lex + yaxx PYL to build a parser from that grammar.
It is probably not the expected answer, but C language parsers are far from trivial, except you want only accept a small subset of the language.

You should perform the following steps to reach the result:
Get your code as separate lines.
Cut leading and trailing spaces.
Skip empty lines.
If your code if given as a string you can use:
lines = content.split('\n')
If as a file:
with open('file.c') as f:
lines = f.readlines()
To cut extra spaces:
lines = list(map(str.strip, lines))
To skip empty lines:
lines = list(filter(lambda x: x, lines))
So the full code may look like this:
content = """
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
printf("equal\n");
}
return(0);
}
"""
lines = content.split('\n')
lines = list(map(str.strip, lines))
lines = list(filter(lambda x: x, lines))
print(lines)

code_list = []
with open("<your-code-file>", 'r') as code_file:
for line in code_file:
if "{" in line:
code_list[-1] = code_list[-1] + line.strip()
else:
code_list.append(line.strip())
print(code_list)
output:
['#include <stdio.h>', '#include <math.h>', '', 'int main (void){\n', 'if(final==(final_t)) {\n', 'foo(final);', 'printf("equal\\n");', '}', 'return(0);', '}']

Related

How to read csv with multiple quoted delimiters in single field?

I'd like to be able to split a string which contains the delimiter quoted multiple times. Is there an argument to handle this type of string with the csv module? Or is there another way to process it?
text = '"a,b"-"c,d","a,b"-"c,d"'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output: ['"a,b"-"c,d"', '"a,b"-"c,d"']
Actual output: ['"a', 'b"-"c', 'd"', '"a', 'b"-"c', 'd"']
EDIT:
The example above is simplified, but apparently too simplified as some comments provided solutions for the simplified version but not for the full version. Below is the actual data I want to process.
import csv
text = '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output
[
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
]
Actual output
[
'"3-Amino-1',
'2',
'4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-CD-0','"3-Amino-1',
'2', '4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-LS-0'
]
I'll only answer the first part of your question: there is no way to do this with the built-in csv module.
Looking at the CPython source code, quotechar option is only processed at the start of a field:
case START_FIELD:
/* expecting field */
...
else if (c == dialect->quotechar &&
dialect->quoting != QUOTE_NONE) {
/* start quoted field */
self->state = IN_QUOTED_FIELD;
}
...
break;
Inside a field, there is no such check:
case IN_FIELD:
/* in unquoted field */
if (c == '\n' || c == '\r' || c == '\0') {
/* end of line - return [fields] */
if (parse_save_field(self) < 0)
return -1;
self->state = (c == '\0' ? START_RECORD : EAT_CRNL);
}
else if (c == dialect->escapechar) {
/* possible escaped character */
self->state = ESCAPED_CHAR;
}
else if (c == dialect->delimiter) {
/* save field - wait for new field */
if (parse_save_field(self) < 0)
return -1;
self->state = START_FIELD;
}
else {
/* normal character - save in field */
if (parse_add_char(self, module_state, c) < 0)
return -1;
}
break;
There is a check for quotechar while the parser is in the IN_QUOTED_FIELD state; however, upon encountering a quote, it goes back to the IN_FIELD state indicating we're inside an unquoted field. So this is possible:
>>> import csv
>>> import io
>>> print(next(csv.reader(io.StringIO('"a,b"cd,e'))))
['a,bcd', 'e']
But once the parser has reached the end of the initial quoted section, it will consider any subsequent quotes as part of the data. I don't know if this behaviour is to conform with any (written or unwritten) CSV specification, or if it's just a bug.
The data is in a non-standard format and so any solution would need to be tested on the full dataset. A possible workaround could be to first replace ," characters with ;" and then simply split it on the ;. This could be done without using CSV or RE:
tests = [
'"a,b"-"c,d","a,b"-"c,d"',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0',
]
for test in tests:
row = test.replace(',"' , ';"').split(';')
print(len(row), row)
Giving:
2 ['"a,b"-"c,d"', '"a,b"-"c,d"']
2 ['"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0', '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
If the structure is always the same with the comma sandwiched between an integer and the '"', you can use a regular expression:
import re
re.split('(?<=[0-9]),(?=")', text)

How to replace letters with numbers and re-convert at anytime (Caesar cipher)?

I've been coding this for almost 2 days now but cant get it. I've coded two different bits trying to find it.
Code #1
So this one will list the letters but wont change it to the numbers (a->1, b->2, ect)
import re
text = input('Write Something- ')
word = '{}'.format(text)
for letter in word:
print(letter)
#lists down
Outcome-
Write something- test
t
e
s
t
Then I have this code that changes the letters into numbers, but I haven't been able to convert it back into letters.
Code #2
u = input('Write Something')
a = ord(u[-1])
print(a)
#converts to number and prints ^^
enter code here
print('')
print(????)
#need to convert from numbers back to letters.
Outcome:
Write Something- test
116
How can I send a text through (test) and make it convert it to either set numbers (a->1, b->2) or random numbers, save it to a .txt file and be able to go back and read it at any time?
What youre trying to achieve here is called "caesar encryption".
You for example say normally you would have: A=1, a=2, B=3, B=4, etc...
then you would have a "key" which "shifts" the letters. Lets say the key is "3", so you would shift all letters 3 numbers up and you would end up with: A=4, a=5, B=6, b=7, etc...
This is of course only ONE way of doing a caesar encryption. This is the most basic example. You could also say your key is "G", which would give you:
A=G, a=g, B=H, b=h, etc.. or
A=G, a=H, B=I, b=J, etc...
Hope you understand what im talking about. Again, this is only one very simple example way.
Now, for your program/script you need to define this key. And if the key should be variable, you need to save it somewhere (write it down). Put your words in a string, and check and convert each letter and write it into a new string.
You then could say (pseudo code!):
var key = READKEYFROMFILE;
string old = READKEYFROMFILE_OR_JUST_A_NORMAL_STRING_:)
string new = "";
for (int i=0, i<old.length, i++){
get the string at i;
compare with your "key";
shift it;
write it in new;
}
Hope i could help you.
edit:
You could also use a dictionary (like the other answer says), but this is a very static (but easy) way.
Also, maybe watch some guides/tutorials on programming. You dont seem to be that experienced. And also, google "Caesar encryption" to understand this topic better (its very interesting).
edit2:
Ok, so basically:
You have a variable, called "key" in this variable, you store your key (you understood what i wrote above with the key and stuff?)
You then have a string variable, called "old". And another one called "new".
In old, you write your string that you want to convert.
New will be empty for now.
You then do a "for loop", which goes as long as the ".length" of your "old" string. (that means if your sentence has 15 letters, the loop will go through itself 15 times and always count the little "i" variable (from the for loop) up).
You then need to try and get the letter from "old" (and save it for short in another vairable, for example char temp = "" ).
After this, you need to compare your current letter and decide how to shift it.
If thats done, just add your converted letter to the "new" string.
Here is some more precise pseudo code (its not python code, i dont know python well), btw char stands for "character" (letter):
var key = g;
string old = "teststring";
string new = "";
char oldchar = "";
char newchar = "";
for (int i=0; i<old.length; i++){
oldchar = old.charAt[i];
newchar = oldchar //shift here!!!
new.addChar(newchar);
}
Hope i could help you ;)
edit3:
maybe also take a look at this:
https://inventwithpython.com/chapter14.html
Caesar Cipher Function in Python
https://www.youtube.com/watch?v=WXIHuQU6Vrs
Just use dictionary:
letters = {'a': 1, 'b': 2, ... }
And in the loop:
for letter in word:
print(letters[letter])
To convert to symbol codes and back to characters:
text = input('Write Something')
for t in text:
d = ord(t)
n = chr(d)
print(t,d,n)
To write into file:
f = open("a.txt", "w")
f.write("someline\n")
f.close()
To read lines from file:
f = open("a.txt", "r")
lines = f.readlines()
for line in lines:
print(line, end='') # all lines have newline character at the end
f.close()
Please see documentation for Python 3: https://docs.python.org/3/
Here are a couple of examples. My method involves mapping the character to the string representation of an integer padded with zeros so it's 3 characters long using str.zfill.
Eg 0 -> '000', 42 -> '042', 125 -> '125'
This makes it much easier to convert a string of numbers back to characters since it will be in lots of 3
Examples
from string import printable
#'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
from random import sample
# Option 1
char_to_num_dict = {key : str(val).zfill(3) for key, val in zip(printable, sample(range(1000), len(printable))) }
# Option 2
char_to_num_dict = {key : str(val).zfill(3) for key, val in zip(printable, range(len(printable))) }
# Reverse mapping - applies to both options
num_to_char_dict = {char_to_num_dict[key] : key for key in char_to_num_dict }
Here are two sets of dictionaries to map a character to a number. The first option uses random numbers eg 'a' = '042', 'b' = '756', 'c' = '000' the problem with this is you can use it one time, close the program and then the next time the mapping will most definitely not match. If you want to use random values then you will need to save the dictionary to a file so you can open to get the key.
The second option creates a dictionary mapping a character to a number and maintains order. So it will follow the sequence eg 'a' = '010', 'b' = '011', 'c' = '012' everytime.
Now I've explained the mapping choices here are the function to convert between
def text_to_num(s):
return ''.join( char_to_num_dict.get(char, '') for char in s )
def num_to_text(s):
slices = [ s[ i : i + 3 ] for i in range(0, len(s), 3) ]
return ''.join( num_to_char_dict.get(char, '') for char in slices )
Example of use ( with option 2 dictionary )
>>> text_to_num('Hello World!')
'043014021021024094058024027021013062'
>>> num_to_text('043014021021024094058024027021013062')
'Hello World!'
And finally if you don't want to use a dictionary then you can use ord and chr still keeping with padding out the number with zeros method
def text_to_num2(s):
return ''.join( str(ord(char)).zfill(3) for char in s )
def num_to_text2(s):
slices = [ s[ i : i + 3] for i in range(0, len(s), 3) ]
return ''.join( chr(int(val)) for val in slices )
Example of use
>>> text_to_num2('Hello World!')
'072101108108111032087111114108100033'
>>> num_to_text2('072101108108111032087111114108100033')
'Hello World!'

Convert csv file to txt file

I'm using perl to convert a comma separated file to a tab separated file with this command:
perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' csvfile.csv > tabfile.tab
However, my file has additional commas that I do not want to be separated in specific columns. Here's and example of my file:
ADNP, "descript1, descript2", 1
PTB, "descriptA, descriptB", 5
I only want to convert the comma's outside of the quotations to tabs as so:
ADNP descript1, descript2 1
PTB descriptA, descriptB 5
Is there anyway to go about doing this with either perl, python, or bash?
Trivial in Perl, using Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
#configure our read format using the default separator of ","
my $input_csv = Text::CSV->new( { binary => 1 } );
#configure our output format with a tab as separator.
my $output_csv = Text::CSV->new( { binary => 1, sep_char => "\t", eol => "\n" } );
#open input file
open my $input_fh, '<', "sample.csv" or die $!;
#iterate input file - reading in 'comma separated'
#printing out (to stdout -can use filehandle) tab separated.
while ( my $row = $input_csv->getline($input_fh) ) {
$output_csv->print( \*STDOUT, $row );
}
In python
import csv
with open('input', 'rb') as inf:
reader = csv.reader(inf)
with open('output', 'wb') as out:
writer = csv.writer(out, delimiter='\t')
writer.writerows(reader)
You need regular expressions to help you. In python it would simply be:
>>> re.split(r'(?!\B"[^"]*),(?![^"]*"\B)', 'ADNP, "descript1, descript2", 1'
['ADNP', ' "descript1, descript2"', ' 1']
Building off rll's regex answer, you can turn it into a perl oneliner like you're currenly doing
perl -ne 'BEGIN{$,="\t";}#a=split(/(?!\B"[^"]*),(?![^"]*"\B)/);print #a' csvfile.csv > tabfile.tab
This'll work:
perl -e '$sep=","; while(<STDIN>) { #data = split(/(\Q$sep\E?\s*"[^"]+"\s*\Q$sep\E?)/); foreach(#data){if(/"/){s/^\Q$sep\E\s*"//;s/"\s*\Q$sep\E$//;}else{s/\Q$sep\E/\t/g;}}print(join("\t",#data));} warn "Changed $sep to tab on $. lines\n"' < csvfile.csv > tabfile.tab
Putting parens in the pattern to split, returns the captured separators along with the split elements and effectively separates the strings containing quotes into separate list elements that can be treated differently when quotes are detected. You just strip off the commas and quotes for the quoted strings and substitute for tabs in the other elements, then join the elements with tabs (so that the quoted strings get joined with tabs to the other already tabbed strings.
The Text::CSV module is what you're looking for. There are a lot of considerations when parsing CSV files, and you really don't want to handle all of them yourself.

delete only lines after match1 up to match2

I have checked and played with various examples and it appears that my problem is a bit more complex than what I have been able to find. What I need to do is search for a particular string and then delete the following line and keep deleting lines until another string is found. So an example would be the following:
a
b
color [
0 0 0,
1 1 1,
3 3 3,
] #color
y
z
Here, "color [" is match1, and "] #color" is match2. So then what is desired is the following:
a
b
color [
] #color
y
z
This "simple to follow" code example will get you started .. you can tweak it as needed. Note that it processes the file line-by-line, so this will work with any size file.
start_marker = 'startdel'
end_marker = 'enddel'
with open('data.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if end_marker in line:
ignoreLines = False
if not ignoreLines:
print line,
It uses startdel and enddel as "markers" for starting and ending the ignoring of data.
Update:
Modified code based on a request in the comments, this will now include/print the lines that contain the "markers".
Given this input data (borrowed from #drewk):
Beginning of the file...
stuff
startdel
delete this line
delete this line also
enddel
stuff as well
the rest of the file...
it yields:
Beginning of the file...
stuff
startdel
enddel
stuff as well
the rest of the file...
You can do this with a single regex by using nongreedy *. E.g., assuming you want to keep both the "look for this line" and the "until this line is found" lines, and discard only the lines in between, you could do:
>>> my_regex = re.compile("(look for this line)"+
... ".*?"+ # match as few chars as possible
... "(until this line is found)",
... re.DOTALL)
>>> new_str = my_regex.sub("\1\2", old_str)
A few notes:
The re.DOTALL flag tells Python that "." can match newlines -- by default it matches any character except a newline
The parentheses define "numbered match groups", which are then used later when I say "\1\2" to make sure that we don't discard the first and last line. If you did want to discard either or both of those, then just get rid of the \1 and/or the \2. E.g., to keep the first but not the last use my_regex.sub("\1", old_str); or to get rid of both use my_regex.sub("", old_str)
For further explanation, see: http://docs.python.org/library/re.html or search for "non-greedy regular expression" in your favorite search engine.
This works:
s="""Beginning of the file...
stuff
look for this line
delete this line
delete this line also
until this line is found
stuff as well
the rest of the file... """
import re
print re.sub(r'(^look for this line$).*?(^until this line is found$)',
r'\1\n\2',s,count=1,flags=re.DOTALL | re.MULTILINE)
prints:
Beginning of the file...
stuff
look for this line
until this line is found
stuff as well
the rest of the file...
You can also use list slices to do this:
mStart='look for this line'
mStop='until this line is found'
li=s.split('\n')
print '\n'.join(li[0:li.index(mStart)+1]+li[li.index(mStop):])
Same output.
I like re for this (being a Perl guy at heart...)

Regular Expression Question

I'm trying to use regular expression to extract the comments in the heading of a file.
For example, the source code may look like:
//This is an example file.
//Please help me.
#include "test.h"
int main() //main function
{
...
}
What I want to extract from the code are the first two lines, i.e.
//This is an example file.
//Please help me.
Any idea?
Why use regex?
>>> f = file('/tmp/source')
>>> for line in f.readlines():
... if not line.startswith('//'):
... break
... print line
...
>>> code="""//This is an example file.
... //Please help me.
...
... #include "test.h"
... int main() //main function
... {
... ...
... }
... """
>>>
>>> import re
>>> re.findall("^\s*//.*",code,re.MULTILINE)
['//This is an example file.', '//Please help me.']
>>>
If you only need to match continuous comment lines at the top, you could use following.
>>> re.search("^((?:\s*//.*\n)+)",code).group().strip().split("\n")
['//This is an example file.', '//Please help me.']
>>>
this doesn't just get the first 2 comment lines, but mulitline and // comments at the back as well. Its not what you required though.
data=open("file").read()
for c in data.split("*/"):
# multiline
if "/*" in c:
print ''.join(c.split("/*")[1:])
if "//" in c:
for item in c.split("\n"):
if "//" in c:
print ''.join(item.split("//")[1:])
to extend the context into below considerations
spaces in front of //...
empty lines between each //... line
import re
code = """//This is an example file.
a
// Please help me.
// ha
#include "test.h"
int main() //main function
{
...
}"""
for s in re.finditer(r"^(\s*)(//.*)",code,re.MULTILINE):
print(s.group(2))
>>>
//This is an example file.
// Please help me.
// ha

Categories