I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.
Related
for some reasons, i have to run a php function in python.
However, i realized that it's beyond my limit.
So, i'm asking for help here.
below is the code
function munja_send($mtype, $name, $phone, $msg, $callback, $contents) {
$host = "www.sendgo.co.kr";
$id = ""; // id
$pass = ""; // password
$param = "remote_id=".$id;
$param .= "&remote_pass=".$pass;
$param .= "&remote_name=".$name;
$param .= "&remote_phone=".$phone; //cellphone number
$param .= "&remote_callback=".$callback; // my cellphone number
$param .= "&remote_msg=".$msg; // message
$param .= "&remote_contents=".$contents; // image
if ($mtype == "lms") {
$path = "/Remote/RemoteMms.html";
} else {
$path = "/Remote/RemoteSms.html";
}
$fp = #fsockopen($host,80,$errno,$errstr,30);
$return = "";
if (!$fp) {
echo $errstr."(".$errno.")";
} else {
fputs($fp, "POST ".$path." HTTP/1.1\r\n");
9
fputs($fp, "Host: ".$host."\r\n");
fputs($fp, "Content-type: application/x-www-form-urlencoded\r\n");
fputs($fp, "Content-length: ".strlen($param)."\r\n");
fputs($fp, "Connection: close\r\n\r\n");
fputs($fp, $param."\r\n\r\n");
while(!feof($fp)) $return .= fgets($fp,4096);
}
fclose ($fp);
$_temp_array = explode("\r\n\r\n", $return);
$_temp_array2 = explode("\r\n", $_temp_array[1]);
if (sizeof($_temp_array2) > 1) {
$return_string = $_temp_array2[1];
} else {
$return_string = $_temp_array2[0];
}
return $return_string;
}
i would be glad if anyone can show me a way.
thank you.
I don't know PHP, but based on my understanding, here should be a raw line-for-line translation of the code you provided, from PHP to python. I've preserved your existing comments, and added new ones for clarification in places where I was unsure or where you might want to change.
It should be pretty straightforward to follow - the difference is mostly in syntax (e.g. + for concatenation instead of .), and in converting str to bytes and vice versa.
import socket
def munja_send(mtype, name, phone, msg, callback, contents):
host = "www.sendgo.co.kr"
remote_id = "" # id (changed the variable name, since `id` is also a builtin function)
password = "" # password (`pass` is a reserved keyword in python)
param = "remote_id=" + remote_id
param += "&remote_pass=" + password
param += "&remote_name=" + name
param += "&remote_phone=" + phone # cellphone number
param += "&remote_callback=" + callback # my cellphone number
param += "&remote_msg=" + msg # message
param += "&remote_contents=" + contents # image
if mtype == "lms"
path = "/Remote/RemoteMms.html"
else:
path = "/Remote/RemoteSms.html"
socket.settimeout(30)
# change these parameters as necessary for your desired outcome
fp = socket.socket(family=socket.AF_INET, type=socket.SOCK_STREAM)
errno = fp.connect_ex((host, 80))
if errno != 0:
# I'm not sure where errmsg comes from in php or how to get it in python
# errno should be the same, though, as it refers to the same system call error code
print("Error(" + errno + ")")
else:
returnstr = b""
fp.send("POST " + path + "HTTP/1.1\r\n")
fp.send("Host: " + host + "\r\n")
fp.send("Content-type: application/x-www-form-urlencoded\r\n")
# for accuracy, we convert param to bytes using utf-8 encoding
# before checking its length. Change the encoding as necessary for accuracy
fp.send("Content-length: " + str(len(bytes(param, 'utf-8'))) + "\r\n")
fp.send("Connection: close\r\n\r\n")
fp.send(param + "\r\n\r\n")
while (data := fp.recv(4096)):
# fp.recv() should return an empty string if eof has been hit
returnstr += data
fp.close()
_temp_array = returnstr.split(b'\r\n\r\n')
_temp_array2 = _temp_array[1].split(b'\r\n')
if len(temp_array2) > 1:
return_string = _temp_array2[1]
else:
return_string = _temp_array2[0]
# here I'm converting the raw bytes to a python string, using encoding
# utf-8 by default. Replace with your desired encoding if necessary
# or just remove the `.decode()` call if you're fine with returning a
# bytestring instead of a regular string
return return_string.decode('utf-8')
If possible, you should probably use subprocess to execute your php code directly, as other answers suggest, as straight-up translating code is often error-prone and has slightly different behavior (case in point, the lack of errmsg and probably different error handling in general, and maybe encoding issues in the above snippet). But if that's not possible, then hopefully this will help.
according to the internet, you can use subprocess and then execute the PHP script
import subprocess
# if the script don't need output.
subprocess.call("php /path/to/your/script.php")
# if you want output
proc = subprocess.Popen("php /path/to/your/script.php", shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
PHP code can be executed in python using libraries subprocess or php.py based on the situation.
Please refer this answer for further details.
I have the following example where I want to output a variable in the same json format with os.system (the idea is to output with a system command) but the double quotes are ignored in output.
import json
import os
import requests
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = (str(PAYLOAD_CONF))
paydic = (json.dumps(PAYLOAD_CONF))
os.system("echo "+paystr+"")
os.system("echo "+paydic+"")
Output:
{cluster: {processes: 22, ldap: string}}
{cluster: {processes: 22, ldap: string}}
Can you help me in a workaround where I can output this with double quotes? It's very important to output with system command.
For cases like this, where you embed variables into a shell command you should use shlex.quote. Using this (and with minor cleanup), the code can be written as:
import json
import os
import shlex
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paydic = json.dumps(PAYLOAD_CONF)
os.system("echo " + shlex.quote(paydic))
Output:
{"cluster": {"ldap": "string", "processes": 22}}
Using subprocess
The subprocess module contains a lot of helper functions for calling external applications. These functions are generally preferrable to use than os.system for various security reasons.
If there is no hard dependency of os.system you can also use one of depending on your needs:
subprocess.call -- This will return even if the subprocess fails. If this returns a non-zero value, the process exited abnormally.
subprocess.check_call -- This will raise an exception if the process exits abnormally.
subprocess.check_output -- This will return stdout from the subprocess and raise an exception if it exits abnormally.
... The module contains many other helpful functions for interacting with subprocess which you should check out if the above don't suit your needs.
using check_output, the code becomes:
from subprocess import check_call
import json
import os
import shlex
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paydic = json.dumps(PAYLOAD_CONF)
check_call(["echo", paydic])
You're not adding quotes; you're adding an empty string to the end.
Also, echo is going to interpret the first set of double quotes as a wrapper around the argument – not as part of the string itself. In the echo command itself, you need to escape double quotes with a backslash, e.g. echo \"hello\" will output "hello", whereas echo "hello" will output hello.
In a Python string, you're going to have to escape the literal backslash in the echo command with another backslash, e.g. os.system('echo \\"hello\\"') for output "hello".
Applying this to your case and using format to make it easy:
import json
import os
import requests
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = (str(PAYLOAD_CONF))
paydic = (json.dumps(PAYLOAD_CONF))
os.system('echo \\"{}\\"'.format(paystr))
os.system('echo \\"{}\\"'.format(paydic))
Output:
"{cluster: {ldap: string, processes: 22}}"
"{cluster: {ldap: string, processes: 22}}"
Your paystr variable is also unnecessary, since all objects are automatically converted to strings by print and format via their inherited or overridden __str__ methods.
EDIT:
To output the variable as it appears in Python you just need to iterate through the payload dict and render each key-value pair in a type-sensitive way.
import os
import requests
def make_payload_str(payload, nested=1):
payload_str = "{\n" + "\t" * nested
for i, k in enumerate(payload.keys()):
v = payload[k]
if type(k) is str:
payload_str += '\\"{}\\"'.format(k)
else:
payload_str += str(k)
payload_str += ": "
if type(v) is str:
payload_str += '\\"{}\\"'.format(v)
elif type(v) is dict:
payload_str += make_payload_str(v, nested=nested + 1)
else:
payload_str += str(v)
# Only add comma if not last element
if i < len(payload) - 1:
payload_str += ",\n" + "\t" * nested
else:
payload_str += "\n"
return payload_str + "\n" + "\t" * (nested - 1) + "}"
PAYLOAD_CONF = {
"cluster": {
"ldap": "string",
"processes": 22
}
}
paystr = make_payload_str(PAYLOAD_CONF)
os.system('echo "{}"'.format(paystr))
Output:
{
"cluster": {
"ldap": "string",
"processes": 22
}
}
If the payload contains a dictionary, as it does in the example you provided, the function calls itself to produce the string for that dictionary, indented the right number of tabs using the nested parameter.
If the payload is also allowed to have lists and other more complex types, you'll have to include cases that account for those, but it's just more of the same.
I basically want to convert tab delimited text file http://www.linux-usb.org/usb.ids into a csv file.
I tried importing using Excel, but it is not optimal, it turns out like:
8087 Intel Corp.
0020 Integrated Rate Matching Hub
0024 Integrated Rate Matching Hub
How I want it so for easy searching is:
8087 Intel Corp. 0020 Integrated Rate Matching Hub
8087 Intel Corp. 0024 Integrated Rate Matching Hub
Is there any ways I can do this in python?
$ListDirectory = "C:\USB_List.csv"
Invoke-WebRequest 'http://www.linux-usb.org/usb.ids' -OutFile $ListDirectory
$pageContents = Get-Content $ListDirectory | Select-Object -Skip 22
"vendor`tvendor_name`tproduct`tproduct_name`r" > $ListDirectory
#Variables and Flags
$currentVid
$currentVName
$currentPid
$currentPName
$vendorDone = $TRUE
$interfaceFlag = $FALSE
$nextline
$tab = "`t"
foreach($line in $pageContents){
if($line.StartsWith("`#")){
continue
}
elseif($line.length -eq 0){
exit
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $TRUE)){
$vendorDone = $FALSE
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $FALSE)){
$pos = $line.IndexOf(" ")
$currentVid = $line.Substring(0, $pos)
$currentVName = $line.Substring($pos+2)
"$currentVid`t$currentVName`t`t`r" >> $ListDirectory
$vendorDone = $TRUE
}
elseif ($line.StartsWith($tab)){
if ($interfaceFlag -eq $TRUE){
$interfaceFlag = $FALSE
}
$nextline = $line.TrimStart()
if ($nextline.StartsWith($tab)){
$interfaceFlag = $TRUE
}
if ($interfaceFlag -eq $FALSE){
$pos = $nextline.IndexOf(" ")
$currentPid = $nextline.Substring(0, $pos)
$currentPName = $nextline.Substring($pos+2)
"$currentVid`t$currentVName`t$currentPid`t$currentPName`r" >> $ListDirectory
Write-Host "$currentVid`t$currentVName`t$currentPid`t$currentPName`r"
$interfaceFlag = $FALSE
}
}
}
I know the ask is for python, but I built this PowerShell script to do the job. It takes no parameters. Just run as admin from the directory where you want to store the script. The script collects everything from the http://www.linux-usb.org/usb.ids page, parses the data and writes it to a tab delimited file. You can then open the file in excel as a tab delimited file. Ensure the columns are read as "text" and not "general" and you're go to go. :)
Parsing this page is tricky because the script has to be contextually aware of every VID-Vendor line proceeding a series of PID-Product lines. I also forced the script to ignore the commented description section, the interface-interface_name lines, the random comments that he inserted throughout the USB list (sigh) and everything after and including "#List of known device classes, subclasses and protocols" which is out of scope for this request.
I hope this helps!
You just need to write a little program that scans in the data a line at a time. Then it should check to see if the first character is a tab ('\t'). If not then that value should be stored. If it does start with tab then print out the value that was previously stored followed by the current line. The result will be the list in the format you want.
Something like this would work:
import csv
lines = []
with open("usb.ids.txt") as f:
reader = csv.reader(f, delimiter="\t")
device = ""
for line in reader:
# Ignore empty lines and comments
if len(line) == 0 or (len(line[0]) > 0 and line[0][0] == "#"):
continue
if line[0] != "":
device = line[0]
elif line[1] != "":
lines.append((device, line[1]))
print(lines)
You basically need to loop through each line, and if it's a device line, remember that for the following lines. This will only work for two columns, and you would then need to write them all to a csv file but that's easy enough
Extremely new to Python. I started off writing a PHP script to find all occurrences of 2 strings in a txt file but it's using too much memory so i read Python would be better.
Basically what i need to do is:
- import a txt file
- go through it and return all the data between the tags below
- remove any duplicates
- output the results
The bits i'm looking for look like this:
------DATA--------------------------------------------------
DATA TO SHOW
------------------------------------------------------------
Naturally, the important bit to output is the DATA TO SHOW part.
Any help would be appreciated :)
Thanks
UPDATE -----------------------
import re
inputFile = open("small.txt", "r")
output = open("result.txt", "w")
searchStart = "----- ASSERT --------------------------------------------------------------------------------"
searchEnd = "---------------------------------------------------------------------------------------------"
match = re.findall('^----- ASSERT --------------------------------------------------------------------------------\n(.*?)---------------------------------------------------------------------------------------------', inputFile.read(), re.MULTILINE)
print match,
Any ideas how to get it to show all the lines until it hits the searchEnd tag? Example data:
----- ASSERT --------------------------------------------------------------------------------
MORE
INFO
THAT
I
NEED
TO
GET
FROM
THE
FILE
---------------------------------------------------------------------------------------------
An example with php (not tested, the idea is here) :
$handle = fopen("inputfile.txt", "r");
if ($handle) {
$record = false;
while (($line = fgets($handle)) !== false) {
if ($line == '------DATA--------------------------------------------------') {
$record = true;
$temp = '';
} elseif ($record) {
if ($line == '------------------------------------------------------------') {
$record = false;
$results[] = $temp;
$temp = '';
} else $temp .= $line;
}
}
} else {
echo 'Gargoyl, the file can\'t be opened!';
}
fclose($handle);
print_r($results);
I need to analyse some C files and print out all the #define found.
It's not that hard with a regexp (for example)
def with_regexp(fname):
print("{0}:".format(fname))
for line in open(fname):
match = macro_regexp.match(line)
if match is not None:
print(match.groups())
But for example it doesn't handle multiline defines for example.
There is a nice way to do it in C for example with
gcc -E -dM file.c
the problem is that it returns all the #defines, not just the one from the given file, and I don't find any option to only use the given file..
Any hint?
Thanks
EDIT:
This is a first solution to filter out the unwanted defines, simply checking that the name of the define is actually part of the original file, not perfect but seems to work nicely..
def with_gcc(fname):
cmd = "gcc -dM -E {0}".format(fname)
proc = Popen(cmd, shell=True, stdout=PIPE)
out, err = proc.communicate()
source = open(fname).read()
res = set()
for define in out.splitlines():
name = define.split(' ')[1]
if re.search(name, source):
res.add(define)
return res
Sounds like a job for a shell one-liner!
What I want to do is remove the all #includes from the C file (so we don't get junk from other files), pass that off to gcc -E -dM, then remove all the built in #defines - those start with _, and apparently linux and unix.
If you have #defines that start with an underscore this won't work exactly as promised.
It goes like this:
sed -e '/#include/d' foo.c | gcc -E -dM - | sed -e '/#define \(linux\|unix\|_\)/d'
You could probably do it in a few lines of Python too.
In PowerShell you could do something like the following:
function Get-Defines {
param([string] $Path)
"$Path`:"
switch -regex -file $Path {
'\\$' {
if ($multiline) { $_ }
}
'^\s*#define(.*)$' {
$multiline = $_.EndsWith('\');
$_
}
default {
if ($multiline) { $_ }
$multiline = $false
}
}
}
Using the following sample file
#define foo "bar"
blah
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
blah
blah
#define X
it prints
\x.c:
#define foo "bar"
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
#define X
Not ideal, at least how idiomatic PowerShell functions should work, but should work well enough for your needs.
Doing this in pure python I'd use a small state machine:
def getdefines(fname):
""" return a list of all define statements in the file """
lines = open(fname).read().split("\n") #read in the file as a list of lines
result = [] #the result list
current = []#a temp list that holds all lines belonging to a define
lineContinuation = False #was the last line break escaped with a '\'?
for line in lines:
#is the current line the start or continuation of a define statement?
isdefine = line.startswith("#define") or lineContinuation
if isdefine:
current.append(line) #append to current result
lineContinuation = line.endswith("\\") #is the line break escaped?
if not lineContinuation:
#we reached the define statements end - append it to result list
result.append('\n'.join(current))
current = [] #empty the temp list
return result