I need to analyse some C files and print out all the #define found.
It's not that hard with a regexp (for example)
def with_regexp(fname):
print("{0}:".format(fname))
for line in open(fname):
match = macro_regexp.match(line)
if match is not None:
print(match.groups())
But for example it doesn't handle multiline defines for example.
There is a nice way to do it in C for example with
gcc -E -dM file.c
the problem is that it returns all the #defines, not just the one from the given file, and I don't find any option to only use the given file..
Any hint?
Thanks
EDIT:
This is a first solution to filter out the unwanted defines, simply checking that the name of the define is actually part of the original file, not perfect but seems to work nicely..
def with_gcc(fname):
cmd = "gcc -dM -E {0}".format(fname)
proc = Popen(cmd, shell=True, stdout=PIPE)
out, err = proc.communicate()
source = open(fname).read()
res = set()
for define in out.splitlines():
name = define.split(' ')[1]
if re.search(name, source):
res.add(define)
return res
Sounds like a job for a shell one-liner!
What I want to do is remove the all #includes from the C file (so we don't get junk from other files), pass that off to gcc -E -dM, then remove all the built in #defines - those start with _, and apparently linux and unix.
If you have #defines that start with an underscore this won't work exactly as promised.
It goes like this:
sed -e '/#include/d' foo.c | gcc -E -dM - | sed -e '/#define \(linux\|unix\|_\)/d'
You could probably do it in a few lines of Python too.
In PowerShell you could do something like the following:
function Get-Defines {
param([string] $Path)
"$Path`:"
switch -regex -file $Path {
'\\$' {
if ($multiline) { $_ }
}
'^\s*#define(.*)$' {
$multiline = $_.EndsWith('\');
$_
}
default {
if ($multiline) { $_ }
$multiline = $false
}
}
}
Using the following sample file
#define foo "bar"
blah
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
blah
blah
#define X
it prints
\x.c:
#define foo "bar"
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
#define X
Not ideal, at least how idiomatic PowerShell functions should work, but should work well enough for your needs.
Doing this in pure python I'd use a small state machine:
def getdefines(fname):
""" return a list of all define statements in the file """
lines = open(fname).read().split("\n") #read in the file as a list of lines
result = [] #the result list
current = []#a temp list that holds all lines belonging to a define
lineContinuation = False #was the last line break escaped with a '\'?
for line in lines:
#is the current line the start or continuation of a define statement?
isdefine = line.startswith("#define") or lineContinuation
if isdefine:
current.append(line) #append to current result
lineContinuation = line.endswith("\\") #is the line break escaped?
if not lineContinuation:
#we reached the define statements end - append it to result list
result.append('\n'.join(current))
current = [] #empty the temp list
return result
Related
I'm working on a bazel rule (using version 5.2.0) that uses SWIG (version 4.0.1) to make a python library from C++ code, adapted from a rule in the tensorflow library. The problem I've run into is that, depending on the contents of ctx.file.source.path, the swig invocation might produce a necessary .h file. If it does, the rule below works great. If it doesn't, I get:
ERROR: BUILD:31:11: output 'foo_swig_h.h' was not created
ERROR: BUILD:31:11: SWIGing foo.i. failed: not all outputs were created or valid
If the h_out stuff is removed from _py_swig_gen_impl, the rule below works great when swig doesn't produce the .h file. But, if swig does produce one, bazel seems to ignore it and it isn't available for native.cc_binary to compile, resulting in gcc failing with a 'no such file or directory' error on an #include <foo_swig_cc.h> line in foo_swig_cc.cc.
(The presence or absence of the .h file in the output is determined by whether the .i file at ctx.file.source.path uses SWIG's "directors" feature.)
def _include_dirs(deps):
return depset(transitive = [dep[CcInfo].compilation_context.includes for dep in deps]).to_list()
def _headers(deps):
return depset(transitive = [dep[CcInfo].compilation_context.headers for dep in deps]).to_list()
# Bazel rules for building swig files.
def _py_swig_gen_impl(ctx):
module_name = ctx.attr.module_name
cc_out = ctx.actions.declare_file(module_name + "_swig_cc.cc")
h_out = ctx.actions.declare_file(module_name + "_swig_h.h")
py_out = ctx.actions.declare_file(module_name + ".py")
args = ["-c++", "-python", "-py3"]
args += ["-module", module_name]
args += ["-I" + x for x in _include_dirs(ctx.attr.deps)]
args += ["-I" + x.dirname for x in ctx.files.swig_includes]
args += ["-o", cc_out.path]
args += ["-outdir", py_out.dirname]
args += ["-oh", h_out.path]
args.append(ctx.file.source.path)
outputs = [cc_out, h_out, py_out]
ctx.actions.run(
executable = "swig",
arguments = args,
mnemonic = "Swig",
inputs = [ctx.file.source] + _headers(ctx.attr.deps) + ctx.files.swig_includes,
outputs = outputs,
progress_message = "SWIGing %{input}.",
)
return [DefaultInfo(files = depset(direct = [cc_out, py_out]))]
_py_swig_gen = rule(
attrs = {
"source": attr.label(
mandatory = True,
allow_single_file = True,
),
"swig_includes": attr.label_list(
allow_files = [".i"],
),
"deps": attr.label_list(
allow_files = True,
providers = [CcInfo],
),
"module_name": attr.string(mandatory = True),
},
implementation = _py_swig_gen_impl,
)
def py_wrap_cc(name, source, module_name = None, deps = [], copts = [], **kwargs):
if module_name == None:
module_name = name
python_deps = [
"#local_config_python//:python_headers",
"#local_config_python//:python_lib",
]
# First, invoke the _py_wrap_cc rule, which runs swig. This outputs:
# `module_name.cc`, `module_name.py`, and, sometimes, `module_name.h` files.
swig_rule_name = "swig_gen_" + name
_py_swig_gen(
name = swig_rule_name,
source = source,
swig_includes = ["//third_party/swig_rules:swig_includes"],
deps = deps + python_deps,
module_name = module_name,
)
# Next, we need to compile the `module_name.cc` and `module_name.h` files
# from the previous rule. The `module_name.py` file already generated
# expects there to be a `_module_name.so` file, so we name the cc_binary
# rule this way to make sure that's the resulting file name.
cc_lib_name = "_" + module_name + ".so"
native.cc_binary(
name = cc_lib_name,
srcs = [":" + swig_rule_name],
linkopts = ["-dynamic", "-L/usr/local/lib/"],
linkshared = True,
deps = deps + python_deps,
)
# Finally, package everything up as a python library that can be depended
# on. Note that this rule uses the user-given `name`.
native.py_library(
name = name,
srcs = [":" + swig_rule_name],
srcs_version = "PY3",
data = [":" + cc_lib_name],
imports = ["./"],
)
My question, broadly, how I might best handle this with a single rule. I've tried adding a ctx.actions.write before the ctx.actions.run, thinking that I could generate a dummy '.h' file that would be overwritten if needed. That gives me:
ERROR: BUILD:41:11: for foo_swig_h.h, previous action: action 'Writing file foo_swig_h.h', attempted action: action 'SWIGing foo.i.'
My next idea is to remove the h_out stuff and then try to capture the h file for the cc_binary rule with some kind of glob invocation.
I've seen two approaches: add an attribute to indicate whether it applies, or write a wrapper script to generate it unconditionally.
Adding an attribute means something like "has_h": attr.bool(), and then use that in _py_swig_gen_impl to make the ctx.actions.declare_file(module_name + "_swig_h.h") conditional.
The wrapper script option means using something like this for the executable:
#!/bin/bash
set -e
touch the_path_of_the_header
exec swig "$#"
That will unconditionally create the output, and then swig will overwrite it if applicable. If it's not applicable, then passing around an empty header file in the Bazel rules should be harmless.
For posterity, this is what my _py_swig_gen_impl looks like after implementing #Brian's suggestion above:
def _py_swig_gen_impl(ctx):
module_name = ctx.attr.module_name
cc_out = ctx.actions.declare_file(module_name + "_swig_cc.cc")
h_out = ctx.actions.declare_file(module_name + "_swig_h.h")
py_out = ctx.actions.declare_file(module_name + ".py")
include_dirs = _include_dirs(ctx.attr.deps)
headers = _headers(ctx.attr.deps)
args = ["-c++", "-python", "-py3"]
args += ["-module", module_name]
args += ["-I" + x for x in include_dirs]
args += ["-I" + x.dirname for x in ctx.files.swig_includes]
args += ["-o", cc_out.path]
args += ["-outdir", py_out.dirname]
args += ["-oh", h_out.path]
args.append(ctx.file.source.path)
outputs = [cc_out, h_out, py_out]
# Depending on the contents of `ctx.file.source`, swig may or may not
# output a .h file needed by subsequent rules. Bazel doesn't like optional
# outputs, so instead of invoking swig directly we're going to make a
# lightweight executable script that first `touch`es the .h file that may
# get generated, and then execute that. This means we may be propagating
# an empty .h file around as a "dependency" sometimes, but that's okay.
swig_script_file = ctx.actions.declare_file("swig_exec.sh")
ctx.actions.write(
output = swig_script_file,
is_executable = True,
content = "#!/bin/bash\n\nset -e\ntouch " + h_out.path + "\nexec swig \"$#\"",
)
ctx.actions.run(
executable = swig_script_file,
arguments = args,
mnemonic = "Swig",
inputs = [ctx.file.source] + headers + ctx.files.swig_includes,
outputs = outputs,
progress_message = "SWIGing %{input}.",
)
return [
DefaultInfo(files = depset(direct = outputs)),
]
The ctx.actions.write generates the suggested bash script:
#!/bin/bash
set -e
touch %{h_out.path}
exec swig "$#"
Which guarantees that the expected h_out will always be output by ctx.actions.run, whether or not swig generates it.
I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.
the question: I want to use shell scripts to remove:
"""
111
222
"""
or
'''
111
222
'''
but do not remove
s = """
111
222
"""
I test the following ways:
find -name *.py | xargs -i sed -i "/^\s*\"\"\".*\"\"\"$/d" {}
find -name *.py | xargs -i sed -i '/"""/,/"""/d' {}
but i have no idea about
s = """
111
222
"""
please help
the test code like this,thanks
>
"""
template
class Foo {
public:
virtual int Bar();
};
"""
source = """
class Foo {
public:
virtual int Bar();
};
"""
"""
template<class T>
class Foo {
public:
virtual int Bar();
};
"""
source = """
class Foo {
public:
virtual void Bar(bool flag) const;
};
"""
source = """
class Foo {
public:
virtual int Bar(void);
};
"""
uptonow:
the
"""
aaa
"""
and the
="""
bbb
"""
have been test ok
function delete_multiline_comments
function delete_multiline_comments
grep -n '\"\"\"' gmock_class_test.py | sed '/=/,+1d' >a.txt
line=$(wc -l a.txt| awk '{print $1}')
for (( i=0;i<$line/2;i++ ))
do
second=`tail -1 a.txt | tr -cd "[0-9]"`
sed -i '$d' a.txt
first=`tail -1 a.txt | tr -cd "[0-9]"`
sed -i '$d' a.txt
sed -i "${first},${second}d" gmock_class_test.py
done
end
the following line used to remove """XXXX""" and '''XXX''' line
sed -ie "/'''.*'''/d"
sed -i "/^\s*\"\"\".*\"\"\"$/d"
Not sed, but you can try vims.
For your problem, this is a quick way to do it (it swaps the """ with a key: eh3UgT):
cat myfile.py | \
vims -e '=\s*"""' 'f\"xxxaeh3UgT\<esc>/\"\"\"\<enter>xxxaeh3UgT' \
'"""' 'V/\"\"\"\<enter>d' -t '%s/eh3UgT/"""/g'
It works for me on your test file!
Explanation:
'=s\*"""' - Match all lines with = and then """ somewhere after
'f\"xxxaeh3UgT\<esc>/\"\"\"\<enter>xxxaeh3UgT' - On that match, move to the first " with f\" (a vim command), then delete the quotes xxx, then enter insert mode a, then type eh3UgT, then move to the next triple quotes \<esc>/\"\"\"\<enter>, then delete them xxx, then type the key again a then eh3UgT.
'"""' - Match all remaining triple quotes
'V/\"\"\"\<enter>d' - Start highlighting V, move to next triple quotes /\"\"\"\<enter>, then delete d.
-t '%s/eh3UgT/"""/g' - Turn on normal vim command-line mode, replace all keys with triple quotes
Here, 1. and 2. act to "save" the string variables by replacing their quotes with a key, then 3. and 4. delete everything contained within triple quotes, then 5. replaces the key back with the triple quotes.
If you are worried about the key eh3UgT being matched elsewhere by accident, just make it longer. If you are worried that this is insecure (say this is a recurring script), then randomly generate the key each time.
I basically want to convert tab delimited text file http://www.linux-usb.org/usb.ids into a csv file.
I tried importing using Excel, but it is not optimal, it turns out like:
8087 Intel Corp.
0020 Integrated Rate Matching Hub
0024 Integrated Rate Matching Hub
How I want it so for easy searching is:
8087 Intel Corp. 0020 Integrated Rate Matching Hub
8087 Intel Corp. 0024 Integrated Rate Matching Hub
Is there any ways I can do this in python?
$ListDirectory = "C:\USB_List.csv"
Invoke-WebRequest 'http://www.linux-usb.org/usb.ids' -OutFile $ListDirectory
$pageContents = Get-Content $ListDirectory | Select-Object -Skip 22
"vendor`tvendor_name`tproduct`tproduct_name`r" > $ListDirectory
#Variables and Flags
$currentVid
$currentVName
$currentPid
$currentPName
$vendorDone = $TRUE
$interfaceFlag = $FALSE
$nextline
$tab = "`t"
foreach($line in $pageContents){
if($line.StartsWith("`#")){
continue
}
elseif($line.length -eq 0){
exit
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $TRUE)){
$vendorDone = $FALSE
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $FALSE)){
$pos = $line.IndexOf(" ")
$currentVid = $line.Substring(0, $pos)
$currentVName = $line.Substring($pos+2)
"$currentVid`t$currentVName`t`t`r" >> $ListDirectory
$vendorDone = $TRUE
}
elseif ($line.StartsWith($tab)){
if ($interfaceFlag -eq $TRUE){
$interfaceFlag = $FALSE
}
$nextline = $line.TrimStart()
if ($nextline.StartsWith($tab)){
$interfaceFlag = $TRUE
}
if ($interfaceFlag -eq $FALSE){
$pos = $nextline.IndexOf(" ")
$currentPid = $nextline.Substring(0, $pos)
$currentPName = $nextline.Substring($pos+2)
"$currentVid`t$currentVName`t$currentPid`t$currentPName`r" >> $ListDirectory
Write-Host "$currentVid`t$currentVName`t$currentPid`t$currentPName`r"
$interfaceFlag = $FALSE
}
}
}
I know the ask is for python, but I built this PowerShell script to do the job. It takes no parameters. Just run as admin from the directory where you want to store the script. The script collects everything from the http://www.linux-usb.org/usb.ids page, parses the data and writes it to a tab delimited file. You can then open the file in excel as a tab delimited file. Ensure the columns are read as "text" and not "general" and you're go to go. :)
Parsing this page is tricky because the script has to be contextually aware of every VID-Vendor line proceeding a series of PID-Product lines. I also forced the script to ignore the commented description section, the interface-interface_name lines, the random comments that he inserted throughout the USB list (sigh) and everything after and including "#List of known device classes, subclasses and protocols" which is out of scope for this request.
I hope this helps!
You just need to write a little program that scans in the data a line at a time. Then it should check to see if the first character is a tab ('\t'). If not then that value should be stored. If it does start with tab then print out the value that was previously stored followed by the current line. The result will be the list in the format you want.
Something like this would work:
import csv
lines = []
with open("usb.ids.txt") as f:
reader = csv.reader(f, delimiter="\t")
device = ""
for line in reader:
# Ignore empty lines and comments
if len(line) == 0 or (len(line[0]) > 0 and line[0][0] == "#"):
continue
if line[0] != "":
device = line[0]
elif line[1] != "":
lines.append((device, line[1]))
print(lines)
You basically need to loop through each line, and if it's a device line, remember that for the following lines. This will only work for two columns, and you would then need to write them all to a csv file but that's easy enough
I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.
As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing