I am receiving normal comma delimited CSV files with data having new line character.
Input data
I want to convert the input data to:
Pipe (|) delimited
Without any quotes to escape (" or ')
Pipe (|) within data escaped with a caret (^) character
My file may also contain multiple lines on data (or data in newline in a single row).
Expected output data
Output file I was able to generate.
As you can see in the image that caret (^) perfectly escaped all pipes (|) in data, but also escaping the newline character in 5th and 6th line, which I don't want.
NOTE: All the carriage returns (\r, or CR) and newline (\n, LF) characters should be as it is just like shown in images.
import csv
import sys
inputPath = sys.argv[1]
outputPath = sys.argv[2]
with open(inputPath, encoding="utf-8") as inputFile:
with open(outputPath, 'w', newline='', encoding="utf-8") as outputFile:
reader = csv.DictReader(inputFile, delimiter=',')
writer = csv.DictWriter(
outputFile, reader.fieldnames, delimiter='|', quoting=csv.QUOTE_NONE, escapechar='^', doublequote=False, quotechar="")
writer.writeheader()
writer.writerows(reader)
print("Formationg complete.")
The above code has been written in Python, it would be great if I can get help in Python.
Answers in other programming languages also accepted.
There is more than 8 million records
Please find below some sample data:
"VENDOR ID","VENDOR NAME","ORGANIZATION NUMBER","ADDRESS 1","CITY","COUNTRY","ZIP","PRIMARY PHONE","FAX","EMAIL","LMS RECORD CREATED DATE","LMS RECORD MODIFY DATE","DELETE FLAG","LMS RECORD ID"
"a0E6D000001Fag8UAC","Test 'Vendor' 1","","This Vendor contains a single (') quote.","","","","","","test#test.com","2020-4-1 06:32:29","2020-4-1 06:34:43","false",""
"a0E6D000001FagDUAS","Test ""Vendor"" 2","","This Vendor contains a double("") quote.","","","","","","test#test.com","2020-4-1 06:33:38","2020-4-1 06:35:18","false",""
"a0E6D000001FagIUAS","Test Vendor | 3","","This Vendor contains a Pipe (|).","","","","","","test#test.com","2020-4-1 06:38:45","2020-4-1 06:38:45","false",""
"a0E6D000001FagNUAS","Test Vendor 4","","This Vendor contains a
carriage return, i.e
data in new line.","","","","","","test#test.com","2020-4-1 06:43:08","2020-4-1 06:43:08","false",""
NOTE: If you copy above data, please make sure that 5th and 6th line should end with only LF (i.e New Line, \n) just like shown in images, or else please try to replicate those 2 line as that's what this question is all about not escaping those 2 lines specificaly, as highlighted in the image below.
The above code is the final out come of all my findings on internet. I've even tried pandas library and it's final output is same as well.
The code below is just an alternate way to get my expected output, but still the issue exists as this script takes forever (more than 12 hours) to complete (and still not finishes, ultimately I have to kill the process) when ran on 9 Millions of records.
Batch wrapper for VBS code:
0</* :
#echo off
cscript /nologo /E:jscript "%~f0" %*
exit /b %errorlevel% */0;
var ARGS = WScript.Arguments;
if (ARGS.Length < 3 ) {
WScript.Echo("Wrong arguments");
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(1);
}
if (ARGS.Item(0).toLowerCase() == "-help" || ARGS.Item(0).toLowerCase() == "-h") {
WScript.Echo(WScript.ScriptName + " path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo(WScript.ScriptName + " e?path_to_file search replace [search replace[search replace [...]]]");
WScript.Echo("if filename starts with \"e?\" search and replace string will be evaluated for special characters ")
WScript.Quit(0);
}
if (ARGS.Length % 2 !== 1 ) {
WScript.Echo("Wrong arguments");
WScript.Quit(2);
}
var jsEscapes = {
'n': '\n',
'r': '\r',
't': '\t',
'f': '\f',
'v': '\v',
'b': '\b'
};
//string evaluation
//http://stackoverflow.com/questions/24294265/how-to-re-enable-special-character-sequneces-in-javascript
function decodeJsEscape(_, hex0, hex1, octal, other) {
var hex = hex0 || hex1;
if (hex) { return String.fromCharCode(parseInt(hex, 16)); }
if (octal) { return String.fromCharCode(parseInt(octal, 8)); }
return jsEscapes[other] || other;
}
function decodeJsString(s) {
return s.replace(
// Matches an escape sequence with UTF-16 in group 1, single byte hex in group 2,
// octal in group 3, and arbitrary other single-character escapes in group 4.
/\\(?:u([0-9A-Fa-f]{4})|x([0-9A-Fa-f]{2})|([0-3][0-7]{0,2}|[4-7][0-7]?)|(.))/g,
decodeJsEscape);
}
function convertToPipe(find, replace, str) {
return str.replace(new RegExp('\\|','g'),"^|");
}
function removeStartingQuote(find, replace, str) {
return str.replace(new RegExp('^"', 'g'), '');
}
function removeEndQuote(find, replace, str) {
return str.replace(new RegExp('"\r\n$', 'g'), '\r\n');
}
function removeLeadingAndTrailingQuotes(find, replace, str) {
return str.replace(new RegExp('"\r\n"', 'g'), '\r\n');
}
function replaceDelimiter(find, replace, str) {
return str.replace(new RegExp('","', 'g'), '|');
}
function convertSFDCDoubleQuotes(find, replace, str) {
return str.replace(new RegExp('""', 'g'), '"');
}
function getContent(file) {
// :: http://www.dostips.com/forum/viewtopic.php?f=3&t=3855&start=15&p=28898 ::
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // code page with minimum adjustments for input
ado.Open();
ado.LoadFromFile(file);
var adjustment = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021" +
"\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F" +
"\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014" +
"\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178" ;
var fs = new ActiveXObject("Scripting.FileSystemObject");
var size = (fs.getFile(file)).size;
var lnkBytes = ado.ReadText(size);
ado.Close();
var chars=lnkBytes.split('');
for (var indx=0;indx<size;indx++) {
if ( chars[indx].charCodeAt(0) > 255 ) {
chars[indx] = String.fromCharCode(128 + adjustment.indexOf(chars[indx]));
}
}
return chars.join("");
}
function writeContent(file,content) {
var ado = WScript.CreateObject("ADODB.Stream");
ado.Type = 2; // adTypeText = 2
ado.CharSet = "iso-8859-1"; // right code page for output (no adjustments)
//ado.Mode=2;
ado.Open();
ado.WriteText(content);
ado.SaveToFile(file, 2);
ado.Close();
}
if (typeof String.prototype.startsWith != 'function') {
// see below for better implementation!
String.prototype.startsWith = function (str){
return this.indexOf(str) === 0;
};
}
var evaluate=false;
var filename=ARGS.Item(0);
if(filename.toLowerCase().startsWith("e?")) {
filename=filename.substring(2,filename.length);
evaluate=true;
}
var content=getContent(filename);
var newContent=content;
var find="";
var replace="";
for (var i=1;i<ARGS.Length-1;i=i+2){
find=ARGS.Item(i);
replace=ARGS.Item(i+1);
if(evaluate){
find=decodeJsString(find);
replace=decodeJsString(replace);
}
newContent=convertToPipe(find,replace,newContent);
newContent=removeStartingQuote(find,replace,newContent);
newContent=removeEndQuote(find,replace,newContent);
newContent=removeLeadingAndTrailingQuotes(find,replace,newContent);
newContent=replaceDelimiter(find,replace,newContent);
newContent=convertSFDCDoubleQuotes(find,replace,newContent);
}
writeContent(filename,newContent);
Execution Steps:
> replace.bat <file_name or full_path_to_file> "." "."
This batch file is made for the purpose of any file's manipulation according to our requirement.
I've compiled and made this from lot of google searches. It's still in process as I've hardcoded my regular expressions in the file. You can make changes according to your need in the functions i've made, or even make your own functions by replicating other functions, and calling them at the end.
Another alternateive to what I want to achive I've done using Wondows Powershell script.
((Get-Content -path $args[0] -Raw) -replace '\|', '^|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '^"', '') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace "`"\r\n$", "") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '"\r\n"', "`r`n") | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '","', '|') | Set-Content -NoNewline -Force -Path $args[0]
((Get-Content -path $args[0] -Raw) -replace '""', '"' ) | Set-Content -Path $args[0]
Execution Ways:
Using Powershell
replace.ps1 '< path_to_file >'
Using a Batch Script
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe -ExecutionPolicy ByPass -command "& '< path_to_ps_script >\replace.ps1' '< path_to_csv_file >.csv'"
NOTE: Powershell V5.0 or greater required
This can process 1 Million of records in a minute or so.
What I've figured out is that we have to split bulky csv files to multiplve file with 1 Million records each and then process them all seperately.
Please correct me if I'm wrong, or there's any other alternate to it.
I am new to Python and trying to figure out how to get the port number from /etc/services if I give the port name.
/etc/services contains following value
DB2_test 60000/tcp
DB2_test_1 60001/tcp
DB2_test_2 60002/tcp
DB2_test_3 60003/tcp
DB2_test_4 60004/tcp
DB2_test_END 60005/tcp
The command
db2port=os.popen("db2 get dbm cfg | grep -i Service | awk '{{print $6}}'").read()
print(db2port)
returns DB2_test
The below command does not work. I want to just see the value of DB2_test, which is 60000:
getnum = "cat /etc/services | sed -n '/\{db2port}\s/p' | awk '{print $2}' | sed 's/\/tcp$//'"
print(getnum}
No need to invoke awk, sed etc. A pure Python solution would be:
for line in open("/etc/services").readlines():
parts = line.split()
if parts and parts[0] == 'DB2_test':
port, protocol = parts[1].split('/')
print(port)
Assuming the variable services contains the text from your /etc/services.
port_map = {
name: int(value.split('/')[0])
for name, value in (
line.split() for line in services.splitlines()
)
}
Now you have a map from the service's name to its port, so that port_map["DB2_test"] == 60000, for example.
I have a file containing similar data
[xxx]
name = xxx
address = bangalore
[yyy]
name = yyy
address = sjc
Please help me getting a regex that I can fetch the address/name value based on xxx or yyy (xxx or yyy and address or name is the input)
You can do something like this with awk if your file is just like that (i.e., the name is the same as the section and it is before the address):
$ awk -v nm='yyy' -F ' *= *' '$1=="name" && $2==nm{infi=1; next}
$1=="address" && infi {print $2; infi=0}' file
sjc
Or, better still you can get the section and then fetch the key, value as they occur and print them and then exit:
$ awk -v sec='yyy' -v key='address' '
BEGIN{
FS=" *= *"
pat=sprintf("^\\[%s\\]", sec)}
$0 ~ pat {secin=$1; next}
NF==2 && $1==key && secin ~ pat {print $2; exit}' file
sjc
If you want to gather all sections with their key/value pairs, you can do (with gawk):
$ gawk 'BEGIN{FS=" *= *"}
/^\[[^\]]+\]/ && NF==1 {sec=$1; next}
NF==2 {d[sec][$1]=$2}
END{ for (k in d){
printf "%s: ",k
for (v in d[k])
printf "\t%s = %s\n", v, d[k][v]
}
}' file
[xxx]: address = bangalore
name = xxx
[yyy]: address = sjc
name = yyy
Config or .ini files can have quoting like csv, so it is best to use a full config file parser. You can use Perl or Python that have robust libraries for parsing .ini or config type files.
Python example:
#!/usr/bin/python
import ConfigParser
config = ConfigParser.ConfigParser()
config.read("/tmp/file")
Then you can grab the sections, the items in each section, or a specific items in a specific section:
>>> config.sections()
['xxx', 'yyy']
>>> config.items("yyy")
[('name', 'yyy'), ('address', 'sjc')]
>>> config.get("xxx", "address")
'bangalore'
Regex to the rescue!
This approach splits the entries into single elements and parses the key-value-pairs afterwards. In the end, you can simply ask your resulting dictionary for ie. values['xxx'].
See a demo on ideone.com.
import re
string = """
[xxx]
name = xxx
address = bangalore
[yyy]
name = yyy
address = sjc
"""
rx_item = re.compile(r'''
^\[(?P<name>[^][]*)\]
.*?
(?=^\[[^][]*\]$|\Z)
''', re.X | re.M | re.DOTALL)
rx_value = re.compile(r'^(?P<key>\w+)\s*=\s*(?P<value>.+)$', re.MULTILINE)
values = {item.group('name'): {
m.group('key'): m.group('value')
for m in rx_value.finditer(item.group(0))}
for item in rx_item.finditer(string)
}
print(values)
# {'xxx': {'name': 'xxx', 'address': 'bangalore'}, 'yyy': {'name': 'yyy', 'address': 'sjc'}}
It's not clear if you're trying to search for the value inside the square brackets or the value of the "name" tag but here's a solution to one possible interpretation of your question:
$ cat tst.awk
BEGIN { FS=" *= *" }
!NF { next }
NF<2 { prt(); k=$0 }
{ map[$1] = $2 }
END { prt() }
function prt() { if (k=="["key"]") print map[tag]; delete map }
$ awk -v key='yyy' -v tag='address' -f tst.awk file
sjc
$ awk -v key='xxx' -v tag='address' -f tst.awk file
bangalore
$ awk -v key='xxx' -v tag='name' -f tst.awk file
xxx
Following this: Find out git branch creator
I making a python script that provides me a sorted set of emails out of the result of
git for-each-ref --format='%(authoremail)%09%(refname)' | sort -k5n -k2M -k3n -k4n | grep remotes | awk -F "\t" '{ printf "%-32s %-27s %s\n", $1, $2, $3 }'
so that I can email them that these are you branches up on remote please delete them.
but when I try to put it together in python I getting error
intitial = "git for-each-ref --format='%(authoremail)%09%(refname)' | sort -k5n -k2M -k3n -k4n | grep remotes | awk -F "
addTab = "\t"
printf = '{ printf "%-32s %-27s %s\n", $1, $2, $3 }'
gitCommnad = "%s%s %s " % (intitial, addTab, printf)
def _exec_git_command(command, verbose=False):
""" Function used to get data out of git commads
and errors in case of failure.
Args:
command(string): string of a git command
verbose(bool): whether to display every command
and its resulting data.
Returns:
(tuple): string of Data and error if present
"""
# converts multiple spaces to single space
command = re.sub(' +',' ',command)
pr = subprocess.Popen(command, shell=True,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
msg = pr.stdout.read()
err = pr.stderr.read()
if err:
print err
if 'Could not resolve host' in err:
return
if verbose and msg:
print "Executing '%s' %s" % (command, msg)
return msg, err
print _exec_git_command(gitCommnad)
The issue is that you are not putting \t or { printf "%-32s %-27s %s\n", $1, $2, $3 } inside quotes, hence awk reports a Syntax error. You should use -
intitial = "git for-each-ref --format='%(authoremail)%09%(refname)' | sort -k5n -k2M -k3n -k4n | grep remotes | awk -F"
addTab = "\t"
printf = '{ printf "%-32s %-27s %s\n", $1, $2, $3 }'
gitCommnad = "%s \"%s\" '%s' " % (intitial, addTab, printf)
I need to analyse some C files and print out all the #define found.
It's not that hard with a regexp (for example)
def with_regexp(fname):
print("{0}:".format(fname))
for line in open(fname):
match = macro_regexp.match(line)
if match is not None:
print(match.groups())
But for example it doesn't handle multiline defines for example.
There is a nice way to do it in C for example with
gcc -E -dM file.c
the problem is that it returns all the #defines, not just the one from the given file, and I don't find any option to only use the given file..
Any hint?
Thanks
EDIT:
This is a first solution to filter out the unwanted defines, simply checking that the name of the define is actually part of the original file, not perfect but seems to work nicely..
def with_gcc(fname):
cmd = "gcc -dM -E {0}".format(fname)
proc = Popen(cmd, shell=True, stdout=PIPE)
out, err = proc.communicate()
source = open(fname).read()
res = set()
for define in out.splitlines():
name = define.split(' ')[1]
if re.search(name, source):
res.add(define)
return res
Sounds like a job for a shell one-liner!
What I want to do is remove the all #includes from the C file (so we don't get junk from other files), pass that off to gcc -E -dM, then remove all the built in #defines - those start with _, and apparently linux and unix.
If you have #defines that start with an underscore this won't work exactly as promised.
It goes like this:
sed -e '/#include/d' foo.c | gcc -E -dM - | sed -e '/#define \(linux\|unix\|_\)/d'
You could probably do it in a few lines of Python too.
In PowerShell you could do something like the following:
function Get-Defines {
param([string] $Path)
"$Path`:"
switch -regex -file $Path {
'\\$' {
if ($multiline) { $_ }
}
'^\s*#define(.*)$' {
$multiline = $_.EndsWith('\');
$_
}
default {
if ($multiline) { $_ }
$multiline = $false
}
}
}
Using the following sample file
#define foo "bar"
blah
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
blah
blah
#define X
it prints
\x.c:
#define foo "bar"
#define FOO \
do { \
do_stuff_here \
do_more_stuff \
} while (0)
#define X
Not ideal, at least how idiomatic PowerShell functions should work, but should work well enough for your needs.
Doing this in pure python I'd use a small state machine:
def getdefines(fname):
""" return a list of all define statements in the file """
lines = open(fname).read().split("\n") #read in the file as a list of lines
result = [] #the result list
current = []#a temp list that holds all lines belonging to a define
lineContinuation = False #was the last line break escaped with a '\'?
for line in lines:
#is the current line the start or continuation of a define statement?
isdefine = line.startswith("#define") or lineContinuation
if isdefine:
current.append(line) #append to current result
lineContinuation = line.endswith("\\") #is the line break escaped?
if not lineContinuation:
#we reached the define statements end - append it to result list
result.append('\n'.join(current))
current = [] #empty the temp list
return result