Processing CSV file without double quotes - python

In other words, I am looking for a way to ignore ", " in one of the fields.
The field should be treated as one single field even though it contains a comma.
Example:
Round,Winner,place,prize
1,xyz,1,$4,500
If I read this with dict reader $4,500 is printed as $4 because 500 is considered to be another field., This makes sense as I am reading the file as comma delimited, so I can't really complain but try to figure out a work around.
reader = csv.reader(f, delimiter=',', quotechar='"')
My source is not wrapped in double quotes so I can't ignore by including a quote string.
Is there any other way to handle this scenario? Probably something like define these dollar fields and make it ignore commas for that field? Or try to inserrt quotes around this field?
If not Python, could shell script or Perl be used to do it?

Perhaps pre-process the data to wrap all money in quotes, then process normally
$line =~ s/( \$\d+ (?:,\d{3})* (?:\.\d{2})? )/"$1"/gx;
The pattern matches digits following a $, optionally followed by any multiples of ,nnn and/or by one .nn. It also wraps $4.22 as well as $100, which I consider good for consistency. Restrict what gets matched if needed, for example to (\$\d{1,3},\d{3}). With fractional cents remove {2}. This doesn't cover all possible edge/broken cases.
The /g modifier makes it replace all such in the line and /x allows spaces for readibilty.
You can do it as a one-liner
perl -pe 's/(\$\d+(?:,\d{3})*(?:\.\d{2})?)/"$1"/g' input.csv > changed.csv
Add -i switch to overwrite input ("in-place"), or -i.bak to also keep backup.
If you anticipate further need for tweaks, or to document this better, put it in a script
use warnings;
use strict;
my $file = '...';
my $fout = '...';
open my $fh, '<', $file or die "Can't open $file: $!";
open my $fh_out, '>', $fout or die "Can't open $fout for writing: $!";
while (my $line = <$fh>) {
$line =~ s/( \$\d+ (?:,\d{3})* (?:\.\d{2})? )/"$1"/gx;
print fh_out $line;
}
close $fh;
close $fh_out;

If the extra , is always going to be a part of the last field when it exists, you could use a Bash read loop for it:
#!/bin/bash
while IFS=, read -r f1 f2 f3 f4; do
# f4 => has everything after f3, including extra commas as in $4,500
# do your processing
printf "f1=[$f1] f2=[$f2] f3=$[f3] f4=[$f4]\n"
done < input.txt
Input:
1,xyz,1,$4,500
2,abc,3,$400
Output:
f1=[1] f2=[xyz] f3=1 f4=[$4,500]
f1=[2] f2=[abc] f3=3 f4=[$400]

Related

How to apply string formatting to a bash command (incorporated into Python script via subprocess)?

I would like to add a bash command to my Python script, which linearises a FASTA sequence file while leaving sequence separation intact (hence the specific choice of command). Below is the command, with the example input file of "inputfile.txt":
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < inputfile.txt
The aim is to allow the user to specify the file which is to be modified in the command line, for example:
$ python3 program.py inputfile.txt
I have tried to use string formatting (i.e. %s) in conjunction with sys.argv in order to achieve this. However, I have tried many different locations of " and ', and still cannot get this to work and accept a user input from the command line here.
(The command contains escapes such as \n and so I have tried to counteract this by adding additional backslashes, as well as additional % for the existing %s in the command.)
import sys
import subprocess
path = sys.argv[1]
holder = subprocess.Popen("""awk '/^>/ {printf("\\n%%s\\n",$0);next; } { printf("%%s",$0);} END {printf("\\n");}' < %s""" % path , shell=True, stdout=subprocess.PIPE).stdout.read()
print(holder)
I would very much appreciate any help with identifying the syntax error here, or suggestions for how I could add this user input.
TL;DR: Don't shell out to awk! Just use Python. But let's go step by step...
Your instinct of using triple quotes here is good, then at least you don't need to escape both single and double quotes, that you need in your shell string.
The next useful device you can use is raw strings, using r'...' or r"..." or r"""...""". Raw strings don't expand backslash escapes, so in that case you can leave the \ns intact.
Last is the %s, which you need to escape if you use the % operator, but here I'm going to suggest that instead of using the shell to redirect input, just use Python's subprocess to send stdin from the file! Much simpler and you end up with no substitution.
I'll also recommend that you use subprocess.check_output() instead of Popen(). It's much simpler to use and it's a lot more robust, since it will check that the command exited successfully (with a zero exit status.)
Putting it all together (so far), you get:
with open(path) as inputfile:
holder = subprocess.check_output(
r"""awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}'""",
shell=True,
stdin=inputfile)
But here you can go one step further, since you don't really need a shell anymore, it's only being used to split the command line into two arguments, so just do this split in Python (it's almost always possible and easy to do this and it's a lot more robust since you don't have to deal with the shell's word splitting!)
with open(path) as inputfile:
holder = subprocess.check_output(
['awk', r'/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}'],
stdin=inputfile)
The second string in the list is still a raw string, since you want to preserve the bacsklash escapes.
I could go into how you can do this without using printf() in awk, using print instead, which should get rid of both \ns and %s, but instead I'll tell you that it's much easier to do what you're doing in Python directly!
In fact, everything that awk (or sed, tr, cut, etc.) can do, Python can do better (or, at least, in a more readable and maintainable way.)
In the case of your particular code:
with open(path) as inputfile:
for line in inputfile:
if line.startswith('>'):
# Insert a blank line before this one.
print()
print(line)
if line.startswith('>'):
# Also insert a blank line after this.
print()
# And a blank line at the end.
print()
Isn't this better?
And you can put this into a function, into a module, and reuse it anywhere you'd like. It's easy to store the result in a string, save it into a variable if you like, much more flexible...
Anyways, if you still want to stick to shelling out, see my previous code, I think that's the best you can do while still shelling out, without significantly changing the external command.

sed find and replace in linux [duplicate]

I would like to update a large number of C++ source files with an extra include directive before any existing #includes. For this sort of task, I normally use a small bash script with sed to re-write the file.
How do I get sed to replace just the first occurrence of a string in a file rather than replacing every occurrence?
If I use
sed s/#include/#include "newfile.h"\n#include/
it replaces all #includes.
Alternative suggestions to achieve the same thing are also welcome.
A sed script that will only replace the first occurrence of "Apple" by "Banana"
Example
Input: Output:
Apple Banana
Apple Apple
Orange Orange
Apple Apple
This is the simple script: Editor's note: works with GNU sed only.
sed '0,/Apple/{s/Apple/Banana/}' input_filename
The first two parameters 0 and /Apple/ are the range specifier. The s/Apple/Banana/ is what is executed within that range. So in this case "within the range of the beginning (0) up to the first instance of Apple, replace Apple with Banana. Only the first Apple will be replaced.
Background: In traditional sed the range specifier is also "begin here" and "end here" (inclusive). However the lowest "begin" is the first line (line 1), and if the "end here" is a regex, then it is only attempted to match against on the next line after "begin", so the earliest possible end is line 2. So since range is inclusive, smallest possible range is "2 lines" and smallest starting range is both lines 1 and 2 (i.e. if there's an occurrence on line 1, occurrences on line 2 will also be changed, not desired in this case). GNU sed adds its own extension of allowing specifying start as the "pseudo" line 0 so that the end of the range can be line 1, allowing it a range of "only the first line" if the regex matches the first line.
Or a simplified version (an empty RE like // means to re-use the one specified before it, so this is equivalent):
sed '0,/Apple/{s//Banana/}' input_filename
And the curly braces are optional for the s command, so this is also equivalent:
sed '0,/Apple/s//Banana/' input_filename
All of these work on GNU sed only.
You can also install GNU sed on OS X using homebrew brew install gnu-sed.
# sed script to change "foo" to "bar" only on the first occurrence
1{x;s/^/first/;x;}
1,/foo/{x;/first/s///;x;s/foo/bar/;}
#---end of script---
or, if you prefer: Editor's note: works with GNU sed only.
sed '0,/foo/s//bar/' file
Source
An overview of the many helpful existing answers, complemented with explanations:
The examples here use a simplified use case: replace the word 'foo' with 'bar' in the first matching line only.
Due to use of ANSI C-quoted strings ($'...') to provide the sample input lines, bash, ksh, or zsh is assumed as the shell.
GNU sed only:
Ben Hoffstein's anwswer shows us that GNU provides an extension to the POSIX specification for sed that allows the following 2-address form: 0,/re/ (re represents an arbitrary regular expression here).
0,/re/ allows the regex to match on the very first line also. In other words: such an address will create a range from the 1st line up to and including the line that matches re - whether re occurs on the 1st line or on any subsequent line.
Contrast this with the POSIX-compliant form 1,/re/, which creates a range that matches from the 1st line up to and including the line that matches re on subsequent lines; in other words: this will not detect the first occurrence of an re match if it happens to occur on the 1st line and also prevents the use of shorthand // for reuse of the most recently used regex (see next point).1
If you combine a 0,/re/ address with an s/.../.../ (substitution) call that uses the same regular expression, your command will effectively only perform the substitution on the first line that matches re.
sed provides a convenient shortcut for reusing the most recently applied regular expression: an empty delimiter pair, //.
$ sed '0,/foo/ s//bar/' <<<$'1st foo\nUnrelated\n2nd foo\n3rd foo'
1st bar # only 1st match of 'foo' replaced
Unrelated
2nd foo
3rd foo
A POSIX-features-only sed such as BSD (macOS) sed (will also work with GNU sed):
Since 0,/re/ cannot be used and the form 1,/re/ will not detect re if it happens to occur on the very first line (see above), special handling for the 1st line is required.
MikhailVS's answer mentions the technique, put into a concrete example here:
$ sed -e '1 s/foo/bar/; t' -e '1,// s//bar/' <<<$'1st foo\nUnrelated\n2nd foo\n3rd foo'
1st bar # only 1st match of 'foo' replaced
Unrelated
2nd foo
3rd foo
Note:
The empty regex // shortcut is employed twice here: once for the endpoint of the range, and once in the s call; in both cases, regex foo is implicitly reused, allowing us not to have to duplicate it, which makes both for shorter and more maintainable code.
POSIX sed needs actual newlines after certain functions, such as after the name of a label or even its omission, as is the case with t here; strategically splitting the script into multiple -e options is an alternative to using an actual newlines: end each -e script chunk where a newline would normally need to go.
1 s/foo/bar/ replaces foo on the 1st line only, if found there.
If so, t branches to the end of the script (skips remaining commands on the line). (The t function branches to a label only if the most recent s call performed an actual substitution; in the absence of a label, as is the case here, the end of the script is branched to).
When that happens, range address 1,//, which normally finds the first occurrence starting from line 2, will not match, and the range will not be processed, because the address is evaluated when the current line is already 2.
Conversely, if there's no match on the 1st line, 1,// will be entered, and will find the true first match.
The net effect is the same as with GNU sed's 0,/re/: only the first occurrence is replaced, whether it occurs on the 1st line or any other.
NON-range approaches
potong's answer demonstrates loop techniques that bypass the need for a range; since he uses GNU sed syntax, here are the POSIX-compliant equivalents:
Loop technique 1: On first match, perform the substitution, then enter a loop that simply prints the remaining lines as-is:
$ sed -e '/foo/ {s//bar/; ' -e ':a' -e '$!{n;ba' -e '};}' <<<$'1st foo\nUnrelated\n2nd foo\n3rd foo'
1st bar
Unrelated
2nd foo
3rd foo
Loop technique 2, for smallish files only: read the entire input into memory, then perform a single substitution on it.
$ sed -e ':a' -e '$!{N;ba' -e '}; s/foo/bar/' <<<$'1st foo\nUnrelated\n2nd foo\n3rd foo'
1st bar
Unrelated
2nd foo
3rd foo
1 1.61803 provides examples of what happens with 1,/re/, with and without a subsequent s//:
sed '1,/foo/ s/foo/bar/' <<<$'1foo\n2foo' yields $'1bar\n2bar'; i.e., both lines were updated, because line number 1 matches the 1st line, and regex /foo/ - the end of the range - is then only looked for starting on the next line. Therefore, both lines are selected in this case, and the s/foo/bar/ substitution is performed on both of them.
sed '1,/foo/ s//bar/' <<<$'1foo\n2foo\n3foo' fails: with sed: first RE may not be empty (BSD/macOS) and sed: -e expression #1, char 0: no previous regular expression (GNU), because, at the time the 1st line is being processed (due to line number 1 starting the range), no regex has been applied yet, so // doesn't refer to anything.
With the exception of GNU sed's special 0,/re/ syntax, any range that starts with a line number effectively precludes use of //.
sed '0,/pattern/s/pattern/replacement/' filename
this worked for me.
example
sed '0,/<Menu>/s/<Menu>/<Menu><Menu>Sub menu<\/Menu>/' try.txt > abc.txt
Editor's note: both work with GNU sed only.
You could use awk to do something similar..
awk '/#include/ && !done { print "#include \"newfile.h\""; done=1;}; 1;' file.c
Explanation:
/#include/ && !done
Runs the action statement between {} when the line matches "#include" and we haven't already processed it.
{print "#include \"newfile.h\""; done=1;}
This prints #include "newfile.h", we need to escape the quotes. Then we set the done variable to 1, so we don't add more includes.
1;
This means "print out the line" - an empty action defaults to print $0, which prints out the whole line. A one liner and easier to understand than sed IMO :-)
Quite a comprehensive collection of answers on linuxtopia sed FAQ. It also highlights that some answers people provided won't work with non-GNU version of sed, eg
sed '0,/RE/s//to_that/' file
in non-GNU version will have to be
sed -e '1s/RE/to_that/;t' -e '1,/RE/s//to_that/'
However, this version won't work with gnu sed.
Here's a version that works with both:
-e '/RE/{s//to_that/;:a' -e '$!N;$!ba' -e '}'
ex:
sed -e '/Apple/{s//Banana/;:a' -e '$!N;$!ba' -e '}' filename
With GNU sed's -z option you could process the whole file as if it was only one line. That way a s/…/…/ would only replace the first match in the whole file. Remember: s/…/…/ only replaces the first match in each line, but with the -z option sed treats the whole file as a single line.
sed -z 's/#include/#include "newfile.h"\n#include'
In the general case you have to rewrite your sed expression since the pattern space now holds the whole file instead of just one line. Some examples:
s/text.*// can be rewritten as s/text[^\n]*//. [^\n] matches everything except the newline character. [^\n]* will match all symbols after text until a newline is reached.
s/^text// can be rewritten as s/(^|\n)text//.
s/text$// can be rewritten as s/text(\n|$)//.
#!/bin/sed -f
1,/^#include/ {
/^#include/i\
#include "newfile.h"
}
How this script works: For lines between 1 and the first #include (after line 1), if the line starts with #include, then prepend the specified line.
However, if the first #include is in line 1, then both line 1 and the next subsequent #include will have the line prepended. If you are using GNU sed, it has an extension where 0,/^#include/ (instead of 1,) will do the right thing.
Just add the number of occurrence at the end:
sed s/#include/#include "newfile.h"\n#include/1
A possible solution:
/#include/!{p;d;}
i\
#include "newfile.h"
:a
n
ba
Explanation:
read lines until we find the #include, print these lines then start new cycle
insert the new include line
enter a loop that just reads lines (by default sed will also print these lines), we won't get back to the first part of the script from here
I know this is an old post but I had a solution that I used to use:
grep -E -m 1 -n 'old' file | sed 's/:.*$//' - | sed 's/$/s\/old\/new\//' - | sed -f - file
Basically use grep to print the first occurrence and stop there. Additionally print line number ie 5:line. Pipe that into sed and remove the : and anything after so you are just left with a line number. Pipe that into sed which adds s/.*/replace to the end number, which results in a 1 line script which is piped into the last sed to run as a script on the file.
so if regex = #include and replace = blah and the first occurrence grep finds is on line 5 then the data piped to the last sed would be 5s/.*/blah/.
Works even if first occurrence is on the first line.
i would do this with an awk script:
BEGIN {i=0}
(i==0) && /#include/ {print "#include \"newfile.h\""; i=1}
{print $0}
END {}
then run it with awk:
awk -f awkscript headerfile.h > headerfilenew.h
might be sloppy, I'm new to this.
As an alternative suggestion you may want to look at the ed command.
man 1 ed
teststr='
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
'
# for in-place file editing use "ed -s file" and replace ",p" with "w"
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-'EOF' | sed -e 's/^ *//' -e 's/ *$//' | ed -s <(echo "$teststr")
H
/# *include/i
#include "newfile.h"
.
,p
q
EOF
I finally got this to work in a Bash script used to insert a unique timestamp in each item in an RSS feed:
sed "1,/====RSSpermalink====/s/====RSSpermalink====/${nowms}/" \
production-feed2.xml.tmp2 > production-feed2.xml.tmp.$counter
It changes the first occurrence only.
${nowms} is the time in milliseconds set by a Perl script, $counter is a counter used for loop control within the script, \ allows the command to be continued on the next line.
The file is read in and stdout is redirected to a work file.
The way I understand it, 1,/====RSSpermalink====/ tells sed when to stop by setting a range limitation, and then s/====RSSpermalink====/${nowms}/ is the familiar sed command to replace the first string with the second.
In my case I put the command in double quotation marks becauase I am using it in a Bash script with variables.
Using FreeBSD ed and avoid ed's "no match" error in case there is no include statement in a file to be processed:
teststr='
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
'
# using FreeBSD ed
# to avoid ed's "no match" error, see
# *emphasized text*http://codesnippets.joyent.com/posts/show/11917
cat <<-'EOF' | sed -e 's/^ *//' -e 's/ *$//' | ed -s <(echo "$teststr")
H
,g/# *include/u\
u\
i\
#include "newfile.h"\
.
,p
q
EOF
This might work for you (GNU sed):
sed -si '/#include/{s//& "newfile.h\n&/;:a;$!{n;ba}}' file1 file2 file....
or if memory is not a problem:
sed -si ':a;$!{N;ba};s/#include/& "newfile.h\n&/' file1 file2 file...
If anyone came here to replace a character for the first occurrence in all lines (like myself), use this:
sed '/old/s/old/new/1' file
-bash-4.2$ cat file
123a456a789a
12a34a56
a12
-bash-4.2$ sed '/a/s/a/b/1' file
123b456a789a
12b34a56
b12
By changing 1 to 2 for example, you can replace all the second a's only instead.
The use case can perhaps be that your occurences are spread throughout your file, but you know your only concern is in the first 10, 20 or 100 lines.
Then simply adressing those lines fixes the issue - even if the wording of the OP regards first only.
sed '1,10s/#include/#include "newfile.h"\n#include/'
The following command removes the first occurrence of a string, within a file. It removes the empty line too. It is presented on an xml file, but it would work with any file.
Useful if you work with xml files and you want to remove a tag. In this example it removes the first occurrence of the "isTag" tag.
Command:
sed -e 0,/'<isTag>false<\/isTag>'/{s/'<isTag>false<\/isTag>'//} -e 's/ *$//' -e '/^$/d' source.txt > output.txt
Source file (source.txt)
<xml>
<testdata>
<canUseUpdate>true</canUseUpdate>
<isTag>false</isTag>
<moduleLocations>
<module>esa_jee6</module>
<isTag>false</isTag>
</moduleLocations>
<node>
<isTag>false</isTag>
</node>
</testdata>
</xml>
Result file (output.txt)
<xml>
<testdata>
<canUseUpdate>true</canUseUpdate>
<moduleLocations>
<module>esa_jee6</module>
<isTag>false</isTag>
</moduleLocations>
<node>
<isTag>false</isTag>
</node>
</testdata>
</xml>
ps: it didn't work for me on Solaris SunOS 5.10 (quite old), but it works on Linux 2.6, sed version 4.1.5
Nothing new but perhaps a little more concrete answer: sed -rn '0,/foo(bar).*/ s%%\1%p'
Example: xwininfo -name unity-launcher produces output like:
xwininfo: Window id: 0x2200003 "unity-launcher"
Absolute upper-left X: -2980
Absolute upper-left Y: -198
Relative upper-left X: 0
Relative upper-left Y: 0
Width: 2880
Height: 98
Depth: 24
Visual: 0x21
Visual Class: TrueColor
Border width: 0
Class: InputOutput
Colormap: 0x20 (installed)
Bit Gravity State: ForgetGravity
Window Gravity State: NorthWestGravity
Backing Store State: NotUseful
Save Under State: no
Map State: IsViewable
Override Redirect State: no
Corners: +-2980+-198 -2980+-198 -2980-1900 +-2980-1900
-geometry 2880x98+-2980+-198
Extracting window ID with xwininfo -name unity-launcher|sed -rn '0,/^xwininfo: Window id: (0x[0-9a-fA-F]+).*/ s%%\1%p' produces:
0x2200003
POSIXly (also valid in sed), Only one regex used, need memory only for one line (as usual):
sed '/\(#include\).*/!b;//{h;s//\1 "newfile.h"/;G};:1;n;b1'
Explained:
sed '
/\(#include\).*/!b # Only one regex used. On lines not matching
# the text `#include` **yet**,
# branch to end, cause the default print. Re-start.
//{ # On first line matching previous regex.
h # hold the line.
s//\1 "newfile.h"/ # append ` "newfile.h"` to the `#include` matched.
G # append a newline.
} # end of replacement.
:1 # Once **one** replacement got done (the first match)
n # Loop continually reading a line each time
b1 # and printing it by default.
' # end of sed script.
A possible solution here might be to tell the compiler to include the header without it being mentioned in the source files. IN GCC there are these options:
-include file
Process file as if "#include "file"" appeared as the first line of
the primary source file. However, the first directory searched for
file is the preprocessor's working directory instead of the
directory containing the main source file. If not found there, it
is searched for in the remainder of the "#include "..."" search
chain as normal.
If multiple -include options are given, the files are included in
the order they appear on the command line.
-imacros file
Exactly like -include, except that any output produced by scanning
file is thrown away. Macros it defines remain defined. This
allows you to acquire all the macros from a header without also
processing its declarations.
All files specified by -imacros are processed before all files
specified by -include.
Microsoft's compiler has the /FI (forced include) option.
This feature can be handy for some common header, like platform configuration. The Linux kernel's Makefile uses -include for this.
I needed a solution that would work both on GNU and BSD, and I also knew that the first line would never be the one I'd need to update:
sed -e "1,/pattern/s/pattern/replacement/"
Trying the // feature to not repeat the pattern did not work for me, hence needing to repeat it.
I will make a suggestion that is not exactly what the original question asks for, but for those who also want to specifically replace perhaps the second occurrence of a match, or any other specifically enumerated regular expression match. Use a python script, and a for loop, call it from a bash script if needed. Here's what it looked like for me, where I was replacing specific lines containing the string --project:
def replace_models(file_path, pixel_model, obj_model):
# find your file --project matches
pattern = re.compile(r'--project.*')
new_file = ""
with open(file_path, 'r') as f:
match = 1
for line in f:
# Remove line ending before we do replacement
line = line.strip()
# replace first --project line match with pixel
if match == 1:
result = re.sub(pattern, "--project='" + pixel_model + "'", line)
# replace second --project line match with object
elif match == 2:
result = re.sub(pattern, "--project='" + obj_model + "'", line)
else:
result = line
# Check that a substitution was actually made
if result is not line:
# Add a backslash to the replaced line
result += " \\"
print("\nReplaced ", line, " with ", result)
# Increment number of matches found
match += 1
# Add the potentially modified line to our new file
new_file = new_file + result + "\n"
# close file / save output
f.close()
fout = open(file_path, "w")
fout.write(new_file)
fout.close()
sed -e 's/pattern/REPLACEMENT/1' <INPUTFILE

SED or AWK script to replace multiple text

I am trying to do the following with a sed script but it's taking too much time. Looks like something I'm doing wrongly.
Scenario:
I've student records (> 1 million) in students.txt.
In This file (each line) 1st 10 characters are student ID and next 10 characters are contact number and so on
students.txt
10000000019234567890XXX...
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ...
I have another file (encrypted_contact_numbers.txt) which has all the phone but numbers and corresponding encrypted phone numbers as below
encrypted_contact_numbers.txt
Phone_Number, Encrypted_Phone_Number
9234567890, 1122334455
9325788532, 4466742178
.
.
.
8766443367, 2964267747
I wanted to replace all the contact numbers (11th–20th position) in students.txt with the corresponding encrypted phone number from encrypted_contact_numbers.txt.
Expected Output:
10000000011122334455XXX...
10000000024466742178YYY...
.
.
.
10010000002964267747ZZZZ...
I am using the below sed script to do this operation. It is working fine but too slowly.
Approach 1:
while read -r pattern replacement; do
sed -i "s/$pattern/$replacement/" students.txt
done < encrypted_contact_numbers.txt
Approach 2:
sed 's| *\([^ ]*\) *\([^ ]*\).*|s/\1/\2/g|' <encrypted_contact_numbers.txt |
sed -f- students.txt > outfile.txt
Is there any way to process this huge file quickly?
Update: 9-Feb-2018
Solutions given in AWK and Perl is working fine if the phone number is in specified position (column 10-20), If I try to do global replacement it took too much time to process. Is there any best way to achieve this?
students.txt : Updated version
10000000019234567890XXX...9234567890
10000000029325788532YYY...
.
.
.
10010000008766443367ZZZZ9234567890...
awk to the rescue!
if you have enough memory to keep the phone_map file in memory
awk -F', *' 'NR==FNR{a[$1]=$2; next}
{key=substr($0,11,20)}
key in a {$0=substr($0,1,10) a[key] substr($0,21)}1' phone_map data_file
not tested since you're missing the data file. It should speed up since both files will be scanned only once.
Following awk may help you on same.
awk '
FNR==NR{
sub(/ +$/,"");
a[$1]=$2;
next
}
(substr($0,11,10) in a){
print substr($0,1,10) a[substr($0,11,10)] substr($0,21)
}
' FS=", " encrypted_contact_number.txt students.txt
Output will be as follows. Will add explanation too shortly.
10000000011122334455XXX...
10000000024466742178YYY...
What question would be complete without a Perl answer? :) Adapted from various answers in the Perl Monks' discussion of this topic.
Edited source
Edited per #Borodin's comment. With some inline comments for explanation, in hopes that they are helpful.
#!/usr/bin/env perl
use strict; # keep out of trouble
use warnings; # ditto
my %numbers; # map from real phone number to encrypted phone number
open(my $enc, '<', 'encrypted_contact_numbers.txt') or die("Can't open map file");
while(<$enc>) {
s{\s+}{}g; #remove all whitespace
my ($regular, $encrypted) = split ',';
$numbers{$regular} = $encrypted;
}
# Make a regex that will match any of the numbers of interest
my $number_pattern = join '|', map quotemeta, keys %numbers;
$number_pattern = qr{$number_pattern}o;
# Compile the regex - we no longer need the string representation
while(<>) { # process each line of the input
next unless length > 1; # Skip empty lines (don't need this line if there aren't any in your input file)
substr($_, 10, 10) =~ s{($number_pattern)}{$numbers{$1}}e;
# substr: replace only in columns 11--20
# Replacement (s{}{}e): the 'e' means the replacement text is perl code.
print; # output the modified line
}
Test
Tested on Perl v5.22.4.
encrypted_contact_numbers.txt:
9234567890, 1122334455
9325788532, 4466742178
students.txt:
aaaaaaaaaa9234567890XXX...
bbbbbbbbbb9325788532YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number
(modified for ease of reading)
Output of ./process.pl students.txt:
aaaaaaaaaa1122334455XXX...
bbbbbbbbbb4466742178YYY...
cccccccccc8766443367ZZZZ...
dddddddddd5432112345Nonexistent phone number
The change has been made on the first two lines, but not the second two, which is correct for this input.

Merge multiple lines

I have a file which contains multiple like this:
s10123-yyy.bkp.abc01.zone,Windows File =
System,N/A,defaultBackupSet,default,272188(* )(S =
),Completed,INCR,Mixed,02/28/2015 19:00:27,02/28/2015 =
19:03:06,02/28/2015 20:32:11,02/28/2015 =
20:32:09,12.08,53.93%,0.18,98.52%,0%,0.12,1:28:23,N/A,8.203,N/A,67303,0,8=
3,"Disk_Library2, Disk_Library6,",N/A,N/A,=0A=
Which I need to make it in one line like this:
s10123-yyy.bkp.abc01.zone,Windows File System,N/A,defaultBackupSet,default,272188(* )(S ),Completed,INCR,Mixed,02/28/2015 19:00:27,02/28/2015 19:03:06,02/28/2015 20:32:11,02/28/2015 20:32:09,12.08,53.93%,0.18,98.52%,0%,0.12,1:28:23,N/A,8.203,N/A,67303,0,83,"Disk_Library2, Disk_Library6,",N/A,N/A
If I do it manually, I highlight the "=" and press "delete" button twice to connect and get the desired result.
The last 5 character ",=0A=" needs to be deleted too.
Awk, Sed, Bash, Perl or Python script would be preferred.
Appreciate you help.
Thanks!
This is most simple with awk1:
awk -v RS=',=0A=\n' -F '=\n' -v OFS= '{ $1 = $1 } 1' filename
The trick is to
use ,=0A=\n as record separator RS
=\n as field separator
have an empty output field separator OFS, so that the fields are printed directly one after the other, and
force the rebuilding of the output record with $1 = $1 before printing it.
Addendum: Obligatory crazy sed solution:
sed -n '/,=0A=$/ { s///; H; s/.*//; x; s/\n//g; p; d; }; /=$/ { s///; H; }' filename
I don't recommend that you use that; I just like writing things in sed that shouldn't be written in sed. It's fun!
1 Tested with GNU awk and mawk, which are the most common ones. Multi-character RS is not strictly required by POSIX, though, so more esoteric awks may reject this. Thanks to #TomFenech for pointing this out.
Through Perl.
perl -0777pe 's/=\n|,=[^,]*$//sg' file
In Python, Create a list and then use the extend method to add the lines to the list, versus append.
This is a Perl solution:
perl -l -0777 -pwe"s/,?=(?:0A=)?\n//g" file
-0777 disables input record separator, making the file into one single line.
-p reads input from file and prints it back to standard output.
-l (before -0) adds newline to your print statements.
The regex s/,?=(?:0A=)?\n//g finds an optional comma, followed by =, followed by optional 0A= string, and ending with newline.
I don't know if all your files are just one of these long lines. If it is multiple such lines, you should set the input record separator to =0A=\n, most likely, chomp the lines and delete =\n.
sed
sed '
:a
/,=0A=$/ {s///; s/\n//g} # "end of line", remove the chars and newlines
/ \?=$/ {s///; N; ba} # line continuation: remove the chars, append
# the next line, goto a
' file

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres.
There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them.
We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are very high(almost 150 files per minute with 20K file size each).
Since the file columns numbers are not fixed, I need to pre-process the file before I pass it to the postgres procedure. The pre-processing is simply to add extra commas in the csv for columns, which are not there in the file.
There are two options for me to preprocess the file - use python or use Sed.
My first question is, what would be the fastest way of pre-process the file?
Second question is, If I use sed how would I insert a comma after say 4th, 5th comma fields?
e.g. if file has entries like
1,23,56,we,89,2009-12-06
and I need to edit the file with final output like:
1,23,56,we,,89,,2009-12-06
Are you aware of the fact that COPY FROM lets you specify which columns (as well as in which order they) are to be imported?
COPY tablename ( column1, column2, ... ) FROM ...
Specifying directly, at the Postgres level, which columns to import and in what order, will typically be the fastest and most efficient import method.
This having been said, there is a much simpler (and portable) way of using sed (than what has been presented in other posts) to replace an n th occurrence, e.g. replace the 4th and 5th occurrences of a comma with double commas:
echo '1,23,56,we,89,2009-12-06' | sed -e 's/,/,,/5;s/,/,,/4'
produces:
1,23,56,we,,89,,2009-12-06
Notice that I replaced the rightmost fields (#5) first.
I see that you have also tagged your question as perl-related, although you make no explicit reference to perl in the body of the question; here would be one possible implementation which gives you the flexibility of also reordering or otherwise processing fields:
echo '1,23,56,we,89,2009-12-06' |
perl -F/,/ -nae 'print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]"'
also produces:
1,23,56,we,,89,,2009-12-06
Very similarly with awk, for the record:
echo '1,23,56,we,89,2009-12-06' |
awk -F, '{print $1","$2","$3","$4",,"$5",,"$6}'
I will leave Python to someone else. :)
Small note on the Perl example: I am using the -a and -F options to autosplit so I have a shorter command string; however, this leaves the newline embedded in the last field ($F[5]) which is fine as long as that field doesn't have to be reordered somewhere else. Should that situation arise, slightly more typing would be needed in order to zap the newline via chomp, then split by hand and finally print our own newline character \n (the awk example above does not have this problem):
perl -ne 'chomp;#F=split/,/;print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]\n"'
EDIT (an idea inspired by Vivin):
COMMAS_TO_DOUBLE="1 4 5"
echo '1,23,56,we,89,2009-12-06' |
sed -e `for f in $COMMAS_TO_DOUBLE ; do echo "s/,/,,/$f" ; done |
sort -t/ -k4,4nr | paste -s -d ';'`
1,,23,56,we,,89,,2009-12-06
Sorry, couldn't resist it. :)
To answer your first question, sed would have less overhead, but might be painful. awk would be a little better (it's more powerful). Perl or Python have more overhead, but would be easier to work with (regarding Perl, that's maybe a little subjective ;). Personally, I'd use Perl).
As far as the second question, I think the problem might be a little more complex. For example, don't you need to examine the string to figure out what fields are actually missing? Or is it guaranteed that it will always be the 4th and 5th? If it's the first case case, it would be way easier to do this in Python or Perl rather than in sed. Otherwise:
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),/\1,\2,\3,\4,,\5,,/'
or (easier on the eyes):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]\+,\)\{3\}\)\([^,]\+\),\([^,]\+\),/\1,\3,,\4,,/'
This will add a comma after the 5th and 4th columns assuming there are no other commas in the text.
Or you can use two seds for something that's a little less ugly (only slightly, though):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]*,\)\{4\}\)/\1,/' | sed -e 's/\(\([^,]*,\)\{6\}\)/\1,/'
#OP, you are processing a csv file, which have distinct fields and delimiters. Use a tool that can split on delimiters and give you fields to work with easily. sed is not one of them, although it can be done, as some of the answers suggested, but you will get sed regex that is hard to read when it gets complicated. Use tools like awk/Python/Perl where they work with fields and delimiters easily, best of all, modules that specifically tailored to processing csv is available. For your example, a simple Python approach (without the use of csv module which ideally you should try to use it)
for line in open("file"):
line=line.rstrip() #strip new lines
sline=line.split(",")
if len(sline) < 8: # you want exact 8 fields
sline.insert(4,"")
sline.insert(6,"")
line=','.join(sline)
print line
output
$ more file
1,23,56,we,89,2009-12-06
$ ./python.py
1,23,56,we,,89,,2009-12-06
sed 's/^([^,]*,){4}/&,/' <original.csv >output.csv
Will add a comma after the 4th comma separated field (by matching 4 repetitions of <anything>, and then adding a comma after that). Note that there is a catch; make sure none of these values are quoted strings with commas in them.
You could chain multiple replacements via pipes if necessary, or modify the regex to add in any needed commas at the same time (though that gets more complex; you'd need to use subgroup captures in your replacement text).
Don't know regarding speed, but here is sed expr that should do the job:
sed -i 's/\(\([^,]*,\)\{4\}\)/\1,/' file_name
Just replace 4 by desured number of columns
Depending on your requirements, consider using ETL software for this and future tasks. Tools like Pentaho and Talend offer you a great deal of flexibility and you don't have to write a single line of code.

Categories