In our project we need to import the csv file to postgres.
There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them.
We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are very high(almost 150 files per minute with 20K file size each).
Since the file columns numbers are not fixed, I need to pre-process the file before I pass it to the postgres procedure. The pre-processing is simply to add extra commas in the csv for columns, which are not there in the file.
There are two options for me to preprocess the file - use python or use Sed.
My first question is, what would be the fastest way of pre-process the file?
Second question is, If I use sed how would I insert a comma after say 4th, 5th comma fields?
e.g. if file has entries like
1,23,56,we,89,2009-12-06
and I need to edit the file with final output like:
1,23,56,we,,89,,2009-12-06
Are you aware of the fact that COPY FROM lets you specify which columns (as well as in which order they) are to be imported?
COPY tablename ( column1, column2, ... ) FROM ...
Specifying directly, at the Postgres level, which columns to import and in what order, will typically be the fastest and most efficient import method.
This having been said, there is a much simpler (and portable) way of using sed (than what has been presented in other posts) to replace an n th occurrence, e.g. replace the 4th and 5th occurrences of a comma with double commas:
echo '1,23,56,we,89,2009-12-06' | sed -e 's/,/,,/5;s/,/,,/4'
produces:
1,23,56,we,,89,,2009-12-06
Notice that I replaced the rightmost fields (#5) first.
I see that you have also tagged your question as perl-related, although you make no explicit reference to perl in the body of the question; here would be one possible implementation which gives you the flexibility of also reordering or otherwise processing fields:
echo '1,23,56,we,89,2009-12-06' |
perl -F/,/ -nae 'print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]"'
also produces:
1,23,56,we,,89,,2009-12-06
Very similarly with awk, for the record:
echo '1,23,56,we,89,2009-12-06' |
awk -F, '{print $1","$2","$3","$4",,"$5",,"$6}'
I will leave Python to someone else. :)
Small note on the Perl example: I am using the -a and -F options to autosplit so I have a shorter command string; however, this leaves the newline embedded in the last field ($F[5]) which is fine as long as that field doesn't have to be reordered somewhere else. Should that situation arise, slightly more typing would be needed in order to zap the newline via chomp, then split by hand and finally print our own newline character \n (the awk example above does not have this problem):
perl -ne 'chomp;#F=split/,/;print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]\n"'
EDIT (an idea inspired by Vivin):
COMMAS_TO_DOUBLE="1 4 5"
echo '1,23,56,we,89,2009-12-06' |
sed -e `for f in $COMMAS_TO_DOUBLE ; do echo "s/,/,,/$f" ; done |
sort -t/ -k4,4nr | paste -s -d ';'`
1,,23,56,we,,89,,2009-12-06
Sorry, couldn't resist it. :)
To answer your first question, sed would have less overhead, but might be painful. awk would be a little better (it's more powerful). Perl or Python have more overhead, but would be easier to work with (regarding Perl, that's maybe a little subjective ;). Personally, I'd use Perl).
As far as the second question, I think the problem might be a little more complex. For example, don't you need to examine the string to figure out what fields are actually missing? Or is it guaranteed that it will always be the 4th and 5th? If it's the first case case, it would be way easier to do this in Python or Perl rather than in sed. Otherwise:
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),/\1,\2,\3,\4,,\5,,/'
or (easier on the eyes):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]\+,\)\{3\}\)\([^,]\+\),\([^,]\+\),/\1,\3,,\4,,/'
This will add a comma after the 5th and 4th columns assuming there are no other commas in the text.
Or you can use two seds for something that's a little less ugly (only slightly, though):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]*,\)\{4\}\)/\1,/' | sed -e 's/\(\([^,]*,\)\{6\}\)/\1,/'
#OP, you are processing a csv file, which have distinct fields and delimiters. Use a tool that can split on delimiters and give you fields to work with easily. sed is not one of them, although it can be done, as some of the answers suggested, but you will get sed regex that is hard to read when it gets complicated. Use tools like awk/Python/Perl where they work with fields and delimiters easily, best of all, modules that specifically tailored to processing csv is available. For your example, a simple Python approach (without the use of csv module which ideally you should try to use it)
for line in open("file"):
line=line.rstrip() #strip new lines
sline=line.split(",")
if len(sline) < 8: # you want exact 8 fields
sline.insert(4,"")
sline.insert(6,"")
line=','.join(sline)
print line
output
$ more file
1,23,56,we,89,2009-12-06
$ ./python.py
1,23,56,we,,89,,2009-12-06
sed 's/^([^,]*,){4}/&,/' <original.csv >output.csv
Will add a comma after the 4th comma separated field (by matching 4 repetitions of <anything>, and then adding a comma after that). Note that there is a catch; make sure none of these values are quoted strings with commas in them.
You could chain multiple replacements via pipes if necessary, or modify the regex to add in any needed commas at the same time (though that gets more complex; you'd need to use subgroup captures in your replacement text).
Don't know regarding speed, but here is sed expr that should do the job:
sed -i 's/\(\([^,]*,\)\{4\}\)/\1,/' file_name
Just replace 4 by desured number of columns
Depending on your requirements, consider using ETL software for this and future tasks. Tools like Pentaho and Talend offer you a great deal of flexibility and you don't have to write a single line of code.
Related
I have a multifasta file that looks like this:
( all sequences are >100bp, more than one line, and same lenght )
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage3_samplenameD
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I need to remove the duplicates BUT keep at least on sequence per lineage. So in this simple example (Notice samplenameA,C and D are identical) above I would want to remove only samplenameD or samplenameC but not both of them. In the end I want to get the same header information as in the original file.
Example output:
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
>Lineage3_samplenameC
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
I found out a way that works to remove just the duplicates. Thanks to Pierre Lindenbaum.
sed -e '/^>/s/$/#/' -e 's/^>/#/'
file.fasta |\
tr -d '\n' | tr "#" "\n" | tr "#"
"\t" |\
sort -u -t ' ' -f -k 2,2 |\
sed -e 's/^/>/' -e 's/\t/\n/'
Running this on my example above would result in:
>Lineage1_samplenameA
CGCTTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAA
>Lineage2_samplenameB
AAATTCAACGGAATGGATCTACGTTACAGCCTGCATAAAGAAAACGGAGTTGCCGAGGACGAAAGCGACTTTAGGTTCTGTCCGTTGTCTTTGGCGGAAG
—> so losing the lineage 3 sequence
Now I’m just looking for a quick solution to remove duplicates but keep at least one sequence per lineage based on the fasta header.
I’m new to scripting... any ideas in bash/python/R are welcome.
Thanks!!!
In this case I can see two relatively good alternatives. A) look into existing tools (such as the Biopython library, or FASTX toolkit. I think both of them have good commands to do most of the work here, so it may be worthwhile to learn them. Or, B) write your own. In this case you may want to try (I'll stick to python):
loop over the file, line-by-line, and add the lineage/sequence data to a dictionary. I suggest using the sequence as a key. This way, you can easily know if you already encountered this key.
myfasta = {}
if myfasta[sequence]:
myfasta[sequence].append(lineage_id)
else:
myfasta[sequence] = [lineage_id]
This way your key (sequence) will hold the list of lineage_ids that have the same sequence. Note that the annoying bits of this solution will be to loop over the file, separate lineage-id from sequence, account for sequences that may extend to multiple lines, etc.
After that, you can loop over the dictionary, and write the sequences to file by using only the first lineage_id from the list within the dictionary.
I have a gigantic json file that was accidentally output without a newline character in between all the json entries. It is being treated as one giant single line. So what I did was try and take a find an replace with sed and insert a newline.
sed 's/{"seq_id"/\n{"seq_id"/g' my_giant_json.json
It doesn't output anything
However, I know my sed expression is working if I operate on just a small part of the file and it works fine.
head -c 1000000 my_giant_json.json | sed 's/{"seq_id"/\n{"seq_id"/g'
I have also tried using python with this gnarly one liner
'\n{"seq_id'.join(open(json_file,'r').readlines()[0].split('{"seq_id')).lstrip()
But this loads into memory thanks to readlines() method. But I don't know how to iterate through a giant single line of characters (iterate in chunks) and do a find and replace.
Any thoughts?
Perl will let you change the input separator ($/) from newline to another character. You could take advantage of this to get some convenient chunking.
perl -pe'BEGIN{$/="}"}s/^({"seq_id")/\n$1/' my_giant_json.json
That sets the input separator to be "}". Then it looks for chunks that start with {"seq_id" and prefixes them with a newline.
Note that it puts an unnecessary empty line at the beginning. You could complicate the program to eliminate that or just delete it manually after.
I have a space separated file (file1.csv) on which I perform 3 UNIX operations manually, namely:
step1. removing all double quotes(") from each line.
sed 's/"//g' file1.csv > file_tmp1.csv
step2. removing all white spaces at the beginning of any line.
sed 's/^ *//' file_tmp1.csv > file_tmp2.csv
step3. removing all additional white spaces in between texts of each line.
cat file_tmp2.csv | tr -s " " > file1_processed.csv
So, i wanted to know if there's any better approach to this and that to in a Pythonic way without much of a computation-time. These 3 steps takes about ~5 min(max) when done using UNIX commands.
Please note the file file1.csv is a space-separated file and I want it to stay space-separated.
Also if your solution suggests loading entire file1.csv into memory then I would request you to suggest a way where this is done in chunks because the file is way too big (~20 GB or so) to load into memory every time.
thanks in advance.
An obvious improvement would be to convert the tr step to sed and combine all parts to one job. First the test data:
$ cat file
"this" "that"
The job:
$ sed 's/"//g;s/^ *//;s/ \+/ /g' file
this that
Here's all of those steps in one awk:
$ awk '{gsub(/\"|^ +/,""); gsub(/ +/," ")}1' file
this that
If you test it, let me know how long it took.
Here's a process which reads one line at a time and performs the substitutions you specified in Python.
with open('file1.csv') as source:
for line in source:
print(' '.join(line.replace('"', '').split())
The default behavior of split() includes trimming any leading (and trailing) whitespace, so we don't specify that explicitly. If you need to keep trailing whitespace, perhaps you need to update your requirements.
Your shell script attempt with multiple temporary files and multiple invocations of sed are not a good example of how to do this in the shell, either.
What is a fast way to:
Replace space with an unused unicode character.
Add spaces in between all characters
I've tried:
$ python3 -c "print (open('test.txt').read().replace(' ', u'\uE000').replace('', ' '))" > test.spaced.txt
But when I tried it on a 6GB textfile with 90 Million lines, it's really slow.
Simply reading the file after opening it takes really long:
$ time python3 -c "print (open('test.txt').read())"
Assume that my machine has more than enough RAM to handle the inflated file,
Is there a way to do it with sed / awk / bash tools?
Is there a faster way to do the replacement and addition faster in Python?
I believe, using the tools specially designed for text processing is faster than invoking a script written in a general-purpose interpreted language such as Python.
SED doesn't support Unicode escape sequences, but it is possible to pass the actual characters using command substitution:
sed -i -e "s/ /$(printf '\uE000')/g; s/\(.\)/ \1 /g" file
Perl is my favorite, because it is very flexible. It is also much better for text processing than Python:
The Perl languages borrow features from other programming languages
including C, shell script (sh), AWK, and sed... They provide
powerful text processing facilities without the arbitrary data-length
limits of many contemporary Unix commandline tools,... facilitating
easy manipulation of text files.
(from Wikipedia)
Example:
perl -CSDL -p -i -e 's/ /\x{E000}/g ; s/(.)/ \1 /g' file
Note, the -CSDL option enables UTF-8 for the output.
There is also an AWKward way of doing this using GNU AWK version 4.1.0 or newer:
gawk -i inplace '{
a = gsub(/ /, "\xee\x80\x80");
a = gensub(/(.)/, " \\1 ", "g");
print a; }' file
But I wouldn't recommend for obvious reasons.
I doubt that anyone would claim that a specific tool, or algorithm is the fastest one, as there are plenty of factors that may affect the performance, - hardware, the way the tools are compiled, tool versions, the kernel version, etc. Perhaps, the best way to find the right tool, or algorithm is to benchmark. I don't think it necessary to mention the time command.
I tested my regex for matching exceptions in a log file :
http://gskinner.com/RegExr/
Regex is :
.+Exception[^\n]+(\s+at.++)+
And it works for couple of cases I pasted here, but not when I'm using it with grep :
grep '.+Exception[^\n]+(\s+at.++)+' server.log
Does grep needs some extra flags to make it work wit regex ?
Update:
It doesn't have to be regex, I'm looking for anything that will print exceptions.
Not all versions of grep understand the same syntax.
Your pattern contains a + for 1 or more repeats, which means it is in egrep territory.
But it also has \s for white space, which most versions of grep are ignorant of.
Finally, you have ++ to mean a possessive match of the preceding atom, which only fairly sophisticated regex engines understand. You might try a non-possessive match.
However, you don’t need a leading .+, so you can jump right to the string you want. Also, I don’t see why you would use [^\n] since that’s what . normally means, and because you’re operating in line mode already anyways.
If you have grep -P, you might try that. I’m using a simpler but equivalent version of your pattern; you aren’t using an option to grep that gives only the exact match, so I assume you want the whole record:
$ grep -P 'Exception.+\sat' server.log
But if that doesn’t work, you can always bring out the big guns:
$ perl -ne 'print if /Exception.+\sat/' server.log
And if you want just the exact match, you could use
$ perl -nle 'print $& if /Exception.*\bat\b.*/' server.log
That should give you enough variations to play with.
I don’t understand why people use web-based “regex” builders when they can just do the same on the command line with existing tools, since that way they can be absolutely certain the patterns they devise will work with those tools.
You need to pass it the -e <regex> option and if you want to use the extended regex -E -e <regex> . Take a look at the man: man grep
It looks like you're trying to find lines that look something like:
... Exception foobar at line 7 ...
So first, to use regular expressions, you have to use -e with grep, or you can just run egrep.
Next, you don't really have to specify the .+ at the start of the expression. It's usually best to minimize what you're searching for. If it's imperative that there is at least one character before "Exception", then just use ..
Also, \s is a perl-ish way of asking for a space. grep uses POSIX regex, so the equivalent is [[:space:]].
So, I would use:
grep -e 'Exception.*[[:space:]]at'
This would get what you want with the least amount of muss and fuss.