sed not working on large file [Looking for other options]

sed not working on large file [Looking for other options] - python

I have a gigantic json file that was accidentally output without a newline character in between all the json entries. It is being treated as one giant single line. So what I did was try and take a find an replace with sed and insert a newline.
sed 's/{"seq_id"/\n{"seq_id"/g' my_giant_json.json
It doesn't output anything
However, I know my sed expression is working if I operate on just a small part of the file and it works fine.
head -c 1000000 my_giant_json.json | sed 's/{"seq_id"/\n{"seq_id"/g'
I have also tried using python with this gnarly one liner
'\n{"seq_id'.join(open(json_file,'r').readlines()[0].split('{"seq_id')).lstrip()
But this loads into memory thanks to readlines() method. But I don't know how to iterate through a giant single line of characters (iterate in chunks) and do a find and replace.
Any thoughts?

Perl will let you change the input separator ($/) from newline to another character. You could take advantage of this to get some convenient chunking.
perl -pe'BEGIN{$/="}"}s/^({"seq_id")/\n$1/' my_giant_json.json
That sets the input separator to be "}". Then it looks for chunks that start with {"seq_id" and prefixes them with a newline.
Note that it puts an unnecessary empty line at the beginning. You could complicate the program to eliminate that or just delete it manually after.

Related

doing basic UNIX operations Pythonic way

I have a space separated file (file1.csv) on which I perform 3 UNIX operations manually, namely:
step1. removing all double quotes(") from each line.
sed 's/"//g' file1.csv > file_tmp1.csv
step2. removing all white spaces at the beginning of any line.
sed 's/^ *//' file_tmp1.csv > file_tmp2.csv
step3. removing all additional white spaces in between texts of each line.
cat file_tmp2.csv | tr -s " " > file1_processed.csv
So, i wanted to know if there's any better approach to this and that to in a Pythonic way without much of a computation-time. These 3 steps takes about ~5 min(max) when done using UNIX commands.
Please note the file file1.csv is a space-separated file and I want it to stay space-separated.
Also if your solution suggests loading entire file1.csv into memory then I would request you to suggest a way where this is done in chunks because the file is way too big (~20 GB or so) to load into memory every time.
thanks in advance.

An obvious improvement would be to convert the tr step to sed and combine all parts to one job. First the test data:
$ cat file
"this" "that"
The job:
$ sed 's/"//g;s/^ *//;s/ \+/ /g' file
this that
Here's all of those steps in one awk:
$ awk '{gsub(/\"|^ +/,""); gsub(/ +/," ")}1' file
this that
If you test it, let me know how long it took.

Here's a process which reads one line at a time and performs the substitutions you specified in Python.
with open('file1.csv') as source:
for line in source:
print(' '.join(line.replace('"', '').split())
The default behavior of split() includes trimming any leading (and trailing) whitespace, so we don't specify that explicitly. If you need to keep trailing whitespace, perhaps you need to update your requirements.
Your shell script attempt with multiple temporary files and multiple invocations of sed are not a good example of how to do this in the shell, either.

Remove non-ASCII characters interpreted as EOF from text file

Following up on my earlier question here: Row limit in read.table.ffdf?
I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.
For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!

gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1
will remove them.

replace a string with regular expression in python

I have been learning regular expression for a while but still find it confusing sometimes
I am trying to replace all the
self.assertRaisesRegexp(SomeError,'somestring'):
to
self.assertRaiseRegexp(SomeError,somemethod('somestring'))
How can I do it? I am assuming the first step is fetch 'somestring' and modify it to somemethod('somestring') then replace the original 'somestring'

here is your regular expression
#f is going to be your file in string form
re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this will grab something that matches and replace it accordingly. It will also make sure that the quotation marks line up correctly by setting a reference in quote
there is no need to iterate over the file here either, the first statement "(?m)" puts it in multiline mode so it maps the regular expression over each line in the file. I have tested this expression and it works as expected!
test
>>> print f
this is some
multi line example that self.assertRaisesRegexp(SomeError,'somestring'):
and so on. there self.assertRaisesRegexp(SomeError,'somestring'): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,'somestring'): okay
im done now
>>> print re.sub(r'(?m)self\.assertRaisesRegexp\((.+?),((?P<quote>[\'"]).*?(?P=quote))\)',r'self.assertRaisesRegexp(\1,somemethod(\2))',f)
this is some
multi line example that self.assertRaisesRegexp(SomeError,somemethod('somestring')):
and so on. there self.assertRaisesRegexp(SomeError,somemethod('somestring')): will be many
of these in the file and I am just ranting for example
here is the last one self.assertRaisesRegexp(SomeError,somemethod('somestring')): okay
im done now

A better tool for this particular task is sed:
$ sed -i 's/\(self.assertRaisesRegexp\)(\(.*\),\(.*\))/\1(\2,somemethod(\3))/' *.py
sed will take care of the file I/O, renaming files, etc.
If you already know how to do the file manipulation, and iterating over lines in each file, then the python re.sub line will look like:
new_line = re.sub(r"(self.assertRaisesRegexp)\((.*),(.*)\)",
r"\1(\2,somemethod(\3)",
old_line)

Regex out leading and trailing quotes if not contains comma

I'm at a total loss of how to do this.
My Question: I want to take this:
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
... (continue)
To this:
"A, two words with comma",B,C word without comma,D
"E, two words with comma",F,G more stuff,H no commas here!
... (continue)
I used software that created 1,900 records in a text file and I think it was supposed to be a CSV but whoever wrote the software doesn't know how CSV files work because it only needs quotes if the cell contains a comma (right?). At least I know that in Excel it puts everything in the first cell...
I would prefer this to be solvable using some sort of command line tool like perl or python (I'm on a Mac). I don't want to make a whole project in Java or anything to take care of this.
Any help is greatly appreciated!

Shot in the dark here, but I think that Excel is putting everything in the first column because it doesn't know it's being given comma-separated data.
Excel has a "text-to-columns" feature, where you'll be able to split a column by a delimiter (make sure you choose the comma).
There's more info here:
http://support.microsoft.com/kb/214261
edit
You might also try renaming the file from *.txt to *.csv. That will change the way Excel reads the file, so it better understands how to parse whatever it finds inside.

If just bashing is an option, you can try this one-liner in a terminal:
cat file.csv | sed 's/"\([^,]*\)"/\1/g' >> new-file.csv

That technically should be fine. It is text delimited with the " and separated via the ,
I don't see anything wrong with the first at all, any field may be quoted, only some require it. More than likely the writer of the code didn't want to over complicate the logic and quoted everything.

One way to clean it up is to feed the data to csv and dump it back.
import csv
from cStringIO import StringIO
bad_data = """\
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
"""
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows(csv.reader(bad_data.split('\n')))
buffer.seek(0)
print buffer.read()
Python's csv.writer will default to the "excel" dialect, so it will not write the commas when not necessary.

Sed script to edit csv file Or Python

In our project we need to import the csv file to postgres.
There are multiple types of files meaning the length of the file changes as some files are with fewer columns and some with all of them.
We need a fast way to import this file to postgres. I want to use COPY FROM of the postgres since the speed requirement of the processing are very high(almost 150 files per minute with 20K file size each).
Since the file columns numbers are not fixed, I need to pre-process the file before I pass it to the postgres procedure. The pre-processing is simply to add extra commas in the csv for columns, which are not there in the file.
There are two options for me to preprocess the file - use python or use Sed.
My first question is, what would be the fastest way of pre-process the file?
Second question is, If I use sed how would I insert a comma after say 4th, 5th comma fields?
e.g. if file has entries like
1,23,56,we,89,2009-12-06
and I need to edit the file with final output like:
1,23,56,we,,89,,2009-12-06

Are you aware of the fact that COPY FROM lets you specify which columns (as well as in which order they) are to be imported?
COPY tablename ( column1, column2, ... ) FROM ...
Specifying directly, at the Postgres level, which columns to import and in what order, will typically be the fastest and most efficient import method.
This having been said, there is a much simpler (and portable) way of using sed (than what has been presented in other posts) to replace an n th occurrence, e.g. replace the 4th and 5th occurrences of a comma with double commas:
echo '1,23,56,we,89,2009-12-06' | sed -e 's/,/,,/5;s/,/,,/4'
produces:
1,23,56,we,,89,,2009-12-06
Notice that I replaced the rightmost fields (#5) first.
I see that you have also tagged your question as perl-related, although you make no explicit reference to perl in the body of the question; here would be one possible implementation which gives you the flexibility of also reordering or otherwise processing fields:
echo '1,23,56,we,89,2009-12-06' |
perl -F/,/ -nae 'print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]"'
also produces:
1,23,56,we,,89,,2009-12-06
Very similarly with awk, for the record:
echo '1,23,56,we,89,2009-12-06' |
awk -F, '{print $1","$2","$3","$4",,"$5",,"$6}'
I will leave Python to someone else. :)
Small note on the Perl example: I am using the -a and -F options to autosplit so I have a shorter command string; however, this leaves the newline embedded in the last field ($F[5]) which is fine as long as that field doesn't have to be reordered somewhere else. Should that situation arise, slightly more typing would be needed in order to zap the newline via chomp, then split by hand and finally print our own newline character \n (the awk example above does not have this problem):
perl -ne 'chomp;#F=split/,/;print "$F[0],$F[1],$F[2],$F[3],,$F[4],,$F[5]\n"'
EDIT (an idea inspired by Vivin):
COMMAS_TO_DOUBLE="1 4 5"
echo '1,23,56,we,89,2009-12-06' |
sed -e `for f in $COMMAS_TO_DOUBLE ; do echo "s/,/,,/$f" ; done |
sort -t/ -k4,4nr | paste -s -d ';'`
1,,23,56,we,,89,,2009-12-06
Sorry, couldn't resist it. :)

To answer your first question, sed would have less overhead, but might be painful. awk would be a little better (it's more powerful). Perl or Python have more overhead, but would be easier to work with (regarding Perl, that's maybe a little subjective ;). Personally, I'd use Perl).
As far as the second question, I think the problem might be a little more complex. For example, don't you need to examine the string to figure out what fields are actually missing? Or is it guaranteed that it will always be the 4th and 5th? If it's the first case case, it would be way easier to do this in Python or Perl rather than in sed. Otherwise:
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),\([^,]\+\),/\1,\2,\3,\4,,\5,,/'
or (easier on the eyes):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]\+,\)\{3\}\)\([^,]\+\),\([^,]\+\),/\1,\3,,\4,,/'
This will add a comma after the 5th and 4th columns assuming there are no other commas in the text.
Or you can use two seds for something that's a little less ugly (only slightly, though):
echo "1,23,56,we,89,2009-12-06" | sed -e 's/\(\([^,]*,\)\{4\}\)/\1,/' | sed -e 's/\(\([^,]*,\)\{6\}\)/\1,/'

#OP, you are processing a csv file, which have distinct fields and delimiters. Use a tool that can split on delimiters and give you fields to work with easily. sed is not one of them, although it can be done, as some of the answers suggested, but you will get sed regex that is hard to read when it gets complicated. Use tools like awk/Python/Perl where they work with fields and delimiters easily, best of all, modules that specifically tailored to processing csv is available. For your example, a simple Python approach (without the use of csv module which ideally you should try to use it)
for line in open("file"):
line=line.rstrip() #strip new lines
sline=line.split(",")
if len(sline) < 8: # you want exact 8 fields
sline.insert(4,"")
sline.insert(6,"")
line=','.join(sline)
print line
output
$ more file
1,23,56,we,89,2009-12-06
$ ./python.py
1,23,56,we,,89,,2009-12-06

sed 's/^([^,]*,){4}/&,/' <original.csv >output.csv
Will add a comma after the 4th comma separated field (by matching 4 repetitions of <anything>, and then adding a comma after that). Note that there is a catch; make sure none of these values are quoted strings with commas in them.
You could chain multiple replacements via pipes if necessary, or modify the regex to add in any needed commas at the same time (though that gets more complex; you'd need to use subgroup captures in your replacement text).

Don't know regarding speed, but here is sed expr that should do the job:
sed -i 's/\(\([^,]*,\)\{4\}\)/\1,/' file_name
Just replace 4 by desured number of columns

Depending on your requirements, consider using ETL software for this and future tasks. Tools like Pentaho and Talend offer you a great deal of flexibility and you don't have to write a single line of code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

sed not working on large file [Looking for other options] - python

Related

doing basic UNIX operations Pythonic way

Remove non-ASCII characters interpreted as EOF from text file

replace a string with regular expression in python

Regex out leading and trailing quotes if not contains comma

Sed script to edit csv file Or Python

Categories

Resources