How to read csv with multiple quoted delimiters in single field? - python

I'd like to be able to split a string which contains the delimiter quoted multiple times. Is there an argument to handle this type of string with the csv module? Or is there another way to process it?
text = '"a,b"-"c,d","a,b"-"c,d"'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output: ['"a,b"-"c,d"', '"a,b"-"c,d"']
Actual output: ['"a', 'b"-"c', 'd"', '"a', 'b"-"c', 'd"']
EDIT:
The example above is simplified, but apparently too simplified as some comments provided solutions for the simplified version but not for the full version. Below is the actual data I want to process.
import csv
text = '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
next(csv.reader(StringIO(text), delimiter=",", quotechar='"', quoting=csv.QUOTE_NONE))
Expected output
[
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'
]
Actual output
[
'"3-Amino-1',
'2',
'4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-CD-0','"3-Amino-1',
'2', '4-triazole"-text-0-"3-Amino-1',
'2',
'4-triazole"-LS-0'
]

I'll only answer the first part of your question: there is no way to do this with the built-in csv module.
Looking at the CPython source code, quotechar option is only processed at the start of a field:
case START_FIELD:
/* expecting field */
...
else if (c == dialect->quotechar &&
dialect->quoting != QUOTE_NONE) {
/* start quoted field */
self->state = IN_QUOTED_FIELD;
}
...
break;
Inside a field, there is no such check:
case IN_FIELD:
/* in unquoted field */
if (c == '\n' || c == '\r' || c == '\0') {
/* end of line - return [fields] */
if (parse_save_field(self) < 0)
return -1;
self->state = (c == '\0' ? START_RECORD : EAT_CRNL);
}
else if (c == dialect->escapechar) {
/* possible escaped character */
self->state = ESCAPED_CHAR;
}
else if (c == dialect->delimiter) {
/* save field - wait for new field */
if (parse_save_field(self) < 0)
return -1;
self->state = START_FIELD;
}
else {
/* normal character - save in field */
if (parse_add_char(self, module_state, c) < 0)
return -1;
}
break;
There is a check for quotechar while the parser is in the IN_QUOTED_FIELD state; however, upon encountering a quote, it goes back to the IN_FIELD state indicating we're inside an unquoted field. So this is possible:
>>> import csv
>>> import io
>>> print(next(csv.reader(io.StringIO('"a,b"cd,e'))))
['a,bcd', 'e']
But once the parser has reached the end of the initial quoted section, it will consider any subsequent quotes as part of the data. I don't know if this behaviour is to conform with any (written or unwritten) CSV specification, or if it's just a bug.

The data is in a non-standard format and so any solution would need to be tested on the full dataset. A possible workaround could be to first replace ," characters with ;" and then simply split it on the ;. This could be done without using CSV or RE:
tests = [
'"a,b"-"c,d","a,b"-"c,d"',
'"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0,"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0',
]
for test in tests:
row = test.replace(',"' , ';"').split(';')
print(len(row), row)
Giving:
2 ['"a,b"-"c,d"', '"a,b"-"c,d"']
2 ['"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-CD-0', '"3-Amino-1,2,4-triazole"-text-0-"3-Amino-1,2,4-triazole"-LS-0'

If the structure is always the same with the comma sandwiched between an integer and the '"', you can use a regular expression:
import re
re.split('(?<=[0-9]),(?=")', text)

Related

Splitting C code to statements using python

Is there any way to split a string (a complete C file) to C statements using python?
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
/*comment*/
printf("equal\n");
}
return(0);
}
If this is read to a string is there any way to split it into a list of strings like this:
list=['#include <stdio.h>', '#include<math.h>', 'int main(void){','if(final==(final_t)){', 'foo(final);', '/*comment*/', 'printf("equal\n);', '}', 'return(0);', '}']
Without being extremely complex, a C language program is composed of lexical tokens that form declarations and statements according to a syntax. And your splitting need some more explainations: according to the C language standard, if (cond) statement1 [else statement2]; is is statement. Simply both statement1 and statement2 can be blocks, so statements can be nested. In your requirements, you seem to concat the opening brace of a eventual block to the conditional, and leave the closing brace alone. And you say nothing about declarations or preprocessor language
So IMHO, your specifications are still incomplete...
Anyway, it is already far too complex for a simple lexical analyzer. So you should first write the complete grammar that you want to process, ideally in Backus-Naur Form, and declare the terminating tokens. Once you have that, it is easy to use lex + yaxx PYL to build a parser from that grammar.
It is probably not the expected answer, but C language parsers are far from trivial, except you want only accept a small subset of the language.
You should perform the following steps to reach the result:
Get your code as separate lines.
Cut leading and trailing spaces.
Skip empty lines.
If your code if given as a string you can use:
lines = content.split('\n')
If as a file:
with open('file.c') as f:
lines = f.readlines()
To cut extra spaces:
lines = list(map(str.strip, lines))
To skip empty lines:
lines = list(filter(lambda x: x, lines))
So the full code may look like this:
content = """
#include <stdio.h>
#include <math.h>
int main (void)
{
if(final==(final_t))
{
foo(final);
printf("equal\n");
}
return(0);
}
"""
lines = content.split('\n')
lines = list(map(str.strip, lines))
lines = list(filter(lambda x: x, lines))
print(lines)
code_list = []
with open("<your-code-file>", 'r') as code_file:
for line in code_file:
if "{" in line:
code_list[-1] = code_list[-1] + line.strip()
else:
code_list.append(line.strip())
print(code_list)
output:
['#include <stdio.h>', '#include <math.h>', '', 'int main (void){\n', 'if(final==(final_t)) {\n', 'foo(final);', 'printf("equal\\n");', '}', 'return(0);', '}']

Sort file by key with awk or perl like a join without presorting

I want to join two tab separated files, but they are in a different order. I know that it is doable with awk, but I don't know how. Here is the equivalent toy python code (python is too memory inefficient for this task without crazy workarounds):
import pandas as pd
from random import shuffle
a = ['bar','qux','baz','foo','spam']
df = pd.DataFrame({'nam':a,'asc':[1,2,3,4,5],'desc':[5,4,3,2,1]})
shuffle(a)
print(a)
dex = pd.DataFrame({'dex' : a})
df_b = pd.DataFrame({'VAL1' :[0,1,2,3,4,5,6]})
pd.merge(dex, df,left_on='dex',right_on='nam')[['asc','desc','nam']]
I have two files:
For file one, column 2 holds the identifier for each row, there 5 columns I don't need, and then there are about 3 million columns of data.
For file two, There are 12 columns, with the second column containing the same identifiers in a different order, along with additional ids.
I want to sort file one to have the same identifiers and order as file two, with the other columns appropriately rearranged.
File one is potentially multiple gigabytes.
Is this easier with awk and/or other GNU tools, or should I use perl?
If the size of file1 is in the order of GB, and you have 3 million columns of data, you have a tiny number of lines (no more than 200). While you can't load all of the lines themselves into memory, you could easily load all of their locations.
use feature qw( say );
use Fcntl qw( SEEK_SET );
open(my $fh1, '<', $qfn1) or die("Can't open \"$qfn1\": $!\n");
open(my $fh2, '<', $qfn2) or die("Can't open \"$qfn2\": $!\n");
my %offsets;
while (1) {
my $offset = tell($fh1);
my $row1 = <$fh1>;
last if !defined($row1);
chomp($row1);
my #fields1 = split(/\t/, $row1);
my $key = $fields1[1];
$offsets{$key} = $offset;
}
while (my $row2 = <$fh2>) {
chomp($row2);
my #fields2 = split(/\t/, $row2);
my $key = $fields2[1];
my $offset = $offsets{$key};
if (!defined($offset)) {
warn("Key $key not found.\n");
next;
}
seek($fh1, $offset, SEEK_SET);
my $row1 = <$fh1>;
chomp($row1);
my #fields1 = split(/\t/, $row1);
say join "\t", #fields2, #fields1[6..$#fields1];
}
This approach can be taken in Python as well.
Note: There exists a much simpler solution if the order is more flexible (i.e. if you're ok with the output being ordered as the records are ordered in file1). This assuming file2 easily fits in memory.
3 million columns of data, eh? It sounds like you're doing some NLP work.
Assuming this is true, and your matrix is sparse, python can handle it just fine (just not with pandas). Look at scipy.sparse. Example:
from scipy.sparse import dok_matrix
A = dok_matrix((10,10))
A[1,1] = 1
B = dok_matrix((10,10))
B[2,2] = 2
print A+B
DOK stands for "dictionary of keys", which is typically used to build the sparse matrix, then it's usually converted to CSR, etc. depending on use. See available sparse matrix types.
The important thing is not to split any more than necessary. If you have enough memory, putting the smaller file in a hash, and then reading through the second file ought to work.
Consider the following example (note the run time of this script includes the time it takes to create sample data):
#!/usr/bin/env perl
use strict;
use warnings;
# This is a string containing 10 lines corresponding to your "file one"
# Second column has the record ID
# Normally, you'd be reading this from a file
my $big_file = join "\n",
map join("\t", 'x', $_, ('x') x 3_000_000),
1 .. 10
;
# This is a string containing 10 lines corresponding to your "file two"
# Second column has the record ID
my $small_file = join "\n",
map join("\t", 'y', $_, ('y') x 10),
1 .. 10
;
# You would normally pass file names as arguments
join_with_big_file(
\$small_file,
\$big_file,
);
sub join_with_big_file {
my $small_records = load_small_file(shift);
my $big_file = shift;
open my $fh, '<', $big_file
or die "Cannot open '$big_file': $!";
while (my $line = <$fh>) {
chomp $line;
my ($first, $id, $rest) = split /\t/, $line, 3;
print join("\t", $first, $id, $rest, $small_records->{$id}), "\n";
}
return;
}
sub load_small_file {
my $file = shift;
my %records;
open my $fh, '<', $file
or die "Cannot open '$file' for reading: $!";
while (my $line = <$fh>) {
# limit the split
my ($first, $id, $rest) = split /\t/, $line, 3;
# I drop the id field here so it is not duplicated in the joined
# file. If that is not a problem, $records{$id} = $line
# would be better.
$records{$id} = join("\t", $first, $rest);
}
return \%records;
}

Convert csv file to txt file

I'm using perl to convert a comma separated file to a tab separated file with this command:
perl -e ' $sep=","; while(<>) { s/\Q$sep\E/\t/g; print $_; } warn "Changed $sep to tab on $. lines\n" ' csvfile.csv > tabfile.tab
However, my file has additional commas that I do not want to be separated in specific columns. Here's and example of my file:
ADNP, "descript1, descript2", 1
PTB, "descriptA, descriptB", 5
I only want to convert the comma's outside of the quotations to tabs as so:
ADNP descript1, descript2 1
PTB descriptA, descriptB 5
Is there anyway to go about doing this with either perl, python, or bash?
Trivial in Perl, using Text::CSV:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
#configure our read format using the default separator of ","
my $input_csv = Text::CSV->new( { binary => 1 } );
#configure our output format with a tab as separator.
my $output_csv = Text::CSV->new( { binary => 1, sep_char => "\t", eol => "\n" } );
#open input file
open my $input_fh, '<', "sample.csv" or die $!;
#iterate input file - reading in 'comma separated'
#printing out (to stdout -can use filehandle) tab separated.
while ( my $row = $input_csv->getline($input_fh) ) {
$output_csv->print( \*STDOUT, $row );
}
In python
import csv
with open('input', 'rb') as inf:
reader = csv.reader(inf)
with open('output', 'wb') as out:
writer = csv.writer(out, delimiter='\t')
writer.writerows(reader)
You need regular expressions to help you. In python it would simply be:
>>> re.split(r'(?!\B"[^"]*),(?![^"]*"\B)', 'ADNP, "descript1, descript2", 1'
['ADNP', ' "descript1, descript2"', ' 1']
Building off rll's regex answer, you can turn it into a perl oneliner like you're currenly doing
perl -ne 'BEGIN{$,="\t";}#a=split(/(?!\B"[^"]*),(?![^"]*"\B)/);print #a' csvfile.csv > tabfile.tab
This'll work:
perl -e '$sep=","; while(<STDIN>) { #data = split(/(\Q$sep\E?\s*"[^"]+"\s*\Q$sep\E?)/); foreach(#data){if(/"/){s/^\Q$sep\E\s*"//;s/"\s*\Q$sep\E$//;}else{s/\Q$sep\E/\t/g;}}print(join("\t",#data));} warn "Changed $sep to tab on $. lines\n"' < csvfile.csv > tabfile.tab
Putting parens in the pattern to split, returns the captured separators along with the split elements and effectively separates the strings containing quotes into separate list elements that can be treated differently when quotes are detected. You just strip off the commas and quotes for the quoted strings and substitute for tabs in the other elements, then join the elements with tabs (so that the quoted strings get joined with tabs to the other already tabbed strings.
The Text::CSV module is what you're looking for. There are a lot of considerations when parsing CSV files, and you really don't want to handle all of them yourself.

python parsing json data with double quotes

How do you parse json data with double quotes within:
json.loads('
{
"time":"1410661614",
"text":"This is great",
"from":
{
"username":"mrb",
"id":"5071",
"full_name":"Free "Mrb"" #here is the problem
},
"id":"8090107"
}
')
python returns:
ValueError: Expecting ',' delimiter: line 1 column 107 (char 106)
You can easily fix this issue by escaping the double quote (\")
import json
json.loads("""
{
"time":"1410661614",
"text":"This is great",
"from":
{
"username":"mrb",
"id":"5071",
"full_name":"Free \\"Mrb\\""
},
"id":"8090107"
}
""")
As said in the comments, better fix the json generator to properly escape the double quote, it will be hard to parse and correct the json file.
Whoever wrote the program that emits those unescaped quotes inside strings needs a serious talking to...
As Martijn said, parsing arbitrary crazy quotes is not easy.
OTOH, if the JSON is otherwise well-formed, and the offending strings don't cross line boundaries, then it's not so bad. Eg,
#! /usr/bin/env python
''' Escape quotes in malformed JSON value strings
Written by PM 2Ring 2014.09.19
'''
import re
data = [
''' "evil_name":"Free "Mrb"",''',
''' "good_name":"Alan Turing",'''
]
for line in data:
pre, val = line.split(':')
parts = re.split('(")', val)
n = parts.count('"')
if n > 2:
i = 1
a = []
for c in parts:
if c == '"':
if 1 < i < n:
c = '\\"'
i += 1
a.append(c)
line = pre + ':' + ''.join(a)
print line
Output
"evil_name":"Free \"Mrb\"",
"good_name":"Alan Turing",

xor each byte with 0x71

I needed to read a byte from the file, xor it with 0x71 and write it back to another file. However, when i use the following, it just reads the byte as a string, so xoring creates problems.
f = open('a.out', 'r')
f.read(1)
So I ended up doing the same in C.
#include <stdio.h>
int main() {
char buffer[1] = {0};
FILE *fp = fopen("blah", "rb");
FILE *gp = fopen("a.out", "wb");
if(fp==NULL) printf("ERROR OPENING FILE\n");
int rc;
while((rc = fgetc(fp))!=EOF) {
printf("%x", rc ^ 0x71);
fputc(rc ^ 0x71, gp);
}
return 0;
}
Could someone tell me how I could convert the string I get on using f.read() over to a hex value so that I could xor it with 0x71 and subsequently write it over to a file?
If you want to treat something as an array of bytes, then usually you want a bytearray as it behaves as a mutable array of bytes:
b = bytearray(open('a.out', 'rb').read())
for i in range(len(b)):
b[i] ^= 0x71
open('b.out', 'wb').write(b)
Indexing a byte array returns an integer between 0x00 and 0xff, and modifying in place avoid the need to create a list and join everything up again. Note also that the file was opened as binary ('rb') - in your example you use 'r' which isn't a good idea.
Try this:
my_num = int(f.read(1))
And then xor the number stored in my_num.

Categories