How to read a CSV line with "? - python

A trivial CSV line could be spitted using string split function. But some lines could have ", e.g.
"good,morning", 100, 300, "1998,5,3"
thus directly using string split would not solve the problem.
My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.
What's the best practice for this problem?
I am interested if there's a Python or F# code snippet for this.
EDIT: I am more interested in the implementation detail, rather than using a library.

There's a csv module in Python, which handles this.
Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)
The state machine for this task would probably only need two states:
Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).
By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".

From python's CSV module:
reading a normal CSV file:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Reading a file with an alternate format:
import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
There are some nice usage examples in LinuxJournal.com.
If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.

Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.

The generic implementation detail would be something like this (untested)
def csvline2fields(line):
fields = []
quote = None
while line.strip():
line = line.strip()
if line[0] in ("'", '"'):
# Find the next quote:
end = line.find(line[0])
fields.append(line[1:end])
# Find the beginning of the next field
next = line.find(SEPARATOR)
if next == -1:
break
line = line[next+1:]
continue
# find the next separator:
next = line.find(SEPARATOR)
fields.append(line[0:next])
line = line[next+1:]

Related

Read CSV with field having multiple quotes and commas

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.
I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.
If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...
Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

Distinguish between "" and empty value when reading csv file using python

CSV file contains values such as "","ab,abc",,"abc". Note, I am referring to empty value ,, as in unknown value. This is different from "", where a value has not been set yet. I am treating these two values differently.
I need a way to read "" and empty value ,, and distinguish between the two. I am mapping data to numbers such that "" is mapped to 0 and ,, is mapped to NaN.
Note, I am not having a parsing issue and field such as "ab,abc" is being parsed just fine with comma as the delimiter. The issue is python reads "" and empty value,, as empty string such as ' '. And these two values are not same and should not be grouped into empty string.
Not only this, but I also need to write csv file such that "" is written as "" and not ,, and NaN should be written as ,, (empty value).
I have looked into csv Dialects such as doublequote, escapechar, quotechar, quoting. This is NOT what I want. These are all cases where delimiter appears within data ie "ab,abc" and as I mentioned, parsing with special characters is not an issue.
I don't want to use Pandas. The only thing I can think of is regex? But that's an overhead if I have millions of lines to process.
The behaviour I want is this:
a = "\"\"" (or it could be a="" or a="ab,abc")
if (a=="\"\""):
map[0]=0
elif(a==""):
map[0]=np.nan
else:
map[0] = a
My csv reader is as follows:
import csv
f = open(filepath, 'r')
csvreader = csv.reader(f)
for row in csvreader:
print(row)
I want above behaviour when reading csv files though. currently only two values are read: ' ' (empty string) or 'ab,abc'.
I want 3 different values to be read. ' ' empty string, '""' string with double quotes, and actual string 'ab,abc'
looking through the csv module in CPython source (search for IN_QUOTED_FIELD), it doesn't have any internal state that would let you do this. for example, parsing:
"a"b"c"d
is parsed as: 'ab"c"d', which might not be what you expect. e.g:
import csv
from io import StringIO
[row] = csv.reader(StringIO(
'"a"b"c"d'))
print(row)
specifically, quotes are only handled specially at the beginning of fields, and all characters are just added to the field as they are encountered, rather than any allowing any special behaviour to be triggered when "un-quote"ing fields
The solution I figured is this:
If I change the input file such that quoted strings have escapechar '\' ,
below is input file:
col1,col2,col3
"",a,b
\cde \,f,g
,h,i
\j,kl\,mno,p
Then double-quoted empty field and unquoted empty field are separable
csvreader = csv.reader(f, quotechar='\\')
for row in csvreader:
print(row)
That's my best solution so far...

Reading regexes from file, in Python

I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.

Reading and correctly understanding/interpreting control characters from a file (python)

I'm a python beginner and just ran into a simple problem: I have a list of names (designators) and then a very simple code that reads lines in a csv file and prints the csv lines that has a name in the first column (row[0]) in common with my "designator list". So:
import csv
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.csv','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
if row[0] in DesignatorList:
print row
My csv files is only a list of names, like this:
AAX-435
AAX-961
HHX-58
HHX-9387
I would like to be able to use wildcards like * and ., example: let's say that I put this on my csv file:
AAX*
H.X-9387
*58
I need my code to be able to interpret those wild cards/control characters, printing the following:
every line that starts with "AAX";
every line that starts with "H", then any following character, then finally ends with "X-9387";
every line that ends with "58".
Thank you!
EDIT: For future reference (in case somebody runs into the same problem), this is how I solved my problem following Roman advice:
import csv
import re
DesignatorList = ["AAX-435", "AAX-961", "HHX-9387", "HHX-58", "K-58", "K-14", "K-78524"]
with open('DesignatorFile.txt','rb') as FileReader:
for row in csv.reader(FileReader, delimiter=';'):
designator_col0 = row[0]
designator_col0_re = re.compile("^" + ".*".join(re.escape(i) for i in designator_col0.split("*")) + "$")
for d in DesignatorList:
if designator_col0_re.match(d):
print d
Try the re module.
You may need to prepare regular expression (regex) for use by replacing '*' with '.*' and adding ^ (beginning of a string) and $ (end of string) to the beginning and the end of the regular expression. In addition, you may need to escape everything else by re.escape function (that is, function escape from module re).
In case you do not have any other "control characters" (as you call them), splitting the string by "*" and joining by ".*" after applying escape.
For example,
import re
def make_rule(rule): # where rule for example "H*X-9387"
return re.compile("^" + ".*".join(re.escape(i) for i in rule.split("*")) + "$")
Then you can match (I guess, your rule is row):
...
rule_re = make_rule(row)
for d in DesignatorList:
if rule_re.match(d):
print row # or maybe print d
(I have understood, that rules are coming from CSV file while designators are from a list. It's easy to do it the other way around).
The examples above are examples. You still need to adapt them into your program.
Python's string object does have a startswith and an endswith method, which you could use here if you only had a small number of rules. The most general way to go with this, since you seem to have fairly simple patterns, is regular expressions. That way you can encode those rules as patterns.
import re
rules = ['^AAX.*$', # starts with AAX
'^H.*X-9387$', # starts with H, ends with X-9387
'^.*58$'] # ends with 58
for line in reader:
if any(re.match(rule, line) for rule in rules):
print line

Python: How to ignore #comment lines when reading in a file

In Python, I have just read a line form a text file and I'd like to know how to code to ignore comments with a hash # at the beginning of the line.
I think it should be something like this:
for
if line !contain #
then ...process line
else end for loop
But I'm new to Python and I don't know the syntax
you can use startswith()
eg
for line in open("file"):
li=line.strip()
if not li.startswith("#"):
print line.rstrip()
I recommend you don't ignore the whole line when you see a # character; just ignore the rest of the line. You can do that easily with a string method function called partition:
with open("filename") as f:
for line in f:
line = line.partition('#')[0]
line = line.rstrip()
# ... do something with line ...
partition returns a tuple: everything before the partition string, the partition string, and everything after the partition string. So, by indexing with [0] we take just the part before the partition string.
EDIT:
If you are using a version of Python that doesn't have partition(), here is code you could use:
with open("filename") as f:
for line in f:
line = line.split('#', 1)[0]
line = line.rstrip()
# ... do something with line ...
This splits the string on a '#' character, then keeps everything before the split. The 1 argument makes the .split() method stop after a one split; since we are just grabbing the 0th substring (by indexing with [0]) you would get the same answer without the 1 argument, but this might be a little bit faster. (Simplified from my original code thanks to a comment from #gnr. My original code was messier for no good reason; thanks, #gnr.)
You could also just write your own version of partition(). Here is one called part():
def part(s, s_part):
i0 = s.find(s_part)
i1 = i0 + len(s_part)
return (s[:i0], s[i0:i1], s[i1:])
#dalle noted that '#' can appear inside a string. It's not that easy to handle this case correctly, so I just ignored it, but I should have said something.
If your input file has simple enough rules for quoted strings, this isn't hard. It would be hard if you accepted any legal Python quoted string, because there are single-quoted, double-quoted, multiline quotes with a backslash escaping the end-of-line, triple quoted strings (using either single or double quotes), and even raw strings! The only possible way to correctly handle all that would be a complicated state machine.
But if we limit ourselves to just a simple quoted string, we can handle it with a simple state machine. We can even allow a backslash-quoted double quote inside the string.
c_backslash = '\\'
c_dquote = '"'
c_comment = '#'
def chop_comment(line):
# a little state machine with two state varaibles:
in_quote = False # whether we are in a quoted string right now
backslash_escape = False # true if we just saw a backslash
for i, ch in enumerate(line):
if not in_quote and ch == c_comment:
# not in a quote, saw a '#', it's a comment. Chop it and return!
return line[:i]
elif backslash_escape:
# we must have just seen a backslash; reset that flag and continue
backslash_escape = False
elif in_quote and ch == c_backslash:
# we are in a quote and we see a backslash; escape next char
backslash_escape = True
elif ch == c_dquote:
in_quote = not in_quote
return line
I didn't really want to get this complicated in a question tagged "beginner" but this state machine is reasonably simple, and I hope it will be interesting.
I'm coming at this late, but the problem of handling shell style (or python style) # comments is a very common one.
I've been using some code almost everytime I read a text file.
Problem is that it doesn't handle quoted or escaped comments properly. But it works for simple cases and is easy.
for line in whatever:
line = line.split('#',1)[0].strip()
if not line:
continue
# process line
A more robust solution is to use shlex:
import shlex
for line in instream:
lex = shlex.shlex(line)
lex.whitespace = '' # if you want to strip newlines, use '\n'
line = ''.join(list(lex))
if not line:
continue
# process decommented line
This shlex approach not only handles quotes and escapes properly, it adds a lot of cool functionality (like the ability to have files source other files if you want). I haven't tested it for speed on large files, but it is zippy enough of small stuff.
The common case when you're also splitting each input line into fields (on whitespace) is even simpler:
import shlex
for line in instream:
fields = shlex.split(line, comments=True)
if not fields:
continue
# process list of fields
This is the shortest possible form:
for line in open(filename):
if line.startswith('#'):
continue
# PROCESS LINE HERE
The startswith() method on a string returns True if the string you call it on starts with the string you passed in.
While this is okay in some circumstances like shell scripts, it has two problems. First, it doesn't specify how to open the file. The default mode for opening a file is 'r', which means 'read the file in binary mode'. Since you're expecting a text file it is better to open it with 'rt'. Although this distinction is irrelevant on UNIX-like operating systems, it's important on Windows (and on pre-OS X Macs).
The second problem is the open file handle. The open() function returns a file object, and it's considered good practice to close files when you're done with them. To do that, call the close() method on the object. Now, Python will probably do this for you, eventually; in Python objects are reference-counted, and when an object's reference count goes to zero it gets freed, and at some point after an object is freed Python will call its destructor (a special method called __del__). Note that I said probably: Python has a bad habit of not actually calling the destructor on objects whose reference count drops to zero shortly before the program finishes. I guess it's in a hurry!
For short-lived programs like shell scripts, and particularly for file objects, this doesn't matter. Your operating system will automatically clean up any file handles left open when the program finishes. But if you opened the file, read the contents, then started a long computation without explicitly closing the file handle first, Python is likely to leave the file handle open during your computation. And that's bad practice.
This version will work in any 2.x version of Python, and fixes both the problems I discussed above:
f = open(file, 'rt')
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
f.close()
This is the best general form for older versions of Python.
As suggested by steveha, using the "with" statement is now considered best practice. If you're using 2.6 or above you should write it this way:
with open(filename, 'rt') as f:
for line in f:
if line.startswith('#'):
continue
# PROCESS LINE HERE
The "with" statement will clean up the file handle for you.
In your question you said "lines that start with #", so that's what I've shown you here. If you want to filter out lines that start with optional whitespace and then a '#', you should strip the whitespace before looking for the '#'. In that case, you should change this:
if line.startswith('#'):
to this:
if line.lstrip().startswith('#'):
In Python, strings are immutable, so this doesn't change the value of line. The lstrip() method returns a copy of the string with all its leading whitespace removed.
I've found recently that a generator function does a great job of this. I've used similar functions to skip comment lines, blank lines, etc.
I define my function as
def skip_comments(file):
for line in file:
if not line.strip().startswith('#'):
yield line
That way, I can just do
f = open('testfile')
for line in skip_comments(f):
print line
This is reusable across all my code, and I can add any additional handling/logging/etc. that I need.
I know that this is an old thread, but this is a generator function that I
use for my own purposes. It strips comments no matter where they
appear in the line, as well as stripping leading/trailing whitespace and
blank lines. The following source text:
# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!
will yield:
host02
host03
host04
Here is documented code, which includes a demo:
def strip_comments(item, *, token='#'):
"""Generator. Strips comments and whitespace from input lines.
This generator strips comments, leading/trailing whitespace, and
blank lines from its input.
Arguments:
item (obj): Object to strip comments from.
token (str, optional): Comment delimiter. Defaults to ``#``.
Yields:
str: Next uncommented non-blank line from ``item`` with
comments and leading/trailing whitespace stripped.
"""
for line in item:
s = line.split(token, 1)[0].strip()
if s:
yield s
if __name__ == '__main__':
HOSTS = """# Comment line 1
# Comment line 2
# host01 # This host commented out.
host02 # This host not commented out.
host03
host04 # Oops! Included leading whitespace in error!""".split('\n')
hosts = strip_comments(HOSTS)
print('\n'.join(h for h in hosts))
The normal use case will be to strip the comments from a file (i.e., a hosts file, as in my example above). If this is the case, then the tail end of the above code would be modified to:
if __name__ == '__main__':
with open('aa.txt', 'r') as f:
hosts = strip_comments(f)
for host in hosts:
print('\'%s\'' % host)
A more compact version of a filtering expression can also look like this:
for line in (l for l in open(filename) if not l.startswith('#')):
# do something with line
(l for ... ) is called "generator expression" which acts here as a wrapping iterator that will filter out all unneeded lines from file while iterating over it. Don't confuse it with the same thing in square brakets [l for ... ] which is a "list comprehension" that will first read all the lines from the file into memory and only then will start iterating over it.
Sometimes you might want to have it less one-liney and more readable:
lines = open(filename)
lines = (l for l in lines if ... )
# more filters and mappings you might want
for line in lines:
# do something with line
All the filters will be executed on the fly in one iteration.
Use regex re.compile("^(?:\s+)*#|(?:\s+)") to skip the new lines and comments.
I tend to use
for line in lines:
if '#' not in line:
#do something
This will ignore the whole line, though the answer which includes rpartition has my upvote as it can include any information from before the #
a good thing to get rid of coments that works for both inline and on a line
def clear_coments(f):
new_text = ''
for line in f.readlines():
if "#" in line: line = line.split("#")[0]
new_text += line
return new_text

Categories