Python working with CSV with 2 delimiters - python

I have a programs which outputs the data into a CSV file. These files contain 2 delimiters, these are , and "" for text. The text also contains commas.
How can I work with these 2 delimiters?
My current code gives me list index out of range. If the CSV file is needed I can provide it.
Current code:
def readcsv():
with open('pythontest.csv') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024),delimiters=',"')
csvfile.seek(0)
reader = csv.reader(csvfile,dialect)
for row in reader:
asset_ip_addresses.append(row[0])
service_protocollen.append(row[1])
service_porten.append(row[2])
vurn_cvssen.append(row[3])
vurn_risk_scores.append(row[4])
vurn_descriptions.append(row[5])
vurn_cve_urls.append(row[6])
vurn_solutions.append(row[7])
The CSV File im working with: http://www.pastebin.com/bUbDC419
It seems to have problems with handling the second line. If i append the rows to a list the first row seems to be ok but the second row seems to take it as whole thing and not seperating the commas anymore.
I guess it has something to do with the "enters"

I don't think you should need to define a custom dialect, unless I'm missing something.
The official documentation shows you can provide quotechar as a keyword to the reader() method. The example from the documentation modified for your code:
import csv
with open('pythontest.csv') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
for row in reader:
#do something to the row
row is a list of strings for each item in the row with " quotes removed.
The issue with the index out of range suggests that one of the row[x] cannot be accessed.

OK, I think I understand what kind of file you are reading... let's say the content of your CSV file looks like this
192.168.12.255,"Great site, a lot of good, recommended",0,"Last, first, middle"
192.168.0.255,"About cats, dogs, must visit!",1,"One, two, three"
Here is the code that will allow you to read it line by line, text in quotes will be taken out as single array element, but it will not split it. The parameter that you need is this quoting=csv.QUOTE_ALL
import csv
with open('students.csv', newline='') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_ALL)
for row in reader:
print(row[0])
print(row[1])
print(row[2])
print(row[3])
The printed output will look like this
192.168.12.255
Great site, a lot of good, recommended
0
Last, first, middle
192.168.0.255
About cats, dogs, must visit!
1
One, two, three
PS solution is based on the latest official documentation, see here https://docs.python.org/3/library/csv.html

how about a quick solution like this
a quick fix, that would split a row in csv like a,"b,c",d as strings a,b,c,d
def readcsv():
with open('pythontest.csv') as csvfile:
dialect = csv.Sniffer().sniff(csvfile.read(1024),delimiters=',"')
csvfile.seek(0)
reader = csv.reader(csvfile,dialect)
for rowx in reader:
row=[e.split(r',') if isinstance(e,str) else e for e in rowx]
#do your stuff on row

Related

Python reading in integers from a csv file into a list

I am having some trouble trying to read a particular column in a csv file into a list in Python. Below is an example of my csv file:
Col 1 Col 2
1,000,000 1
500,000 2
250,000 3
Basically I am wanting to add column 1 into a list as integer values and am having a lot of trouble doing so. I have tried:
for row in csv.reader(csvfile):
list = [int(row.split(',')[0]) for row in csvfile]
However, I get a ValueError that says "invalid literal for int() with base 10: '"1'
I then tried:
for row in csv.reader(csvfile):
list = [(row.split(',')[0]) for row in csvfile]
This time I don't get an error however, I get the list:
['"1', '"500', '"250']
I have also tried changing the delimiter:
for row in csv.reader(csvfile):
list = [(row.split(' ')[0]) for row in csvfile]
This almost gives me the desired list however, the list includes the second column as well as, "\n" after each value:
['"1,000,000", 1\n', etc...]
If anyone could help me fix this it would be greatly appreciated!
Cheers
You should choose your delimiter wisely :
If you have floating numbers using ., use , delimiter, or if you use , for floating numbers, use ; as delimiter.
Moreover, as referred by the doc for csv.reader you can use the delimiter= argument to define your delimiter, like so:
with open('myfile.csv', 'r') as csvfile:
mylist = []
for row in csv.reader(csvfile, delimiter=';'):
mylist.append(row[0]) # careful here with [0]
or short version:
with open('myfile.csv', 'r') as csvfile:
mylist = [row[0] for row in csv.reader(csvfile, delimiter=';')]
To parse your number to a float, you will have to do
float(row[0].replace(',', ''))
You can open the file and split at the space using regular expressions:
import re
file_data = [re.split('\s+', i.strip('\n')) for i in open('filename.csv')]
final_data = [int(i[0]) for i in file_data[1:]]
First of all, you must parse your data correctly. Because it's not, in fact, CSV (Comma-Separated Values) but rather TSV (Tab-Separated) of which you should inform CSV reader (I'm assuming it's tab but you can theoretically use any whitespace with a few tweaks):
for row in csv.reader(csvfile, delimiter="\t"):
Second of all, you should strip your integer values of any commas as they don't add new information. After that, they can be easily parsed with int():
int(row[0].replace(',', ''))
Third of all, you really really should not iterate the same list twice. Either use a list comprehension or normal for loop, not both at the same time with the same variable. For example, with list comprehension:
csvfile = StringIO("Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n")
reader = csv.reader(csvfile, delimiter="\t")
next(reader, None) # skip the header
lst = [int(row[0].replace(',', '')) for row in reader]
Or with normal iteration:
csvfile = StringIO("Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n")
reader = csv.reader(csvfile, delimiter="\t")
lst = []
for i, row in enumerate(reader):
if i == 0:
continue # your custom header-handling code here
lst.append(int(row[0].replace(',', '')))
In both cases, lst is set to [1000000, 500000, 250000] as it should. Enjoy.
By the way, using reserved keyword list as a variable is an extremely bad idea.
UPDATE. There's one more option that I find interesting. Instead of setting the delimiter explicitly you can use csv.Sniffer to detect it e.g.:
csvdata = "Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n"
csvfile = StringIO(csvdata)
dialect = csv.Sniffer().sniff(csvdata)
reader = csv.reader(csvfile, dialect=dialect)
and then just like the snippets above. This will continue working even if you replace tabs with semicolons or commas (would require quotes around your weird integers) or, possibly, something else.

Row in Excel to array?

I have lots of data in an Excel spreadsheet that I need to import using Python. i need each row to be read as an array so I can call on the first data point in a specified row, the second, the third, and so on.
This is my code so far:
from array import *
import csv
with open ('vals.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
reader_x = []
reader_y = []
reader_z = []
row = next(reader)
reader_x.append(row)
row = next(reader)
reader_y.append(row)
row = next(reader)
reader_z.append(row)
print reader_x
print reader_y
print reader_z
print reader_x[0]
It is definitely storing it as an array I think. But I think it is storing the entire row of Excel as a string instead of each block being a separate data point, because when I tell Python to print an entire array it looks something like this (a shortened version because there's like a thousand in each row):
[['13,14,12']]
And when I tell it to print reader_x[0] (or any of the other two for that matter) it looks like this:
['13,14,12']
But when I tell it to print anything beyond the 0th thing in the array, it just gives me an error because it's out of range.
How can I fix this? How can I make it [13,14,12] instead of ['13,14,12'] so I can actually use these numbers in calculation? (I want to avoid downloading any more libraries if I can because this is for a school thing and I need to avoid that if possible.)
I have been stuck on this for several days and nothing I can find has worked for me and half of it I didn't even understand. Please try to explain simply if you can, as if you're talking to someone who doesn't even know how to print "Hello World".
You can use split to do this and use , as a separator.
For example:
row = '11,12,13'
row = row.split(',')
It is a csv, (comma separated values) try setting delimiter to ','
You don't need from array import * ... What the rest of the world calls an array is called a list in Python. The Python array is rather specialised and you are not actually using it so just delete that line of code.
As others have pointed out, you need incoming lines to be split. The csv default delimiter is a comma. Just let csv.reader do the job, something like this:
reader = csv.reader(csvfile)
data = [map(int, row) for row in reader]

How to search CSV line for string in certain column, print entire line to file if found

Sorry, very much a beginner with Python and could really use some help.
I have a large CSV file, items separated by commas, that I'm trying to go through with Python. Here is an example of a line in the CSV.
123123,JOHN SMITH,SMITH FARMS,A,N,N,12345 123 AVE,CITY,NE,68355,US,12345 123 AVE,CITY,NE,68355,US,(123) 555-5555,(321) 555-5555,JSMITH#HOTMAIL.COM,15-JUL-16,11111,2013,22-DEC-93,NE,2,1\par
I'd like my code to scan each line and look at only the 9th item (the state). For every item that matches my query, I'd like that entire line to be written to an CSV.
The problem I have is that my code will find every occurrence of my query throughout the entire line, instead of just the 9th item. For example, if I scan looking for "NE", it will write the above line in my CSV, but also one that contains the string "NEARY ROAD."
Sorry if my terminology is off, again, I'm a beginner. Any help would be greatly appreciated.
I've listed my coding below:
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for line in f:
if "NE" in line:
print ('Found: []'.format(line))
writer.writerow([line])
You're not actually using your reader to read the input CSV, you're just reading the raw lines from the file itself.
A fixed version looks like the following (untested):
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if row[8] == 'NE':
print ('Found: {}'.format(row))
writer.writerow(row)
The changes are as follows:
Instead of iterating over the input file's lines, we iterate over the rows parsed by the reader (each of which is a list of each of the values in the row).
We check to see if the 9th item in the row (i.e. row[8]) is equal to "NE".
If so, we output that row to the output file by passing it in, as-is, to the writer's writerow method.
I also fixed a typo in your print statement - the format method uses braces (not square brackets) to mark replacement locations.
This snippet should solves your problem
import csv
with open('Sample.csv', 'rb') as f, open('NE_Sample.csv', 'wb') as outf:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(outf)
for row in reader:
if "NE" in row:
print ('Found: {}'.format(row))
writer.writerow(row)
if "NE" in line in your code is trying to find out whether "NE" is a substring of string line, which works not as intended. The lines are raw lines of your input file.
If you use if "NE" in row: where row is parsed line of your input file, you are doing exact element matching.

Python csv reader: null/empty value at end of line not being parsed

I have a tab delimited file with lines of data as such:
8600tab8661tab000000000003148415tab10037-434tabXEOL
8600tab8662tab000000000003076447tab6134505tabEOL
8600tab8661tab000000000003426726tab470005-063tabXEOL
There should be 5 fields with the possibility of the last field having a value 'X' or being empty as shown above.
I am trying to parse this file in Python (2.7) using the csv reader module as such:
file = open(fname)
reader = csv.reader(file, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
for i in range(5): # there are 5 fields
print row[i] # this fails if there is no 'X' in the last column
# index out of bounds error
If the last column is empty the row structure will end up looking like:
list: ['8600', '8662', '000000000003076447', '6134505']
So when row[4] is called, the error follows..
I was hoping for something like this:
list: ['8600', '8662', '000000000003076447', '6134505', '']
This problem only seems to occur if the very last column is empty. I have been looking through the reader arguments and dialect options to see if the is a simple command to pass into the csv.reader to fix the way it handles an empty field at the end of the line. So far no luck.
Any help will be much appreciated!
The easiest option would be to check the length of the row beforehand. If the length is 4, append an empty string to your list.
for row in reader:
if len(row) == 4:
row.append('')
for i in range(5):
print row[i]
There was a minor PEBCAK on my part. I was going back and forth between editing the file in Notepad++ and Gvim. At some point I lost my last tab on the end. I fixed the file and it parsed as expected.

Python: Read fields of CSV File with a list of list

i just wondering how i can read special field from a CVS File with next structure:
40.0070222,116.2968604,2008-10-28,[["route"], ["sublocality","political"]]
39.9759505,116.3272935,2008-10-29,[["route"], ["establishment"], ["sublocality", "political"]]
the way that on reading cvs files i used to work with:
with open('routes/stayedStoppoints', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
The problem with that is the first 3 fields no problem i can use:
for row in spamreader:
row[0],row[1],row[2] i can access without problem. but in the last field and i guess that with csv.reader(csvfile, delimiter=',', quotechar='"') split also for each sub-list:
so when i tried to access just show me:
[["route"]
Anyone has a solution to handle the last field has a full list ( list of list indeed)
[["route"], ["sublocality","political"]]
in order to can access to each category.
Thanks
Your format is close to json. You only need to wrap each line in brackets, and to quote the dates.
For each line l just do:
lst=json.loads(re.sub('([0-9]+-[0-9]+-[0-9]+)',r'"\1"','[%s]'%(l)))
results in lst being
[40.0070222, 116.2968604, u'2008-10-28', [[u'route'], [u'sublocality', u'political']]]
You need to import the json parser and regular expressions
import json
import re
edit: you asked how to access the element containing 'route'. the answer is
lst[3][0][0]
'political' is at
lst[3][1][1]
If the strings ('political' and others) may contain strings looking like dates, you should go with the solution by #unutbu
Use line.split(',', 3) to split on just the first 3 commas:
import json
with open(filename, 'rb') as csvfile:
for line in csvfile:
row = line.split(',', 3)
row[3] = json.loads(row[3])
print(row)
yields
['40.0070222', '116.2968604', '2008-10-28', [[u'route'], [u'sublocality', u'political']]]
['39.9759505', '116.3272935', '2008-10-29', [[u'route'], [u'establishment'], [u'sublocality', u'political']]]
That is not a valid CSV file. The csv module won't be able to read this.
If the line structure is always like this (two numbers, a date, and a nested list), you can do this:
import ast
result = []
with open('routes/stayedStoppoints') as infile:
for line in infile:
coord_x, coord_y, datestr, objstr = line.split(",", 3)
result.append([float(coord_x), float(coord_y),
datestr, ast.literal_eval(objstr)])
Result:
>>> result
[[40.0070222, 116.2968604, '2008-10-28', [['route'], ['sublocality', 'political']]],
[39.9759505, 116.3272935, '2008-10-29', [['route'], ['establishment'], ['sublocality', 'political']]]]

Categories