Reading CSV files in numpy where delimiter is "," - python

I've got a CSV file with a format that looks like this:
"FieldName1", "FieldName2", "FieldName3", "FieldName4"
"04/13/2010 14:45:07.008", "7.59484916392", "10", "6.552373"
"04/13/2010 14:45:22.010", "6.55478493312", "9", "3.5378543"
...
Note that there are double quote characters at the start and end of each line in the CSV file, and the "," string is used to delimit fields within each line. The number of fields in the CSV file can vary from file to file.
When I try to read this into numpy via:
import numpy as np
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True)
all the data gets read in as string values, surrounded by double-quote characters. Not unreasonable, but not much use to me as I then have to go back and convert every column to its correct type
When I use delimiter='","' instead, everything works as I'd like, except for the 1st and last fields. As the start of line and end of line characters are a single double-quote character, this isn't seen as a valid delimiter for the 1st and last fields, so they get read in as e.g. "04/13/2010 14:45:07.008 and 6.552373" - note the leading and trailing double-quote characters respectively. Because of these redundant characters, numpy assumes the 1st and last fields are both String types; I don't want that to be the case
Is there a way of instructing numpy to read in files formatted in this fashion as I'd like, without having to go back and "fix" the structure of the numpy array after the initial read?

The basic problem is that NumPy doesn't understand the concept of stripping quotes (whereas the csv module does). When you say delimiter='","', you're telling NumPy that the column delimiter is literally a quoted comma, i.e. the quotes are around the comma, not the value, so the extra quotes you get on he first and last columns are expected.
Looking at the function docs, I think you'll need to set the converters parameter to strip quotes for you (the default does not):
import re
import numpy as np
fieldFilter = re.compile(r'^"?([^"]*)"?$')
def filterTheField(s):
m = fieldFilter.match(s.strip())
if m:
return float(m.group(1))
else:
return 0.0 # or whatever default
#...
# Yes, sorry, you have to know the number of columns, since the NumPy docs
# don't say you can specify a default converter for all columns.
convs = dict((col, filterTheField) for col in range(numColumns))
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True,
converters=convs)
Or abandon np.genfromtxt() and let csv.csvreader give you the file's contents a row at a time, as lists of strings, then you just iterate through the elements and build the matrix:
reader = csv.csvreader(csvfile)
result = np.array([[float(col) for col in row] for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT: Okay, so it looks like your file isn't all floats. In that case, you can set convs as needed in the genfromtxt case, or create a vector of conversion functions in the csv.csvreader case:
reader = csv.csvreader(csvfile)
converters = [datetime, float, int, float]
result = np.array([[conv(col) for col, conv in zip(row, converters)]
for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT 2: Okay, variable column count... Your data source just wants to make life difficult. Luckily, we can just use magic...
reader = csv.csvreader(csvfile)
result = np.array([[magic(col) for col in row] for row in reader])
... where magic() is just a name I got off the top of my head for a function. (Psyche!)
At worst, it could be something like:
def magic(s):
if '/' in s:
return datetime(s)
elif '.' in s:
return float(s)
else:
return int(s)
Maybe NumPy has a function that takes a string and returns a single element with the right type. numpy.fromstring() looks close, but it might interpret the space in your timestamps as a column separator.
P.S. One downside with csvreader I see is that it doesn't discard comments; real csv files don't have comments.

Related

parsing using multiple delimiters in python

I have a data file in which data is stored with comma and tab and newline delimiter like this
[32135, 311351, 88686
123152, 3153131, 131513
....]
i want to extract a nx3 array out of it
how could i do that ?
have tried using split in splitlines but it just parsed the file partially
import numpy as np
filename="Elem_Output.inp"
f = open(filename,"r")
pmax=f.read()
p1=pmax.split()
i expect to extract an array with every line a row and the numbers in each column in the arrays' column
After pmax=f.read(), you may want to write:
#Replace tab and newline as comma separater
pmax = pmax.replace("\n",",").replace("\t", ",")
#Replace repeated delimiter by a single instance
pmax = pmax.replace(",,,",",").replace(",,",",")
Needless to say, this can be coded much better using regex (import re).
Secondly, if your file starts and ends with square brackets, you may want to additionally add:
pmax = pmax.replace("[","").replace("]","")
Now, if you want this output as an array instead of list, try this:
from array import array
array_pmax = array("B", pmax)
The first argument in the array() function indicates the typecode. To know more, just use help(array)
Hope that helps!!

Truncate a column of a csv file?

I'm new to Python and I have the following csv file (let's call it out.csv):
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27.363000+00:00,0.9987,1.0113
2017-01-15,13:03:46.660000+00:00,0.9987,1.0113
2017-01-15,21:25:07.320000+00:00,0.9987,1.0113
2017-01-15,21:26:46.164000+00:00,0.9987,1.0113
2017-01-16,12:40:11.593000+00:00,,1.0154
2017-01-16,12:40:11.593000+00:00,1.0004,
2017-01-16,12:43:34.696000+00:00,,1.0095
and I want to truncate the second column so the csv looks like:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
This is what I have so far..
with open('out.csv','r+b') as nL, open('outy_3.csv','w+b') as nL3:
new_csv = []
reader = csv.reader(nL)
for row in reader:
time = row[1].split('.')
new_row = []
new_row.append(row[0])
new_row.append(time[0])
new_row.append(row[2])
new_row.append(row[3])
print new_row
nL3.writelines(new_row)
I can't seem to get a new line in after writing each line to the new csv file.
This definitely doesnt look or feel pythonic
Thanks
The missing newlines issue is because the file.writelines() method doesn't automatically add line separators to the elements of the argument it's passed, which it expects to be an sequence of strings. If these elements represent separate lines, then it's your responsibility to ensure each one ends in a newline.
However, your code is tries to use it to only output a single line of output. To fix that you should use file.write() instead because it expects its argument to be a single string—and if you want that string to be a separate line in the file, it must end with a newline or have one manually added to it.
Below is code that does what you want. It works by changing one of the elements of the list of strings that the csv.reader returns in-place, and then writes the modified list to the output file as single string by join()ing them all back together, and then manually adds a newline the end of the result (stored in new_row).
import csv
with open('out.csv','rb') as nL, open('outy_3.csv','wt') as nL3:
for row in csv.reader(nL):
time_col = row[1]
try:
period_location = time_col.index('.')
row[1] = time_col[:period_location] # only keep characters in front of period
except ValueError: # no period character found
pass # leave row unchanged
new_row = ','.join(row)
print(new_row)
nL3.write(new_row + '\n')
Printed (and file) output:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095

Converting a modified JSON to CSV using python

I know this question has been asked before, but never with the following caveats:
I'm a complete python n00b. Also a JSON noob.
The JSON file / string is not the same as those seen in json2csv examples.
The CSV file output is supposed to have standard columns.
Due to point number 1, I'm not aware of most terminologies and technologies used for this. So please bear with me.
Point number 2: Here's a single line of the supposed JSON file:
"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^
Weird, I know - it lacks braces and brackets and stuff. Which is why I'm convinced posted solutions won't work.
I'm not sure what the 0^ at the end of the line is, but I see it at the end of every line. I'm assuming the 0 is the value for "were_here_count" while the ^ is a... line terminator? EDIT: Apparently, I can just disregard it.
Of note is that the value of "parking" appears to be yet another array - I'm fine with just displaying it as is (minus the double quotes).
Point number 3: Here's the columns of the supposed CSV file output. This is the complete column set - the JSON file won't always have them all.
ID STRING,
ABOUT STRING,
ATTIRE STRING,
BAND_MEMBERS STRING,
BEST_PAGE STRING,
BIRTHDAY STRING,
BOOKING_AGENT STRING,
CAN_POST STRING,
CATEGORY STRING,
CATEGORY_LIST STRING,
CHECKINS STRING,
COMPANY_OVERVIEW STRING,
COVER STRING,
CONTEXT STRING,
CURRENT_LOCATION STRING,
DESCRIPTION STRING,
DIRECTED_BY STRING,
FOUNDED STRING,
GENERAL_INFO STRING,
GENERAL_MANAGER STRING,
GLOBAL_BRAND_PARENT_PAGE STRING,
HOMETOWN STRING,
HOURS STRING,
IS_PERMANENTLY_CLOSED STRING,
IS_PUBLISHED STRING,
IS_UNCLAIMED STRING,
LIKES STRING,
LINK STRING,
LOCATION STRING,
MISSION STRING,
NAME STRING,
PARKING STRING,
PHONE STRING,
PRESS_CONTACT STRING,
PRICE_RANGE STRING,
PRODUCTS STRING,
RESTAURANT_SERVICES STRING,
RESTAURANT_SPECIALTIES STRING,
TALKING_ABOUT_COUNT STRING,
USERNAME STRING,
WEBSITE STRING,
WERE_HERE_COUNT STRING
Here's my code so far:
import os
num = '1'
inPath = "./fb-data_input/"
outPath = "./fb-data_output/"
#Get list of Files, put them in filenameList array
fileNameList = os.listdir(path)
#Process per file in
for item in fileNameList:
print("Processing: " + item)
fb_inputFile = open(inPath + item, "rb").read().split("\n")
fb_outputFile = open(outPath + "fbdata-IAB-output" + num, "wb")
num++
jsonString = fb_inputFile.split("\",\"")
jsonField = jsonString[0]
jsonValue = jsonString[1]
jsonHash[?] = [?,?]
#Do Code stuff here
Up until the for loop, it just loads the json file names into an array, and then processes it one by one.
Here's my logic for the rest of the code:
Split the json string by something. Perhaps the "," so that other commas won't get split.
Store it into a hashmap / 2D array (dynamic?)
Trim away the JSON fields and the first and/or last double quotes.
Add the resulting output to another hashmap, with those set columns, putting in null in a column that the JSON file does not have.
And then I output the result to a CSV.
It sounds logical in my head, but I'm pretty sure there's something I missed. And of course, I have a hard time putting it in code.
Can I have some help on this? Thanks.
P.S.
Additional information:
OS: Mac OSX
Target platform OS: Ubuntu of some sort
Here is a full solution, based on your original code:
import os
import json
from csv import DictWriter
import codecs
def get_columns():
columns = []
with open("columns.txt") as f:
columns = [line.split()[0] for line in f if line.strip()]
return columns
if __name__ == "__main__":
in_path = "./fb-data_input/"
out_path = "./fb-data_output/"
columns = get_columns()
bad_keys = ("has_added_app", "is_community_page")
for filename in os.listdir(in_path):
json_filename = os.path.join(in_path, filename)
csv_filename = os.path.join(out_path, "%s.csv" % (os.path.basename(filename)))
with open(json_filename) as f, open(csv_filename, "wb") as csv_file:
csv_file.write(codecs.BOM_UTF8)
csv = DictWriter(csv_file, columns)
csv.writeheader()
for line_number, line in enumerate(f, start=1):
try:
data = json.loads("{%s}" % (line.strip().strip('^')))
# fix parking column
if "parking" in data:
data['parking'] = ", ".join("%s: %s" % (k, str(v)) for k, v in data['parking'].items())
data = {k.upper(): unicode(v).encode('utf8') for k, v in data.items() if k not in bad_keys}
except Exception, e:
import traceback
traceback.print_exc()
data = {columns[0]: "Error on line %s of %s: %s" % (line_number, json_filename, e)}
csv.writerow(data)
Edited: Full unicode support plus extended error information.
So, first off, your string is valid json if you just add curly braces around it. You can then deserialize with Python's json library. Setup your csv columns as a dictionary with each of them pointing to whatever you want as a default value (None? ""? you're choice). Once you've deserialized the json to a dict, just loop through each key there and fill in the csv_columns dict as appropriate. Then just use Python's csv module to write it out:
import json
import csv
string = '"id":"123456","about":"YESH","can_post":true,"category":"Community","checkins":0,"description":"OLE!","has_added_app":false,"is_community_page":false,"is_published":true,"likes":48,"link":"www.fake.com","name":"Test Name","parking":{"lot":0,"street":0,"valet":0},"talking_about_count":0,"website":"www.fake.com/blog","were_here_count":0^'
string = '{%s}' % string[:-1]
json_dict = json.loads(string)
#make 'parking' a string. I'm assuming that's your only hash.
json_dict['parking'] = json.dumps(json_dict['parking'])
csv_cols_list = ['a','b','c'] #put your actual csv columns here
csv_cols = {col: '' for col in csv_cols_list}
for k, v in json_dict.iterkeys():
if k in csv_cols:
csv_cols[k] = v
#now just write to csv using Python's csv library
Note: this is a general answer that assumes that your "json" will be valid key/value pairs. Your "parking" key is a special case you'll need to deal with somehow. I left it as is because I don't know what you want with it. I'm also assuming the '^' at the end of your string was a typo.
[EDIT] Changed to account for parking and the '^' at the end. [/EDIT]
Either way, the general idea here is what you want.
The first thing is your input is not JSON. Its just a string that is delimited, where the column and value is quoted.
Here is a solution that would work:
import csv
columns = ['ID', 'ABOUT', ... ]
with open('input_file.txt', 'r') as f, open('output_file.txt', 'w') as o:
reader = csv.reader(f, delimiter=',')
writer = csv.writer(o, delimiter=',')
writer.writerow(columns)
for row in reader:
data = {k.upper():v for k,v in row.split(':', 1)}
row = [data.get(v, '') for v in columns]
writer.writerow(row)
In this loop, for each line we read from the input file, a dictionary is created. The key is the first value from the 'foo:bar' pair, and we convert it to upper case.
Next, for each column, we try to fetch a value from this dictionary in the order that the columns are written out. If a value for the column doesn't exist, a blank '' is returned. These values are collected in a list row. This makes sure no matter how many columns are missing, we write an equal number of columns to the output.

loading strings with spaces as numpy array

I would like to load a csv file as a numpy array. Each row contains string fields with spaces.
I tried with both loadtxt() and genfromtxt() methods available in numpy. By default both methods consider space as a delimiter and separates each word in the string as a separate column. Is there anyway to load this sort of data using loadtxt() or genfromtxt() or will I have to write my own code for it?
Sample row from my file:
826##25733##Emanuele Buratti## ##Mammalian cell expression
Here ## is the delimiter and space denotes missing values.
I think your problem is that the default comments character # is conflicting with your delimiter. I was able to load your data like this:
>>> import numpy as np
>>> np.loadtxt('/tmp/sample.txt', dtype=str, delimiter='##', comments=None)
array(['826', '25733', 'Emanuele Buratti', ' ', 'Mammalian cell expression'],
dtype='|S25')
You can see that the dtype has been automatically set to whatever the maximum length string was. You can use dtype=object if that is troublesome. As an aside, since your data is not numeric, I would probably recommend using csv module rather than numpy for this job.
Here is the csv equivalent, as wim suggested:
import csv
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter='##')
rows = list(reader)
As #wim pointed out the comments, this doesn't really work since the delimiter must be one character. So if you change the above so that delimiter='#', you get this as the result:
[['826', '', '25733', '', 'Emanuele Buratti', '', ' ', '', 'Mammalian cell expression']]

How to use python csv module for splitting double pipe delimited data

I have got data which looks like:
"1234"||"abcd"||"a1s1"
I am trying to read and write using Python's csv reader and writer.
As the csv module's delimiter is limited to single char, is there any way to retrieve data cleanly? I cannot afford to remove the empty columns as it is a massively huge data set to be processed in time bound manner. Any thoughts will be helpful.
The docs and experimentation prove that only single-character delimiters are allowed.
Since cvs.reader accepts any object that supports iterator protocol, you can use generator syntax to replace ||-s with |-s, and then feed this generator to the reader:
def read_this_funky_csv(source):
# be sure to pass a source object that supports
# iteration (e.g. a file object, or a list of csv text lines)
return csv.reader((line.replace('||', '|') for line in source), delimiter='|')
This code is pretty effective since it operates on one CSV line at a time, provided your CSV source yields lines that do not exceed your available RAM :)
>>> import csv
>>> reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
>>> for row in reader:
... assert not ''.join(row[1::2])
... row = row[0::2]
... print row
...
['1234', 'abcd', 'a1s1']
>>>
Unfortunately, delimiter is represented by a character in C. This means that it is impossible to have it be anything other than a single character in Python. The good news is that it is possible to ignore the values which are null:
reader = csv.reader(['"1234"||"abcd"||"a1s1"'], delimiter='|')
#iterate through the reader.
for x in reader:
#you have to use a numeric range here to ensure that you eliminate the
#right things.
for i in range(len(x)):
#Odd indexes will be discarded.
if i%2 == 0: x[i] #x[i] where i%2 == 0 represents the values you want.
There are other ways to accomplish this (a function could be written, for one), but this gives you the logic which is needed.
If your data literally looks like the example (the fields never contain '||' and are always quoted), and you can tolerate the quote marks, or are willing to slice them off later, just use .split
>>> '"1234"||"abcd"||"a1s1"'.split('||')
['"1234"', '"abcd"', '"a1s1"']
>>> list(s[1:-1] for s in '"1234"||"abcd"||"a1s1"'.split('||'))
['1234', 'abcd', 'a1s1']
csv is only needed if the delimiter is found within the fields, or to delete optional quotes around fields

Categories