parsing using multiple delimiters in python - python

I have a data file in which data is stored with comma and tab and newline delimiter like this
[32135, 311351, 88686
123152, 3153131, 131513
....]
i want to extract a nx3 array out of it
how could i do that ?
have tried using split in splitlines but it just parsed the file partially
import numpy as np
filename="Elem_Output.inp"
f = open(filename,"r")
pmax=f.read()
p1=pmax.split()
i expect to extract an array with every line a row and the numbers in each column in the arrays' column

After pmax=f.read(), you may want to write:
#Replace tab and newline as comma separater
pmax = pmax.replace("\n",",").replace("\t", ",")
#Replace repeated delimiter by a single instance
pmax = pmax.replace(",,,",",").replace(",,",",")
Needless to say, this can be coded much better using regex (import re).
Secondly, if your file starts and ends with square brackets, you may want to additionally add:
pmax = pmax.replace("[","").replace("]","")
Now, if you want this output as an array instead of list, try this:
from array import array
array_pmax = array("B", pmax)
The first argument in the array() function indicates the typecode. To know more, just use help(array)
Hope that helps!!

Related

Python CSV output - additional formatting

To start...Python noob...
My first goal is to read the first row of a CSV and output. The following code does that nicely.
import csv
csvfile = open('some.csv','rb')
csvFileArray = []
for row in csv.reader(csvfile, delimiter = ','):
csvFileArray.append(row)
print(csvFileArray[0])
Output looks like...
['Date', 'Time', 'CPU001 User%', 'CPU001 Sys%',......
My second and third tasks deal with formatting.
Thus, if I want the print(csvFileArray[0]) output to contain 'double quotes' for the delimiter how best can I handle that?
I'd like to see...
["Date","Time", "CPU001 User%", "CPU001 Sys%",......
I have played with formatting the csvFileArray field and all I can get it to do is to prefix or append data.
I have also looked into the 'dialect', 'quoting', etc., but am just all over the place.
My last task is to add text into each value (into the array).
Example:
["Test Date","New Time", "Red CPU001 User%", "Blue CPU001 Sys%",......
I've researched a number of methods to do this but am awash in the multiple ways.
Should I ditch the Array as this is too constraining?
Looking for direction not necessarily someone to write it for me.
Thanks.
OK.....refined the code a bit and am looking for direction, not direct solution (need to learn).
import csv
with open('ba200952fd69 - Copy.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
print (row)
break
The code nicely reads the first line of the CSV and outputs the first row as follows:
['Date', 'Time', 'CPU001 User%', 'CPU001 Sys%',....
If I want to add formatting to each/any item within that row, would I be performing those actions within the quotes of the print command? Example: If I wanted each item to have double-quotes, or have a prefix of 'XXXX', etc.
I have read through examples of .join type commands, etc., and am sure that there are much easier ways to format print output than I'm aware of.
Again, looking for direction, not immediate solutions.
For your first task, I'd recommend using the next function to grab the first row rather than iterating through the whole csv. Also, it might be useful to take a look at with blocks as they are the standard way of dealing with opening and closing files.
For your second question, it looks like you want to change the format of the print statement. Note that it is printing strings, which is indicated by the single quotes around each element in the array. This has nothing to do with the csv module, but simply because you are print an array of strings. To print with double quotes, you would have to reformat the print statement. You could take a look at this for some ways on doing that.
For your last question, I'd recommend looking at list comprehensions. E.g.,
["Test " + word for word in words].
If words = ["word1", "word2"], then this would return ["Test word1", "Test word2"].
Edit: If you want to add a different value to each value in the array, you could do something similar. Let prefixes be an array of prefixes you want to add to the word in words at the same index location. You could then use the list comprehension:
[prefix + " " + word for prefix, word in zip(prefixes, words)]

Truncate a column of a csv file?

I'm new to Python and I have the following csv file (let's call it out.csv):
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27.363000+00:00,0.9987,1.0113
2017-01-15,13:03:46.660000+00:00,0.9987,1.0113
2017-01-15,21:25:07.320000+00:00,0.9987,1.0113
2017-01-15,21:26:46.164000+00:00,0.9987,1.0113
2017-01-16,12:40:11.593000+00:00,,1.0154
2017-01-16,12:40:11.593000+00:00,1.0004,
2017-01-16,12:43:34.696000+00:00,,1.0095
and I want to truncate the second column so the csv looks like:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
This is what I have so far..
with open('out.csv','r+b') as nL, open('outy_3.csv','w+b') as nL3:
new_csv = []
reader = csv.reader(nL)
for row in reader:
time = row[1].split('.')
new_row = []
new_row.append(row[0])
new_row.append(time[0])
new_row.append(row[2])
new_row.append(row[3])
print new_row
nL3.writelines(new_row)
I can't seem to get a new line in after writing each line to the new csv file.
This definitely doesnt look or feel pythonic
Thanks
The missing newlines issue is because the file.writelines() method doesn't automatically add line separators to the elements of the argument it's passed, which it expects to be an sequence of strings. If these elements represent separate lines, then it's your responsibility to ensure each one ends in a newline.
However, your code is tries to use it to only output a single line of output. To fix that you should use file.write() instead because it expects its argument to be a single string—and if you want that string to be a separate line in the file, it must end with a newline or have one manually added to it.
Below is code that does what you want. It works by changing one of the elements of the list of strings that the csv.reader returns in-place, and then writes the modified list to the output file as single string by join()ing them all back together, and then manually adds a newline the end of the result (stored in new_row).
import csv
with open('out.csv','rb') as nL, open('outy_3.csv','wt') as nL3:
for row in csv.reader(nL):
time_col = row[1]
try:
period_location = time_col.index('.')
row[1] = time_col[:period_location] # only keep characters in front of period
except ValueError: # no period character found
pass # leave row unchanged
new_row = ','.join(row)
print(new_row)
nL3.write(new_row + '\n')
Printed (and file) output:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095

loading strings with spaces as numpy array

I would like to load a csv file as a numpy array. Each row contains string fields with spaces.
I tried with both loadtxt() and genfromtxt() methods available in numpy. By default both methods consider space as a delimiter and separates each word in the string as a separate column. Is there anyway to load this sort of data using loadtxt() or genfromtxt() or will I have to write my own code for it?
Sample row from my file:
826##25733##Emanuele Buratti## ##Mammalian cell expression
Here ## is the delimiter and space denotes missing values.
I think your problem is that the default comments character # is conflicting with your delimiter. I was able to load your data like this:
>>> import numpy as np
>>> np.loadtxt('/tmp/sample.txt', dtype=str, delimiter='##', comments=None)
array(['826', '25733', 'Emanuele Buratti', ' ', 'Mammalian cell expression'],
dtype='|S25')
You can see that the dtype has been automatically set to whatever the maximum length string was. You can use dtype=object if that is troublesome. As an aside, since your data is not numeric, I would probably recommend using csv module rather than numpy for this job.
Here is the csv equivalent, as wim suggested:
import csv
with open('somefile.txt') as f:
reader = csv.reader(f, delimiter='##')
rows = list(reader)
As #wim pointed out the comments, this doesn't really work since the delimiter must be one character. So if you change the above so that delimiter='#', you get this as the result:
[['826', '', '25733', '', 'Emanuele Buratti', '', ' ', '', 'Mammalian cell expression']]

Slice specific characters in CSV using python

I have data in tab delimited format that looks like:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match and the genfromtxt in numpy. This example is as far as I have gotten:
import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
m = re.match('[0-9]/[0-9]', i)
if m:
print m.group(0),
else:
print "NA",
This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.
Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?
Unless you really want to use NumPy, try this:
file = open('home/python/batch1.hg19.table')
for line in file:
for cell in line.split('\t'):
print(cell[:3])
Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.
Numpy is great when you want to load in an array of numbers.
The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.
Here's a simple way to do it without numpy:
result=[]
with open(csvfile,'r') as f:
for line in f:
row=[]
for text in line.split('\t'):
match=re.search('([0-9]/[0-9])',text)
if match:
row.append(match.group(1))
else:
row.append("NA")
result.append(row)
print(result)
yields
# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]
on this data:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
Its pretty easy to parse the whole file without regular expressions:
for line in open('yourfile').read().split('\n'):
for token in line.split('\t'):
print token[:3] if token else 'N\A'
I haven't written python in a while. But I would probably write it as such.
file = open("home/python/batch1.hg19.table")
for line in file:
columns = line.split("\t")
for column in columns:
print column[:3]
file.close()
Of course if you need to validate the first three characters, you'll still need the regex.

Reading CSV files in numpy where delimiter is ","

I've got a CSV file with a format that looks like this:
"FieldName1", "FieldName2", "FieldName3", "FieldName4"
"04/13/2010 14:45:07.008", "7.59484916392", "10", "6.552373"
"04/13/2010 14:45:22.010", "6.55478493312", "9", "3.5378543"
...
Note that there are double quote characters at the start and end of each line in the CSV file, and the "," string is used to delimit fields within each line. The number of fields in the CSV file can vary from file to file.
When I try to read this into numpy via:
import numpy as np
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True)
all the data gets read in as string values, surrounded by double-quote characters. Not unreasonable, but not much use to me as I then have to go back and convert every column to its correct type
When I use delimiter='","' instead, everything works as I'd like, except for the 1st and last fields. As the start of line and end of line characters are a single double-quote character, this isn't seen as a valid delimiter for the 1st and last fields, so they get read in as e.g. "04/13/2010 14:45:07.008 and 6.552373" - note the leading and trailing double-quote characters respectively. Because of these redundant characters, numpy assumes the 1st and last fields are both String types; I don't want that to be the case
Is there a way of instructing numpy to read in files formatted in this fashion as I'd like, without having to go back and "fix" the structure of the numpy array after the initial read?
The basic problem is that NumPy doesn't understand the concept of stripping quotes (whereas the csv module does). When you say delimiter='","', you're telling NumPy that the column delimiter is literally a quoted comma, i.e. the quotes are around the comma, not the value, so the extra quotes you get on he first and last columns are expected.
Looking at the function docs, I think you'll need to set the converters parameter to strip quotes for you (the default does not):
import re
import numpy as np
fieldFilter = re.compile(r'^"?([^"]*)"?$')
def filterTheField(s):
m = fieldFilter.match(s.strip())
if m:
return float(m.group(1))
else:
return 0.0 # or whatever default
#...
# Yes, sorry, you have to know the number of columns, since the NumPy docs
# don't say you can specify a default converter for all columns.
convs = dict((col, filterTheField) for col in range(numColumns))
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True,
converters=convs)
Or abandon np.genfromtxt() and let csv.csvreader give you the file's contents a row at a time, as lists of strings, then you just iterate through the elements and build the matrix:
reader = csv.csvreader(csvfile)
result = np.array([[float(col) for col in row] for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT: Okay, so it looks like your file isn't all floats. In that case, you can set convs as needed in the genfromtxt case, or create a vector of conversion functions in the csv.csvreader case:
reader = csv.csvreader(csvfile)
converters = [datetime, float, int, float]
result = np.array([[conv(col) for col, conv in zip(row, converters)]
for row in reader])
# BTW, column headings are in reader.fieldnames at this point.
EDIT 2: Okay, variable column count... Your data source just wants to make life difficult. Luckily, we can just use magic...
reader = csv.csvreader(csvfile)
result = np.array([[magic(col) for col in row] for row in reader])
... where magic() is just a name I got off the top of my head for a function. (Psyche!)
At worst, it could be something like:
def magic(s):
if '/' in s:
return datetime(s)
elif '.' in s:
return float(s)
else:
return int(s)
Maybe NumPy has a function that takes a string and returns a single element with the right type. numpy.fromstring() looks close, but it might interpret the space in your timestamps as a column separator.
P.S. One downside with csvreader I see is that it doesn't discard comments; real csv files don't have comments.

Categories