Errors when copy - pasting a column from Excel - python

I want to write a function to process some data I bring from Excel. The data is essentially in an Excel column (transaction IDs). For reasons of my own convenience, I thought I'd use raw_input with copy-pasting the column from Excel, store it and run the function on that.
However, whatever I do I get errors (I actually got stuck in this very first stage of bringing in the data), and I'm pretty sure the reason is because each item is in a new line (when I use Excel's option to transpose the column to a row, I get no errors).
So, for instance, if I wanted to try and set a sample string to work with, e.g.:
some_string = "014300071432Gre
014300054037Col
014300065692ASC"
(this is the formatting you get when pasting from a column in Excel),
and just call some_string, I'd get:
File "<stdin>", line 1
al = "014300071432Gre
^
SyntaxError: EOL while scanning string literal
I tried removing the line-breaks with .split() but that didn't work
I also tried the triple quotes I saw suggested in several threads, but that didn't work either. It only got me more confused because I thought triple quotes are used when you don't want python to evaluate something.
I've placed some Sample Data in a Google doc.
Would really appreciate any help.
Thanks!

You're right that the difficulty in using raw_input with a copied column of Excel data is the newlines. The issue is that raw_input specifically reads one line. From the official docs:
raw_input([prompt])
If the prompt argument is present, it is written to standard output without a trailing newline. The function then reads a line from input, converts it to a string (stripping a trailing newline), and returns that.
By definition, a newline character marks the end of a line. So there really isn't a simple way to paste a column of Excel data into raw_input.
In most cases, the best way to read Excel data from Python is simply to read the Excel file directly. The best package for this is xlrd. Assuming your workbook is named myData.xls and you want to read A2:A5 from the first sheet, you would do something like
import xlrd
wb = xlrd.open_workbook('myData.xls')
ws = wb.sheet_by_index(0)
result = ws.col_values(0, 1, 5)
At that point, result would be a 4-element list of cell values (A2, A3, A4, and A5).
If you really need the user interface to be "copy a range of cells in Excel; paste into my app" then you probably have to look into building a GUI which has a multiline text input box. Here you have lots of choices, from Python's included Tkinter, to third-party libraries for Python, to non-Python GUIs (as long as they can read the input and then pass it to your Python program).
Edit: You can read the clipboard directly (so don't do the paste step at all). See these questions for more information. The simplest solution taken from those questions relies on Tkinter:
from Tkinter import Tk
r = Tk()
result = r.selection_get(selection='CLIPBOARD')
r.destroy()
The above assumes that the clipboard is already populated. In other words, the flow would be something like
Your program prompts the user to copy a selection in Excel
The user copies a selection in Excel
The user responds to your program's prompt (to let your program know the clipboard is ready)
Your program issues the above snippet to grab the clipboard contents into result
Your program processes result as desired
No doubt there are more sophisticated ways, but that should be enough to get you going.

some_string = '''014300071432Gre
014300054037Col
014300065692ASC'''

some_string = """014300071432Gre
014300054037Col
014300065692ASC"""
triple quotes are a multiline string,
you could write this as:
some_string = "014300071432Gre\n014300054037Col\n014300065692ASC"

Related

Assigning multi-line raw string to variable for use in read_csv

I am trying to assign a raw file path to a variable for use in read_csv in python. The ultimate intent is to take file path as an input in a GUI, and use this to run read_csv. The string is very long, and, for the time being, I am just trying to get the string - variable assignment working.
I followed another thread which suggested using r'''drive:\yada\yada...''' however this adds an additional "\" to each step in the file path. Any suggestions for how to prevent this? Also, any suggestions on best approach to take a file path as input to a GUI and use this to read_csv would be greatly appreciated.
Example of problem below...
In[219]: pathProject = r'''C:\Users\Account\OneDrive\
\Documents\Projects\2016\Shared\
\Project-1\Administrative\Phase-1\
\Final'''
In[220]: pathProject
Out[220]: 'C:\\Users\\Account\\OneDrive\\\n\\Documents\\Projects\\2016\\Shared\\\n\\Project-1\\Administrative\\Phase-1\\\n\\Final'
If you want to enter a long string by spliting it on many lines, you can take advantage of Python's string concatenation. As you want to enter it on many lines, they have to be included in parentheses, for example:
pathProject = (r"C:\Users\Account\OneDrive"
r"\Documents\Projects\2016\Shared"
r"\Project-1\Administrative\Phase-1"
r"\Final")
print(pathProject)
# C:\Users\Account\OneDrive\Documents\Projects\2016\Shared\Project-1\Administrative\Phase-1\Final
Note the opening and closing parentheses, and that each part of the string has to be declared as a raw string.

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number
Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.
Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?
Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;

Regex out leading and trailing quotes if not contains comma

I'm at a total loss of how to do this.
My Question: I want to take this:
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
... (continue)
To this:
"A, two words with comma",B,C word without comma,D
"E, two words with comma",F,G more stuff,H no commas here!
... (continue)
I used software that created 1,900 records in a text file and I think it was supposed to be a CSV but whoever wrote the software doesn't know how CSV files work because it only needs quotes if the cell contains a comma (right?). At least I know that in Excel it puts everything in the first cell...
I would prefer this to be solvable using some sort of command line tool like perl or python (I'm on a Mac). I don't want to make a whole project in Java or anything to take care of this.
Any help is greatly appreciated!
Shot in the dark here, but I think that Excel is putting everything in the first column because it doesn't know it's being given comma-separated data.
Excel has a "text-to-columns" feature, where you'll be able to split a column by a delimiter (make sure you choose the comma).
There's more info here:
http://support.microsoft.com/kb/214261
edit
You might also try renaming the file from *.txt to *.csv. That will change the way Excel reads the file, so it better understands how to parse whatever it finds inside.
If just bashing is an option, you can try this one-liner in a terminal:
cat file.csv | sed 's/"\([^,]*\)"/\1/g' >> new-file.csv
That technically should be fine. It is text delimited with the " and separated via the ,
I don't see anything wrong with the first at all, any field may be quoted, only some require it. More than likely the writer of the code didn't want to over complicate the logic and quoted everything.
One way to clean it up is to feed the data to csv and dump it back.
import csv
from cStringIO import StringIO
bad_data = """\
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
"""
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows(csv.reader(bad_data.split('\n')))
buffer.seek(0)
print buffer.read()
Python's csv.writer will default to the "excel" dialect, so it will not write the commas when not necessary.

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.
When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.
Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.
Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

Categories