How to paste a Numpy array to Excel - python

I have multiple files which I process using Numpy and SciPy, but I am required to deliver an Excel file. How can I efficiently copy/paste a huge numpy array to Excel?
I have tried to convert to Pandas' DataFrame object, which has the very usefull function to_clipboard(excel=True), but I spend most of my time converting the array into a DataFrame.
I cannot simply write the array to a CSV file then open it in excel, because I have to add the array to an existing file; something very hard to achieve with xlrd/xlwt and other Excel tools.

My best solution here would be to turn the array into a string, then use win32clipboard to sent it to the clipboard. This is not a cross-platform solution, but then again, Excel is not avalable on every platform anyway.
Excel uses tabs (\t) to mark column change, and \r\n to indicate a line change.
The relevant code would be:
import win32clipboard as clipboard
def toClipboardForExcel(array):
"""
Copies an array into a string format acceptable by Excel.
Columns separated by \t, rows separated by \n
"""
# Create string from array
line_strings = []
for line in array:
line_strings.append("\t".join(line.astype(str)).replace("\n",""))
array_string = "\r\n".join(line_strings)
# Put string into clipboard (open, clear, set, close)
clipboard.OpenClipboard()
clipboard.EmptyClipboard()
clipboard.SetClipboardText(array_string)
clipboard.CloseClipboard()
I have tested this code with random arrays of shape (1000,10000) and the biggest bottleneck seems to be passing the data to the function. (When I add a print statement at the beginning of the function, I still have to wait a bit before it prints anything.)
EDIT: The previous paragraph related my experience in Python Tools for Visual Studio. In this environment, it seens like the print statement is delayed. In direct command line interface, the bottleneck is in the loop, like expected.

import pandas as pd
pd.DataFrame(arr).to_clipboard()
I think it's one of the easiest way with pandas package.

If I would need to process multiple files loaded into python and then parse into excel, I would probably make some tools using xlwt
That said, may I offer my recipe Pasting python data into a spread sheet open for any edits, complaints or feedback. It uses no third party libraries and should be cross platform.

As of today, you can also use xlwings. It's open source, and fully compatible with Numpy arrays and Pandas DataFrames.

I extended PhilMacKay answer to:
- incude 1dimensional arrays and,
- to allow for commas as decimal separator (decimal=","):
import win32clipboard as clipboard
def to_clipboard(array, decimal=","):
"""
Copies an array into a string format acceptable by Excel.
Columns separated by \t, rows separated by \n
"""
# Create string from array
try:
n, m = np.shape(array)
except ValueError:
n, m = 1, 0
line_strings = []
if m > 0:
for line in array:
if decimal == ",":
line_strings.append("\t".join(line.astype(str)).replace(
"\n","").replace(".", ","))
else:
line_strings.append("\t".join(line.astype(str)).replace(
"\n",""))
array_string = "\r\n".join(line_strings)
else:
if decimal == ",":
array_string = "\r\n".join(array.astype(str)).replace(".", ",")
else:
array_string = "\r\n".join(array.astype(str))
# Put string into clipboard (open, clear, set, close)
clipboard.OpenClipboard()
clipboard.EmptyClipboard()
clipboard.SetClipboardText(array_string)
clipboard.CloseClipboard()

You could also look into pyxll project.

Related

When reading excel files with pandas, what determines the datatype of the cells being read?

I am reading an excel sheet and plucking data from rows containing the given PO.
import pandas as pd
xlsx = pd.ExcelFile('Book2.xlsx')
df = pd.read_excel(xlsx)
PO_arr = ['121121','212121']
for i in PO_arr:
PO = i
PO_DATA = df.loc[df['PONUM'] == PO]
for i in range(1, max(PO_DATA['POLINENUM'].values) +1):
When I take this Excel sheet straight from its source, my code works fine. But when I cut out only the rows I want and paste them to a new spreadsheet with the exact same formatting and read this new spreadsheet, I have to change PO_DATA to look for an integer instead of a string as such:
PO_DATA = df.loc[df['PONUM'] == int(PO)]
If not, I get an error, and calling PO_DATA returns an empty dataframe.
C:\...\pandas\core\ops\array_ops.py:253: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
res_values = method(rvalues)
I checked the cell formatting in Excel and in both cases, they are formatted as 'General' cells.
What is going on that makes it so when I chop up my spreadsheet, I have to look for an integer and not a string? What do I have to do to make it work for sheets I've created and pasted relevant data into instead of only sheets from the source?
Excel can do some funky formatting when copy and paste is used: ctl-c : ctl-v.
I am sure you tried these but...
A) Try copy ctl-c then ctl-alt-v:"v":enter ... on new sheet/file
B) Try using the format painter in Excel : Looks like a paintbrush on the home tab - select the properly formatted cells first - double click format painter - move to your new file/sheet - select cells you want the format to conform to.
C) Select your new file/table you pasted into - select purple eraser icon from the top options in excel - clear all formats
Update: I found an old related thread that didn't necessarily answer the question but solved the problem.
you can force pandas to import values as a certain datatype when reading from excel using the converters argument for read_excel.
df = pd.read_excel(xlsx, converters={'POLINENUM':int,'PONUM':int})

Writing results of several cells in jupyter to one file (no overwriting)

I have a jupyter notebook where I run the same simulation using many different combinations of parameters (essentially, to simulate different versions of environment and their effect on the results). Let's say that the result of each run is an image and a 2d array of all relevant metrics for my system. I want to be able to keep the images in notebook, but save the arrays all in one place, so that I can work with them later on if needed.
Ideally I would save them into an external file with the following format:
'Experiment environment version i' (or some other description)
2d array
and every time I would run a new simulation (a new cell) the results would be added into this file until I close it.
Any ideas how to end up with such external summary file?
If you have excel available to you then you could use pandas to write the results to a spreadsheet (or you could use pandas to write to a csv). See the documentation here, but essentially you would do the following when appending and/or using a new sheet:
import pandas as pd
for i in results:
with pd.ExcelWriter('results.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Result'+i)
You will need to have your array in dataframe 'df', there are lots of tutorials on how to put an array into pandas.
After a bit of try and error, here is a general answer how to write to txt (without pandas, otherwise see #jaybeesea's answer)
with open("filename.txt", "a+") as f:
f.write("Comment 1 \n")
f.write("%s \n" %np.array2string(array, separator=' , '))
Every time you run it, it adds to the file "f".

Matrix from Excel to Python

I'm writing a Python program that will import a square matrix from an Excel sheet and do some NumPy work with it. So far it looks like OpenPyXl is the best way to transfer the data from an XLSX file to the Python environment, but it's not clear the best way to turn that data from a tuple of tuples* of cell references into an array of the actual values that are in the Excel sheet.
*created by calling sheet_ranges = wb['Sheet1'] and then mat = sheet_ranges['A1:IQ251']
Of course I could check the size of the tuple, write a nested for loop, check every element of each tuple within the tuple, and fill up an array.
But is there really no better way?
As commented above, the ideal solution is to use a pandas dataframe. For example:
import pandas as pd
dataframe = pd.read_excel("name_of_my_excel_file.xlsx")
print(dataframe)
Just pip install pandas and then run the code above, only replacing name_of_my_excel_file with the full path to your Excel file. Then you can proceed with Pandas functions to deeply analyse your data, for example. See docs at here!

numpy won't print full (unsummarized array)

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.
I have a CSV with named headers. Here are the first five rows
v0 v1 v2 v3 v4
1001 5529 24 56663 16445
1002 4809 30.125 49853 28069
1003 407 20 28462 8491
1005 605 19.55 75423 4798
1007 1607 20.26 79076 12962
I'd like to read in the data and be able to view it fully. I tried doing this:
import numpy as np
np.set_printoptions(threshold=np.inf)
main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]
However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?
OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:
>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)
When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).
>>> x
This is the same as
>>> print(repr(x))
The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.
I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.
Somewhat curiously, writing this array as a csv is fast, with a minimal delay:
>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')
And I know what savetxt does - it iterates on rows, and does a file write
f.write(fmt % tuple(row))
Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.
Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?
I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.
I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.
Pandas has many tricks for operating with table like data.
import pandas as pd
df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)
When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:
np.set_printoptions(threshold=100000, suppress=True)
The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

How to copy/paste a dataframe from iPython into Google Sheets or Excel?

I've been using iPython (aka Jupyter) quite a bit lately for data analysis and some machine learning. But one big headache is copying results from the notebook app (browser) into either Excel or Google Sheets so I can manipulate results or share them with people who don't use iPython.
I know how to convert results to csv and save. But then I have to dig through my computer, open the results and paste them into Excel or Google Sheets. That takes too much time.
And just highlighting a resulting dataframe and copy/pasting usually completely messes up the formatting, with columns overflowing. (Not to mention the issue of long resulting dataframes being truncated when printed in iPython.)
How can I easily copy/paste an iPython result into a spreadsheet?
Try using the to_clipboard() method. E.g., for a dataframe, df: df.to_clipboard() will copy said dataframe to your clipboard. You can then paste it into Excel or Google Docs.
If df.to_clipboard doesn't work. This will work.
import io
with io.StringIO() as buffer:
df.to_csv(buffer, sep=' ', index=False)
print(buffer.getvalue())
Then, you can copy the printed dataframe and paste it in Excel or Google Sheets.
Paste the output to an IDE like Atom and then paste in Google Sheets/Excel
I use display() instead of print() and it works fine for me. Example:
from IPython.display import display
import pandas as pd
dict = {'Name' : ['Alice', 'Bob', 'Charlie'],
'English' : [73, 55, 90],
'Math' : [78, 100, 33],
'Geography' : [92, 87, 72]}
df = pd.DataFrame(dict)
display(df)
The result can easily be copied and pasted into Excel and formatting won't be messed up. This method also works with Colab.
If you are able to make the csv or html available and reachable by a url - you can use this in google sheets.
=IMPORTDATA("url to the csv/html file")
In my experience SpreadSheet uses tabulation (\t) to separate cells and newline (\n) to separate rows.
Assuming this I wrote a simple function to convert from clipboard data:
def from_excel_to_list(copy_text):
"""Use it to copy and paste data from SpreadSheet software
(MS Excel, Libreoffice) and convert to a list
"""
if isinstance(copy_text, str):
array = []
rows = copy_text.split("\n") # splits rows
for row in rows:
if len(row): # removes empty lines
array.append(row.split("\t"))
return array
else:
raise TypeError("text must be string")
You can define the function inside Jupiter and use it in this way:
Copy with ctrl-c on the SpreadSheet and than call the function from_excel_to_list pasting the data with ctrl-v inside the double brackets
my_excel_converted = from_excel_to_list("""Paste here with ctrl-v the text""")
Example
Data from ctrl-c:
N U tot
1 18,236 18,236
17 20,37 346,29
5 6,318 31,59
Call The function:
from_excel_to_list("""N U tot
1 18,236 18,236
17 20,37 346,29
5 6,318 31,59
""")
Result in Jupiter:
[['N', 'U', 'tot'],
['1', '18,236', '18,236'],
['17', '20,37', '346,29'],
['5', '6,318', '31,59']]
This is a base for further elaboration.
The same approach can be used to obtain dictionary, namedtuple and so on.
For a small table, you can print the dataframe, use mouse to select the table, copy the table using Ctrl/Cmd + C, go to spreadsheet and paste the table, and you will get the following:
click on the first cell and insert a cell to fix the header:
Done.
PS: for a bigger table, some rows/columns will show as '...', refer to How do I expand the output display to see more columns of a Pandas DataFrame? to show all rows and columns. For a even bigger table (that is difficult to select using the mouse), this method is not so convenient.

Categories