I'm writing a Python program that will import a square matrix from an Excel sheet and do some NumPy work with it. So far it looks like OpenPyXl is the best way to transfer the data from an XLSX file to the Python environment, but it's not clear the best way to turn that data from a tuple of tuples* of cell references into an array of the actual values that are in the Excel sheet.
*created by calling sheet_ranges = wb['Sheet1'] and then mat = sheet_ranges['A1:IQ251']
Of course I could check the size of the tuple, write a nested for loop, check every element of each tuple within the tuple, and fill up an array.
But is there really no better way?
As commented above, the ideal solution is to use a pandas dataframe. For example:
import pandas as pd
dataframe = pd.read_excel("name_of_my_excel_file.xlsx")
print(dataframe)
Just pip install pandas and then run the code above, only replacing name_of_my_excel_file with the full path to your Excel file. Then you can proceed with Pandas functions to deeply analyse your data, for example. See docs at here!
Related
I did export the following dataframe in Google Colab. Whichever method I used, when I import it later, my dataframe appears as pandas.core.series.Series, not as an array.
from google.colab import drive
drive.mount('/content/drive')
path = '/content/drive/My Drive/output.csv'
with open(path, 'w', encoding = 'utf-8-sig') as f:
df_protein_final.to_csv(f)
After importing the dataframe looks like below
pandas.core.series.Series
Note: The first image and second image can be different order in terms of numbers (It can be look as a different dataset). Please don't get hung up on this. Don't worry. Those images are just an example.
Why does column, which is originally an array before exporting, converts to series after exporting?
The code below gives the same result. Can't export original structure.
from google.colab import files
df.to_csv('filename.csv')
files.download('filename.csv')
Edit: I am looking for a solution is there any way to keep original structure (e.g. array) while exporting.
Actually that is how pandas work. When you try to insert a list or an numpy array into a pandas dataframe, it converts that array to a series always. If you want to turn the series back to a list/array use Series.values, Series.array or Series.to_numpy() . refer this
EDIT :
I got an idea from your comments. You are asking to save dframe into a file while preserving its all properties. You are actually (intentionally or unintentionally) asking how to SERIALIZE the data frame. You have to use pickle for this. Refer this
Note : Pandas has inbuilt pickle support. So you can directly export dframe into pickle file like in this example
df.to_pickle(file_name)
I have a CSV file, diseases_matrix_KNN.csv which has excel table.
Now, I would like to store all the numbers from the row like:
Hypothermia = [0,-1,0,0,0,0,0,0,0,0,0,0,0,0]
For some reason, I am unable to find a solution to this. Even though I have looked. Please let me know if I can read this type of data in the chosen form, using Python please.
most common way to work with excel is use Pandas.
Here is example:
import pandas as pd
df = pd.read_excel(filename)
print (df.iloc['Hypothermia']). # gives you such result
I'm having troubles writing something that I believe should be relatively easy.
I have a template excel file, that has some visualizations on it with a few spreadsheets. I want to write a scripts that loads the template, inserts an existing dataframe rows to specific cells on each sheet, and saves the new excel file as a new file.
The template already have all the cells designed and the visualization, so i will want to insert this data only without changing the design.
I tried several packages and none of them seemed to work for me.
Thanks for your help! :-)
I have written a package for inserting Pandas DataFrames to Excel sheets (specific rows/cells/columns), it's called pyxcelframe:
https://pypi.org/project/pyxcelframe/
It has very simple and short documentation, and the method you need is insert_frame
So, let's say we have a Pandas DataFrame called df which we have to insert in the Excel file ("MyWorkbook") sheet named "MySheet" from the cell B5, we can just use insert_frame function as follows:
from pyxcelframe import insert_frame
from openpyxl import load_workbook
workbook = load_workbook("MyWorkbook.xlsx")
worksheet = workbook["MySheet"]
insert_frame(worksheet=worksheet,
dataframe=df,
row_range=(5, 0),
col_range=(2, 0))
0 as the value of the second element of row_range or col_range means that there is no ending row or column specified, if you need specific ending row/column you can replace 0 with it.
Sounds like a job for xlwings. You didn't post any test data, but modyfing below to suit your needs should be quite straight-forward.
import xlwings as xw
wb = xw.Book('your_excel_template.xlsx')
wb.sheets['Sheet1'].range('A1').value = df[your_selected_rows]
wb.save('new_file.xlsx')
wb.close()
So I have a data file, which i must extract specific data from. Using;
x=15 #need a way for code to assess how many lines to skip from given data
maxcol=2000 #need a way to find final row in data
data=numpy.genfromtxt('data.dat.csv',skip_header=x,delimiter=',')
column_one=data[0;max,0]
column_two=data[0:max,1]
this gives me an array for the specific case where there are (x=)15 lines of metadata above the required data and where the number of rows of data is (maxcol=)2000. In what way do I go about changing the code to satisfy any value for x and maxcol?
Use pandas. Its read_csv function does all that you want (I don't include its equivalent of delimiter, sep=',', because comma-delimited is the default):
import pandas as pd
data = pd.read_csv('data.dat.csv', skiprows=x, nrows=maxcol)
If you really want that as a numpy array, you can do this:
data = data.values
But you can probably just leave it as a pandas DataFrame.
I am creating a dataframe with a bunch of calculations and adding new columns using these formulas (calculations). Then I am saving the dataframe to an Excel file.
I lose the formula after I save the file and open the file again.
For example, I am using something like:
total = 16
for s in range(total):
df_summary['Slopes(avg)' + str(s)]= df_summary[['Slope_S' + str(s)]].mean(axis=1)*df_summary['Correction1']/df_summary['Correction2'].mean(axis=1)
How can I make sure this formula appears in my excel file I write to, similar to how we have a formula in an excel worksheet?
You can write formulas to an excel file using the XlsxWriter module. Use .write_formula() https://xlsxwriter.readthedocs.org/worksheet.html#worksheet-write-formula. If you're not attached to using an excel file to store your dataframe you might want to look into using the pickle module.
import pickle
# to save
pickle.dump(df,open('saved_df.p','wb'))
# to load
df = pickle.load(open('saved_df.p','rb'))
I think my answer here may be responsive. The short of it is you need to use openpyxl (or possibly xlrd if they've added support for it) to extract the formula, and then xlsxwriter to write the formula back in. It can definitely be done.
This assumes, of course, as #jay s pointed out, that you first write Excel formulas into the DataFrame. (This solution is an alternative to pickling.)