Spurious 'None' cells loaded at the beginning of columns by openpyxl

Spurious 'None' cells loaded at the beginning of columns by openpyxl - python

I've been working on a function in python, using the openpyxl library, that will load columns from a specified sheet in a workbook and do some data conditioning before returning the columns in lists or numpy arrays.
To load the columns, I'm loading the workbook, getting the target sheet, storing the columns, then simply iterating through each column and appending the cell contents to lists:
#open the excel file
wb = openpyxl.load_workbook(fname, read_only = True)
print('\nWorkbook "%s" open...' % (fname))
#get the target sheet
sh = wb.get_sheet_by_name(sheet)
print('Sheet "%s" aquired...' % (sheet))
#store only the desired columns of the sheet
sheetcols = sh.columns
columns = [[] for i in range(L)]
for i in range(L):
columns[i] = sheetcols[cols[i] - 1]
#read selected columns into a list of lists
print('Parsing desired columns of data...')
data = [[] for i in range(L)]
#iterate over the columns
for i in range(L):
#iterate over a specific column
print(len(columns[i]))
for j in range(len(columns[i])):
#store cell contents as a string (for now)
data[i].append(columns[i][j].value)
Some columns will load with several None elements at the beginning of their respective list that do not correspond to the data in the excel file. For example, a column with two empty cells at the beginning (left empty because of header space or whatever) is expected to load with two None elements at the beginning of its list but it might load with five or six None elements instead of just two...
It's consistent every time I run the function. The same columns will have this problem every time, which makes me think there is hidden data of some kind in the excel sheet. I've tried clearing the contents of the cells that are supposed to be empty but no luck.
Does anybody more familiar with the openpyxl module or maybe just excel have thoughts about why these mysterious extra None elements get into the imported data?

The code is incomplete but it's probably worth noting that the behaviour for worksheets with missing cells is necessarily somewhat unpredictable. For example, if a worksheet only has values in the cells from D3:G8 what should its columns be? openpyxl will create cells on-demand for any given range and I suspect that is what you may be seeing.
ws.rows and ws.columns are provided by convenience but you are almost always better working with ws.get_squared_range(…) which should give you few surprises.

Related

Python OpenPyxl trouble detecting all merged cells

I am trying to detect all the merged cells in a openpyxl.worksheet.worksheet.Worksheet object and it seems that the merged_cells.ranges cannot all the merged cells but merged cells in some columns. My goal is to detect the merged cells, unmerge them and then remerge certain cells based on column values. During unmerging, I fill the unmerged cells with the top-left cell value of the merged cell.
I have worked around with this problem by filling nan in cells which are supposed to be recognized as a merged cell with previous value in the column since all my merged cells range in the same column, for example A18:A19, B18:B19. But things have become more tricky after I updated my xlsx file. OpenPyxl didn't find merged cells in A, C and E columns in my previous xlsx. Now it has trouble finding merged cells in B, D and F columns. Two xlsx file has the same format but different data.
Here is an example of what my xlsx looks like:
xlsx sample
My code to read the xlsx then detect & unmerge the merged cells:
client_info_wb = load_workbook(path_client_info)
sheet_name = client_info_wb.sheetnames[0]
client_info_ws = client_info_wb[sheet_name]
for cell_group in client_info_ws.merged_cells.ranges:
print(cell_group)
min_col, min_row, max_col, max_row = range_boundaries(str(cell_group))
top_left_cell_value = client_info_ws.cell(row=min_row, column=min_col).value
print(top_left_cell_value)
client_info_ws.unmerge_cells(str(cell_group))
for row in client_info_ws.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row):
for cell in row:
cell.value = top_left_cell_value
Output for print(cell_group):
A48:A49
2021-01-05
C48:C49
XX5614
E48:E49
ID
A46:A47
2021-01-05
C46:C47
XX2134
E46:E47
ID
A44:A45
2021-01-05
C44:C45
XX1234
E44:E45
ID
The information in those columns where openpyxl merged_cells.ranges fails to detect merged cells is necessary to the following operations in my code. So can anyone help me with this? Are there anyone having the same issue? I have spent a long time trying to find the patterns in my xlsx to find out what is causing the trouble and had no luck.

while sheet.merged_cells: # <- Here's the change to make.
for cell_group in sheet.merged_cells:
val = str(cell_group.start_cell.value).strip()
sheet.unmerge_cells(str(cell_group))
for merged_cell in cell_group.cells:
sheet.cell(row=merged_cell[0], column=merged_cell[1]).value = val
It seems like the set of merged_cells changes as it is iterated over, so repeating the loop until merged_cells is None does the trick.
There's also something weird going on with the in-memory buffer, so I save the file to disk and reload it with pandas, rather than loading the dataframe from the sheet in memory. (This could easily be optimized with a BytesIO object.)
For me, this guarantees that all merged cells are unmerged and replaced with the start cell's value.

Python and Excel - OpenPyXL

I am working with Excel using Python and have couple of questions:
Loading Excel Sheet into 2d Array.
In VBA I'd simply do:
dim arrData as Variant
arrData = shtData.Range("A1:E2500")
I would get array(1 to 2500, 1 to 5), which i can easily access for example arrData(1,5) -> row 1 column 5
In Python what i managed to do is:
#declare list
excel_data=[]
#loop to load excel spreadsheet data into 2d Array
#basically I am looping through every row and append it to list
for row in shtData.iter_rows(min_row=5, max_row=50,values_only =True):
excel_data.append(row)
Is there a way to assign row to list, starting from index 1 not 0?
In VBA there is an option Option Base 1.
https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/option-base-statement
Is it the fastest way to operate on Excel Dataset?
I am planning then to loop through let's say 2500 rows and 5 columns -> 12'500 cells.
With VBA it was very efficient to be honest (operating on array in memory).
As I understand, functionsof OpenPyXL:
load_workbook
#only creates REFERENCE TO EXCEL WORKBOOK - it does not open it? or is it "loaded" into memory but what is on the HD is actually intact?
shtData = wkb.worksheets[0]
#again only reference?
shtReport = wkb.create_sheet(title="ReportTable")
#it adds sheet but it adds it in excel which is loaded into memory, only after saving, Excel on HD is actually overwritten?

You can used Pandas and create a dataframe (2D table) from the Excel spreadsheat.
import pandas as pd
df = pd.read_excel("data.xls")
print(df)
print("____________")
print(f'Average sales are: {df["Gross"].values.mean()}')
print(f'Net income for April: {df.at[3, "Net"]}')
print("____________")
df_no_header = pd.read_excel("data.xls",skiprows=1, header=None)
print(df_no_header)
print("____________")
print(f'Net income for April: {df_no_header.at[3, 2]}')
Output:
The Pandas dataframe has many methods that will allow you to access rows and columns and do much more. Setting skiprows=1, header=None will skip the header row. See here.

how to only read rows with the first column containing name i

I have two excel worksheets I am reading in Python. The first worksheet has a list of companies names. The second is a sheet with multiple of the same companies' names and data to the right that corresponds to the row.
[![Worksheet 1][1]][1]
[![Worksheet 2][2]][2]
I want to make some kind of condition, if the name in column A WS 2 matches the name in WS 1, then print the data (columns A:F WS 2) only for the rows corresponding to the name.
I am pretty new to coding, so I've been playing with it a lot without finding much luck. Right now I don't have much code because I tried restarting again. Been trying to use just pandas to read, sometimes I've been trying openpyxl.
import pandas as pd
import xlsxwriter as xlw
import openpyxl as xl
TickList = pd.read_excel("C:\\Users\\Ashley\\Worksheet1.xlsx",sheet_name='Tickers', header=None)
stocks = TickList.values.ravel()
Data = pd.read_excel("C:\\Users\\Ashley\\Worksheet2.xlsx", sheet_name='Pipeline', header=None, usecols="A:F")
data = Pipeline.values.ravel()
for i in stocks:
for t in data:
if i == t:
print(data)
[1]: https://i.stack.imgur.com/f6mXI.png
[2]: https://i.stack.imgur.com/4vKGR.png

I would imagine that the first thing you are doing wrong is not stipulating the key value on which the "i" in stocks is meant to match on the values in "t". Remember - "t" are the values - all of them. You have to specify that you wish to match the value of "i" to (probably) the first column of "t". What you appear to be doing here is akin to a vlookup without the target range properly specified.
Whilst I do not know the exact method in which the ravel() function stores the data, I have to believe something like this would be more likely to work:
for i in stocks:
for t in data:
if i == t[0]:
print(t)

Importing multiple slightly different excel files

I have to import tables from multiple excel files to a python, specifically with a Panda Dataframe format.
The problem is that the excel files do not have all the same structure,
particularly in some of them the table starts at the cell A1 while in another start at cell A2 or even B1 or B2.
The only thing that stays constant through all excel files are the headers positioned in the first row of the table.
So for instance in the first row in the very right of the table is always written "setting". But "setting" is sometimes written at position A1 and sometimes at position B2.
Currently, I am just modifying manually numHeader and skip parameters in the pandas.read_excel method for each single excel, but since there is quite a lot of file doing this manually every time is quite a waste of time.
CurrExcelFile = pd.read_excel(Files[i], \
header=numHeader, skiprows=skip)
Is there exist, or can it be easily written a package which takes as a parameter a string for identifying the first element of a table?
So that I could just pass "setting" and the script could get automatically the index of the cell and start fetching data from there.
UPDATE:
So currently I managed to it by first importing the whole sheet, find where it is the "setting" value, then dropping the unnecessary columns, rename the data frame and finally dropping the unnecessary rows.
Test = pd.read_excel('excelfile',sheet_name='sheetname')
#Find the index and the column of the first cell
for column in Test.columns:
tmp = Test[column] == "setting"
if len(Test.loc[tmp].index) == 1:
RowInd = Test.loc[tmp].index[0]
ColPos = Test.columns.get_loc(column)
#Drop Columns
ColumnsToDrop = Test.columns[np.arange(0,ColPos)]
Test.drop(ColumnsToDrop, inplace=True, axis=1)
#Rename Axis
Test.columns = (Test.iloc[RowInd])
#Drop Rows
Test.drop(np.arange(0,RowInd+1), inplace=True, axis=0)
This is rather a workaround and I wish there was a easier solution

In Python have a list of tuples and for each tuple would like to put x[0] and x[1] into column A and B of an Excel spreadsheet

I used xlrd to pull an Excel file with prescription numbers and drug names. I then made a list of tuples that include the prescription number and drug name. The list looks like this:
[(123, enalapril),
(456, atenolol),
(789, lovastatin)
]
I would like to create a new Excel file that lists each prescription number in column A with the corresponding drug in column B. I plan to use xlsxwriter. Is there a way to do this with tuples?
A workaround I tried involved creating two separate lists (one of prescription numbers and one of drugs). It worked in this small example, but I would like to make this work reliably on a large scale. I am concerned that by using two separate lists somehow the prescription numbers may be matched to the wrong drug in the new Excel file. Thank you.

xlsxwriter makes it pretty straight forward to create a basic spreadsheet:
Code:
import xlsxwriter
# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('my_excel.xlsx')
worksheet = workbook.add_worksheet()
# Some data we want to write to the worksheet.
data = [(123, 'enalapril'), (456, 'atenolol'), (789, 'lovastatin')]
# Iterate over the data and write it out row by row.
for row, line in enumerate(data):
for col, cell in enumerate(line):
worksheet.write(row, col, cell)
workbook.close()
Results:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spurious 'None' cells loaded at the beginning of columns by openpyxl - python

Related

Python OpenPyxl trouble detecting all merged cells

Python and Excel - OpenPyXL

how to only read rows with the first column containing name i

Importing multiple slightly different excel files

In Python have a list of tuples and for each tuple would like to put x[0] and x[1] into column A and B of an Excel spreadsheet

Categories

Resources