Python OpenPyxl trouble detecting all merged cells - python

I am trying to detect all the merged cells in a openpyxl.worksheet.worksheet.Worksheet object and it seems that the merged_cells.ranges cannot all the merged cells but merged cells in some columns. My goal is to detect the merged cells, unmerge them and then remerge certain cells based on column values. During unmerging, I fill the unmerged cells with the top-left cell value of the merged cell.
I have worked around with this problem by filling nan in cells which are supposed to be recognized as a merged cell with previous value in the column since all my merged cells range in the same column, for example A18:A19, B18:B19. But things have become more tricky after I updated my xlsx file. OpenPyxl didn't find merged cells in A, C and E columns in my previous xlsx. Now it has trouble finding merged cells in B, D and F columns. Two xlsx file has the same format but different data.
Here is an example of what my xlsx looks like:
xlsx sample
My code to read the xlsx then detect & unmerge the merged cells:
client_info_wb = load_workbook(path_client_info)
sheet_name = client_info_wb.sheetnames[0]
client_info_ws = client_info_wb[sheet_name]
for cell_group in client_info_ws.merged_cells.ranges:
print(cell_group)
min_col, min_row, max_col, max_row = range_boundaries(str(cell_group))
top_left_cell_value = client_info_ws.cell(row=min_row, column=min_col).value
print(top_left_cell_value)
client_info_ws.unmerge_cells(str(cell_group))
for row in client_info_ws.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row):
for cell in row:
cell.value = top_left_cell_value
Output for print(cell_group):
A48:A49
2021-01-05
C48:C49
XX5614
E48:E49
ID
A46:A47
2021-01-05
C46:C47
XX2134
E46:E47
ID
A44:A45
2021-01-05
C44:C45
XX1234
E44:E45
ID
The information in those columns where openpyxl merged_cells.ranges fails to detect merged cells is necessary to the following operations in my code. So can anyone help me with this? Are there anyone having the same issue? I have spent a long time trying to find the patterns in my xlsx to find out what is causing the trouble and had no luck.

while sheet.merged_cells: # <- Here's the change to make.
for cell_group in sheet.merged_cells:
val = str(cell_group.start_cell.value).strip()
sheet.unmerge_cells(str(cell_group))
for merged_cell in cell_group.cells:
sheet.cell(row=merged_cell[0], column=merged_cell[1]).value = val
It seems like the set of merged_cells changes as it is iterated over, so repeating the loop until merged_cells is None does the trick.
There's also something weird going on with the in-memory buffer, so I save the file to disk and reload it with pandas, rather than loading the dataframe from the sheet in memory. (This could easily be optimized with a BytesIO object.)
For me, this guarantees that all merged cells are unmerged and replaced with the start cell's value.

Related

Start reading from a column where data exists is in excel using pandas [duplicate]

How could I read an Excel file from column AF and onwards? I don't know the last column letter name and the file is too large to constantly keep checking.
df = pd.read_excel(r"Documents\file.xlsx", usecols="AF:")
You can't write it directly in read_excel function so we can only look for other possible options.
For the moment we could write 'AF:XFD' because 'XFD' is the last column in excel, but it returns information that it will be depracated soon and start returning ParseError, so it's not recommended.
You can use other libraries to find the last column, but it doesn't work too fast - excel file is read, then we check last column and after that we create a dataframe.
If I had such problem I would do everything in Pandas, by adding .iloc at the end. We know that 'AF' is 32th column in excel, so:
df = pd.read_excel(r"Documents\file.xlsx").iloc[:, 32:]
It will return all columns from 'AF' till the end without having a noticeable impact on performance.
You can use the xlwings library to determine the last column in which there is data and then replace it in your code line.
import xlwings as xw
app = xw.App(visible = False)
book = app.books.open(r"Documents\file.xlsx")
sheet = xw.Book(r"Documents\file.xlsx").sheets[0]
app.quit()
colNum = sheet.range('AF1').current_region.last_cell.column
df = pd.read_excel(r"Documents\file.xlsx", usecols = list(range(32, colNum, 1)))
where 32 corresponds to the column number for AF.

Python/Openpyxl copy and paste columns with formula

I have a workbook with a worksheet, named sheet #1. I want to copy some columns from sheet #1 and change column orders a little bit.
First, I create a new worksheet, named sheet #2. I can copy and paste from sheet #1 to sheet #2, but I find openpyxl copies formulas exactly as is, so I have a problem. For example, column O in sheet #1 has some formula like:
O3=(M3*1)+(N3*1)
I move column M in sheet #1 to column H in sheet #2 and move column N in sheet #1 to column I in sheet #2. When I move column O in sheet #1 to column M in sheet #2, I have problems. Firstly, column M in sheet #2's formula is still:
M3=(M3*1)+(N3*1)
I have a circular reference issue since I try to use myself to calculate myself. Secondly, if I move column O in sheet #1 to column J in sheet #2, I don't have this circular reference problem, but my formula is still messed up.
I use the following way to copy and paste:
for i in range(0, 1000):
sheet_#2.cell(row=i,column=12).value = sheet_#1.cell(row=i,column=14).value
I have tried data_only with true and false when I call load_workbook as follows.
my_workbook = openpyxl.load_workbook(args.input_file, data_only=False)
Neither works for me. True gets me all zeros in both sheet #1 and sheet #2. False gets me the circular reference problem as described above.
Is there a way to use openpyxl package to solve my problem? I think as long as when copying and pasting, if worksheet name can be added to specify the cells in the formula, my problem is solved, something like this:
M3=("Sheet #1"M3*1)+("Sheet #1"N3*1)
If openpyxl doesn't do the job, is there a better package to solve this problem? pandas?
I will start off by saying I am no expert, but I'll give it a go.
By the sound of your question it seems you may not be familiar with Pandas. I would tackle this with pandas. Also do some additional reading on Pandas it is so powerful! Especially for excel automation.
import pandas as pd
# Read the excel sheets to Pandas DataFrames
DataFrame1 = pd.read_excel("FileName.xlsx", sheetname='sheet_number_1')
DataFrame2 = pd.read_excel("FileName.xlsx", sheetname='sheet_number_2')
You should read your sheet #2 DataFrame and bring over columns that DO NOT have formulas from your sheet#1 DataFrame first.
#You can set columns equal to each other like this.
sheet2df['sheet_2_column_name'] = sheet1df['sheet_1_column_name']
This will bring over all data from whatever sheet 1 column you choose into whatever sheet 2 column you choose.
Now for columns with formulas... You mentioned the formula (M3*1)+(N3*1) will now become (H3*1)+(I3*1) in your sheet#2. I wouldn't bring these columns over using above method instead I would do something like this...
#apply formula down each row in a column
DataFrame2['column_name_to_insert_formula_to'] = DataFrame2.apply(lambda row: '(H{}*1)+(I{}*1)'.format(row.name + 2), axis=1)
In this case you can leave the {} blank. This formula will iterate down the row number in the {} brackets. You are essentially passing .format(row.name +2) which in this case is your row number as a parameter into the brackets. Also we use axis=1 because you want to apply it to each row in a column. axis=1 will do that for us.
More info on Pandas .apply function from the Pandas docs
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
More info on Apply and Lambda usage in Pandas
https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7

How to add to data frame based on text color (Styleframe)

I have a large excel file with 90k rows, and I want to add only the rows that have red colored text to a dataframe (using styleframe). The code below works if I use a small excel file with 5 rows, but when I attempt to use it with a larger file the data frame is always empty.
even if I remove the dropna I get a styleframe with all Nans, and no reds.
sf = StyleFrame.read_excel('myFile.xlsx', read_style=True, use_openpyxl_styles=False, usecols = ['COLUMN_1'], header = 2)
.
def only_cells_with_red_text(cell):
return cell if cell.style.font_color in {utils.colors.red, 'FFFF0000'} else np.nan
.
sf_2 = StyleFrame(sf.applymap(only_cells_with_red_text).dropna(axis=(0, 1), how='all'))
I expected only cells with red text to be added to dataframe
The output is Empty DataFrame
Columns: []
Index: []
It's a bug in StyleFrame. The usecols and header kwargs change the shape of the dataframe (since they cause pd.read_excel to return a subset of the dataframe/sheet). When read_excel then applies the styles it applies the styles to the wrong cells (simply put, it applies the styles based on the location of the cells in the original, entire sheet).
For now, the "workaround" is to remove usecols = ['COLUMN_1'], header=2 (much less efficient, of course) and do the filtering later, ie
sf = sf[['COLUMN_1']]
until I (I'm one of the authors of StyleFrame) find a way to overcome this.

Importing multiple slightly different excel files

I have to import tables from multiple excel files to a python, specifically with a Panda Dataframe format.
The problem is that the excel files do not have all the same structure,
particularly in some of them the table starts at the cell A1 while in another start at cell A2 or even B1 or B2.
The only thing that stays constant through all excel files are the headers positioned in the first row of the table.
So for instance in the first row in the very right of the table is always written "setting". But "setting" is sometimes written at position A1 and sometimes at position B2.
Currently, I am just modifying manually numHeader and skip parameters in the pandas.read_excel method for each single excel, but since there is quite a lot of file doing this manually every time is quite a waste of time.
CurrExcelFile = pd.read_excel(Files[i], \
header=numHeader, skiprows=skip)
Is there exist, or can it be easily written a package which takes as a parameter a string for identifying the first element of a table?
So that I could just pass "setting" and the script could get automatically the index of the cell and start fetching data from there.
UPDATE:
So currently I managed to it by first importing the whole sheet, find where it is the "setting" value, then dropping the unnecessary columns, rename the data frame and finally dropping the unnecessary rows.
Test = pd.read_excel('excelfile',sheet_name='sheetname')
#Find the index and the column of the first cell
for column in Test.columns:
tmp = Test[column] == "setting"
if len(Test.loc[tmp].index) == 1:
RowInd = Test.loc[tmp].index[0]
ColPos = Test.columns.get_loc(column)
#Drop Columns
ColumnsToDrop = Test.columns[np.arange(0,ColPos)]
Test.drop(ColumnsToDrop, inplace=True, axis=1)
#Rename Axis
Test.columns = (Test.iloc[RowInd])
#Drop Rows
Test.drop(np.arange(0,RowInd+1), inplace=True, axis=0)
This is rather a workaround and I wish there was a easier solution

Spurious 'None' cells loaded at the beginning of columns by openpyxl

I've been working on a function in python, using the openpyxl library, that will load columns from a specified sheet in a workbook and do some data conditioning before returning the columns in lists or numpy arrays.
To load the columns, I'm loading the workbook, getting the target sheet, storing the columns, then simply iterating through each column and appending the cell contents to lists:
#open the excel file
wb = openpyxl.load_workbook(fname, read_only = True)
print('\nWorkbook "%s" open...' % (fname))
#get the target sheet
sh = wb.get_sheet_by_name(sheet)
print('Sheet "%s" aquired...' % (sheet))
#store only the desired columns of the sheet
sheetcols = sh.columns
columns = [[] for i in range(L)]
for i in range(L):
columns[i] = sheetcols[cols[i] - 1]
#read selected columns into a list of lists
print('Parsing desired columns of data...')
data = [[] for i in range(L)]
#iterate over the columns
for i in range(L):
#iterate over a specific column
print(len(columns[i]))
for j in range(len(columns[i])):
#store cell contents as a string (for now)
data[i].append(columns[i][j].value)
Some columns will load with several None elements at the beginning of their respective list that do not correspond to the data in the excel file. For example, a column with two empty cells at the beginning (left empty because of header space or whatever) is expected to load with two None elements at the beginning of its list but it might load with five or six None elements instead of just two...
It's consistent every time I run the function. The same columns will have this problem every time, which makes me think there is hidden data of some kind in the excel sheet. I've tried clearing the contents of the cells that are supposed to be empty but no luck.
Does anybody more familiar with the openpyxl module or maybe just excel have thoughts about why these mysterious extra None elements get into the imported data?
The code is incomplete but it's probably worth noting that the behaviour for worksheets with missing cells is necessarily somewhat unpredictable. For example, if a worksheet only has values in the cells from D3:G8 what should its columns be? openpyxl will create cells on-demand for any given range and I suspect that is what you may be seeing.
ws.rows and ws.columns are provided by convenience but you are almost always better working with ws.get_squared_range(…) which should give you few surprises.

Categories