i'm working with openpyxl on a .xlsx file which has around 10K products, of which some are "regular items" and some are products that need to be ordered when required. For the project I'm doing I would like to delete all of the rows containing the items that need to be ordered.
I tested this with a small sample size of the actual workbook and did have the code working the way I wanted to. However when I tried this in the actual workbook with 10K rows it seems to be taking forever to delete those rows (it has been running for nearly and hour now).
Here's the code that I used:
wb = openpyxl.load_workbook('prod.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def clean_workbook():
for row in sheet:
for cell in row:
if cell.value == 'ordered':
sheet.delete_rows(cell.row)
I would like to know is there a faster way of doing this with some tweaks in my code? Or is there a better way to just read just the regular stock from the workbook without deleting the unwanted items?
Deleting rows in loops can be slow because openpyxl has to update all the cells below the row being deleted. Therefore, you should do this as little as possible. One way is to collect a list of row numbers, check for contiguous groups and then delete using this list from the bottom.
A better approach might be to loop through ws.values and write to a new worksheet filtering out the relevant rows. Copy any other relevant data such as formatting, etc. Then you can delete the original worksheet and rename the new one.
ws1 = wb['My Sheet']
ws2 = wb.create_sheet('My Sheet New')
for row in ws1.values:
if row[x] == "ordered": # we can assume this is always the same column
continue
ws2.append(row)
del wb["My Sheet"]
ws2.title = "My Sheet"
For more sophisticated filtering you will probably want to load the values into a Pandas dataframe, make the changes and then write to a new sheet.
You can open with read-only mode, and import all content into a list, then modify in list is always a lot more faster than working in excel. After you modify the list, made a new worksheet and upload your list back to excel. I did this way with my 100k items excel .
Related
I am looking for a way to append data from a Python program to an excel sheet. For this, I chose the openpyxl library to save this data.
My problem is how to put new data in the excel file without losing the current data, in the last row of the sheet. I look into the documentation but I did not see any answer.
I do not know if this library has a method to add new data or I need to make a logic to this task.
The last row of the sheet can be found using max_row():
from openpyxl import load_workbook
myFileName=r'C:\DemoFile.xlsx'
#load the workbook, and put the sheet into a variable
wb = load_workbook(filename=myFileName)
ws = wb['Sheet1']
#max_row is a sheet function that gets the last row in a sheet.
newRowLocation = ws.max_row +1
#write to the cell you want, specifying row and column, and value :-)
ws.cell(column=1,row=newRowLocation, value="aha! a new entry at the end")
wb.save(filename=myFileName)
wb.close()
What you're looking for is the Worksheet.append method:
Appends a group of values at the bottom of the current sheet.
If it’s a list: all values are added in order, starting from the first column
If it’s a dict: values are assigned to the columns indicated by the keys (numbers or letters)
So no need to check for the last row. Just use this method to always add the data at the end.
ws.append(["some", "test", "data"])
If I call a sheet by name, get_all_values function will always give me an empty list for a sheet that is definitely not empty.
import gspread
sheet = workbook.worksheet(sheet_name)
all_rows_list = sheet.get_all_values()
The only time get_all_values seems to return like it should is if I do the following:
all_rows_list = workbook.sheet1.get_all_values()
But the above works just for the first sheet and for no other, which is kind of useless for a workbook with more sheets.
What always works is reading row by row like
one_row_list = sheet.row_values(1) # first row
But the problem is that I'm trying to read a relatively big workbook with lots of sheets to figure out where I'm supposed to start writing, and it looks like reading row by row triggers "RESOURCES EXHAUSTED" error very fast.
So, am I doing something wrong or is get_all_values broken in gspread?
EDIT:
Added a screenshot.
gspread doesn't work well with sheets with names that could be confused as a cell reference in the A1 notation (like X101 and AT8 in your case).
https://github.com/burnash/gspread/issues/554 is an older issue that describes the underlying problem (the symptoms in that issue are different, but I'm pretty sure the root problem is the same).
I'll copy the workaround with providing a range, that you've discovered yourself:
ws.range("A1:C"+str(end_row)) That end_row is usually row_count of the sheet.
I created a small program that writes to an excel file. I have another program that needs to read the last entry (in column A) every day. Since there is a new data imported into the excel file every day, the cell that I need to capture is different.
I'm looking to see if there is a way for me to grab the last cell in Column A using openpyxl in python?
I don't have much experience with this, so I wasn't sure where to start.
import openpyxl
wb = openpyxl.load_workbook('text.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
from https://openpyxl.readthedocs.io/en/stable/tutorial.html
try this, it should get the entire A column and take the last entry:
sheet['A'][-1]
I'm trying to create a new workbook with data from an already existing workbook. The existing workbook is extremely large so I have it loaded as a read-only workbook.
Because of this, I need to iterate through the rows but I can't seem to figure out how to do this AND get data into the new workbook.
Along with this, the data is from column A and is only put into the new workbook if the cell in column B say "IL".
for row in existing_sheet.iter_rows(min_col=2, max_col=2):
for cell in row:
print("CHECKING IF IT IS IN IL")
if "IL" in str(cell.value):
currSheet.cell(row=counter, column=1).value = existing_sheet.cell(row=counter, column=41).value
I keep getting deprecation warnings and the program is going much slower than I think it should be.
When I simply do a print statement to see the cell value, it goes through all 40,000 rows in just a few minutes.
My current code takes hours, if not longer.
existing_sheet.cell(row=counter, column=41).value
This is what is slowing everything down. In read-only mode every call to iter_rows() or cell() will force openpyxl to parse the worksheet again. But you will need to have a wider row to get the 41st cell row[40].
for row in ws1.iter_rows(min_col=2, max_col=41):
if "IL" in row[2].value:
ws2.cell(row=row[2].row, column=1).value = row[40].value
I have this:
dic_sheets = {}
for y in xl_files[]
dic_sheets.update({y:[]})
I want to populate the tables in the dictionary (dic_sheets) for each key(y) with the individual sheets inside of the excel document.
I do not know how many sheets are inside of the excel document; I don't have an index number to stop a range (x,y,z) loop.
Another way to put it: I want to dump x-number of excel files into the active directory and have each files sheets populate in a dictionary when I run the .py in CMD.
Can anyone help me achieve this goal?
xl_files contains "ExcelFile" data "pandas.io.excel.ExcelFile object at 0x0FF6B0D0
Edit: y represents individual excel files
Edit2: I need only the sheet names (or their unique index numbers) to populate, (i.e. 'sheet1', 'pivot2'). I'm not yet concerned with cells in the sheets.
Edit3: I already have the table ‘xl_files’ generated to contain every excel file in the cwd
I figured it out!
I had to use a for loop and the return function as an object, then combine it with another object of the array.append function and return function with a new array.
I'll try to word my questions better in the future, as I did not get a bite this round.