How could I retrieve
the column names (values of the cells in the first row) in an openpyxl Read-only worksheet?
City, Population, Country in the below example worksheet
all column names in an openpyxl Read-only workbook?
City, Population, Country, frames from worksheet 1 and the other column names from all other worksheets
Example Excel worksheet:
| City | Population | Country |
| -----------|------------ | ------------ |
| Madison | 252,551 | USA |
| Bengaluru | 10,178,000 | India |
| ... | ... | ... |
Example code:
from openpyxl import load_workbook
wb = load_workbook(filename=large_file.xlsx, read_only=True)
sheet = wb.worksheets[0]
... (not sure where to go from here)
Notes:
I have to use readonly because the Excel file has over 1 million rows (don't ask)
I'd like the column names so I can eventually infer the column types and import the excel data into a PostgreSQL database
This will print every thing from row 1;
list_with_values=[]
for cell in ws[1]:
list_with_values.append(cell.value)
If for some reason you want to get a list of the column letters that are filled in you can just:
column_list = [cell.column for cell in ws[1]]
For your 2nd question;
Assuming you have stored the header values in a list called : "list_with_values"
from openpyxl import Workbook
wb = Workbook()
ws = wb['Sheet']
#Sheet is the default sheet name, you can rename it or create additional ones with wb.create_sheet()
ws.append(list_with_values)
wb.save('OutPut.xlsx')
Read-only mode provides fast access to any row or set of rows in a worksheet. Use the method iter_rows() to restric the selection. So to get the first row of the worksheet:
rows = ws.iter_rows(min_row=1, max_row=1) # returns a generator of rows
first_row = next(rows) # get the first row
headings = [c.value for c in first_row] # extract the values from the cells
Charlie Clarks answer compacted down to a one liner with list comprehension
headers = [c.value for c in next(wb['sheet_name'].iter_rows(min_row=1, max_row=1))]
This is how I handled this
from openpyxl.utils import get_column_letter
def get_columns_from_worksheet(ws):
return {
cell.value: {
'letter': get_column_letter(cell.column),
'number': cell.column - 1
} for cell in ws[1] if cell.value
}
An Example of this being used would be
from openpyxl import load_workbook
wb = load_workbook(filename='my_file.xlsx')
ws = wb['MySheet']
COLUMNS = get_columns_from_worksheet(ws)
for cell in ws[COLUMNS['MY Named Column']['letter']]:
print(cell.value)
The main reason for capturing both the letter and number code is because different functions and patterns within openpyxl use either the number or the letter so having reference to both is invaluable
Related
My question is simple and I'm sorry to ask it here. But I tried several ways to iterate through my excel file and I'm having trouble finding the solution.
from openpyxl import workbook, load_workbook
wb = load_workbook("italian_team.xlsx")
ws = wb.active
rows = ws["A"]
equipe = ["Juventus", "Ac Milan", "Torino", "Pescara", "As Roma", "Genoa", "Napoli"]
for cell in rows:
x = equipe[cell]
wb.save("italian_team.xlsx")
Do you mean you just want to insert your list as a row in the workbook?
If so there are a few options, you could just append the list as is to the sheet in which case it will be enter after the last used row.
Or specify the row (and column) to add to.
Both options are shown in the code below
from openpyxl import workbook, load_workbook
wb = load_workbook("italian_team.xlsx")
ws = wb.active
# rows = ws["A"]
equipe = ["Juventus", "Ac Milan", "Torino", "Pescara", "As Roma", "Genoa", "Napoli"]
# for cell in rows:
# x = equipe[cell]
# This will append the list after the last used row
ws.append(equipe)
# This will enter the list at row 1 column 1 to the length of the list
# Use min_row = and max_col = as well if the list is to be on another row or start at another column
for row in ws.iter_rows(max_row=1, max_col=len(equipe)):
for enum, cell in enumerate(row):
cell.value = equipe[enum]
wb.save("italian_team.xlsx")
Is it possible to get the SKUs — and other information — from one Excel file and reference it in another below an identifier: SKU, Name, etc.?
I have one Excel file with various sheets which acts like a template. In each sheet the same, or very similar, identifiers are found e.g. Product SKU.
My current idea is that once I have the cell.column and cell.row for Product SKU I can store that in variable, respectively, and increment the cell.row by +1 so I can load the information from another sheet below Product SKU. The end result would look like this in Excel. I would apply the logic above to all other information I want to load.
+-------------+----------------+
| Product Sku | Product Name |
+-------------+----------------+
| 12345678 | Some Product |
+-------------+----------------+
The idea I am trying to produce. If there is a better way, please do let me know.
from openpyxl import load_workbook
wb = load_workbook(filename = "product_import_tmp.xlsx")
inventory_1_sheet = wb["Inventory 1"]
for row in inventory_1_sheet.iter_rows(values_only=False):
for cell in row:
if cell.value == "Product SKU":
sku_y = cell.column
sku_x = cell.row + 1 # The Product SKU information would go here
It would be better if you can provide what your excel files look like.
I would suggest pandas instead of openpyxl if there is complex manipulation to excel.
I'm trying to find a creative way to get the dataframe of several sheets within a spreadsheet that's quite irregular but I can't find the way to do it.
If I try this:
file= 'filename.xlsx'
df = xlrd.open_workbook(file)
print(df)
This is my current output:
A | B | C
1 Random text | Empty cell|Empty cell
------------------------------------
2 Empty cell | |
------------------------------------
3 Empty cell | |
------------------------------------
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
I want to start my dataframe in the CODE row and column, but pandas just gets the "Random text" cell as the first cell
This is my desired output:
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
How would you make Pandas ignore the first rows? It has to be value-based because in the next sheets CODE starts in row 8, and in the next one in row 3
Not sure about XLRD, but Pandas has an easy way in the excel reading method that allows you to specify which row is your headers. That would be an easy fix unless you're intent on using XLRD.
You can try:
import pandas as pd
file= 'filename.xlsx'
df = pd.read_excel(open(file, 'rb'),sheet_name='sheetname', skiprows=[0,1,2])
Alternatively you can use header argument as mentioned earlier.
In my previous answer I pointed a static solution, and in this one I have added a helper function for dynamic parsing. get_header_index helper function dynamically gets the index of the row containing header keyword in the first column. You may change the col_index argument if you believe header keyword is in another column tough. Likewise you can change keyword argument's input as you like. The output dfs is dictionary of dataframes where keys are sheet names of a given workbook.
import pandas as pd
def get_header_index(sheet, col_index=0, keyword='code'):
arr = sheet[sheet.columns[int(col_index)]]
header_index = arr[arr.str.contains(str(keyword), na=False)].iloc[[0,]].index[0]
return header_index
file = 'filename.xlsx'
sheets_dict = pd.read_excel(open(file, 'rb'), sheet_name=None)
dfs = {}
for name, sheet in sheets_dict.items():
header = get_header_index(sheet, col_index=0, keyword='code') + 1
df = pd.read_excel(open(file, 'rb'), sheet_name=name, header=header)
dfs[name] = df
This is a form of what I did in mine, adjusted for your use (based on my previous comment):
for file in file_names: # Iterate through all of the individual report files
book = xlrd.open_workbook(file)
sheetname = get_sheetname(book)
if sheetname is not None: # Check that sheet name is valid
sheet = book.sheet_by_name(sheetname)
nrows = sheet.nrows
ncols = sheet.ncols
for i in range(nrows):
for j in range(ncols):
check = sheet.cell_value(i, j)
if check.contains("CODE"):
return (i, j)
I want to read the data in one column in excel, here is my code:
import xlrd
file_location = "location/file_name.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('sheet')
x = []
for cell in sheet.col[9]:
if isinstance(cell, float):
x.append(cell)
print(x)
It is wrong because there is no method in sheet called col[col.num], but I just want to extract the data from column 8 (column H), what can I do?
If you're not locked with xlrd I would probably have used pandas instead which is pretty good when working with data from anywhere:
import pandas as pd
df = pd.ExcelFile('location/test.xlsx').parse('Sheet1') #you could add index_col=0 if there's an index
x=[]
x.append(df['name_of_col'])
You could then just write the new extracted columns to a new excel file with pandas df.to_excel()
You can get the values of the 8th column like this:
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 7))
By far the easiest way to get all the values in a column using xlrd is the col_values() worksheet method:
x = []
for value in sheet.col_values(8):
if isinstance(value, float):
x.append(value)
(Note that if you want column H, you should use 7, because the indices start at 0.)
Incidentally, you can use col() to get the cell objects in a column:
x = []
for cell in sheet.col(8):
if isinstance(cell.value, float):
x.append(cell.value)
The best place to find this stuff is the official tutorial (which serves as a decent reference for xlrd, xlwt, and xlutils). You could of course also check out the documentation and the source code.
I would recommend to do it as:
import openpyxl
fname = 'file.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.get_sheet_by_name('sheet-name')
for rowOfCellObjects in sheet['C5':'C7']:
for cellObj in rowOfCellObjects:
print(cellObj.coordinate, cellObj.value)
Result: C5 70.82 C6 84.82 C7 96.82
Note: fname refers to excel file, get_sheet_by_name('sheet-name') refers to desired sheet and in sheet['C5':'C7'] ranges are mentioned for columns.
Check out the link for more detail. Code segment taken from here too.
XLRD is good, but for this case you might find Pandas good because it has routines to select columns by using an operator '[ ]'
Complete Working code for your context would be
import pandas as pd
file_location = "file_name.xlsx"
sheet = pd.read_excel(file_location)
print(sheet['Sl'])
Output 1 - For column 'Sl'
0 1
1 2
2 3
Name: Sl, dtype: int64
Output 2 - For column 'Name'
print(sheet['Name'])
0 John
1 Mark
2 Albert
Name: Name, dtype: object
Reference: file_name.xlsx data
Sl Name
1 John
2 Mark
3 Albert
I have what feels like a problem with a relatively simple solution, but to this point it escapes my research. I'm attempting to write items from a tuple to four consecutive rows using a for loop, but I can't seem to figure it out. I suspect that it can be done with the iter_rows module in the openpyxl package, but I haven't been able to properly apply it within the loop. The following piece of code results in the generation of an .xlsx file with the last item from the tuple assigned to cell 'A2':
from openpyxl import Workbook
nfc_east = ('DAL', 'WAS', 'PHI', 'NYG')
wb = Workbook()
ws = wb.active
row_cell = 2
for i in nfc_east:
column_cell = 'A'
ws.cell(row = row_cell, column = column_cell).value = str(i)
row_cell = row_cell + 1
wb.save("row_creation_loop.xlsx")
All suggestions and (constructive) criticism welcome. Thank you!
If all you want to do is write the cells from the tuple, you can directly do that with a syntax like - ws['A1'] = <something> this would write the value into the cell A1 .
Example -
from openpyxl import Workbook
nfc_east = ('DAL', 'WAS', 'PHI', 'NYG')
wb = Workbook()
ws = wb.active
for row, i in enumerate(nfc_east):
column_cell = 'A'
ws[column_cell+str(row+2)] = str(i)
wb.save("row_creation_loop.xlsx")
When you are using the syntax - ws.cell(row = row_cell, column = column_cell).value , the column_cell has to be integer, not a string , so for A column, you have to give the value for column argument as 1 (for B it would be 2) , etc.
Your code doesn't run for me (Invalid column index A). Which version of pyxl are you using? AFAIK pyxl uses integer indexes. The following code produces the output you're after (I think).
from openpyxl import Workbook
nfc_east = ('DAL', 'WAS', 'PHI', 'NYG')
wb = Workbook()
ws = wb.active
start_row = 2
start_column = 1
for team in nfc_east:
ws.cell(row=start_row, column=start_column).value = team
start_row += 1
wb.save("row_creation_loop.xlsx")
# Prints...
#
# | A |
# 1 | |
# 2 | DAL |
# 3 | WAS |
# 4 | PHI |
# 5 | NYG |