How to iterate over a particular column in excel using pyxl(python) - python

I am new to python and need you help.I am trying to write code that iterates through a particular column in excel using pyxl
from io import StringIO
import pandas as pd
import pyodbc
from openpyxl import load_workbook
d=pd.read_excel('workbook.xlsx',header=None)
wb = load_workbook('workbook.xlsx')
SO here in the above example I have to go column J and display all the values in the column.
Please help me solve this.
Also,I have the same column name repeated in my excel sheet..For Example "Sample" column name is available in B2 and also in J2..But I want to get all the column information of J2.
Please let me know how to solve this...
Thankyou ..Please reply

Since you're new to python, you should learn to read the documentation. There are tons of modules available and it will be quicker for you and easier for the rest of us if you make the effort first.
import openpyxl
from openpyxl.utils import cell as cellutils
## My example book simply has "=Address(Row(),Column())" in A1:J20
## Because my example uses formulae, I am loading my workbook with
## "data_only = True" in order to get the values; if your cells do not
## contain formulae, you can omit data_only
workbook = openpyxl.load_workbook("workbook.xlsx", data_only = True)
worksheet = workbook.active
## Alterntively: worksheet = workbook["sheetname"]
## A container for gathering the cell values
output = []
## Current Row = 2 assumes that Cell 1 (in this case, J1) contains your column header
## Adjust as necessary
column = cellutils.column_index_from_string("J")
currentrow = 2
## Get the first cell
cell = worksheet.cell(column = column, row = currentrow)
## The purpose of "While cell.value" is that I'm assuming the column
## is complete when the cell does not contain a value
## If you know the exact range you need, you can either use a for-loop,
## or look at openpyxl.utils.cell.rows_from_range
while cell.value:
## Add Cell value to our list of values for this column
output.append(cell.value)
## Move to the next row
currentrow += 1
## Get that cell
cell = worksheet.cell(column = column, row = currentrow)
print(output)
""" output: ['$J$2', '$J$3', '$J$4', '$J$5', '$J$6', '$J$7',
'$J$8', '$J$9', '$J$10', '$J$11', '$J$12', '$J$13', '$J$14',
'$J$15', '$J$16', '$J$17', '$J$18', '$J$19', '$J$20']

Related

Python/Openpyxl: Merge empty row cells delimited by string

I am trying to create a script using python and openpyxl to open up a given excel sheet and merge all cells in a given row together until the script finds a cell containing a string. The row placement is always the same, but the number of columns and the column placement of the strings is not so it needs to be dynamic. Once a new string is found, I want to continue to merge cells until the column that is right before the grand total. There are also cases where the cell doesn't need to be merged, because there is no empty cell in the data set to merge it with.
I found this answer here, which is doing a similar procedure except it is merging rows instead of columns. I was able to refactor part of this to create a list of the cells that have strings in my workbook, but am struggling on next steps. Any thoughts?
import openpyxl
from openpyxl.utils import get_column_letter
from openpyxl import Workbook
wb1 = openpyxl.load_workbook('stackoverflow question.xlsx')
ws1 = wb1.worksheets['ws1']
columns_with_strings = []
merge_row = '3' #the data to merge will always be in this row
for col in range (2, ws1.max_column-1):
for row in merge_row:
if ws1[get_column_letter(col) + merge_row].value != None:
columns_with_strings.append(str(get_column_letter(col) + merge_row)
The above code yields this list which includes the correct cells that contain strings and need to be checked for merging:
['C3', 'F3', 'J3']
This is how the workbook looks now:
This is how I am trying to get it to look in the end:
To complete your code, you can use worksheet.merge_cells with worhseet.cell.alignment:
from openpyxl import load_workbook
from openpyxl.styles import Alignment
wb = load_workbook("tmp/stackoverflow question.xlsx")
ws = wb["Sheet1"]
merge_row = 3
#here, we get the columns idx for every non null cell in row 3
#and after that, we make a text alignment (center) in the last cell
idx_col_strings = [cell.column for cell in ws[merge_row] if cell.value]
ws.cell(3, idx_col_strings[-1]).alignment = Alignment(horizontal="center")
#here, we loop through each range until the last non null cell in row 3
#then, we make a merge as much as the number of transitions (non null => null)
#and finally, we make a text alignement (center) for each cell/merge
for i in range(len(idx_col_strings)-1):
start_col, end_col = idx_col_strings[i], idx_col_strings[i+1]-1
ws.merge_cells(start_row=merge_row, start_column=start_col,
end_row=merge_row, end_column=end_col)
ws.cell(merge_row, start_col).alignment = Alignment(horizontal="center")
wb.save("tmp/stackoverflow answer.xlsx")
BEFORE :
AFTER :
To start, if you aren't familiar with openpyxl's merge and unmerge functions, I recommend your read about them in the documentation (https://openpyxl.readthedocs.io/en/stable/usage.html#merge-unmerge-cells) to get a sense of how this works.
Here is base code that should provide the functionality you are wanting, but some values may need tweaked for your device or spreadsheet.
import openpyxl # Necessary imports.
from openpyxl.utils import get_column_letter
from openpyxl.utils.cell import coordinate_from_string
from openpyxl.utils.cell import column_index_from_string
from openpyxl import Workbook
wb1 = openpyxl.load_workbook('stackoverflow question.xlsx') # Start of your code.
ws1 = wb1.worksheets[0]
columns_with_strings = []
merge_row = '3' #the data to merge will always be in this row
for col in range (2, ws1.max_column):
for row in merge_row:
if ws1[get_column_letter(col) + merge_row].value != None:
columns_with_strings.append(str(get_column_letter(col) + merge_row)) # End of your code.
prior_string = columns_with_strings[0] # Set the "prior_string" to be the first detected string.
for cell in columns_with_strings:
coords = coordinate_from_string(cell) # Split "prior_string" into the letter and number components.
if column_index_from_string(coords[0]) >1:
prior = str(get_column_letter(column_index_from_string(coords[0])-1)) + str(coords[1]) # Get the cell that is left of the cell "prior_string"
if prior > prior_string:
ws1.merge_cells(f'{prior_string}:{prior}') # Merge the cells.
prior_string=cell # Set the current string to be the prior string.
ws1.merge_cells(f'{cell}:{get_column_letter(ws1.max_column)+str(coords[1])}') # Merge the last string to the end (the last column).
wb1.save("stackoverflow question.xlsx") # Save the file changes.
I hope this helps to point you in the right direction!
Based on #timeless' answer I've cleaned the code up a bit to make better use of Python's tools and the openpyxl API
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.append([])
ws.append([])
ws.append([None, None, "Group A", None, None, "Group B", None, None, None, "Group C"])
# get column indices for header cells
headings = [cell.column for cell in next(ws.iter_rows(min_row=3, max_row=3)) if cell.value]
from openpyxl.styles import Alignment, PatternFill, NamedStyle
fill = PatternFill(patternType="solid", fgColor="DDDDDD")
alignment = Alignment(horizontal="center")
header_style = NamedStyle(alignment=alignment, fill=fill, name="Header")
wb.named_styles.append(header_style)
from itertools import zip_longest
# create ranges for merged cells from the list of header cells: the boundary of the first range, is the index of the start of the next minus 1. Use zip_longest for the final value
for start_column, end_column in zip_longest(headings, headings[1:], fillvalue=headings[-1]+1):
ws.cell(3, start_column).style = header_style
ws.merge_cells(start_row=3, end_row=3, start_column=start_column, end_column=end_column-1)
wb.save("merged.xlsx")
Using the API wherever possible generally leads to more manageable and generic code.

Write an excel formula all column with python

I have existing excel document and want to update M column according to A column. And I want to start from second row to maintain first row 'header'.
Here is my code;
import openpyxl
wb = openpyxl.load_workbook('D:\Documents\Desktop\deneme/formula.xlsx')
ws=wb['Sheet1']
for i, cellObj in enumerate(ws['M'], 1):
cellObj.value = '=_xlfn.ISOWEEKNUM(A2)'.format(i)
wb.save('D:\Documents\Desktop\deneme/formula.xlsx')
When I run that code;
-first row 'header' changes.
-all columns in excel "ISOWEEKNUM(A2)", but I want it to change according to row number (A3,A4,A5... "ISOWEEKNUM(A3), ISOWEEKNUM(A4), ISOWEEKNUM(A5)....")
Edit:
I handled right now the ISOWEEKNUM issue with below code. I changed A2 to A2:A5.
import openpyxl
wb = openpyxl.load_workbook('D:\Documents\Desktop\deneme/formula.xlsx')
ws=wb['Sheet1']
for i, cellObj in enumerate(ws['M'], 1):
cellObj.value = '=_xlfn.ISOWEEKNUM(A2:A5)'.format(i)
wb.save('D:\Documents\Desktop\deneme/formula.xlsx')
But still starts from first row.
Here is an answer using pandas.
Let us consider the following spreadsheet:
First import pandas:
import pandas as pd
Then load the third sheet of your excel workbook into a dataframe called df:
df=pd.read_excel('D:\Documents\Desktop\deneme/formula.xlsx', sheet_name='Sheet3')
Update column 'column_to_update' using column 'deneme'. The line below converts the dates in the 'deneme' column from strings to datetime objects and then returns the week of the year associated with each of those dates.
df['Column_to_update'] = pd.to_datetime(df['deneme']).dt.week
You can then save your dataframe to a new excel document:
df.to_excel('./newspreadsheet.xlsx', index=False)
Here is the result:
You can see that the values in 'column_to_update' got updated from 1, 2 and 3 to 12, 12 and 18.

Data append to list using XLRD

I am able to import data of rows in a particular column of certain sheet name in to a python list. But, the list is looking like Key:Value formatted list (not the one I need).
Here is my code:
import xlrd
excelList = []
def xcel(path):
book = xlrd.open_workbook(path)
impacted_files = book.sheet_by_index(2)
for row_index in range(2, impacted_files.nrows):
#if impacted_files.row_values(row_index) == 'LCR':
excelList.append(impacted_files.cell(row_index, 1))
print(excelList)
if __name__ == "__main__":
xcel(path)
The output is like below:
[text:'LCR_ContractualOutflowsMaster.aspx', text:'LCR_CountryMaster.aspx', text:'LCR_CountryMasterChecker.aspx', text:'LCR_EntityMaster.aspx', text:'LCR_EntityMasterChecker.aspx', text:'LCR_EscalationMatrixMaster.aspx',....]
I want the list to have just values. Like this...
['LCR_ContractualOutflowsMaster.aspx', 'LCR_CountryMaster.aspx', 'LCR_CountryMasterChecker.aspx', 'LCR_EntityMaster.aspx', 'LCR_EntityMasterChecker.aspx', 'LCR_EscalationMatrixMaster.aspx',...]
I've tried pandas too (df.value.tolist() method). Yet the output is not what I visualize.
Please suggest a way.
Regards
You are accumulating a list of cells, and what you are seeing is the repr of each cell in your list. Cell objects have three attributes: ctype is an int that identifies the type of the cell's value, value (which which is a Python rtype holding the cell's value) and xf_index. If you want only the values then try
excelList.append(impacted_files.cell(row_index, 1).value)
You can read more about cells in the documentation.
If you are willing to try one more library, openpyxl this is how it can be done.
from openpyxl import load_workbook
book = load_workbook(path)
sh = book.worksheets[0]
print([cell.value for cell in row for row in sheet.iter_rows()] )

Import Excel Tables into pandas dataframe

I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns

xlwings function to find the last row with data

I am trying to find the last row in a column with data. to replace the vba function: LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
I am trying this, but this pulls in all rows in Excel. How can I just get the last row.
from xlwings import Workbook, Range
wb = Workbook()
print len(Range('A:A'))
Consolidating the answers above, you can do it in one line:
wb.sheet.range(column + last cell value).Get End of section going up[non blank assuming the last cell is blank].row
Example code:
import xlwings as xw
from xlwings import Range, constants
wb = xw.Book(r'path.xlsx')
wb.sheets[0].range('A' + str(wb.sheets[0].cells.last_cell.row)).end('up').row
We can use Range object to find the last row and/or the last column:
import xlwings as xw
# open raw data file
filename_read = 'data_raw.csv'
wb = xw.Book(filename_read)
sht = wb.sheets[0]
# find the numbers of columns and rows in the sheet
num_col = sht.range('A1').end('right').column
num_row = sht.range('A1').end('down').row
# collect data
content_list = sht.range((1,1),(num_row,num_col)).value
print(content_list)
This is very much the same as crazymachu's answer, just wrapped up in a function. Since version 0.9.0 of xlwings you can do this:
import xlwings as xw
def lastRow(idx, workbook, col=1):
""" Find the last row in the worksheet that contains data.
idx: Specifies the worksheet to select. Starts counting from zero.
workbook: Specifies the workbook
col: The column in which to look for the last cell containing data.
"""
ws = workbook.sheets[idx]
lwr_r_cell = ws.cells.last_cell # lower right cell
lwr_row = lwr_r_cell.row # row of the lower right cell
lwr_cell = ws.range((lwr_row, col)) # change to your specified column
if lwr_cell.value is None:
lwr_cell = lwr_cell.end('up') # go up untill you hit a non-empty cell
return lwr_cell.row
Intuitively, the function starts off by finding the most extreme lower-right cell in the workbook. It then moves across to your selected column and then up until it hits the first non-empty cell.
You could try using Direction by starting at the very bottom and then moving up:
import xlwings
from xlwings.constants import Direction
wb = xlwings.Workbook(r'data.xlsx')
print(wb.active_sheet.xl_sheet.Cells(65536, 1).End(Direction.xlUp).Row)
Try this:
import xlwings as xw
cellsDown = xw.Range('A1').vertical.value
cellsRight = xw.Range('A1').horizontal.value
print len(cellsDown)
print len(cellsRight)
One could use the VBA Find function that is exposed through api property (use it to find anything with a star, and begin your search from the first cell).
Example:
row_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByRows,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
column_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByColumns,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
print((row_cell.Row, column_cell.Column))
Other methods outlined here seems to require no empty rows/columns between data.
source: https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33
python 3.6, xlwings 0.11
Solutoin 1
To find last row with data, you should do some work both horizontally and vertically. You have to go through every column to determine which row is the last row.
import xlwings
workbook_all = xlwings.Book(r'path.xlsx')
objectiveSheet = workbook_all .sheets['some_sheet']
# lastCellContainData(), inspired of Stefan's answer.
def lastCellContainData(objectiveSheet,lastRow=None,lastColumn=None):
lastRow = objectiveSheet.cells.last_cell.row if lastRow==None else lastRow
lastColumn = objectiveSheet.cells.last_cell.column if lastColumn==None else lastColumn
lastRows,lastColumns = [],[]
for col in range(1,lastColumn):
lastRows.append(objectiveSheet.range((lastRow, col)).end('up').row)
# extract last row of every column, then max(). Or you can compare the next
# column's last row number to the last column's last row number. Here you get
# the last row with data, you can also go further get the last column with data:
for row in range(1,lastRow):
lastColumns.append(objectiveSheet.range((row, lastColumn)).end('left').column)
return max(lastRows),max(lastColumns)
lastCellContainData(objectiveSheet,lastRow=5000,lastColumn=300)
I added lastRow and lastColumn. To make the program more effective, you can set these parameters according to the approximate shape of the data you're dealing with.
Solution 2
xlwings is honored for being wrapper of pywin32. I don't know if your situation allows for keyboard or mouse. If so, first you ctrl+tab switch to the workbook, then ctrl+a to select the region containing data, then you call workbook_all.selection.rows.count.
another way:
When you know where right bottom cell of your data locates faintly, say AAA10000, just call objectiveSheet.range('A1:'+'AAA10000').current_region.rows.count
Update:
After a while none of the solutions were really intuitive to me, so I decided to compile the following:
Code:
import xlwings as Objxlwings
import xlwings.constants
def Return_RangeLastCell(ObjWS):
return ObjWS.api.Cells.SpecialCells(xlwings.constants.CellType.xlCellTypeLastCell)
I tried to keep consistency with the way to call it from Excel to keep it simple
Then on my main code, I just call it like so:
ObjWS=Objxlwings.Book('Book1.xlsx').sheets["Sheet1"]
print(Return_RangeLastCell(ObjWS).Column)
Interesting solutions. But maybe like this:
print(sheet.used_range.last_cell.row)
#Cody's answer will help under normal circumstances, but if your sheet have hidden rows at bottom like links: example, it will give the wrong row number.
Lets say, if your row counts of data is 10, and row[5:11] are hidden, i.e. actually last_row will be 10.
[code a] below will give you answer 5, [code b] below will give you answer 10.
code a:
ws = wb.sheets[your_sheet_name]
last_row = ws.range('A' + str(ws.cells.last_cell.row)).end('up').row # return 5
code b:
ws = wb.sheets[your_sheet_name]
last_row_1 = ws.used_range.last_cell.row # return 10

Categories