I am trying to read all values within the first sheet of an excel file via xlrd, but I need it to start reading values from row 3 of the excel sheet, until the end of values in the column
Current version reads all information within the columns including the headers, this is not desired
code:
for col in range(sheet.nrows):
names = sheet.cell(col,0)
nums = sheet.cell(col,1)
if names.value != xlrd.empty_cell.value:
if nums.value != xlrd.empty_cell.value:
f.write('\t\t\t\t\t\t\t\t\t'+ '<li><strong>' + names.value + '</strong> '+ repr(nums.value)+'</li>' + "\n")
Change your index in the code..... for col in range(2,sheet.nrows): should give the desired behaviour.
On a sidenote, you should really rename your variables, you're using col as a variable for the number of rows in a sheet (which causes all kinds of confusion).
EDIT to point out that XLREAD is 0 indexed.
Related
I am working on a problem where I have 2 .tsv files and one has been arranged wrongly with respect to the other one.
When I scan the file , I noticed a pattern which I am unable to put it in terms of coding language. The pattern that I observed was :
For every increase in the row number of metadata file = 8 rows of increment to match in the flipped_metadata.tsv file to match the same values in the metadata file
For every increase in the flipped_metadata file = 12 rows if increment in the metadata.tsv file to match the same values in the flipped_metadata file.
For more clarity I have attached the 2 .tsv files along with this:
Metadata.tsv file and Flipped_metadata.tsv file
The openpyxl library has good functions for dealing with Excel cell locations. These can be used to convert A1 to a proper row and column.
Read each row in and convert the cell reference to a simple numeric row and column value. Use a dictionary to store each cell found with the two values for that cell. e.g. cells[(1,1)] = "123 456"
Whilst reading in, keep a track of the largest row and column.
Create an empty array (list of lists) to allow each cell to be assigned into.
Iterate over all of the dictionary items and assign each value into the array.
Finally save the array to a new CSV file.
For example:
from openpyxl.utils.cell import coordinate_from_string, column_index_from_string
import csv
def flip(input_filename, output_filename):
cells = {}
max_row = 0
max_col = 0
with open(input_filename) as f_input:
for cell, v1, v2 in csv.reader(f_input, delimiter='\t'):
col_letter, row_number = coordinate_from_string(cell)
col_number = column_index_from_string(col_letter)
cells[(row_number, col_number)] = f"{v1} {v2}"
if row_number > max_row:
max_row = row_number
if col_number > max_col:
max_col = col_number
output = [[''] * max_col for _ in range(max_row)]
for (row_number, col_number), values in cells.items():
output[row_number-1][col_number-1] = values
with open(output_filename, 'w', newline='') as f_output:
csv.writer(f_output).writerows(output)
flip('metadata.tsv', 'output_metadata.csv')
flip('flipped_metadata.tsv', 'output_flipped_metadata.csv')
This would give you:
Note: this approach correctly handles all cell references e.g. FK42. It would also handle holes in the data, if A2 was deleted it would still align correctly, as it is not 100% clear if data in cells can be missing,
I try to delete all rows from a Excel sheet, who satisfied this condition: all cells from a row must contain only the values "-" or "".
I use Python and openpyxl .
But my code, don't work well:
import openpyxl
wb1 = openpyxl.load_workbook(filename="testat_openpyxl.xlsx")
instr=""
for i in range(5, wb1["Centralizator"].max_row):
for j in range(7,49):
celval=wb1["Centralizator"].cell(row=i,column=j).value
ins='{}=="-" or {}=="" and '.format(celval)
instr=instr + ins
if instr[:-3]:
wb1["Centralizator"].delete_rows(i,1)
wb1.save('testat_openpyxl.xlsx')
My idea is to create a big "if" statement , to check all cells from a row:
if wb1["Centralizator"].cell(row=5,column=7).value=="-" or
wb1["Centralizator"].cell(row=5,column=7).value=="" and
wb1["Centralizator"].cell(row=5,column=8).value=="-" or
wb1["Centralizator"].cell(row=5,column=8).value==""
wb1["Centralizator"].cell(row=5,column=9).value=="-" or
wb1["Centralizator"].cell(row=5,column=9).value==""
.........loop until column 48............
wb1["Centralizator"].cell(row=5,column=48).value=="-" or
wb1["Centralizator"].cell(row=5,column=48).value==""
now jump to next row
if wb1["Centralizator"].cell(row=6,column=7).value=="-" or
wb1["Centralizator"].cell(row=6,column=7).value=="" and
wb1["Centralizator"].cell(row=6,column=8).value=="-" or
wb1["Centralizator"].cell(row=6,column=8).value==""
.........loop until column 48...........
wb1["Centralizator"].cell(row=6,column=48).value=="-" or
wb1["Centralizator"].cell(row=6,column=48).value==""
next row.......until max_row....
I have a medium size excel file, with about 25000 rows.
In the excel file I check if a specific column value is in a list, and if is in the list I delete the row.
I'm using openpyxl.
The code:
count = 1
while count <= ws.max_row:
if ws.cell(row=count, column=2).value in remove_list:
ws.delete_rows(count, 1)
else:
count += 1
wb.save(src)
The code works, but is very slow(take hours) to finish.
I know that is a read-only and write-only modes, but in my case, I use both, first checking and second deleting.
I see you are using a list of rows which you need to delete. Instead, you can create "sequences" of rows to delete, thus changing a delete list like [2,3,4,5,6,7,8,45,46,47,48] to one like [[2, 7],[45, 4]]
i.e. Delete 7 rows starting at row 2, then delete 4 rows starting at row 45
Deleting in bulk is faster than 1 by 1. I deleted 6k rows in around 10 seconds
The following code will convert a list to a list of lists/sequences:
def get_sequences(list_of_ints):
sequence_count = 1
sequences = []
for row in list_of_ints:
next_item = None
if list_of_ints.index(row) < (len(list_of_ints) - 1):
next_item = list_of_ints[list_of_ints.index(row) + 1]
if (row + 1) == next_item:
sequence_count += 1
else:
first_in_sequence = list_of_ints[list_of_ints.index(row) - sequence_count + 1]
sequences.append([first_in_sequence, sequence_count])
sequence_count = 1
return sequences
Then run another loop to delete
for sequence in sequences:
sheet.delete_rows(sequence[0], sequence[1])
Personally, I would do two things:
first transform the list into a set so the lookup of the item takes less time
remove_set = set(remove_list)
...
if ws.cell(row=count, column=2).value in remove_set:
then I would avoid removing the rows in place, as it takes a lot of time to reorganise the data structures representing the sheet.
I would create a new blank worksheet and add to it only the rows which must be kept.
Then save the new worksheet, overwriting the original if you wish.
If it still takes too long, consider using a CSV format so you can treat the input data as text and output it the same way, re-importing the data later from the spreadsheet program (e.g. Ms-Excel)
Have a look at the official docs and at this tutorial to find out how to use the CSV library
Further note: as spotted by #Charlie Clark, the calculation of
ws.max_row
may take some time as well and there is no need to repeat it.
To do that, the easiest solution is to work backwards from the last row down to the first, so that the deleted rows do not affect the position of the ones before them.
When a number of rows have to be deleted from a sheet, I create a list of these row numbers, e.g. remove_list and then I rewrite the sheet to a temporary sheet, excluding these rows. I delete the original sheet and rename the temporary sheet to the original sheet. See my function for doing this below:
def delete_excel_rows_with_openpyxl(workbook, sheet, remove_list):
""" Delete rows with row numbers in remove_list from sheet contained in workbook """
temp_sheet = workbook.create_sheet('TempSheet')
destination_row_counter = 1
for source_row_counter, source_row in enumerate(sheet.iter_rows(min_row=1, max_row=sheet.max_row)):
try:
i = remove_list.index(source_row_counter+1) # enumerate counts from 0 and sheet from 1
# do not copy row
del remove_list[i]
except ValueError:
# copy row
column_count = 1
for cell in source_row:
temp_sheet.cell(row=destination_row_counter, column=column_count).value = cell.value
column_count = column_count + 1
destination_row_counter = destination_row_counter + 1
sheet_title = sheet.title
workbook.remove_sheet(sheet)
temp_sheet.title = sheet_title
return workbook, temp_sheet
Adding on to ketdaddy's response. I tested it and noticed that when you use this sequence in a for loop as suggested, you need to update the row number in every loop to account for the deleted rows.
For example, when you get to the second step in the loop, the start row is not the original start row, it's the original start row minus the rows which were previously deleted.
This code will update ketdaddy's sequence to generate a sequence which takes this into account.
original sequence = get_sequences(deleterows)
updated_sequence=[]
cumdelete = 0
for start, delete in original sequence:
new_start = start-cumdelete
cumdelete = cumdelete + delete
updated_sequence.append([new_start, delete])
updated_sequence
I have a somewhat large .xlsx file - 19 columns, 5185 rows. I want to open the file, read all the values in one column, do some stuff to those values, and then create a new column in the same workbook and write out the modified values. Thus, I need to be able to both read and write in the same file.
My original code did this:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc)
ws = wb["Sheet1"]
# iterate through the columns to find the correct one
for col in ws.iter_cols(min_row=1, max_row=1):
for mycell in col:
if mycell.value == "PerceivedSound.RESP":
origCol = mycell.column
# get the column letter for the first empty column to output the new values
newCol = utils.get_column_letter(ws.max_column+1)
# iterate through the rows to get the value from the original column,
# do something to that value, and output it in the new column
for myrow in range(2, ws.max_row+1):
myrow = str(myrow)
# do some stuff to make the new value
cleanedResp = doStuff(ws[origCol + myrow].value)
ws[newCol + myrow] = cleanedResp
wb.save(doc)
However, python threw a memory error after row 3853 because the workbook was too big. The openpyxl docs said to use Read-only mode (https://openpyxl.readthedocs.io/en/latest/optimized.html) to handle big workbooks. I'm now trying to use that; however, there seems to be no way to iterate through the columns when I add the read_only = True param:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc, read_only=True)
ws = wb["Sheet1"]
for col in ws.iter_cols(min_row=1, max_row=1):
#etc.
python throws this error:
AttributeError: 'ReadOnlyWorksheet' object has no attribute 'iter_cols'
If I change the final line in the above snippet to:
for col in ws.columns:
python throws the same error:
AttributeError: 'ReadOnlyWorksheet' object has no attribute 'columns'
Iterating over rows is fine (and is included in the documentation I linked above):
for col in ws.rows:
(no error)
This question asks about the AttritubeError but the solution is to remove Read-only mode, which doesn't work for me because openpyxl won't read my entire workbook in not Read-only mode.
So: how do I iterate through columns in a large workbook?
And I haven't yet encountered this, but I will once I can iterate through the columns: how do I both read and write the same workbook, if said workbook is large?
Thanks!
If the worksheet has only around 100,000 cells then you shouldn't have any memory problems. You should probably investigate this further.
iter_cols() is not available in read-only mode because it requires constant and very inefficient reparsing of the underlying XML file. It is however, relatively easy to convert rows into columns from iter_rows() using zip.
def _iter_cols(self, min_col=None, max_col=None, min_row=None,
max_row=None, values_only=False):
yield from zip(*self.iter_rows(
min_row=min_row, max_row=max_row,
min_col=min_col, max_col=max_col, values_only=values_only))
import types
for sheet in workbook:
sheet.iter_cols = types.MethodType(_iter_cols, sheet)
According to the documentation, ReadOnly mode only supports row-based reads (column reads are not implemented). But that's not hard to solve:
wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()
# find what column I need
colcounter = 0
for row in ws.rows:
for cell in row:
if cell.value == "PerceivedSound.RESP":
break
colcounter += 1
# cells are apparently linked to the parent workbook meta
# this will retain only values; you'll need custom
# row constructor if you want to retain more
row2 = [cell.value for cell in row]
ws2.append(row2) # preserve the first row in the new file
break # stop after first row
for row in ws.rows:
row2 = [cell.value for cell in row]
row2.append(doStuff(row2[colcounter]))
ws2.append(row2) # write a new row to the new wb
wb2.save('newfile.xlsx')
wb.close()
wb2.close()
# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()
You will not be able to write to the same workbook, but as shown above you can open a new workbook (in writeonly mode), write to it, and overwrite the old file using OS copy.
I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''