Unmerged Cell For Loop skipping cells - python

I am using openpyxl and found some code through goolging about unmerging cells in an xlsx workbook.
I got the code to work, but found that it was not removing all the merged cells in a single pass through. I set it up to run using a while loop and solved the issue, but was wondering what I am doing wrong to cause the skipping in the first place. Any insight would be helpful.
Code:
import openpyxl
wb = './filename.xlsx'
ws = wb[sheetname]
def remove_merged(sheet_object):
merged = ws.merged_cell_ranges
while len(merged)>0:
for mergedRNG in merged:
ws.unmerge_cells(range_string = mergedRNG)
merged = ws.merged_cell_ranges
return len(merged)
remove_merged(ws)

ws.merged_cell_ranges is mutable so you need to be careful that it is not directly used in any for-loop, because the implicit counter won't take into account that the property has been recalculated. This is a common gotcha in Python illustrated by:
l = list(range(10))
for i in l:
print(i)
l.pop(0) # anything that affects the structure of the list
The following is how to avoid this:
for rng in ws.merged_cell_ranges[:]: # create a copy of the list
ws.unmerge(rng) # remove range from original
PS. just copying stuff from an internet search isn't really advisable. There are several sites with outdated or unnecessarily complex code. Best referring to the documentation or asking on the mailing list.

Related

OpenPyXL - DataValidation adding list to the cell (issue)

This is my code:
dv = DataValidation(type="list", formula1='"11111,22222,33333,44444,55555,66666,77777,88888,99999,111110,122221,133332,144443,155554,166665,177776,188887,199998,211109,222220,233331,244442,255553,266664,277775,288886,299997,311108,322219,333330,344441,355552,366663,377774,388885,399996,411107,422218,433329,444440,455551,466662,477773,488884,499995,511106,522217,533328,544439,555550,566661,577772,588883,599994,611105,622216,633327,644438,655549,666660,677771,688882,699993,711104,722215,733326,744437,755548,766659,777770,788881,799992,811103,822214,833325,844436,855547,866658,877769,888880,899991,911102,922213,933324,944435,955546,966657,977768,988879,999990,1011101,1022212,1033323,1044434,1055545,1066656,1077767,1088878,1099989,1111100,1122211"', allow_blank=False)
sheet.add_data_validation(dv)
dv.add('K5')
But then I have this issue:
BUT if formula1 list is small ... then all is working fine.....
WHat is the way to add a BIG list of options which will not cause issues(as you can see above)?
Excel may impose additional limits on what it accepts. See https://learn.microsoft.com/en-us/openspecs/office_standards/ms-oi29500/8ebf82e4-4fa4-43a6-9ecd-d2d793a6f4bf. In the implementers notes there is additional information but I cannot find the passage referred to.
Basically, I think it's generally easier to refer to values on a separate sheet.

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

Efficiently removing duplicates from a list [duplicate]

This question already has an answer here:
Removing repetitive/duplicate occurance in excel using python
(1 answer)
Closed 3 years ago.
Good evening. I have an excel file with zip codes and associated information. Those zip codes have a lot of duplicates. I'd like to figure out which zip codes I have by putting them all in a list without duplicates. This code works, but runs very slowly (took over 100 seconds), and was wondering what I could do to improve the efficiency of it.
I know that having to check the whole list for duplicates each time is contributing a lot to the inefficiency, but I'm not sure how to fix that. I also know that going through every row is probably not the best answer, but again I am pretty new and am now stuck.
Thanks in advance.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
for i in range(sheet.nrows):
if str(sheet.cell(i,0).value) in zipsInSheet:
pass
else:
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()
If you're looking to avoid duplicates then you should definitely consider using Sets in python. See here
What I would do is to create a set and simply add all your elements to a set; note that, a set is an unordered, unique collection of items. Once all data has been added you can then just add all elements in the set it to your sheet. This, therefore, avoids redundant data.
import sys
import xlrd
loc = ("locationOfFile")
wb = xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
def findUniqueZips():
zipsInSheet = []
data = set()
for i in range(sheet.nrows):
data.add(str(sheet.cell(i,0).value)
#now add all elements in the set to your sheet
for i in range(len(data)):
zipsInSheet.append(str(sheet.cell(i,0).value))
print(zipsInSheet)
findUniqueZips()
I usually just convert it to a set. Sets are your friend. They are much faster than lists. Unless you intentionally need or want duplicates, use sets.
https://docs.python.org/3.7/tutorial/datastructures.html?highlight=intersection#sets

Copying data to clipboard manager Ditto

I refused to ask a question here but I just can't find a solution.
I use Ditto as my fav clipboard manager, when I copy data there I can access it via assigned keys on my keyboard. This is very handy. I need to copy values from cells in Excel, so far I've tried many solutions but each one has the same outcome, mainly (tkinter, pyperclip, pandas, os, pynput) gives me an output as a last copied variable (or string) under first position in Ditto. If I copy value 'a' then 'b' it gives me 'b' or I gain access to whole copied content it doesnt distinguish. The closest solution is in this code below, close but it is still whole content in one clip under one key.
from openpyxl import load_workbook
from pyperclip import *
wb = load_workbook(filename='C:/Users/Robert/Desktop/dane.xlsx')
ws = wb['Sheet']
column = ws['B']
list = ''
for x in range(len(column)) :
a = ''
if column[x].value is None:
column[x].value = a
list = list + str(column[x].value) + '\n'
copy(list)
I need every single string (cell.value) under different slot in Ditto. This gives me all values in one (first) slot.
Thanks in advance, it is fourth day in a row and I am close to jump from my balcony...
Lately I came across a solution.
Ditto requires delay, at least 500ms to copy items separately.
for i in arr:
pyperclip.copy(i)
time.sleep(.6)

Excel worksheet to Numpy array

I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic:
say my worksheet was loaded as "ws", the code:
A = np.zeros((37,3))
for i in range(2,39):
for j in range(1,4):
A[i-2,j-1]= ws.cell(row = i, column = j).value
loads the contents of "ws" into array A.
There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?
Thank you in advance for your answers.
PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.
You could do
A = np.array([[i.value for i in j] for j in ws['C1':'E38']])
EDIT - further explanation.
(firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)
the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
each row is a tuple (even if only one column wide) of
Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.
The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns
A = np.array([[i.value for i in j[2:5]] for j in ws.rows])
if you don't know how many columns then you will have to loop and check values more like your original idea
If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.
import numpy as np
from tableconvert.converter import convert
array = convert("""
123 456 3.14159
SOMETEXT 2,71828 0
""")
print(type(array))
print(array)
Output:
<class 'numpy.ndarray'>
[[ 123. 456. 3.14159]
[ nan 2.71828 0. ]]

Categories