Python - Formatting excel sheet too slow (row by row) - python

I am trying to format an excel sheet using python using the function like this,
def highlight_myrow_cells(sheetnumber, Sheetname ,dataFrame):
Pre_Out_df_ncol = dataFrame.shape[1]
RequiredCol_let = colnum_num_string(Pre_Out_df_ncol)
#identifying the rows that needs to be highlighted
arr = (dataFrame.select_dtypes(include=[bool])).eq(False).any(axis=1).values
ReqRows = np.arange(1, len(dataFrame) + 1)[arr].tolist()
#The ReqRows are a list of values something like [1,2,3,5,6,8,10]
print("Highlighting the Sheet " + Sheetname + " in the output workbook")
# Program is too slow over here---
for i in range(len(ReqRows)):
j = ReqRows[i] + 1
xlwb1.sheets(sheetnumber).Range('A' + str(j) + ":" + RequiredCol_let + str(j)).Interior.ColorIndex = 6
xlwb1.sheets(sheetnumber).Columns.AutoFit()
for i in range(1, Emergency_df.shape[1]):
j = i - 1
RequiredCol_let = colnum_num_string(i)
Required_Column_Name = (Emergency_df.columns[j])
DateChecker1 = contains_word(Required_Column_Name, "Date", "of Death", "Day of Work")
ResultChecker = Required_Column_Name.startswith("Result")
if ResultChecker == False:
if (DateChecker1 == True):
xlwb1.sheets(sheetnumber).Range(Required_Column_Name + ":" + Required_Column_Name).NumberFormat = "m/d/yyyy"
The program is too slow while highlighting the rows based on logics,
From what I understand from excel is - the speed is quiet good if you highlight using a range of rows, rather than to use one row after another row.
I am not looking to do this with an external library like stylewriter etc.,

Since you can't use threading, I would just cut down the time needed to execute each loop. The methods I know of would look something like:
ReqRows += 1
for i in range(len(ReqRows)):
xlwb1.sheets(sheetnumber).Range('A{0}:{1}{0}'.format(i, RequiredCol_let)).Interior.ColorIndex = 6
xlwb1.sheets(sheetnumber).Columns.AutoFit()
This should speed up your loop (albeit probably not nearly as much as threading). Hope this helps solve your problem!

Related

Removing doubles while iterating a dataframe

I'm trying to remove doubles from a dataframe.
Basically, the dataframe contains two (or more) occurence of a document.
The doubles can be found by comparing the description of the document.
In my logic, I had to find who the duplicates are, copy the data and drop them from both the dataframe and the iterated dataframe.
But it appears there are still doubles, I do think it is because of the drop but don't know how to fix it.
So what is in green is the description, I need to drop one of the two, and fuse all that there is in black.
For example:
URL1 + URL2|Explorimmo + Bien_ici|Apartment|Description
Unfortunately, I can't link the dataset.
file = pd.ExcelFile(mc.file_path)
df = pd.read_excel(file)
description_duplicate = df.loc[df.duplicated(['DESCRIPTION']) == True]
for idx1, clean in description_duplicate.iterrows():
for idx2, dirty in description_duplicate.iterrows():
if idx1 != idx2:
if clean['DESCRIPTION'] == dirty['DESCRIPTION']:
clean['CRAWL_SOURCE'] = clean['CRAWL_SOURCE'] + " / " +dirty['CRAWL_SOURCE']
clean['URL'] = clean['URL'] + " / " + dirty['URL']
description_duplicate = description_duplicate.drop(idx2)
df = df.drop(idx2)
df[idx1] = clean
You only need to remove duplicates with the pandas.DataFrame.drop_duplicates() function:
df.drop_duplicates(subset='DESCRIPTION', inplace=True)

Avoid having to repeat the same dataframe column names when modifying them

I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)
Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)

How to access a cell in a new dataframe?

I created a sub dataframe (drama_df) based on a criteria in the original dataframe (df). However, I can't access a cell using the typical drama_df['summary'][0] . Instead I get a KeyError: 0. I'm confused since type(drama_df) is a DataFrame. What do I do? Note that df['summary'][0] does indeed return a string.
drama_df = df[df['drama'] > 0]
#Now we generate a lump of text from the summaries
drama_txt = ""
i = 0
while (i < len(drama_df)):
drama_txt = drama_txt + " " + drama_df['summary'][i]
i += 1
EDIT
Here is an example of df:
Here is an example of drama_df:
This will solve it for you:
drama_df['summary'].iloc[0]
When you created the subDataFrame you probably left the index 0 behind. So you need to use iloc to get the element by position and not by index name (0).
You can also use .iterrows() or .itertuples() to do this routine:
Itertuples is a lot faster, but it is a bit more work to handle if you have a lot of columns
for row in drama_df.iterrows():
drama_txt = drama_txt + " " + row['summary']
To go faster:
for index, summary in drama_df[['summary']].itertuples():
drama_txt = drama_txt + " " + summary
Wait a moment here. You are looking for the str.join() operation.
Simply do this:
drama_txt = ' '.join(drama_df['summary'])
Or:
drama_txt = drama_df['summary'].str.cat(sep=' ')

how can I make a loop for importing data and saving in sequence

I have a xls file and the first column consist of many rows for example
MN
TN
RMON
BNE
RMGS
HUDGD
YINT
Then I want to pass each cell (the value of it) to a function
mystruc1 = make_structure("MN")
mystruc2 = make_structure("TN")
mystruc3 = make_structure("RMON")
mystruc4 = make_structure("BNE")
mystruc5 = make_structure("RMGS")
mystruc6 = make_structure("HUDGD")
mystruc7 = make_structure("YINT")
So each time the value of one cell will go to the function
Then I want to pass the output of it to another function
out = Bio.PDB.PDBIO()
out.set_structure(mystruc1)
out.save( "MN001.pdb" )
out.set_structure(mystruc2)
out.save( "MN002.pdb" )
out.set_structure(mystruc3)
out.save( "MN003.pdb" )
out.set_structure(mystruc4)
out.save( "MN004.pdb" )
out.set_structure(mystruc5)
out.save( "MN005.pdb" )
out.set_structure(mystruc6)
out.save( "MN006.pdb" )
out.set_structure(mystruc7)
out.save( "MN007.pdb" )
this is how if i do it manually. I want to avoid doing it manually
You can construct the filename using str.format, Format String Syntax
>>> filename = '{}{:04}.pdb'
>>> filename.format('MN', 1)
'MN0001.pdb'
>>> filename.format('MN', 352)
'MN0352.pdb'
>>>
You can use enumerate while iterating over the sheet's rows to help construct the filename.
import xlrd
filename = '{}{:04}.pdb'
workbook = xlrd.open_workbook('test.xls')
for sheet in workbook.sheets():
for n, row in enumerate(sheet.get_rows()):
col_0 = row[0].value
print filename.format(col_0, n)
If you only want to iterate over the first column.
for sheet in workbook.sheets():
for n, value in enumerate(sheet.col_values(0, start_rowx=0, end_rowx=None)):
print filename.format(value, n)
Or you can access the cel values directly.
for sheet in workbook.sheets():
for i in xrange(sheet.nrows):
rowi_col0 = sheet.cell_value(i, 0)
print filename.format(rowi_col0, i)
Once you have extracted a cel's value you can pass it to any function/method - similar to passing the cel value to the str.format method.
mystruc = make_structure(value)
To automate processing the cel values, add your process to the loop.
for sheet in workbook.sheets():
for i in xrange(sheet.nrows):
rowi_col0 = sheet.cell_value(i, 0)
#print filename.format(col_0, i)
my_structure = make_structure(rowi_col0)
out = Bio.PDB.PDBIO()
out.set_structure(my_structure)
out.save(filename.format(rowi_col0, i))
I don't have comment privileges to ask for clarification, so I'm going to answer this best I can and hopefully you can clarify if I'm going in the wrong direction.
From what you wrote, I'm assuming that you have some column, 'MN' and you want to name a bunch of files starting from 'MN001.pdb' all the way to 'MN0xx.pdb' (where xx is the last row you're working with.
One way you can achieve this is by working with a loop that has a counter and iterates with each iteration of your second for loop.
colname = "MN"
for sheet in workbook.sheets():
counter = 0
for row in range(sheet.nrows):
# pass your code here
counter += 1
s_counter = str(counter)
s = ''
if len(s_counter) < 2:
s = '0' + s
elif len(s_counter) < 3:
s = '00' + s
...
out.save(s + '.pdb')

python openpyxl time insert without date

I have a value that I want to insert into excel as time formatted like HH:MM
If I use
cellFillTime = Style(fill = PatternFill(start_color=shiftColor,end_color=shiftColor,fill_type='solid'),
border = Border(left=Side(style='thin'),right=Side(style='thin'),top=Side(style='thin'),bottom=Side(style='thin')),
alignment=Alignment(wrap_text=True)
,number_format='HH:MM'
valM = 8
cellData = ws5.cell(column= a + 3, row= i+2, value=valM)
_cellStlyed = cellData.style = cellFillTime
I always get 1.1.1900 8:00:00 in excel worksheet
The problem I get is when I have SUM functions later and therefore they do not work.
How can I remove date from my cell when formatting cell to get only time values
thank you
best regards
This worked in terms of only hour insert
UPDATED CODE
cellFillTime = Style(fill = PatternFill(start_color=shiftColor,end_color=shiftColor,fill_type='solid'),
border = Border(left=Side(style='thin'),right=Side(style='thin'),top=Side(style='thin'),bottom=Side(style='thin')),
alignment=Alignment(wrap_text=True)
)
if rrr["rw_seqrec"] == 1 or rrr["rw_seqrec"] == 1001:
val_ = abs((rrr['rw_end'] - rrr['rw_start'])) / 60
#print "val_ ", val_
valM = datetime.datetime.strptime(str(val_), '%H').time()
cellData = ws5.cell(column= a + 3, row= i+2, value=valM)
cellData.style = cellFillTime
cellData.number_format='HH:MM'
the problem I have now is that excel still does not want to sum the time fields. It has smth to do with field format or smth.
any suggestions?
So the final catch was also adding the right time format for cells with hours and also to the cell that contains the SUM formula
_cell.number_format='[h]:mm;#'

Categories