Append data from list to excel using openpyxl - python

I am using openpyxl to work with excel on python.
I have a list i want to add each value inside it in excel file, my current code:
for y in myzoo:
loo1 = str(y)
c5a = my_sheet.cell(row= 21 , column = 3)
c5a.value = loo1
myzoo is the list (its originally a pyodbc.Row)
and i convert each entry to string, then save it to excel file, the problem is currently it save only last one overwriting all earlier values, i want to do one of two: save each value in next empty cell in row, or even (which less preferable) saving all the exported data into the cell without deleting earlier ones, thanks.

I think you can just do something like this:
column = 3 # start column
while myzoo:
c5a = my_sheet.cell(row=21, column=column)
if not c5a.value:
c5a.value = str(myzoo.pop(0))
column += 1
in case you need to preserve myzoo - you will need to copy it. (temp = myzoo.copy())

Related

Append std,mean columns to a DataFrame with a for-loop

I want to put the std and mean of a specific column of a dataframe for different days in a new dataframe. (The data comes from analyses conducted on big data in multiple excel files.)
I use a for-loop and append(), but it returns the last ones, not the whole.
here is my code:
hh = ['01:00','02:00','03:00','04:00','05:00']
for j in hh:
month = 1
hour = j
data = get_data(month, hour) ## it works correctly, reads individual Excel spreadsheet
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td = data.iloc[:,4].std()
meean = data.iloc[:,4].mean()
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
final.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean},ignore_index=True)
I am not sure, but I believe you should assign the final.append(... to a variable:
final = final.append({'Month':j ,'Hour':j,'standard deviation':x,'average':y},ignore_index=True)
Update
If time efficiency is of interest to you, it is suggested to use a list of your desired values ({'Month':j ,'Hour':j,'standard deviation':x,'average':y}), and assign this list to the dataframe. It is said it has better performance.(Thanks to #stefan_aus_hannover)
This is what I am referring to in the comments on Amirhossein's answer:
hh=['01:00','02:00','03:00','04:00','05:00']
lister = []
final = pd.DataFrame(columns=['Month','Hour','standard deviation','average'])
for j in hh:``
month=1
hour = j
data = get_data(month, hour) ## it works correctly
data=pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
s_td=data.iloc[:,4].std()
meean=data.iloc[:,4].mean()
lister.append({'Month':j ,'Hour':j,'standard deviation':s_td,'average':meean})
final = final.append(pd.DataFrame(lister),ignore_index=True)
Conceptually you're just doing aggregate by hour, with the two functions std, mean; then appending that to your result dataframe. Something like the following; I'll revise it if you give us reproducible input data. Note the .agg/.aggregate() function accepts a dict of {'result_col': aggregating_function} and allows you to pass multiple aggregating functions, and directly name their result column, so no need to declare temporaries. If you only care about aggregating column 4 ('Total Load (MWh)'), then no need to read in columns 0..3.
for hour in hh:
# Read in columns-of-interest from individual Excel sheet for this month and day...
data = get_data(1, hour)
data = pd.DataFrame(data,columns=['Flowday','Interval','Demand','Losses (MWh)','Total Load (MWh)'])
# Compute corresponding row of the aggregate...
dat_hh_aggregate = pd.DataFrame({['Month':whatever ,'Hour':hour]})
dat_hh_aggregate = dat_hh_aggregate.append(data.agg({'standard deviation':pd.Series.std, 'average':pd.Series.mean)})
final = final.append(dat_hh_aggregate, ignore_index=True)
Notes:
pd.read_excel usecols=['Flowday','Interval',...] allows you to avoid reading in columns that you aren't interested in the first place. You haven't supplied reproducible code for get_data(), but you should parameterize it so you can pass the list of columns-of-interest. But you seem to only want to aggregate column 4 ('Total Load (MWh)') anyway.
There's no need to store separate local variables s_td, meean, just directly use .aggregate()
There's no need to have both lister and final. Just have one results dataframe final, and append to it, ignoring the index. (If you get issues with that, post updated code here, make sure it's reproducible)

Merging Specific Cells in an Excel Sheet with Python

I've been trying to merge cells that meet specific criteria with the cell next to it via a loop, but I'm not quite sure how to go about it.
For example, starting at row 7, if the cell has the word "Sample" in it, I want it to merge with the cell in the column next to it and I want to keep doing that until I get to the end of that row.
I'm currently using openpyxl for this.
Here is what I've tried (it does not work):
wb = load_workbook('Test.xlsx')
ws = wb.active
worksheet = wb['Example']
q_cells = []
for row_cells in worksheet.iter_rows(min_row = 7):
for cell in row_cells:
if cell.value == 'Sample':
q_cells.append(cell.coordinate)
for item in q_cells:
worksheet.merge_cells(item:item+1)
wb.save('merging.xlsx')
I'm not quite sure how best to proceed with this code. Any help would be appreciated!
merge_cells takes a string (eg: "A2:A8") or a set of values. From the docs:
>>> ws.merge_cells('A2:D2')
>>> ws.unmerge_cells('A2:D2')
>>>
>>> # or equivalently
>>> ws.merge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
>>> ws.unmerge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
Source: https://openpyxl.readthedocs.io/en/stable/usage.html
It sounds like you will want to find your first cell and your last cell, and merge as such (here I'm using f-strings):
ws.merge_cells(f'{first_cell.coordinate}:{last_cell.coordinate}')
Merged cells in openpyxl change from type 'Cell' to type 'MultiCellRange', which is specified as a particular range of cell coordinates. Openpyxl will let you overlap merge ranges without throwing an error, but Excel won't let you open the resulting file without a warning (and probably removing the later merges). If you want to merge, you have to specify the whole range.

Using Python to find values from one excel sheet in another and printing result

I have 2 spreadsheets "Old_Data" and "New_Data", both contain a column called "ID" they can have 10K+ entries, and they are not ordered in anyway. I.E ID "1001" may be on row 2 in "Old_Data" but be on row 4500 in "New_Data"
There are also entries within "New_Data" that are not in "Old_Data" and vice versa. For now I'm trying to figure out how I can use Python to take every entry from "Old_Data" try and locate it within "New_Data" and then append a Field/Column called "Found" within "Old_Data" with a true or false based on if it was located or not.
Any ideas on how I would go about starting on this? I've attached a couple of examples of both "New_Data" and "Old_Data" excel sheets
You could read both columns to a list.
new_ids = list()
old_ids = list()
You have to fill those lists either converting your excel to a csv.file and read the lists from there using input() / sys.stdin or alternatively using openpyxl or a similar module.
Afterwards, assuming all those IDs are unique elements:
old_d = dict(old_ids)
for id in old_ids:
pos = new_ids.index(id)
old_d[id] = pos # position in the old list
#or for simple existence:
#old_d[id] = (id in new_ids) # just True / False
#print(f'id {id} from new_ids is in old_ids at position {pos}')
In case of looking for the position: not found will be a -1 in your dictionary.
Might be more useful to have a position then a simple existence check.
Its basically the same as Excels VLookup.
Check this code and the corresponding result in old_data
import pandas as pd
import numpy as np
new_data = pd.read_excel('new_data.xlsx')
old_data = pd.read_excel('old_data.xlsx')
old_data['exists'] = np.nan
for each in list(old_data.loc[old_data.ID.isin(new_data['ID'])].index):
old_data['exists'].iloc[each] = 'Exists'
old_data['exists'].fillna('non-existent', inplace=True)
print(old_data)

How can I extract a substring between two characters for every row of a column in a CSV file and copy those values into a new column in Python?

I have a column with unique ID numbers, called "UnitID", that is organised in a way such as this:
ABC2_DEFGH12-01_X1_Y1
The segment of DEFGH12-01 hypothetically refers to the ID of the specific batch of units. I need to make a new column that specifies this batch, and therefore, want to extract the "DEFGH12-01" values (like extracting the value between the first and second "_", but I haven't been able to figure out how), into a new column, called "BatchID".
I would want to just leave "UnitID" as is, and simply add the new "BatchID" column before it.
I've tried everything but I haven't really managed to do this.
Using str.split("_").str[1]
Ex:
df = pd.DataFrame({"UnitID": ["ABC2_DEFGH12-01_X1_Y1"]})
df["BatchID"] = df["UnitID"].str.split("_").str[1]
print(df)
Output:
UnitID BatchID
0 ABC2_DEFGH12-01_X1_Y1 DEFGH12-01
If you need Regex use str.extract(r"(?<=_)(.*?)(?=_)").
df["BatchID"] = df["UnitID"].str.extract(r"(?<=_)(.*?)(?=_)")

Imitate the copy function of Excel or LibreOffice Calc with openpyxl and python3 (copy with properties to new position)

is there a way to imitate the copy function of "Excel" or "LibreOffice Calc" using openpyxl and python3?
I would like to specify a certain range (e.g. "A1:E5") and copy the "border", "alignment", "number_format", "value", "merged_cells", ... properties of each cell to an other position (and probably to another worksheet), whereby the used formulas should be updated automatically to the new position. The new formulas are intended to refer to cells within the target worksheet and not to the old cells in the original worksheet.
My project:
I generate a workbook for every month. This workbook contains yield monitoring tables that list all working days.
Although the tables differ from month to month, all have the same structure within a workbook, so I would like to create a template and paste it into the individual worksheets.
Copying the entire worksheet is not really a solution because I also like to specify the position of the table individually for every worksheet. So the position in the target sheet could differ from the postion in the template.
My current code (where the formulas are not automatically updated):
import copy
# The tuple "topLeftCell" represents the assumed new position in the target worksheet. It is zero-based. (e.g. (0,0) or (7,3))
# "templateSheet" is the template from which to copy.
# "worksheet" is the target worksheet
# Create the same "merged_cells" as in the template at the new positioin
for cell_range in templateSheet.merged_cells.ranges:
startCol, startRow, endCol, endRow = cell_range.bounds
worksheet.merge_cells(start_column=topLeftCell[0] + startCol,
start_row=topLeftCell[1] + startRow,
end_column=topLeftCell[0] + endCol,
end_row=topLeftCell[1] + endRow)
colNumber = topLeftCell[0] + 1 # 1 is added up because topLeftCell is zero based.
rowNumber = topLeftCell[1] + 1 # 1 is added up because topLeftcell is zero based.
# Copy the properties of the old cells into the target worksheet
for row in templateSheet[templateSheet.dimensions]:
tmpCol = colNumber # sets the column back to the original value
for cell in row:
ws_cell = worksheet.cell(column=tmpCol, row=rowNumber)
ws_cell.alignment = copy.copy(cell.alignment)
ws_cell.border = copy.copy(cell.border)
ws_cell.font = copy.copy(cell.font)
ws_cell.number_format = copy.copy(cell.number_format)
ws_cell.value = cell.value
tmpCol += 1 # move one column further
rowNumber += 1 # moves to the next line
Since copying ranges is actually a common task, I assumed that openpyxl provides a function or method for doing so. Unfortunately, I could not find one so far.
I'm using openpyxl version 2.5.1 (and Python 3.5.2).
Best regards
AFoeee
openpyxl will let you copy entire worksheets within a workbook. That should be sufficient for your work but if you need any more you will need to write your own code.
Since it seems that openpyxl does not provide a general solution for my problem, I proceeded as follows:
I created a template with the properties of the cells set (borders, alignment, number format, etc.). Although the formulas are entered in the respective cells, columns and rows are replaced by placeholders. These placeholders indicate the offset to the "zero point".
The template area is copied as described above, but when copying "cell.value", the placeholder is used to calculate the actual position in the target worksheet.
Best regards
AFoeee

Categories