Python script to manipulate excel sheets by auto-filling spaces in columns - python

Hi I have an excel file with a similar structure as below:
Location column2 column3
1 South Africa
2
3
4
5 England
6
7
8
9 U.S
10
11
12
I am trying to write a python script that can fill out the spaces between each location with the name of the preceding location (i.e fill out the space from 2 to 4 with the South Africa as the location, 6-8 will be filled out with England as the location, etc)
I would be grateful if someone could point me in the right direction.Thanks

Ok dude, I think the answer is this dumb wrapper I made for xlrd (or, one you write yourself!). The key is that the function reads one row at a time into a list, and that Python lists remember the order in which they were populated. The wrapper produces a dictionary which maps Excel sheet names to a list of rows on that sheet (we're assuming one table per sheet here, you'll have to generalize things otherwise). Each row is a dictionary whose keys are the column names.
For you, I'd read in your data, and then do something like this (not tested):
import see_below as sb
dict = sb.workbookToDict(your_file)
output = []
this_location = None
for row in dict[relevant_sheet_name]:
output_row = row
if row['Location'] is not None:
this_location = row['Location']
else:
output_row['Location'] = this_location
There might be something cute you can do with list comprehension, but I've had too much wine to fool with that tonight :)
Here's the wrapper for the reader:
import xlrd
def _isEmpty(_):
return ''
def _isString(element):
return element.value.encode('ascii', 'ignore')
def _isFloat(element):
return float(element.value)
def _isDate(element):
import datetime
rawDate = float(element.value)
return (datetime.datetime(1899, 12, 30) +
datetime.timedelta(days=rawDate))
def _isBool(element):
return element.value == 1
def _isExcelGarbage(element):
return int(element.value)
_options = {0: _isEmpty,
1: _isString,
2: _isFloat,
3: _isDate,
4: _isBool,
5: _isExcelGarbage,
6: _isEmpty}
def WorkbookToDict(filename):
'''
Reads .xlsx file into dictionary.
The keys of the dictionary correspond to sheet names in the Excel workbook.
The first row of the Excel workbook is taken to be column names, and each row
of the worksheet is read into a separate dictionary, whose keys correspond to
column names. The collection of dictionaries (as a list) forms the value in the
dictionary. The output maps sheet names (keys) to a collection of dictionaries
(value).
'''
book = xlrd.open_workbook(filename)
allSheets = {}
for s in book.sheets():
thisSheet = []
headings = [_options[x.ctype](x) for x in s.row(0)]
for i in range(s.nrows):
if i == 0:
continue
thisRow = s.row(i)
if len(thisRow) != len(headings):
raise Exception("Mismatch between headings and row length in ExcelReader")
rowDict = {}
for h, r in zip(headings, thisRow):
rowDict[h] = _options[r.ctype](r)
thisSheet.append(rowDict)
allSheets[str(s.name)] = thisSheet
return allSheets
The writer is here:
import xlwt
def write(workbookDict, colMap, filename):
'''
workbookDict should be a map of sheet names to a list of dictionaries.
Each member of the list should be a mapping of column names to contents,
missing keys are handled with the nullEntry field. colMap should be a
dictionary whose keys are identical tto the sheet names in the workbookDict.
Each value is a list of column names that are assumed to be in order.
If a key exists in the workbookDict that does not exist in the colDict, the
entry in workbookDict will not be written.
'''
workbook = xlwt.Workbook()
for sheet in workbookDict.keys():
worksheet = workbook.add_sheet(sheet)
cols = colMap[sheet]
i = 0
writeCols = True
while i <= len(workbookDict[sheet]):
if writeCols:
for j in range(len(cols)):
if writeCols: # write col headings
worksheet.write(i, j, cols[j])
writeCols = False
else:
for j in range(len(cols)):
worksheet.write(i, j, workbookDict[sheet][(i-1)][cols[j]])
i += 1
workbook.save(filename)
Anyway, I really hope this works for you!

wb = openpyxl.load_workbook('enter your workbook name')
sheet = wb.get_sheet_by_name('enter your sheet name')
row=sheet.max_row
for row in range (3,row):
if sheet.cell(row=row, column=1).value is not None and sheet.cell(row=row+1,column=1).value is None:
sheet.cell(row=row+1, column=1).value = sheet.cell(row=row, column=1).value
wb.save('enter your workbook name')

Related

How do you save dictionary data to the matching excel row but different column using openpyxl

I am using openpyxl to read a column (A) from an excel spreadsheet. I then iterate through a dictionary to find the matching information and then I want to write this data back to column (C) of the same Excel spreadsheet.
I have tried to figure out how to append data back to the corresponding row but without luck.
CODE
from openpyxl import load_workbook
my_dict = {
'Agriculture': 'ET_SS_Agriculture',
'Dance': 'ET_FA_Dance',
'Music': 'ET_FA_Music'
}
wb = load_workbook("/Users/administrator/Downloads/Book2.xlsx") # Work Book
ws = wb['Sheet1'] # Work Sheet
column = ws['A'] # Column
write_column = ws['C']
column_list = [column[x].value for x in range(len(column))]
for k, v in my_dict.items():
for l in column_list:
if k in l:
print(f'The dict for {l} is {v}')
# append v to row of cell index of column_list
So, if my excel spreadsheet looks like this:
I would like Column C to look like this after I have matched the data dictionary.
In order to do this with your method you need the index (ie: row) to assign the values to column C, you can get this with enumerate when running over your column_list
for i, l in enumerate(column_list):
if k in l:
print(f'The dict for {l} is {v}')
# append v to row of cell index of column_list
write_column[i].value = v
After writing all the values you will need to run
wb.save("/Users/administrator/Downloads/Book2.xlsx")
To save your changes
That said, you do a lot of unnecessary iterations of the data in the spreadsheet, and also make things a little bit difficult for yourself by dealing with this data in columns rather than rows. You already have your dict with the values in column A, so you can just do direct lookups using split.
You are adding to each row, so it makes sense to loop over rows instead, in my opinion.
my_dict = {
'Agriculture': 'ET_SS_Agriculture',
'Dance': 'ET_FA_Dance',
'Music': 'ET_FA_Music'
}
wb = load_workbook("/Users/administrator/Downloads/Book2.xlsx") # Work Book
ws = wb['Sheet1'] # Work Sheet
for row in ws:
try:
# row[0] = column A
v = my_dict[row[0].value.split("-")[0]] # get the value from the dict using column A
except KeyError:
# leave rows which aren't in my_dict alone
continue
# row[2] = column C
row[2].value = v
wb.save("/Users/administrator/Downloads/Book2.xlsx")

CREATING AN A COLUMN CALLED ID THAT AUTO INCREMENTS ACCORDING TO THE ROWS IN A DATAFRAME

I am trying to write a code that looks into a certain directory, selects .xlsx extension files then appends the sheets together and creates a single Excel sheet, it then drops a row if FAMILYNAME AND FIRSTNAME are null, creates a code column for every file it creates, like CODE = 1 for the first booklet it has finished working upon, then 2 for the second and so on, then it will create a column called ID, which will generate values from 1 to the total number of rows in the data frame, though am facing a challenge on this one. Can anyone help? Thanks!
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
def read_excel_sheets(xls_path):
"""Read all sheets of an Excel workbook and return a single DataFrame"""
print(f'Loading {xls_path} into pandas')
xl = pd.ExcelFile(xls_path)
df = pd.DataFrame()
columns = None
for idx, name in enumerate(xl.sheet_names):
print(f'Reading sheet #{idx}: {name}')
#sheet = xl.parse(name)
sheet = xl.parse(name,skiprows=1)
#if idx == 0:
# Save column names from the first sheet to match for append
# *****************THE FOLLOWINF TWO LINES ARE SO IMPORTANT IF THE EXCEL WORK SHEETS ARE HAVING THE SAME NUMBER OF COLUMNS AND IN THE SAME ORDER***********
# columns = sheet.columns
# sheet.columns = columns
#****************************************************************************************#
# Assume index of existing data frame when appended
df = df.append(sheet, ignore_index=False)
return df
n=0
for files in os.listdir():
if files.endswith(".xlsx"):
kim = pd.read_excel(files, sheet_name=None,header=0)
kim.keys()
kim = pd.concat(kim, ignore_index=True)
kim = pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
kim =kim[kim['FAMILYNAME'].notna() & kim['FIRSTNAME'].notna()]
kim =kim[kim.FAMILYNAME !='FAMILYNAME']
row,col = kim.shape
#############Alternatively you can use the code above in just a single line as below###########
#kim= pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
print(kim.shape)
if files.endswith(".xlsx"):
#HOW TO AUYOMATICALLY IMPORT THE FILE NAMES AND APPEND THEM
stops = read_excel_sheets(files)
n = n+1
stops['ID'] = pd.Series(ID)
stops['CODE']= n
ID = list( range(1, row+1))
stops['ID'] = pd.Series(ID)
stops['GENDER'] = np.where((stops.GENDER == 'M'),'MALE',stops.GENDER)
stops['GENDER'] = np.where((stops.GENDER == 'F'),'FEMALE',stops.GENDER)
stops =stops[stops['FAMILYNAME'].notna() & stops['FIRSTNAME'].notna()]
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\TRIAL")
stops.to_excel(files,index=False)
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
This is what the code is doing after execution - I would love something like this, yet I want it to automatically increment to all rows in the sheet

How can I concatenate multiple rows of excel data into one?

I'm currently facing an issue where I need to bring all of the data shown in the images below into one line only.
So using Python and Openpyxl, I tried to write a parsing script that reads the line and only copies when values are non-null or non-identical, into a new workbook.
I get out of range errors, and the code does not keep just the data I want. I've spent multiple hours on it, so I thought I would ask here to see if I can get unstuck.
I've read some documentation on Openpyxl and about making lists in python, tried a couple of videos on youtube, but none of them did exactly what I was trying to achieve.
import openpyxl
from openpyxl import Workbook
path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active
listab = []
rows = ws.max_row
columns = ws.max_column
for i in range (1, rows+1):
listab.append([])
cellValue = " "
prevCell = " "
for c in range (1, rows+1):
for r in range(1, columns+1):
cellValue = ws.cell(row=r, column=c).value
if cellValue == prevCell:
listab[r-1].append(prevCell)
elif cellValue == "NULL":
listab[r-1].append(prevCell)
elif cellValue != prevCell:
listab[r-1].append(cellValue)
prevCell = cellValue
for r in range(1, rows+1):
for c in range (1, columns+1):
j = ws2.cell(row = r, column=c)
j.value = listab[r-1][c-1]
print(listab)
wb2.save("output.xlsx")
There should be one line with the below information:
ods_service_id | service_name| service_plan_name| CPU | RAM | NIC | DRIVE |
Personally I would go with pandas.
import pandas as pd
#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True) ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()
#Storing pd into a list
records_list = df_data.to_dict('records')
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}
#Go through unique ids
for key in unique_ids:
processed[key] = {}
#Get related records
matching_records = [y for y in records_list if y['ods_service_ids'] == key]
#Loop through records
for record in matching_records:
#For each key to check, save in dict if non null
processed[key]['ods_service_ids'] = key
for detail_key in keys_to_check:
if record[detail_key] != "NO DATA" :
processed[key][detail_key] = record[detail_key]
##Note : doesn't handle duplicate values for different keys so far
#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]
#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)
The above should work but I wasn't really sure on how you wanted to save duplicated records for the same id. Do you look to store them as DRIVE_0, DRIVE_1, DRIVE_2 ?
EDIT:
df could be exported in a different way. Replaced below #export to Excel with the following :
df.to_excel("output.xlsx",sheet_name='Sheet_name_1')
EDIT 2:
with no input data it was hard to see any flows. Corrected the code above with fake data
To be honest, I think you've managed to get confused by data structures and come up with something far more complicated than you need.
One approach that would suit would be to use Python dictionaries for each service, updating them row by row.
wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
if row[0] not in objs:
obj = {key:value for key, value in zip(headers, row)}
objs[obj['ods_service_id']] = obj
else:# update dict with non-None values
extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
obj.update(extra)
# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
ws2.append([obj[key] for key in headers])
Note how you can do everything without using counters.

openpyxl: Iterate through all the rows and get the row data in a tuple

How do I iterate through all the rows in an xls sheet, and get each row data in a tuple. So at the end of the iteration, I should have a list of tuples with each element in the list, being a tuple of row data.
For instance: This is the content of my spreadsheet:
testcase_ID input_request request_change
test_1A test/request_1 YES
test_2A test/request_2 NO
test_3A test/request_3 YES
test_4A test/request_4 YES
my final list should be:
[(test_1A, test/request_1, YES),
(test_2A, test/request_2, NO),
(test_3A, test/request_3, YES),
(test_4A, test/request_4, YES)]
How can I do this in openpyxl?
I think this task would be easier with xlrd. However, if you want to use openpyxl, then assuming that testcase_ID is in column A, input_request in column B, and request_change in column C somehting like this might be what you are looking for:
import openpyxl as xl
#Opening xl file
wb = xl.load_workbook('PATH/TO/FILE.xlsx')
#Select your sheet (for this example I chose active sheet)
ws = wb.active
#Start row, where data begins
row = 2
testcase = '' #this is just so that you can enter while - loop
#Initialiazing list
final_list = []
#With each iteration we get the value of testcase, if the cell is empty
#tescase will be None, when that happens the while loop will stop
while testcase is not None:
#Getting cell value, from columns A, B and C
#Iterating through rows 2, 3, 4 ...
testcase = ws['A' + str(row)].value
in_request = ws['B' + str(row)].value
req_change = ws['C' + str(row)].value
#Making tuple
row_tuple = (testcase, in_request, req_change)
#Adding tuple to list
final_list.append(row_tuple)
#Going to next row
row += 1
#This is what you return, you don't want the last element
#because it is tuple of None's
print(final_list[:-1])
If you want to do it with xlrd this is how I would do it:
import xlrd
#Opening xl file
wb = xlrd.open_workbook('PATH/TO/FILE.xlsx')
#Select your sheet (for this example I chose first sheet)
#you can also choose by name or something else
ws = wb.sheet_by_index(0)
#Getting number of rows and columns
num_row = ws.nrows
num_col = ws.ncols
#Initializing list
final_list = []
#Iterating over number of rows
for i in range(1,num_row):
#list of row values
row_values = []
#Iterating over number of cols
for j in range(num_col):
row_values.append(ws.cell_value(i,j))
#Making tuple with row values
row_tuple = tuple(row_values)
#Adding tuple to list
final_list.append(row_tuple)
print(final_list)
Adding xlrd index specifications comments at the end for easy reading:
Deleted if statement, when num_row is 1 then for-loop never happens
xlrd indexes rows beginning at 0
for row 2 we want index 1
Columns are also zero-indexed (A=0, B=1, C=2...)

Reading Excel File using Python, how do I get the values of a specific column with indicated column name?

I've an Excel File:
Arm_id DSPName DSPCode HubCode PinCode PPTL
1 JaVAS 01 AGR 282001 1,2
2 JaVAS 01 AGR 282002 3,4
3 JaVAS 01 AGR 282003 5,6
I want to save a string in the form Arm_id,DSPCode,Pincode. This format is configurable, i.e. it might change to DSPCode,Arm_id,Pincode. I save it in a list like:
FORMAT = ['Arm_id', 'DSPName', 'Pincode']
How do I read the content of a specific column with provided name, given that the FORMAT is configurable?
This is what I tried. Currently I'm able to read all the content in the file
from xlrd import open_workbook
wb = open_workbook('sample.xls')
for s in wb.sheets():
#print 'Sheet:',s.name
values = []
for row in range(s.nrows):
col_value = []
for col in range(s.ncols):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append(value)
values.append(col_value)
print values
My output is :
[
[u'Arm_id', u'DSPName', u'DSPCode', u'HubCode', u'PinCode', u'PPTL'],
['1', u'JaVAS', '1', u'AGR', '282001', u'1,2'],
['2', u'JaVAS', '1', u'AGR', '282002', u'3,4'],
['3', u'JaVAS', '1', u'AGR', '282003', u'5,6']
]
Then I loop around values[0] trying to find out the FORMAT content in values[0] and then getting the index of Arm_id, DSPname and Pincode in the values[0] and then from next loop I know the index of all the FORMAT factors , thereby getting to know which value do I need to get .
But this is such a poor solution.
How do I get the values of a specific column with name in excel file?
A somewhat late answer, but with pandas, it is possible to get directly a column of an excel file:
import pandas
df = pandas.read_excel('sample.xls')
#print the column names
print df.columns
#get the values for a given column
values = df['Arm_id'].values
#get a data frame with selected columns
FORMAT = ['Arm_id', 'DSPName', 'Pincode']
df_selected = df[FORMAT]
Make sure you have installed xlrd and pandas:
pip install pandas xlrd
This is one approach:
from xlrd import open_workbook
class Arm(object):
def __init__(self, id, dsp_name, dsp_code, hub_code, pin_code, pptl):
self.id = id
self.dsp_name = dsp_name
self.dsp_code = dsp_code
self.hub_code = hub_code
self.pin_code = pin_code
self.pptl = pptl
def __str__(self):
return("Arm object:\n"
" Arm_id = {0}\n"
" DSPName = {1}\n"
" DSPCode = {2}\n"
" HubCode = {3}\n"
" PinCode = {4} \n"
" PPTL = {5}"
.format(self.id, self.dsp_name, self.dsp_code,
self.hub_code, self.pin_code, self.pptl))
wb = open_workbook('sample.xls')
for sheet in wb.sheets():
number_of_rows = sheet.nrows
number_of_columns = sheet.ncols
items = []
rows = []
for row in range(1, number_of_rows):
values = []
for col in range(number_of_columns):
value = (sheet.cell(row,col).value)
try:
value = str(int(value))
except ValueError:
pass
finally:
values.append(value)
item = Arm(*values)
items.append(item)
for item in items:
print item
print("Accessing one single value (eg. DSPName): {0}".format(item.dsp_name))
print
You don't have to use a custom class, you can simply take a dict(). If you use a class however, you can access all values via dot-notation, as you see above.
Here is the output of the script above:
Arm object:
Arm_id = 1
DSPName = JaVAS
DSPCode = 1
HubCode = AGR
PinCode = 282001
PPTL = 1
Accessing one single value (eg. DSPName): JaVAS
Arm object:
Arm_id = 2
DSPName = JaVAS
DSPCode = 1
HubCode = AGR
PinCode = 282002
PPTL = 3
Accessing one single value (eg. DSPName): JaVAS
Arm object:
Arm_id = 3
DSPName = JaVAS
DSPCode = 1
HubCode = AGR
PinCode = 282003
PPTL = 5
Accessing one single value (eg. DSPName): JaVAS
So the key parts are to grab the header ( col_names = s.row(0) ) and when iterating through the rows, to skip the first row which isn't needed for row in range(1, s.nrows) - done by using range from 1 onwards (not the implicit 0). You then use zip to step through the rows holding 'name' as the header of the column.
from xlrd import open_workbook
wb = open_workbook('Book2.xls')
values = []
for s in wb.sheets():
#print 'Sheet:',s.name
for row in range(1, s.nrows):
col_names = s.row(0)
col_value = []
for name, col in zip(col_names, range(s.ncols)):
value = (s.cell(row,col).value)
try : value = str(int(value))
except : pass
col_value.append((name.value, value))
values.append(col_value)
print values
By using pandas we can read excel easily.
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
DataF=pd.read_excel("Test.xlsx",sheet_name='Sheet1')
print("Column headings:")
print(DataF.columns)
Test at :https://repl.it
Reference: https://pythonspot.com/read-excel-with-pandas/
Here is the code to read an excel file and and print all the cells present in column 1 (except the first cell i.e the header):
import xlrd
file_location="C:\pythonprog\xxx.xlsv"
workbook=xlrd.open_workbook(file_location)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))
for row in range(1,sheet.nrows):
print(sheet.cell_value(row,0))
The approach I took reads the header information from the first row to determine the indexes of the columns of interest.
You mentioned in the question that you also want the values output to a string. I dynamically build a format string for the output from the FORMAT column list. Rows are appended to the values string separated by a new line char.
The output column order is determined by the order of the column names in the FORMAT list.
In my code below the case of the column name in the FORMAT list is important. In the question above you've got 'Pincode' in your FORMAT list, but 'PinCode' in your excel. This wouldn't work below, it would need to be 'PinCode'.
from xlrd import open_workbook
wb = open_workbook('sample.xls')
FORMAT = ['Arm_id', 'DSPName', 'PinCode']
values = ""
for s in wb.sheets():
headerRow = s.row(0)
columnIndex = [x for y in FORMAT for x in range(len(headerRow)) if y == firstRow[x].value]
formatString = ("%s,"*len(columnIndex))[0:-1] + "\n"
for row in range(1,s.nrows):
currentRow = s.row(row)
currentRowValues = [currentRow[x].value for x in columnIndex]
values += formatString % tuple(currentRowValues)
print values
For the sample input you gave above this code outputs:
>>> 1.0,JaVAS,282001.0
2.0,JaVAS,282002.0
3.0,JaVAS,282003.0
And because I'm a python noob, props be to:
this answer,
this answer,
this question,
this question
and this answer.
I have read using openpyxl library,
import openpyxl
from pathlib import Path
xlsx_file = Path('C:\\Users\\Amit\\Desktop\\ReadExcel', 'ReadData.xlsx')
wb_obj = openpyxl.load_workbook(xlsx_file)
# Read the active sheet:
sheet = wb_obj.active
for i in range(sheet.max_column):
print(f'i = {i}')
for row in sheet.iter_rows():
print(row[i].value)
Although I almost always just use pandas for this, my current little tool is being packaged into an executable and including pandas is overkill. So I created a version of poida's solution that resulted in a list of named tuples. His code with this change would look like this:
from xlrd import open_workbook
from collections import namedtuple
from pprint import pprint
wb = open_workbook('sample.xls')
FORMAT = ['Arm_id', 'DSPName', 'PinCode']
OneRow = namedtuple('OneRow', ' '.join(FORMAT))
all_rows = []
for s in wb.sheets():
headerRow = s.row(0)
columnIndex = [x for y in FORMAT for x in range(len(headerRow)) if y == headerRow[x].value]
for row in range(1,s.nrows):
currentRow = s.row(row)
currentRowValues = [currentRow[x].value for x in columnIndex]
all_rows.append(OneRow(*currentRowValues))
pprint(all_rows)

Categories