Python to excel array broken into characters

Python to excel array broken into characters - python

When I put the code into excel every character is spaced out. This causes Tuesday to look like T,u,e,s,d,a,y in excel. The goal would be to have each cell in excel to have its own word and not the character. There are many for loops and I struggle with finding an answer to this ongoing problem. Any ideas?
import requests
from pprint import pprint
from xml.dom.minidom import parseString
from openpyxl import Workbook
NMNorth2=[("Farmington"),("Gallup"),("Grants"),("Las_Vegas"),("Raton"),("Santa_Fe"), ("Taos"),("Tijeras"),("Tucumcari")]
NMNorth=[("NM", "Farmington"),("NM", "Gallup"),("NM", "Grants"),("NM", "Las_Vegas"),("NM", "Raton"),("NM", "Santa_Fe"), ("NM", "Taos"),("NM", "Tijeras"),("NM", "Tucumcari")]
wb = Workbook()
dest_filename = 'weather.xlsx'
ws1 = wb.active
ws1.title = "Weather"
for state, city in NMNorth:
r = requests.get("http://api.wunderground.com/api/id/forecast/q/"+state+"/"+city+".json")
data = r.json()
forecast = data['forecast']['txt_forecast']['forecastday']
for n in forecast:
day = n['title']
forecaststm = (n['fcttext'])
columnVariable = 2
for x in day:
ws1.cell(row = 1, column = columnVariable).value = x
columnVariable +=1
for y in forecaststm:
ws1.cell(row = 2, column = columnVariable).value = y
columnVariable +=1
rowVariable = 2
ws1.cell(row = 1, column = 1).value = "City"
for state in NMNorth2:
ws1.cell(row = rowVariable, column = 1).value = state
rowVariable +=1
wb.save(filename = dest_filename)

The issue here is that python treats strings as iterables. In other words, this bites you if you think you're iterating through a list of strings (or similar) and go one level too deep in nested for loops; the easiest way to identify this is to print what you're working with on each loop.
In your case, the below loop is taking each letter (x) in the day of the week (day), writing it to a column and then incrementing the column you're writing to (columnVariable):
for x in day:
ws1.cell(row = 1, column = columnVariable).value = x
columnVariable +=1
Aside, camelCase isn't standard Python, it's more common to use underscores e.g. column_variable. See PEP8

Related

Change dates format in one column (replace, insert column, append), any way to update? ..it must be simple

Need to change dates format in Excel column.
I can get into single cell but in case to update whole column with "proper_date" I am stuck
wb = load_workbook(...)
ws = wb['Lista']
daty_wystawienia = ws['G']
# This solution works but assigning values to first column under the chart
for daty in daty_wystawienia:
date_string = daty.value
if re.search('[0-9-]', str(date_string)):
proper_date = datetime.datetime.strptime(date_string, '%d-%m-%Y').strftime('%y.%m.%d')
for row in range(1):
ws.append([proper_date])
#tried to make last line: daty_wystawienia.append([proper_date]) but got:
AttributeError: 'tuple' object has no attribute 'append'
wb.save(...)
# Also tried this, and only this seems to work. Meaning replacing values with other correctly formatted, but I need this applied to whole column at once:
wb = load_workbook(...)
ws = wb['Lista polis']
daty_wystawienia = ws['G']
ws['G6'] = "19.05.06"
ws['G7'] = "19.05.06"
ws['G8'] = "19.05.06"
ws['G10'] = "19.05.07"
ws['G11'] = "19.05.07"
# or replace
for i in ws['G']:
ws['G9'] = ws['G9'].value.replace('06-05-2019', '10000000000')
wb.save(...)
Is there any way to replace, append, override existing values in excel using openpyxl. I am stuck on this.
Thanks in advance.

If you just want Excel to change the format of the Cell in order to display the date as you like, this is how I did it for a column:
from openpyxl import load_workbook
book = load_workbook('Example.xlsx')
ws = book['Sheet1']
for x in range (1, 500):
_cell = ws.cell(x,1)
_cell.number_format = '[$-en-GB]dd-mmm-yyyy'
book.save("Dates.xlsx")

Thanks for your effor. It looks beautiful but for some reason it does not work for me.
I went through it like this:
def date_of_issuance():
for i in ws.iter_rows():
for cell in i:
d_w = 'Date of issuance'
if cell.value == d_w:
c = cell.column
col = column_index_from_string(c)
r = cell.row
for daty in ws[c]:
date_string = daty.value
if re.search('[0-9]', str(date_string)):
proper_date = datetime.datetime.strptime(date_string, '%d-%m-%Y').strftime('%y-%m-%d')
date = datetime.datetime.strptime(proper_date, '%y-%m-%d').date()
for j in range(1):
ws.cell(row=r+1, column=col, value=date)
r += 1

How can I concatenate multiple rows of excel data into one?

I'm currently facing an issue where I need to bring all of the data shown in the images below into one line only.
So using Python and Openpyxl, I tried to write a parsing script that reads the line and only copies when values are non-null or non-identical, into a new workbook.
I get out of range errors, and the code does not keep just the data I want. I've spent multiple hours on it, so I thought I would ask here to see if I can get unstuck.
I've read some documentation on Openpyxl and about making lists in python, tried a couple of videos on youtube, but none of them did exactly what I was trying to achieve.
import openpyxl
from openpyxl import Workbook
path = "sample.xlsx"
wb = openpyxl.load_workbook(path)
ws = wb.active
path2 = "output.xlsx"
wb2 = Workbook()
ws2 = wb2.active
listab = []
rows = ws.max_row
columns = ws.max_column
for i in range (1, rows+1):
listab.append([])
cellValue = " "
prevCell = " "
for c in range (1, rows+1):
for r in range(1, columns+1):
cellValue = ws.cell(row=r, column=c).value
if cellValue == prevCell:
listab[r-1].append(prevCell)
elif cellValue == "NULL":
listab[r-1].append(prevCell)
elif cellValue != prevCell:
listab[r-1].append(cellValue)
prevCell = cellValue
for r in range(1, rows+1):
for c in range (1, columns+1):
j = ws2.cell(row = r, column=c)
j.value = listab[r-1][c-1]
print(listab)
wb2.save("output.xlsx")
There should be one line with the below information:
ods_service_id | service_name| service_plan_name| CPU | RAM | NIC | DRIVE |

Personally I would go with pandas.
import pandas as pd
#Loading into pandas
df_data = pd.read_excel('sample.xlsx')
df_data.fillna("NO DATA",inplace=True) ## Replaced nan values with "NO DATA"
unique_ids = df_data.ods_service_ids.unique()
#Storing pd into a list
records_list = df_data.to_dict('records')
keys_to_check = ['service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']
processed = {}
#Go through unique ids
for key in unique_ids:
processed[key] = {}
#Get related records
matching_records = [y for y in records_list if y['ods_service_ids'] == key]
#Loop through records
for record in matching_records:
#For each key to check, save in dict if non null
processed[key]['ods_service_ids'] = key
for detail_key in keys_to_check:
if record[detail_key] != "NO DATA" :
processed[key][detail_key] = record[detail_key]
##Note : doesn't handle duplicate values for different keys so far
#Records are put back in list
output_data = [processed[x] for x in processed.keys()]
# -> to Pandas
df = pd.DataFrame(output_data)[['ods_service_ids','service_name', 'service_plan_name', 'CPU','RAM','NIC','DRIVE']]
#Export to Excel
df.to_excel("output.xlsx",sheet_name='Sheet_name_1', index=False)
The above should work but I wasn't really sure on how you wanted to save duplicated records for the same id. Do you look to store them as DRIVE_0, DRIVE_1, DRIVE_2 ?
EDIT:
df could be exported in a different way. Replaced below #export to Excel with the following :
df.to_excel("output.xlsx",sheet_name='Sheet_name_1')
EDIT 2:
with no input data it was hard to see any flows. Corrected the code above with fake data

To be honest, I think you've managed to get confused by data structures and come up with something far more complicated than you need.
One approach that would suit would be to use Python dictionaries for each service, updating them row by row.
wb = load_workbook("sample.xlsx")
ws = wb.active
objs = {}
headers = next(ws.iter_rows(min_row=1, max_row=1, values_only=True))
for row in ws.iter_rows(min_row=2, values_only=True):
if row[0] not in objs:
obj = {key:value for key, value in zip(headers, row)}
objs[obj['ods_service_id']] = obj
else:# update dict with non-None values
extra = {key:value for key, value in zip(headers[3:], row[3:]) if value != "NULL"}
obj.update(extra)
# write to new workbook
wb2 = Workbook()
ws2 = wb2.active
ws2.append(headers)
for row in objs.values(): # do they need sorting?
ws2.append([obj[key] for key in headers])
Note how you can do everything without using counters.

Excel parser stuck on one row

So I was making a quick script to loop through a bunch of sheets in an excel file (22 to be exact) and what I wanted to do was the following:
Open the excel sheet and open the sheet named "All" which contained a list of names and then loop through each name and do the following
To loop through all the other 22 sheets in the same workbook and look through each one for the name, which I knew was in the 'B' column.
If the name were to be found, I wanted to take all the columns in that row containing the data for that name and these columns were from A-H
Then copy and paste them next to the original name (same row) in the 'All sheet' while leaving a bit of a space between the original name and the others.
I wanted to do this for all 22 sheets and for the 200+ names listed in the 'All' sheet, my code is as follows:
import openpyxl, pprint
columns = ['A','B','C','D','E','F','G','H']
k = 10
x = 0
def colnum_string(n):
string = ""
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
print("Opening Workbook...")
wb = openpyxl.load_workbook('FileName.xlsx')
sheet_complete = wb.get_sheet_by_name("All")
row_count_all = sheet_complete.max_row
for row in range(4, row_count_all+1):
k = 10
cell = 'B' + str(row)
print(cell)
name = sheet_complete[cell].value
for i in range(2, 23):
sheet = wb.get_sheet_by_name(str(1995 + i))
row_count = sheet.max_row
for row2 in range(2, row_count+1):
cell2 = 'B' + str(row2)
name2 = sheet[cell].value
if name == name2:
x = x + 1
for z in range(0,len(columns)):
k = k + 1
cell_data = sheet[columns[z] + str(row2)].value
cell_target = colnum_string(k) + str(row)
sheet_complete[cell_target] = cell_data
wb.save('Scimago Country Ranking.xlsx')
print("Completed " + str(x) + " Task(s)")
break
The problem is that it keeps looping with the first name only, so it goes through all the names but when it comes to copying and pasting the data, it just redoes the first name so in the end, I end up with all the names in the 'All' sheet and next to each one is the data for the first name repeated over and over. I can't see what's wrong with my code but forgive me if it's a silly mistake as I'm kind of a beginner in these excel parsing scripts. print statements were for testing reasons.
P.S I know I'm using a deprecated function and I will change that, I was just too lazy to do it since it seems to still work fine and if that's the problem then please let me know.

Quickly count non empty cells in large excel sheet

I'm trying to determine how much data is missing from a large excel sheet. The following code takes a prohibitive amount of time to complete. I've seen similar questions, but I'm not sure how to translate the answer to this case. Any help would be appreciated!
import openpyxl
wb = openpyxl.load_workbook('C://Users/Alec/Documents/Vertnet master list.xlsx', read_only = True)
sheet = wb.active
lat = 0
loc = 0
ele = 0
a = openpyxl.utils.cell.column_index_from_string('CF')
b = openpyxl.utils.cell.column_index_from_string('BU')
c = openpyxl.utils.cell.column_index_from_string('BX')
print('Workbook loaded')
for x in range(2, sheet.max_row):
if sheet.cell(row = x, column = a).value:
lat += 1
if sheet.cell(row = x, column = b).value:
loc += 1
if sheet.cell(row = x, column = c).value:
ele += 1
print((x/sheet.max_row) * 100, '%')
print('Latitude: ', lat/sheet.max_row)
print('Location', loc/sheet.max_row)
print('Elevation', ele/sheet.max_row)

If you are simply trying to do the calc on a table on the sheet and not the entire sheet, you could make one adjustment to make it faster.
row = 1
Do Until IsEmpty(range("A1").offset(row,1).value)
if range("B"&row).value: lat += 1
if range("C"&row).value: loc += 1
if range("D"&row).value: ele += 1
row = row + 1
Loop
This would take you to the end of your defined table rather than the end of the whole sheet which is 90% of the reason it's taking you so long.
Hope this helps

Your problem is that, despite advice in the documentation to the contrary, you're using your own counters to access cells. In read-only mode each use of ws.cell() will force the worksheet to reparse the XML source for the worksheet. Simply use ws.iter_rows(min_col=a, max_col=c) to get the cells in the columns you're interested in.

Is it possible to get an Excel document's row count without loading the entire document into memory?

I'm working on an application that processes huge Excel 2007 files, and I'm using OpenPyXL to do it. OpenPyXL has two different methods of reading an Excel file - one "normal" method where the entire document is loaded into memory at once, and one method where iterators are used to read row-by-row.
The problem is that when I'm using the iterator method, I don't get any document meta-data like column widths and row/column count, and i really need this data. I assume this data is stored in the Excel document close to the top, so it shouldn't be necessary to load the whole 10MB file into memory to get access to it.
So, is there a way to get ahold of the row/column count and column widths without loading the entire document into memory first?

Adding on to what Hubro said, apparently get_highest_row() has been deprecated. Using the max_row and max_column properties returns the row and column count. For example:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column

The solution suggested in this answer has been deprecated, and might no longer work.
Taking a look at the source code of OpenPyXL (IterableWorksheet) I've figured out how to get the column and row count from an iterator worksheet:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1
IterableWorksheet.get_highest_column returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I've also written a function to translate the column letter to a zero based index:
def letter_to_index(letter):
"""Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
column index.
A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.
Args:
letter (str): The column index letter.
Returns:
The column index as an integer.
"""
letter = letter.upper()
result = 0
for index, char in enumerate(reversed(letter)):
# Get the ASCII number of the letter and subtract 64 so that A
# corresponds to 1.
num = ord(char) - 64
# Multiply the number with 26 to the power of `index` to get the correct
# value of the letter based on it's index in the string.
final_num = (26 ** index) * num
result += final_num
# Subtract 1 from the result to make it zero-based before returning.
return result - 1
I still haven't figured out how to get the column sizes though, so I've decided to use a fixed-width font and automatically scaled columns in my application.

Python 3
import openpyxl as xl
wb = xl.load_workbook("Sample.xlsx", enumerate)
#the 2 lines under do the same.
sheet = wb.get_sheet_by_name('sheet')
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
#this works fore me.

This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:
from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)
for _,x in iterparse(xml):
name= x.tag.split("}")[-1]
if name=="col":
print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]
if name=="cols":
print "break before reading the rest of the file"
break

https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html
see : row_range() Utility function to get row range
if you use pyexcel, can call row_range get max rows.
python 3.4 test pass.

Options using pandas.
Gets all sheetnames with count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
df = xl.parse(sheet)
dimensions = df.shape
print('sheetname', ' --> ', dimensions)
Single sheet count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0]) # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')
output sheetname "Sheet1" --> (row count, column count)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python to excel array broken into characters - python

Related

Change dates format in one column (replace, insert column, append), any way to update? ..it must be simple

How can I concatenate multiple rows of excel data into one?

Excel parser stuck on one row

Quickly count non empty cells in large excel sheet

Is it possible to get an Excel document's row count without loading the entire document into memory?

Categories

Resources