An empty data frame error when using read_excel(),skiprow

An empty data frame error when using read_excel(),skiprow - python

I have an excel with an explanation written in the first row as follows:
I want to convert the second row as a header and the data below it into a dataframe format.
So I wrote the following code, but the result is an empty data frame
df = pd.read_excel(filename, skiprows=1)
print(df) =
Empty DataFrame
Columns: []
Index: []
If I enter Excel and delete the first row and do not use skiprows, a correct dataframe appears. Which should I fix?
And since there is only one sheet, I did not set it
When opening the first file saved, the cursor is positioned at row=1 column[a]=1. If I change the cursor position and save the excel, the data frame comes out well. How do I move the cursor and save it?

skiprows=1 works fine me.
pd.read_excel('book1.xlsx', skiprows=1)
Here is an alternative using openpyxl, maybe you can give this a try and see -
from openpyxl import load_workbook
wb = load_workbook('book1.xlsx')
ws = wb.get_sheet_by_name('Sheet1')
data = []
for row in ws.values:
data.append([item for item in row])
df = pd.DataFrame(data[2:], columns=data[1])

have you tried skiprows=[0]?
This is outlined here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Then the default behaviour for headers should be taken as the new row 0

The skiprow argument looks at rows starting from 0. So you have to put skiprows=0.

Related

Python Openpxyl writing into an excel in random locations

I'm working on an API that I was given and am having trouble with printing to the correct row in an Excel file. It is to check the first row that has any open cells in the table and then print to it. Sometimes the first available row is listed as a random row where the row is already full and is overwriting the data or completely skipping rows with only 11/30 cells with data. I have two programs doing this and having the same issues.
Here is the row selection and printing portion of the code.
wb = openpyxl.load_workbook(path)
ws = wb.active
# This loop will go over the excel rows to find the first empty row. It increments the variable #"firstEmptyRow" until it finds the first empty row.
firstEmptyRow = 0
for row in ws:
if not any([cell.value == None for cell in row]):
firstEmptyRow += 1
print(firstEmptyRow)
# The following lines will post the dataframes in the excel path.
with pd.ExcelWriter(path, mode="a",engine = "openpyxl", if_sheet_exists="overlay" ) as writer:
allDf.to_excel(writer, sheet_name= "Main Table", index = False, header = False,startrow=firstEmptyRow,startcol=0)
with pd.ExcelWriter(path, mode="a",engine = "openpyxl", if_sheet_exists="overlay" ) as writer:
combinedArray.to_excel(writer, sheet_name= "Main Table", index = False, header = False,startrow=firstEmptyRow,startcol=18)
Thank you for any help you have. :) Let me kn0ow if you have any questions.
To fix this I've made sure locate if there was data in each row, there is. Beyond this, I have deleted any data that may have been in any cells outside of the table. Neither helped.

Formatting of Excel sheets in Python

In my project I am opening an Excel file with multiple sheets. I want to manipulate "sheet2" in Python (which works fine) and after that overwrite the old "sheet2" with the new one but KEEP the formatting.. so something like this:
import pandas as pd
update_sheet2 = pd.read_excel(newest_isaac_file, sheet_name='sheet2')
#do stuff with the sheet
with pd.ExcelWriter(filepath, engine='openpyxl', if_sheet_exists='replace', mode='a',
KEEP_FORMATTING = True) as writer:
df.to_excel(writer, sheet_name=sheetname, index=index)
In other words: Is there a way to get the formatting from an existing Excel sheet?
I could not find anything about that. I know I can manually set the formatting in Python but the formatting of the existing sheet is really complicated and has to stay the same.
thanks for your help!

As per your comment, try this code. It will open a file (Sample.xlsx), go to a sheet (Sheet1), insert new row at 15, copy the text and formatting from row 22 and paste it in the empty row (row 15). Code and final screen shot attached.
import openpyxl
from copy import copy
wb=openpyxl.load_workbook('Sample.xlsx') #Load workbook
ws=wb['Sheet1'] #Open sheet
ws.insert_rows(15, 1) #Insert one row at 15 and move everything one row downwards
for row in ws.iter_rows(min_row=22, max_row=22, min_col=1, max_col=ws.max_column): # Read values from row 22
for cell in row:
ws.cell(row=15, column=cell.column).value = cell.value #Update value to row 22 to new row 15
ws.cell(row=15, column=cell.column)._style = copy(cell._style) #Copy formatting
wb.save('Sample.xlsx')
How excel looks after running the code

Copy column of cell values from one workbook to another with openpyxl

I am extracting data from one workbook's column and need to copy the data to another existing workbook.
This is how I extract the data (works fine):
wb2 = load_workbook('C:\\folder\\AllSitesOpen2.xlsx')
ws2 = wb2['report1570826222449']
#Extract column A from Open Sites
DateColumnA = []
for row in ws2.iter_rows(min_row=16, max_row=None, min_col=1, max_col=1):
for cell in row:
DateColumnA.append(cell.value)
DateColumnA
The above code successfully outputs the cell values in each row of the first column to DateColumnA
I'd like to paste the values stored in DateColumnA to this existing destination workbook:
#file to be pasted into
wb3 = load_workbook('C:\\folder\\output.xlsx')
ws3 = wb3['Sheet1']
But I am missing a piece conceptually here. I can't connect the dots. Can someone advise how I can get this data from my source workbook to the new destination workbook?

Lets say you want to copy the column starting in cell 'A1' of 'Sheet1' in wb3:
wb3 = load_workbook('C:\\folder\\output.xlsx')
ws3 = wb3['Sheet1']
for counter in range(len(DateColumnA)):
cell_id = 'A' + str(counter + 1)
ws3[cell_id] = DateColumnA[counter]
wb3.save('C:\\folder\\output.xlsx')

I ended up getting this to write the list to another pre-existing spreadsheet:
for x, rows in enumerate(DateColumnA):
ws3.cell(row=x+1, column=1).value = rows
#print(rows)
wb3.save('C:\\folder\\output.xlsx')
Works great but now I need to determine how to write the data to output.xlsx starting at row 16 instead of row 1 so I don't overwrite the first 16 existing header rows in output.xlsx. Any ideas appreciated.

I figured out a more concise way to write the source data to a different starting row on destination sheet in a different workbook. I do not need to dump the values in to a list as I did above. iter_rows does all the work and openpyxl nicely passes it to a different workbook and worksheet:
row_offset=5
for rows in ws2.iter_rows(min_row=2, max_row=None, min_col=1, max_col=1):
for cell in rows:
ws3.cell(row=cell.row + row_offset, column=1, value=cell.value)
wb3.save('C:\\folder\\DestFile.xlsx')

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())

Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)

You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)

It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!

openpyxl row iterator ignores row_offset argument?

I am trying to sift through tons of worthless data, and I want to give the user the opportunity to set the offset herself. The the code ultimately ends up looking like:
master_rows = self.worksheet.iter_rows(row_offset=10000)
However upon calling next(master_rows)[0], the output turns out:
RawCell(row=1, column='A', coordinate='A1', [...] )
Basically, the offset seems to be completely ignored, it always starts from 0. Am I doing something wrong?

According to the source code, if you don't pass range_string, then iter_rows will read all cols and rows in the sheet. In other words, row_offset will take effect if range_string is provided.
For example:
from openpyxl import load_workbook
wb = load_workbook('test.xlsx', use_iterators=True)
ws = wb.get_sheet_by_name('Sheet1')
# printing coordinates of all rows and cols
for row in ws.iter_rows(row_offset=2):
for cell in row:
print cell.coordinate
# printing coordinates from "A3:E5" range
for row in ws.iter_rows(range_string="A1:E3", row_offset=2):
for cell in row:
print cell.coordinate

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

An empty data frame error when using read_excel(),skiprow - python

have you tried skiprows=[0]? This is outlined here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html Then the default behaviour for headers should be taken as the new row 0

The skiprow argument looks at rows starting from 0. So you have to put skiprows=0.

Related

Python Openpxyl writing into an excel in random locations

Formatting of Excel sheets in Python

Copy column of cell values from one workbook to another with openpyxl

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

openpyxl row iterator ignores row_offset argument?

Categories

Resources