I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns
Related
I have an excel sheet which has multiple tables in it, using openpyxl .tables method to read the tables but getting empty list eventhough there are two tables but it return empty. Is there a way to achieve it in python. I need to further process the data from these tables after extracting it into a dataframe. But the tables itself is not getting detected. Any pointers on this would be helpful.
from openpyxl import load_workbook
import pandas as pd
#read file
wb = load_workbook('29.xlsx')
#access specific sheet
ws = wb["Sheet1"]
print(ws.tables.items())
Below is the structure of the Excel sheet.
Parsing the dataframe read, able to get only the tables as ouptut as given below,
is there a better way to handle this, so it works for other excel files of similar kind with multiple tables in it.
from operator import index
import xlrd
import pandas
import math
df = pandas.read_excel('29.xlsx', engine='openpyxl',index_col=None)
noofColumsn = df.shape[1]
a_list = []
for i in df.itertuples():
j = 0
for x in i:
if(pandas.isna(x)):
j = j + 1
if(j == (noofColumsn -1)):
break
if(j < (noofColumsn -1)):
list(i)
print(i)
a_list.append(i)
df1 = pandas.DataFrame(a_list)
del df1[df1.columns[0]]
print(df1.head(1))
df1.to_excel("output.xlsx",header=None,index = False)
The code below only exports the last table on the page to excel, but when I run the print function, it will print all of them. Is there an issue with my code causing not to export all data to excel?
I've also tried exporting as .csv file with no luck.
import pandas as pd
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
for df in dfs:
if len(df.columns) > 1:
df.to_excel(r'VegasInsiderCFB.xlsx', index = False)
#print(df)
Your problem is that each time df.to_excel is called, you are overwriting the file, so only the last df will be left. What you need to do is use a writer and specify a sheet name for each separate df e.g:
url = 'https://www.vegasinsider.com/college-football/matchups/'
writer = pd.ExcelWriter('VegasInsiderCFB.xlsx', engine='xlsxwriter')
dfs = pd.read_html(url)
counter = 0
for df in dfs:
if len(df.columns) > 4:
counter += 1
df.to_excel(writer, sheet_name = f"sheet_{counter}", index = False)
writer.save()
You might need pip install xlsxwriter xlwt to make it work.
Exporting to a csv will never work, since a csv is a single data table (like a single sheet in excel), so in that case you would need to use a new csv for each df.
As pointed out in the comments, it would be possible to write the data onto a single sheet without changing the dfs, but it is likely much better to merge them:
import pandas as pd
import numpy as np
url = 'https://www.vegasinsider.com/college-football/matchups/'
dfs = pd.read_html(url)
dfs = [df for df in dfs if len(df.columns) > 4]
columns = ["gameid","game time", "team"] + list(dfs[0].iloc[1])[1:]
N = len(dfs)
values = np.empty((2*N,len(columns)),dtype=np.object)
for i,df in enumerate(dfs):
time = df.iloc[0,0].replace(" Game Time","")
values[2*i:2*i+2,2:] = df.iloc[2:,:]
values[2*i:2*i+2,:2] = np.array([[i,time],[i,time]])
newdf = pd.DataFrame(values,columns = columns)
newdf.to_excel("output.xlsx",index = False)
I used a numpy.array of object type to be able to copy a submatrix from the original dataframes easily into their intended place. I also needed to create a gameid, that connects the games across rows. It should be now trivial to rewrite this so you loop through a list of urls and write these to separate sheets.
I don't understand how to import a Smartsheet and convert it to a pandas dataframe. I want to manipulate the data from smartsheets, currently I go to smartsheets export to csv and import csv in python but want to eliminate this step so that it can run on a schedule.
import smartsheet
import pandas as pd
access_token ='#################'
smartsheet = Smartsheet(access_token)
sheet = smartsheet.sheets.get('Sheet 1')
pd.DataFrame(sheet)
Here is a simple method to convert a sheet to a dataframe:
def simple_sheet_to_dataframe(sheet):
col_names = [col.title for col in sheet.columns]
rows = []
for row in sheet.rows:
cells = []
for cell in row.cells:
cells.append(cell.value)
rows.append(cells)
data_frame = pd.DataFrame(rows, columns=col_names)
return data_frame
The only issue with creating a dataframe from smartsheets is that for certain column types cell.value and cell.display_value are different. For example, contact columns will either display the name or the email address depending on which is used.
Here is a snippet of what I use when needing to pull in data from Smartsheet into Pandas. Note, I've included garbage collection as I regularly work with dozens of sheets at or near the 200,000 cell limit.
import smartsheet
import pandas as pd
import gc
configs = {'api_key': 0000000,
'value_cols': ['Assigned User']}
class SmartsheetConnector:
def __init__(self, configs):
self._cfg = configs
self.ss = smartsheet.Smartsheet(self._cfg['api_key'])
self.ss.errors_as_exceptions(True)
def get_sheet_as_dataframe(self, sheet_id):
sheet = self.ss.Sheets.get_sheet(sheet_id)
col_map = {col.id: col.title for col in sheet.columns}
# rows = sheet id, row id, cell values or display values
data_frame = pd.DataFrame([[sheet.id, row.id] +
[cell.value if col_map[cell.column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells]
for row in sheet.rows],
columns=['Sheet ID', 'Row ID'] +
[col.title for col in sheet.columns])
del sheet, col_map
gc.collect() # force garbage collection
return data_frame
def get_report_as_dataframe(self, report_id):
rprt = self.ss.Reports.get_report(report_id, page_size=0)
page_count = int(rprt.total_row_count/10000) + 1
col_map = {col.virtual_id: col.title for col in rprt.columns}
data = []
for page in range(1, page_count + 1):
rprt = self.ss.Reports.get_report(report_id, page_size=10000, page=page)
data += [[row.sheet_id, row.id] +
[cell.value if col_map[cell.virtual_column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells] for row in rprt.rows]
del rprt
data_frame = pd.DataFrame(data, columns=['Sheet ID', 'Row ID']+list(col_map.values()))
del col_map, page_count, data
gc.collect()
return data_frame
This adds additional columns for sheet and row IDs so that I can write back to Smartsheet later if needed.
Sheets cannot be retrieved by name, as you've shown in your example code. It is entirely possible for you to have multiple sheets with the same name. You must retrieve them with their sheetId number.
For example:
sheet = smartsheet_client.Sheets.get_sheet(4583173393803140) # sheet_id
http://smartsheet-platform.github.io/api-docs/#get-sheet
Smartsheet sheets have a lot of properties associated with them. You'll need to go through the rows and columns of your sheet to retrieve the information you're looking for, and construct it in a format your other system can recognize.
The API docs contain a listing of properties and examples. As a minimal example:
for row in sheet.rows:
for cell in row.cells
# Do something with cell.object_value here
Get the sheet as a csv:
(https://smartsheet-platform.github.io/api-docs/?python#get-sheet-as-excel-pdf-csv)
smartsheet_client.Sheets.get_sheet_as_csv(
1531988831168388, # sheet_id
download_directory_path)
Read the csv into a DataFrame:
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
pandas.read_csv
You can use this library
Very easy to use and allows Sheets or Reports to be delivered as a Dataframe.
pip install smartsheet-dataframe
Get a report as df
from smartsheet_dataframe import get_as_df, get_report_as_df
df = get_report_as_df(token='smartsheet_auth_token',
report_id=report_id_int)
Get a sheet as df
from smartsheet_dataframe import get_as_df, get_sheet_as_df
df = get_sheet_as_df(token='smartsheet_auth_token',
sheet_id=sheet_id_int)
replace 'smartsheet_auth_token' with your token (numbers and letters)
replace sheet_id_int with your sheet/report id (numbers only)
I am trying to make a CSV, Excel I followed an online aid however, it appears not to work, and it brings up KeyError: 'Teflon'. Any thoughts why?
Here is the aid I was following Aid
import pandas as pd
import os
def sort_data_frame_by_Teflon_column(dataframe):
dataframe = dataframe.sort_values(by= ['Teflon'])
def sort_data_frame_by_LacticAcid_column(dataframe):
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
def sort_data_frame_by_ExperimentalWeight_column(dataframe):
dataframe = dataframe.sort_values(by= ['Experimental Weight'])
def sort_data_frame_by_Ratio_column(dataframe):
dataframe = dataframe.sort_values(by= ['Ratio of Lactic Acid to Experimental Weight'])
def get_data_in_Teflon(dataframe):
dataframe = dataframe.loc[dataframe['Teflon']]
dataframe = dataframe.sort_values(by=['Teflon'])
def get_data_in_LacticAcid(dataframe):
dataframe = dataframe.loc[dataframe['Lactic Acid']]
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
def get_data_in_ExperimentalWeight(dataframe):
dataframe = dataframe.loc[dataframe['Experimental Weight']]
dataframe = dataframe.sort_values(by= ['Experimental Weight'])
def get_data_in_Ratio(dataframe):
dataframe = dataframe.loc[dataframe['Ratio of Lactic Acid to Experimental Weight']]
dataframe = dataframe.sort_values(by= ['Ratio of Lactic Acid to Experimental Weight'])
path = 'C:\\Users\\Light_Wisdom\\Documents\\Spyder\\Mass-TeflonLacticAcidRatio.csv'
#output_file = open(path,'x')
#text = input("Input Data: ")
#text.replace('\\n', '\n')
#output_file.write(text. replace('\\', ''))
#output_file.close()
csv_file = 'C:\\Users\\Light_Wisdom\\Documents\\Spyder\\Mass-TeflonLacticAcidRatio.csv'
dataframe = pd.read_csv(csv_file)
dataframe = dataframe.set_index('Teflon')
sort_data_frame_by_Teflon_column(dataframe)
sort_data_frame_by_LacticAcid_column(dataframe)
sort_data_frame_by_ExperimentalWeight_column(dataframe)
sort_data_frame_by_Ratio_column(dataframe)
get_data_in_Teflon(dataframe)
get_data_in_LacticAcid(dataframe)
get_data_in_ExperimentalWeight(dataframe)
get_data_in_Ratio(dataframe)
write_to_csv_file_by_pandas("C:\\images\\Trial1.csv", dataframe)
write_to_excel_file_by_pandas("C:\\images\\Trial1.xlsx", dataframe)
#data_frame.to_csv(csv_file_path)
#excel_writer = pd.ExcelWriter(excel_file_path, engine = 'xlsxwriter')
#excel_writer.save()
Here is the CSV:
Teflon,Lactic Acid,Experimental Weight,Ratio of Lactic Acid to Experimental Weight
1.973,.2201,1.56,.14
2.05,.15,.93,.16
1.76,.44,1.56,.28
Edit New Question 7/24/19
I am trying to automate an answer with functions and I was on the attempt when I got this error.
def get_Data():
check = 'No'
while(check == 'Yes'):
row_name = input("What is the row number? ")
row_name = []
data = float(input("Teflon, Lactic_Acid, Expt_Wt, LacticAcid_to_Expt1_Wt: "))
dataframe = []
check = input("Add another row? ")
return row_name,data, dataframe
def row_inputter(row_name,data,dataframe):
row_name.append(data)
dataframe.append(row_name)
return row_name, dataframe
# Define your data
#row1 = [ 1.973, .2201, 1.56, .14]
#row2 = [2.05, .15, .93, .16]
#row3 = [1.76, .44, 1.56, .28]
row_name,data, dataframe = get_Data()
row, df = row_inputter()
I can tell that you are a Pandas beginner. No worries... Here's how you do the first few operations.
The AID that you reference is doing things the old fashioned way, and not leveraging many fine tools already created for working with CSV and XLSX data in and out of Pandas and Python.
XLSXWriter is a fabulous library that reads and writes Pandas data easily.
[XLXSwriter.com][1]https://xlsxwriter.readthedocs.io/working_with_pandas.html
# Do necessary imports
import pandas as pd
import os
import xlsxwriter
# Define your data
expt_data = ["Teflon", "Lactic_Acid", "Expt_Wt", "LacticAcid_to_Exptl_Wt"]
row1 = [ 1.973, .2201, 1.56, .14]
row2 = [2.05, .15, .93, .16]
row3 = [1.76, .44, 1.56, .28]
# Create dataframe using constructor method
df1 = pd.DataFrame([row1, row2, row3], columns=expt_data)
# Output dataframe
df1
# Sort dataframe by Teflon column values and output it
Teflon_Sorted = df1.sort_values(by=["Teflon"])
Teflon_Sorted
# Sort dataframe by Lactic_Acid column values and output it
Lactic_Acid_Sorted = df1.sort_values(by=["Lactic_Acid"])
Lactic_Acid_Sorted
# Sort dataframe by Expt_Wt column values and output it
Expt_Wt_sorted = df1.sort_values(by=["Expt_Wt"])
Expt_Wt_sorted
# Sort dataframe by Expt_Wt column values and output it
LacticAcid_to_Exptl_Wt_sorted = df1.sort_values(by=["LacticAcid_to_Exptl_Wt"])
LacticAcid_to_Exptl_Wt_sorted
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("Trial1.xlsx", engine='xlsxwriter')
# Convert all dataframes to XlsxWriter Excel objects and then write each to a different worksheet in the workbook created above named "Trial1.xlsx".
Teflon_Sorted.to_excel(writer, sheet_name='Teflon_Sorted')
Lactic_Acid_Sorted.to_excel(writer, sheet_name='Lactic_Acid_Sorted')
Expt_Wt_sorted.to_excel(writer, sheet_name='Expt_Wt_sorted')
LacticAcid_to_Exptl_Wt_sorted.to_excel(writer, sheet_name='LacticAcid_to_Exptl_Wt_sorted')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
# now go to your current directory in your file system where the Jupyter Notebook or Python file is executing and find your file.
# Type !dir in Jupyter cell to list current directory on MS-Windows
!dir
[XLXSwriter.com][1]https://xlsxwriter.readthedocs.io/working_with_pandas.html
Sorry this is not a complete application that does everything you want, but I have limited time. I showed you how to write out your final results. I left it as a learning exercise for you to learn how to read in your data file, rather than creating it "on the fly" inline in your Python program.
My recommendation is to use XLXSwriter for everything related to Excel or Pandas. Follow the fabulous tutorial on the XLSXwriter website. XLSXwriter is probably the best and easiest Python-Pandas-Excel toolkit right now. It does everything programmatically that someone would normally have to do manually ("interactively").
You already set Teflon as index by
dataframe = dataframe.set_index('Teflon')
you dataframe no longer contains that columns. Your function
sort_data_frame_by_Teflon_column()
would fail and through that error.
Also, the other functions like:
def get_data_in_LacticAcid(dataframe):
dataframe = dataframe.loc[dataframe['Lactic Acid']]
dataframe = dataframe.sort_values(by= ['Lactic Acid'])
will likely fail or turns your dataframe to an empty one due to the first line. What exactly are you trying to achieve with those functions?
I am new to python and need you help.I am trying to write code that iterates through a particular column in excel using pyxl
from io import StringIO
import pandas as pd
import pyodbc
from openpyxl import load_workbook
d=pd.read_excel('workbook.xlsx',header=None)
wb = load_workbook('workbook.xlsx')
SO here in the above example I have to go column J and display all the values in the column.
Please help me solve this.
Also,I have the same column name repeated in my excel sheet..For Example "Sample" column name is available in B2 and also in J2..But I want to get all the column information of J2.
Please let me know how to solve this...
Thankyou ..Please reply
Since you're new to python, you should learn to read the documentation. There are tons of modules available and it will be quicker for you and easier for the rest of us if you make the effort first.
import openpyxl
from openpyxl.utils import cell as cellutils
## My example book simply has "=Address(Row(),Column())" in A1:J20
## Because my example uses formulae, I am loading my workbook with
## "data_only = True" in order to get the values; if your cells do not
## contain formulae, you can omit data_only
workbook = openpyxl.load_workbook("workbook.xlsx", data_only = True)
worksheet = workbook.active
## Alterntively: worksheet = workbook["sheetname"]
## A container for gathering the cell values
output = []
## Current Row = 2 assumes that Cell 1 (in this case, J1) contains your column header
## Adjust as necessary
column = cellutils.column_index_from_string("J")
currentrow = 2
## Get the first cell
cell = worksheet.cell(column = column, row = currentrow)
## The purpose of "While cell.value" is that I'm assuming the column
## is complete when the cell does not contain a value
## If you know the exact range you need, you can either use a for-loop,
## or look at openpyxl.utils.cell.rows_from_range
while cell.value:
## Add Cell value to our list of values for this column
output.append(cell.value)
## Move to the next row
currentrow += 1
## Get that cell
cell = worksheet.cell(column = column, row = currentrow)
print(output)
""" output: ['$J$2', '$J$3', '$J$4', '$J$5', '$J$6', '$J$7',
'$J$8', '$J$9', '$J$10', '$J$11', '$J$12', '$J$13', '$J$14',
'$J$15', '$J$16', '$J$17', '$J$18', '$J$19', '$J$20']