Extract excel ranges into a dataframe in python - python

I have a working excel sheet that does not contain any tables. Instead it has multiple sections of data. I want to extract certain ranges of cells from this sheet and create a new data source that can be used to create Power BI reports.
Examples of the ranges are:
range1 = ws['A5':'N7']
range2 = ws['A12':'N13']
range3 = ws['A17':'N20']
range4 = ws['A33':'N35']
range5 = ws['A41':'N42']
When I print the values of these ranges using Python and openpyxl I get a long list of values which I would like to transform into a new dataframe with custom column headers.
How do I transform that list into a table that I can then either export to an excel or into a sql database?
Thank you

I'd use pandas.read_excel as documented here. Note the usecols, skiprows, and nrows to select the ranges. Simply call it multiple times to access different ranges.
Subsequently, I'd use pandas.to_excel (and other to_... functions) to export the dataframe to the appropriate format.
Personally, I only use openpyxl directly when I need to optimize performance, or when I can't install pandas.

Related

Pandas read_excel: How to preserve cell format information for currency and percent

[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?

How to extract multiple pandas dataframes from tables in PDF file and store them as CSVs in Python?

I have a cookbook PDF file which consists of various tables that describe about the variables that are used in one of the datasets I am working with. Since the actual data consists of the values that I need to lookup, I will need to create multiple CSV output files from all the tables that are present in this cookbook.
For instance, on page 15 of this PDF file, we have a table as below from which I need to extract pandas dataframe so that I can save it as a CSV file for later use. I do not care about the "Totals" in these tables since I only need the value and the label field.
I tried to solve this problem by using camelot library in Python -
import camelot
# try extracting table from 1 of the pages
tables = camelot.read_pdf('/Users/Downloads/TEDS-A-2018-DS0001-info-codebook_v1.pdf', pages = '12')
# check data
>>> type(tables)
<class 'camelot.core.TableList'>
>>> len(tables)
0
I am not sure why I do not get any tables in the output. Any help is highly appreciated.
Update - I have also tried out the tabula library however I only get odd rows and not even rows from a table. Here is my code for this trial -
pdf_loc = 'csvs/TEDS-A-2018-DS0001-info-codebook_v1.pdf'
list_of_dataframs = tb.read_pdf(input_path=pdf_loc, pages='all')
number_of_dfs = len(list_of_dataframs)
print('first df in list')
list_of_dataframs[0]
Here is the output -
The PDF cookbook can be found here
One can use Tabula with trying few of it's parameters.
As per your case, I have seen that the structure of the table is similar through out the PDF and so we can use column parameter of Tabula to define our own column structure. If we don't describe this parameter, tabula tries to guess the column structure on it's own, and yes it some times fails to identify the right table structure.
tables = tabula.read_pdf(filename, area = (0,0,800,800), pages=15, columns = (95, 410, 490), pandas_options={'header': None})
After using that parameter I am getting below output for page-15 of the PDF:
We can use this for all the pages and of course we can do pre processing also to remove unnecessary rows, so that you get a perfect tabular data. I would love to help further counting this would work for you.

Using Styleframe to pull styles of individual cells from Excel

I'm trying to write a script that merges two excel files together. One has been has been hand processed and has a bunch custom formatting done to it, and the other is an auto-generated file. Doing the merge in pandas is simple enough, but preserving the formatting is proving troublesome. I found the styleframe library, which seems like it should simplify what I'm trying to do, as it can import style info in addition to the raw data. However, I'm having problems actually implementing the code.
My questions is this: how can I pull style information from each individual cell in the excel and then apply that to my merged dataframe? Note that the data is not formatted consistently across columns or rows, so I don't think I can apply styles in this manner. Here's the relevant portion of my code:
#iterate thorough all cells of merged dataframe
for rownum, row in output_df.iterrows():
for column, value in row.iteritems():
filename = row['File Name']
cur_style = orig_excel.loc[orig_excel['File Name'] == filename, column][0].style #pulls the style of relevant cell in the original excel document
target_style = output_df.loc[output_df['File Name'] == filename, column][0].style #style of the cell in the merged dataframe
target_style = cur_style #set style in current output_df cell to match original excel file style
This code runs (slowly) but it doesn't seem to actually apply any styling to the output styleframe
Looking through the documentation, I don't really see a method for applying styles at an individual styleframe container level--everything is geared towards doing it as a row or column. It also seems like you need to use a styler object to set the style.
Figured it out. I rejiggered my dataframe so that I could just us a .at instead of a .loc lookup. This, coupled with the apply_style_by_indexes method got me where I needed to be:
for index, row in orig_excel.iterrows():
for column, value in row.iteritems():
index_num = output_df.index.get_loc(index)
#Pull style to copy to new df
cur_style = orig_excel.at[index, column].style
#Apply original style to new df
output_df.apply_style_by_indexes(output_df.index[index_num],
cur_style,
cols_to_style = column)

How to write dataframe to csv with a single row header(5k columns)?

I am trying to export a pandas dataframe with to_csv so it can be processed by another tool before using it again with python. It is a token dataset with 5k columns. When exported the header is split in two rows. This might not be an issue for pandas but in this case I need to export it on a single row csv. Is this a pandas limitation or a csv format one?
Currently, searching returned no compatible results. The only solution I came up is writing the column names and the values separately, eg. writing an str column list first and then a numpy array to the csv. Can this be implemented, and if so how?
For me this problem was caused by having multiple indexes. The easiest way to resolve this issue is to specify your own headers. I found reference to an option called tupleize_cols but it doesn't exist in current (1.2.2) pandas.
I was using the following aggregation:
df.groupby(["device"]).agg({
"outage_length":["count","sum"],
}).to_csv("example.csv")
This resulted in the following csv output:
,outage_length,outage_length
,count,sum
device,,
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0
I specified my own headers in the call to to_csv; excluding my group_by, as follows:
}).to_csv("example.csv",header=("flaps","downtime"))
And got the following csv output, which was much more pleasing to spreadsheet software:
device,flaps,downtime
device0001,3,679.0
device0002,1,113.0
device0003,2,400.0
device0004,1,112.0

python equivalent to listObjects in VBA for Excel (tables)

I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.

Categories