I'm trying to write a script that merges two excel files together. One has been has been hand processed and has a bunch custom formatting done to it, and the other is an auto-generated file. Doing the merge in pandas is simple enough, but preserving the formatting is proving troublesome. I found the styleframe library, which seems like it should simplify what I'm trying to do, as it can import style info in addition to the raw data. However, I'm having problems actually implementing the code.
My questions is this: how can I pull style information from each individual cell in the excel and then apply that to my merged dataframe? Note that the data is not formatted consistently across columns or rows, so I don't think I can apply styles in this manner. Here's the relevant portion of my code:
#iterate thorough all cells of merged dataframe
for rownum, row in output_df.iterrows():
for column, value in row.iteritems():
filename = row['File Name']
cur_style = orig_excel.loc[orig_excel['File Name'] == filename, column][0].style #pulls the style of relevant cell in the original excel document
target_style = output_df.loc[output_df['File Name'] == filename, column][0].style #style of the cell in the merged dataframe
target_style = cur_style #set style in current output_df cell to match original excel file style
This code runs (slowly) but it doesn't seem to actually apply any styling to the output styleframe
Looking through the documentation, I don't really see a method for applying styles at an individual styleframe container level--everything is geared towards doing it as a row or column. It also seems like you need to use a styler object to set the style.
Figured it out. I rejiggered my dataframe so that I could just us a .at instead of a .loc lookup. This, coupled with the apply_style_by_indexes method got me where I needed to be:
for index, row in orig_excel.iterrows():
for column, value in row.iteritems():
index_num = output_df.index.get_loc(index)
#Pull style to copy to new df
cur_style = orig_excel.at[index, column].style
#Apply original style to new df
output_df.apply_style_by_indexes(output_df.index[index_num],
cur_style,
cols_to_style = column)
Related
[ 10-07-2022 - For anyone stopping by with the same issue. After much searching, I have yet to find a way, that isn't convoluted and complicated, to accurately pull mixed type data from excel using Pandas/Python. My solution is to convert the files using unoconv on the command line, which preserves the formatting, then read into pandas from there. ]
I have to concatenate 1000s of individual excel workbooks with a single sheet, into one master sheet. I use a for loop to read them into a data frame, then concatenate the data frame to a master data frame. There is one column in each that could represent currency, percentages, or just contain notes. Sometimes it has been filled out with explicit indicators in the cell, Eg., '$' - other times, someone has used cell formatting to indicate currency while leaving just a decimal in the cell. I've been using a formatting routine to catch some of this but have run into some edge cases.
Consider a case like the following:
In the actual spreadsheet, you see: $0.96
When read_excel siphons this in, it will be represented as 0.96. Because of the mixed-type nature of the column, there is no sure way to know whether this is 96% or $0.96
Is there a way to read excel files into a data frame for analysis and record what is visually represented in the cell, regardless of whether cell formatting was used or not?
I've tried using dtype="str", dtype="object" and have tried using both the default and openpyxl engines.
UPDATE
Taking the comments below into consideration, I'm rewriting with openpyxl.
import openpyxl
from openpyxl import load_workbook
def excel_concat(df_source):
df_master = pd.DataFrame()
for index, row in df_source.iterrows():
excel_file = Path(row['Test Path']) / Path(row['Original Filename'])
wb = openpyxl.load_workbook(filename = excel_file)
ws = wb.active
df_data = pd.DataFrame(ws.values)
df_master = pd.concat([df_master, df_data], ignore_index=True)
return df_master
df_master1 = excel_concat(df_excel_files)
This appears to be nothing more than a "longcut" to just calling the openpyxl engine with pandas. What am I missing in order to capture the visible values in the excel files?
looking here,https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html , noticed the following
dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. **If converters are specified, they will be applied INSTEAD of dtype conversion.**
converters dict, default None
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Do you think that might work for you?
I am trying to read an excel file and write every fourth row into a new Excel file. I'm using Pandas to read and write, and if int(num%4) == 0 to determine which rows to select, but the iteration and subsequent writing continue to escape me. I've tried my best to look up answers, but I'm a new programmer and struggling :/
If you're using Pandas I'm assuming you've loaded the data into a dataframe?
If so then consider this:
import pandas as pd
df = pd.read_csv('YourFile.csv')
df.iloc[::4]
#once you're done with the data you can save it to another csv file
df.to_csv('OutputFile.csv')
This will leave your dataframe df with the 4th, 8th, 12th, etc. rows from your original dataframe/file. You can then read/write to each row left in the dataframe df. To visualize the before and after just insert df.head() before and after the df.iloc[::4] expression.
I did not understand what the problem is to be more specific, but you should try pandas' iloc property (or even loc depending on your df), check more info in here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
im kinda new to pandas and stuck at how to refer a column within same name under different merged column. here some example which problem im stuck about. i wanna refer a database from worker at company C. but if im define this excel as df and
dfcompanyAworker=df[Worker]
it wont work
is there any specific way to define a database within identifical column like this ?
heres the table
https://i.stack.imgur.com/8Y6gp.png
thanks !
first read the dataset that will be used, then set the shape for example I use excel format
dfcompanyAworker = pd.read_excel('Worker', skiprows=1, header=[1,2], index_col=0, skipfooter=7)
dfcompanyAworker
where:
skiprows=1 to ignore the title row in the data
header=[1, 2] is a list because we have multilevel columns, namely Category (Company) and other data
index_col=0 to make the Date column an index for easier processing and analysis
skipfooter=7 to ignore the footer at the end of the data line
You can follow or try the steps as I made the following
I'm new to python and just trying to redo my first project from matlab. I've written a code in vscode to import an excel file using pandas
filename=r'C:\Users\user\Desktop\data.xlsx'
sheet=['data']
with pd.ExcelFile(filename) as xls:
Dateee=pd.read_excel(xls, sheet,index_col=0)
Then I want to access data in a row and column.
I tried to print data using code below:
for key in dateee.keys():
print(dateee.keys())
but this returns nothing.
Is there anyway to access the data (as a list)?
You can iterate on each column, making the contents of each a list:
for c in df:
print(df[c].to_list())
df is what the dataframe was assigned as. (OP had inconsistent syntax & so I didn't use that.)
Look into df.iterrows() or df.itertuples() if you want to iterate by row. Example:
for row in df.itertuples():
print(row)
Look into df.iloc and df.loc for row and column selection of individual values, see Pandas iloc and loc – quickly select rows and columns in DataFrames.
Or df.iat or df.at for getting or setting single values, see here, here, and here.
I have an existing excel. That looks like
and I have another excel that has around 40000 rows and around 300 columns. shortened version looks like
I would like to append values to my existing excel from second excel. But only values that match values in col4 from my existing excel. So i would get something like this
Hope you guys get the picture of what I am trying to do.
yes, that is possible in pandas and it is way faster than anything in excel
df_result = pd.merge(FirstTable, SecondTable, how='left', on='col4')
this will look into both the tables for column "col4" so it needs to be named this way in both the tables.
Also be aware of the fact that if you have multiple values in second table for single value in the first table it will make as many lines in the result as in the second table.
to read the excel you can use:
import pandas as pd
xl=pd.ExcelFile('MyFile.xlsx')
FirstTable = pd.read_excel(xl, 'sheet_name_FIRST_TABLE')
for more detailed description see documentation