Importing multiple slightly different excel files - python

I have to import tables from multiple excel files to a python, specifically with a Panda Dataframe format.
The problem is that the excel files do not have all the same structure,
particularly in some of them the table starts at the cell A1 while in another start at cell A2 or even B1 or B2.
The only thing that stays constant through all excel files are the headers positioned in the first row of the table.
So for instance in the first row in the very right of the table is always written "setting". But "setting" is sometimes written at position A1 and sometimes at position B2.
Currently, I am just modifying manually numHeader and skip parameters in the pandas.read_excel method for each single excel, but since there is quite a lot of file doing this manually every time is quite a waste of time.
CurrExcelFile = pd.read_excel(Files[i], \
header=numHeader, skiprows=skip)
Is there exist, or can it be easily written a package which takes as a parameter a string for identifying the first element of a table?
So that I could just pass "setting" and the script could get automatically the index of the cell and start fetching data from there.
UPDATE:
So currently I managed to it by first importing the whole sheet, find where it is the "setting" value, then dropping the unnecessary columns, rename the data frame and finally dropping the unnecessary rows.
Test = pd.read_excel('excelfile',sheet_name='sheetname')
#Find the index and the column of the first cell
for column in Test.columns:
tmp = Test[column] == "setting"
if len(Test.loc[tmp].index) == 1:
RowInd = Test.loc[tmp].index[0]
ColPos = Test.columns.get_loc(column)
#Drop Columns
ColumnsToDrop = Test.columns[np.arange(0,ColPos)]
Test.drop(ColumnsToDrop, inplace=True, axis=1)
#Rename Axis
Test.columns = (Test.iloc[RowInd])
#Drop Rows
Test.drop(np.arange(0,RowInd+1), inplace=True, axis=0)
This is rather a workaround and I wish there was a easier solution

Related

Start reading from a column where data exists is in excel using pandas [duplicate]

How could I read an Excel file from column AF and onwards? I don't know the last column letter name and the file is too large to constantly keep checking.
df = pd.read_excel(r"Documents\file.xlsx", usecols="AF:")
You can't write it directly in read_excel function so we can only look for other possible options.
For the moment we could write 'AF:XFD' because 'XFD' is the last column in excel, but it returns information that it will be depracated soon and start returning ParseError, so it's not recommended.
You can use other libraries to find the last column, but it doesn't work too fast - excel file is read, then we check last column and after that we create a dataframe.
If I had such problem I would do everything in Pandas, by adding .iloc at the end. We know that 'AF' is 32th column in excel, so:
df = pd.read_excel(r"Documents\file.xlsx").iloc[:, 32:]
It will return all columns from 'AF' till the end without having a noticeable impact on performance.
You can use the xlwings library to determine the last column in which there is data and then replace it in your code line.
import xlwings as xw
app = xw.App(visible = False)
book = app.books.open(r"Documents\file.xlsx")
sheet = xw.Book(r"Documents\file.xlsx").sheets[0]
app.quit()
colNum = sheet.range('AF1').current_region.last_cell.column
df = pd.read_excel(r"Documents\file.xlsx", usecols = list(range(32, colNum, 1)))
where 32 corresponds to the column number for AF.

Changing data in CSV cells with Python

I'm looking to insert a few characters into the beginning of a cell on a CSV using Python. Python needs to do this with the same cell on each row.
As an example, see:
Inserting values into specific cells in csv with python
So:
row 1 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
row 2 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
row 3 - cell 3 'Qwerty' - add 3 characters (HAL) to beginning of the cell. So cell now reads 'HALQwerty'
Does anyone know how to do this?
I found this link:
https://www.protechtraining.com/blog/post/python-for-beginners-reading-manipulating-csv-files-737
But it doesn't go into enough detail.
Simplest way is probably to use Pandas. First run 'pip install pandas'
import pandas as pd
# read the CSV file and store into dataframe
df = pd.read_csv("test.csv")
# change value of a single cell directly
# this is selecting index 4 (row index) and then the column name
df.at[4,'column-name'] = 'HALQwerty'
# change multiple values simultaneously
# here we have a range of rows (0:4) and a couple column values
df.loc[0:4 ,['Num','NAME']] = [100, 'HALQwerty']
# write out the CSV file
df.to_csv(f"output.csv")
Pandas allows for a lot of control over your CSV files, and its well documented.
https://pandas.pydata.org/docs/index.html
Edit: To allow conditional appending of text:
df = pd.DataFrame({'col1':['a', 'QWERTY', "QWERTY", 'b'], 'col2':['c', 'tortilla', 'giraffe', 'monkey'] })
mask = (df['col1'] == 'QWERTY')
df.loc[mask, 'col1'] = 'HAL' + df['col1'].astype(str)
The mask is the subset of rows that match a condition (where cell value equals "QWERTY"). The ".loc" function identifies where in the dataframe that subset is, and helps to apply whatever change you want.

Problem with combining multiple excel files in python pandas

I am quite new to python programming. I need to combine 1000+ files into one file. each file has 3 sheets in it and I need to get data only from sheet2 and make an final excel file. I am facing a problem to pick a value from specific cell from each excel file on sheet2 and create a column. python is picking the value from first file and create a column on that
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsm'):
df = pd.read_excel(file, sheet_name=1, header=None)
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
df2 = df2.dropna(thresh=15)
df2 = df.append(df, ignore_index=True)
df = df.reset_index()
del df['index']
df2.to_excel('FINAL_FILES.xlsx')
How can I solve this issue so python can take from each excel and put the information on right rows.
I df.iloc[2][4] refers to the 2nd row and 4th column of the 1st sheet. You have imported with sheet_name=1 and never activated a different sheet, though you mentioned all of the .xlsm have 3 sheets.
II your scoping could be wrong. Why define df outside of the loop? If will change per file, so no need for an external one. All info form the loop should be put into your df2 before the next iteration of the loop.
III Have you checked if append is adding a row or a column?
Even though
df['REPORT_NO'] = df.iloc[1][4] #Report Number
df['SUPPLIER'] = df.iloc[2][4] #Supplier
df['REPORT_DATE'] = df.iloc[0][4] #Report Number
are written as columns they have Report Number/Supplier/Report Date repeated for every row in that column.
When you use df2 = df.append(df, ignore_index=True) check the output. It might not be appending in the way you intend.

how to delete columns by column name from an excel using python openpyxl

I have an Excel worksheet from which I want to delete certain columns based on their column names using python openpyxl as the column positions aren't fixed.
Their indices can change every time i get a new report, but the column name to be deleted remains the same each time.
In the below example i want to delete columns if their names are equal to = ["To be determined","No Value","Total"]
I have tried fetching the column index number so i can delete those using index value, but it isn't working as expected.
where max_file is the excel file path and sh2 is sheet2 containing the data
Have you tried using
sh2.delete_cols(2,4)
2 is the starting column and 4 is the number of columns.
Function to delete multiple columns by column name
# Even if your cell names are not like 'B12' instead only 'B'. That also can be done just let me know in the comment
list_of_cell_names_need_to_be_deleted = ['B1', 'C2', 'E7', 'A5']
def delete_columns_by_name(list_of_cell_names_need_to_be_deleted):
for i in list_of_cell_names_need_to_be_deleted:
# For the first iteration first it splits 'B1' to ('B',1) and stores 'B' to the assigned variable below
cell_name_only_alpha = openpyxl.utils.cell.coordinate_from_string(i)[0]
# After splitting it, we will store the index of the letter exctrated ( here for first iteration it will be 'B' and hence index will be 2 )
index_value = openpyxl.utils.cell.column_index_from_string(cell_name_only_alpha)
# Now let's delete the colu
# If you don't know what is ws, here is a brief information
# wb - Used to mention the name of the file.
# ws - Used to mention the sheet name of the file. (There may be more than one sheet within an excel file)
# For more info : https://openpyxl.readthedocs.io/en/stable/tutorial.html
ws.delete_cols(index_value,1)
delete_columns_by_name(list_of_cell_names_need_to_be_deleted)
# NOTE : At the end don't forget to save the file
# Code to save the file : wb.save(name_of_file_here)
wb.save(name_of_file_here)

Python OpenPyxl trouble detecting all merged cells

I am trying to detect all the merged cells in a openpyxl.worksheet.worksheet.Worksheet object and it seems that the merged_cells.ranges cannot all the merged cells but merged cells in some columns. My goal is to detect the merged cells, unmerge them and then remerge certain cells based on column values. During unmerging, I fill the unmerged cells with the top-left cell value of the merged cell.
I have worked around with this problem by filling nan in cells which are supposed to be recognized as a merged cell with previous value in the column since all my merged cells range in the same column, for example A18:A19, B18:B19. But things have become more tricky after I updated my xlsx file. OpenPyxl didn't find merged cells in A, C and E columns in my previous xlsx. Now it has trouble finding merged cells in B, D and F columns. Two xlsx file has the same format but different data.
Here is an example of what my xlsx looks like:
xlsx sample
My code to read the xlsx then detect & unmerge the merged cells:
client_info_wb = load_workbook(path_client_info)
sheet_name = client_info_wb.sheetnames[0]
client_info_ws = client_info_wb[sheet_name]
for cell_group in client_info_ws.merged_cells.ranges:
print(cell_group)
min_col, min_row, max_col, max_row = range_boundaries(str(cell_group))
top_left_cell_value = client_info_ws.cell(row=min_row, column=min_col).value
print(top_left_cell_value)
client_info_ws.unmerge_cells(str(cell_group))
for row in client_info_ws.iter_rows(min_col=min_col, min_row=min_row, max_col=max_col, max_row=max_row):
for cell in row:
cell.value = top_left_cell_value
Output for print(cell_group):
A48:A49
2021-01-05
C48:C49
XX5614
E48:E49
ID
A46:A47
2021-01-05
C46:C47
XX2134
E46:E47
ID
A44:A45
2021-01-05
C44:C45
XX1234
E44:E45
ID
The information in those columns where openpyxl merged_cells.ranges fails to detect merged cells is necessary to the following operations in my code. So can anyone help me with this? Are there anyone having the same issue? I have spent a long time trying to find the patterns in my xlsx to find out what is causing the trouble and had no luck.
while sheet.merged_cells: # <- Here's the change to make.
for cell_group in sheet.merged_cells:
val = str(cell_group.start_cell.value).strip()
sheet.unmerge_cells(str(cell_group))
for merged_cell in cell_group.cells:
sheet.cell(row=merged_cell[0], column=merged_cell[1]).value = val
It seems like the set of merged_cells changes as it is iterated over, so repeating the loop until merged_cells is None does the trick.
There's also something weird going on with the in-memory buffer, so I save the file to disk and reload it with pandas, rather than loading the dataframe from the sheet in memory. (This could easily be optimized with a BytesIO object.)
For me, this guarantees that all merged cells are unmerged and replaced with the start cell's value.

Categories