Read Excel with multiple headers and unnamed column

Read Excel with multiple headers and unnamed column - python

I recieve some Excel files like that :
USA UK
plane cars plane cars
2016 2 7 1 3 # a comment after the last country
2017 3 1 8 4
There is an unknown amount of countries and there can be a comment after the last column.
When I read the Excel file like that...
df = pd.read_excel(
sourceFilePath,
sheet_name = 'Sheet1',
index_col = [0],
header = [0, 1]
)
... I have a value error :
ValueError: Length of new names must be 1, got 2
The problem is I cannot use the usecols param because I don't know how many countries there is before reading my file.
How can I read such a file ?

It's possible Pandas won't be able to fix your special use case, but you can write a program that fixes the spreadsheet using openpyxl. It has really clear documentation, but here's an overview of how to use it:
import openpyxl as xl
wb = xl.load_workbook("ExampleSheet.xlsx")
for sheet in wb.worksheets:
print("Sheet Title => {}".format(sheet.title))
print("Dimensions => {}".format(sheet.dimensions)) # just returns a string
print("Columns: {} <-> {}".format(sheet.min_column, sheet.max_column))
print("Rows: {} <-> {}".format(sheet.min_row, sheet.max_row))
for r in range(sheet.min_row, sheet.max_row + 1):
for c in range(sheet.min_column, sheet.max_column + 1):
if (sheet.cell(r,c).value != None):
print("Cell {}:{} has value {}".format(r,c,sheet.cell(r,c).value))

what about just using pd.read_csv?
once loaded, you can then determine how many columns you have with df.columns

Related

Pandas: Cant Insert Pivot Table information into a different Sheet

I have a code where I convert a txt to xlsx, then add a column with formulas and then I want to create a Pivot Table with that information in a different Sheet. The code works without errors but it creates and empty Sheet instead of a Sheet with information.
So the code looks like this:
import numpy as np
import openpyxl
#Transforming our txt to xlsx
path = r"C:\Users\roslber\Desktop\Codes\Python\Projects\Automated routes.xlsx"
rssdata= pd.read_csv("dwp.txt", sep="\t")
rssdata.to_excel(path, index= None , header= True)
#Writing the formula column
wb = openpyxl.load_workbook(filename=path)
ws1 = wb["Sheet1"]
ws1["AC1"] = "CF Weight"
row_count= ws1.max_row
actual_row= 2
while actual_row <= row_count: #writting the formula in every row
r= str(actual_row)
ws1["AC"+r] = "=(O"+r + "*P"+r +"*Q"+r +")/28316.8"
actual_row= actual_row + 1
#Creating a new sheet with the pivot tables
df = pd.read_excel(path, 0, header= 0) #defining pivot table dataframe
wb.create_sheet("Sheet2")
pv_pack = pd.pivot_table(df, values=["actual_service_time"],\
index=["delivery_station_code"], columns=["cluster_prefix"], aggfunc=np.sum) #constructing the pivot table
print(pv_pack)
with pd.ExcelWriter(path, mode="a") as writer:
pv_pack.to_excel(writer, sheet_name="Sheet2")
writer.save() #inserting pivot table in sheet2
wb.save(path)
For data protection reasons I can´t show you the information inside the pivot table but when I print it I can see exactly what I want. The problem is that, although a Sheet2 is created correctly, The information that I can see printed doesn't appear in Sheet2. Why is this happening?
I have checked these questions:
Trouble writing pivot table to excel file
How to save a new sheet in an existing excel file, using Pandas?
Regarding to the first one, apparently openpyxl can't create a Pivot Table, but I actually don't need a Pivot Table format, I just need the pv_pack information in Sheet2 as its shown when I print it.
I tried to change my code to imitate what they did in the second question but it didn't work.
Thank you in advance
Edit answering to RJ Adriaansen:
The information in Sheet1 would look like this:
id order mtd delivery_station_code cluster_prefix actual_service_time
xh aabb1 one 1 One_ 231
xr aabb2 two 2 Two_ 135
xd aabb3 three 3 One_ 80
xh aabb8 two 1 Two_ 205
xp aabb9 three 2 One_ 1
xl aabb10 one 3 Two_ 115
And the code printed in my editor looks like this:
delivery_station_code One_ Two_
1 231 205
2 1 135
3 80 115

with automatically closes the file, so there is no need to try to save it manually. It is also not needed to create the second sheet prior to writing it. Removing writer.save() and moving wb.save(path) up will make the code work.
#Writing the formula column
wb = openpyxl.load_workbook(filename=path)
ws1 = wb["Sheet1"]
ws1["AC1"] = "CF Weight"
row_count= ws1.max_row
actual_row= 2
while actual_row <= row_count: #writting the formula in every row
r= str(actual_row)
ws1["AC"+r] = "=(O"+r + "*P"+r +"*Q"+r +")/28316.8"
actual_row= actual_row + 1
wb.save(path)
#Creating a new sheet with the pivot tables
df = pd.read_excel(path, 0, header= 0) #defining pivot table dataframe
pv_pack = pd.pivot_table(df, values=["actual_service_time"],\
index=["delivery_station_code"], columns=["cluster_prefix"], aggfunc=np.sum) #constructing the pivot table
with pd.ExcelWriter(path, mode="a") as writer:
pv_pack.to_excel(writer, sheet_name="Sheet2")

How to extract a particular set of values from excel file using a numerical range in python?

What I intend to do :
I have an excel file with Voltage and Current data which I would like to extract from a specific sheet say 'IV_RAW'. The values are only from 4th row and are in columns D and E.
Lets say the values look like this:
V(voltage)
I(Current)
47
1
46
2
45
3
0
4
-0.1
5
-10
5
Now, I just want to take out only the values starting with a voltage (V) of 45 and shouldnt take negative voltages. The corresponding current (I) values are also needed to be taken out. This has to be done for multiple excel files. So starting from a particular row number cannot be done instead voltage values should be the criterion.
What I know:
I know only how to take out the entire set of values using openxyl:
loc = ("path")
wb = load_workbook("Data") #thefilename
ws = wb["IV_raw"] #theactiveworksheet
#to extract the voltage and current data:
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2, values_only=True):
print(row)
I am a noon coder and new to python. So it will be really helpful if you guys could help. If there is a simplified versions with pandas it will be really great.
Thank you in advance

The following uses pandas which you should definitly take a look at. with sheet_name you set the sheet_name, header is the row index of the header (starting at 0, so Row 4 -> 3), usecols defines the columns using A1 notation.
The last line filters the dataframe. If I understand correctly, then you want Voltage between 0 and 45, thats what the example does and df is your resulting data_frame
import pandas as pd
file_loc = "path.xlsx"
df = pd.read_excel(file_loc,
sheet_name = 'IV_raw',
header = 3,
usecols = "D:E")
df = df[(df['V(voltage)'] > 0) & (df['V(voltage)'] < 45)]

Building on from your example, you can use the following example to get what you need
from openpyxl import load_workbook
wb = load_workbook(filepath,data_only=True) #load the file using its full path
ws = wb["Sheet1"] #theactiveworksheet
#to extract the voltage and current data:
data = ws.iter_rows(min_col=4, max_col=5, min_row=2, max_row=ws.max_row, values_only=True)
output = [row for row in data if row[0]>45]

you can try this,
import openpyxl
tWorkbook = openpyxl.load_workbook("YOUR_FILEPATH")
tDataBase = tWorkbook.active
voltageVal= "D4"
currentVal= "E4"
V = tDataBase[voltageVal].value
I = tDataBase[currentVal].value

How to ignore a custom header in Pandas

I'm trying to find a creative way to get the dataframe of several sheets within a spreadsheet that's quite irregular but I can't find the way to do it.
If I try this:
file= 'filename.xlsx'
df = xlrd.open_workbook(file)
print(df)
This is my current output:
A | B | C
1 Random text | Empty cell|Empty cell
------------------------------------
2 Empty cell | |
------------------------------------
3 Empty cell | |
------------------------------------
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
I want to start my dataframe in the CODE row and column, but pandas just gets the "Random text" cell as the first cell
This is my desired output:
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
How would you make Pandas ignore the first rows? It has to be value-based because in the next sheets CODE starts in row 8, and in the next one in row 3

Not sure about XLRD, but Pandas has an easy way in the excel reading method that allows you to specify which row is your headers. That would be an easy fix unless you're intent on using XLRD.

You can try:
import pandas as pd
file= 'filename.xlsx'
df = pd.read_excel(open(file, 'rb'),sheet_name='sheetname', skiprows=[0,1,2])
Alternatively you can use header argument as mentioned earlier.

In my previous answer I pointed a static solution, and in this one I have added a helper function for dynamic parsing. get_header_index helper function dynamically gets the index of the row containing header keyword in the first column. You may change the col_index argument if you believe header keyword is in another column tough. Likewise you can change keyword argument's input as you like. The output dfs is dictionary of dataframes where keys are sheet names of a given workbook.
import pandas as pd
def get_header_index(sheet, col_index=0, keyword='code'):
arr = sheet[sheet.columns[int(col_index)]]
header_index = arr[arr.str.contains(str(keyword), na=False)].iloc[[0,]].index[0]
return header_index
file = 'filename.xlsx'
sheets_dict = pd.read_excel(open(file, 'rb'), sheet_name=None)
dfs = {}
for name, sheet in sheets_dict.items():
header = get_header_index(sheet, col_index=0, keyword='code') + 1
df = pd.read_excel(open(file, 'rb'), sheet_name=name, header=header)
dfs[name] = df

This is a form of what I did in mine, adjusted for your use (based on my previous comment):
for file in file_names: # Iterate through all of the individual report files
book = xlrd.open_workbook(file)
sheetname = get_sheetname(book)
if sheetname is not None: # Check that sheet name is valid
sheet = book.sheet_by_name(sheetname)
nrows = sheet.nrows
ncols = sheet.ncols
for i in range(nrows):
for j in range(ncols):
check = sheet.cell_value(i, j)
if check.contains("CODE"):
return (i, j)

How to use Python to read one column from Excel file?

I want to read the data in one column in excel, here is my code:
import xlrd
file_location = "location/file_name.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('sheet')
x = []
for cell in sheet.col[9]:
if isinstance(cell, float):
x.append(cell)
print(x)
It is wrong because there is no method in sheet called col[col.num], but I just want to extract the data from column 8 (column H), what can I do?

If you're not locked with xlrd I would probably have used pandas instead which is pretty good when working with data from anywhere:
import pandas as pd
df = pd.ExcelFile('location/test.xlsx').parse('Sheet1') #you could add index_col=0 if there's an index
x=[]
x.append(df['name_of_col'])
You could then just write the new extracted columns to a new excel file with pandas df.to_excel()

You can get the values of the 8th column like this:
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 7))

By far the easiest way to get all the values in a column using xlrd is the col_values() worksheet method:
x = []
for value in sheet.col_values(8):
if isinstance(value, float):
x.append(value)
(Note that if you want column H, you should use 7, because the indices start at 0.)
Incidentally, you can use col() to get the cell objects in a column:
x = []
for cell in sheet.col(8):
if isinstance(cell.value, float):
x.append(cell.value)
The best place to find this stuff is the official tutorial (which serves as a decent reference for xlrd, xlwt, and xlutils). You could of course also check out the documentation and the source code.

I would recommend to do it as:
import openpyxl
fname = 'file.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.get_sheet_by_name('sheet-name')
for rowOfCellObjects in sheet['C5':'C7']:
for cellObj in rowOfCellObjects:
print(cellObj.coordinate, cellObj.value)
Result: C5 70.82 C6 84.82 C7 96.82
Note: fname refers to excel file, get_sheet_by_name('sheet-name') refers to desired sheet and in sheet['C5':'C7'] ranges are mentioned for columns.
Check out the link for more detail. Code segment taken from here too.

XLRD is good, but for this case you might find Pandas good because it has routines to select columns by using an operator '[ ]'
Complete Working code for your context would be
import pandas as pd
file_location = "file_name.xlsx"
sheet = pd.read_excel(file_location)
print(sheet['Sl'])
Output 1 - For column 'Sl'
0 1
1 2
2 3
Name: Sl, dtype: int64
Output 2 - For column 'Name'
print(sheet['Name'])
0 John
1 Mark
2 Albert
Name: Name, dtype: object
Reference: file_name.xlsx data
Sl Name
1 John
2 Mark
3 Albert

data validation range Django and xlsxwriter

I have been using Django and xlsxwriter on a project that I am working on. I want to use data_validation in Sheet1 to pull in the lists that I have printed in Sheet2. I get the lists to print, but am not seeing the data_validation in Sheet1 when I open the file. Any insight on what I am doing incorrectly is much appreciated!
wb = xlsxwriter.Workbook(TestCass)
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = listFunction(headerToModelDic[header])
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,{'validate':'list','source':'=Sheet2!$A2:$A9'})
col += 1
wb.close()
Note: The reason I am not pulling the list directly from the site is because it is too long (longer than 256 characters). Secondly, I ultimately would like the source range in the data validation to take in variables from sheet2, however I cannot get sheet 1 to have any sort of data validation as is so I figured I would start with the absolute values.

It looks like the data ranges are wrong in the example. It appears that you are writing out the list data in a column but the data validation refers to a row of data.
Maybe in your full example there is data in that row but in the example above there isn't.
I've modified your example slightly to a non-Django example with some sample data. I've also changed the data validation range to match the written data range:
import xlsxwriter
wb = xlsxwriter.Workbook('test.xlsx')
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
headers = ['Header 1']
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = [1, 2, 3, 4, 5]
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,
{'validate':'list','source':'=Sheet2!$P2:$P6'})
col += 1
wb.close()
And here is the output:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read Excel with multiple headers and unnamed column - python

what about just using pd.read_csv? once loaded, you can then determine how many columns you have with df.columns

Related

Pandas: Cant Insert Pivot Table information into a different Sheet

How to extract a particular set of values from excel file using a numerical range in python?

How to ignore a custom header in Pandas

How to use Python to read one column from Excel file?

data validation range Django and xlsxwriter

Categories

Resources