How to use Python to read one column from Excel file? - python

I want to read the data in one column in excel, here is my code:
import xlrd
file_location = "location/file_name.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('sheet')
x = []
for cell in sheet.col[9]:
if isinstance(cell, float):
x.append(cell)
print(x)
It is wrong because there is no method in sheet called col[col.num], but I just want to extract the data from column 8 (column H), what can I do?

If you're not locked with xlrd I would probably have used pandas instead which is pretty good when working with data from anywhere:
import pandas as pd
df = pd.ExcelFile('location/test.xlsx').parse('Sheet1') #you could add index_col=0 if there's an index
x=[]
x.append(df['name_of_col'])
You could then just write the new extracted columns to a new excel file with pandas df.to_excel()

You can get the values of the 8th column like this:
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 7))

By far the easiest way to get all the values in a column using xlrd is the col_values() worksheet method:
x = []
for value in sheet.col_values(8):
if isinstance(value, float):
x.append(value)
(Note that if you want column H, you should use 7, because the indices start at 0.)
Incidentally, you can use col() to get the cell objects in a column:
x = []
for cell in sheet.col(8):
if isinstance(cell.value, float):
x.append(cell.value)
The best place to find this stuff is the official tutorial (which serves as a decent reference for xlrd, xlwt, and xlutils). You could of course also check out the documentation and the source code.

I would recommend to do it as:
import openpyxl
fname = 'file.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.get_sheet_by_name('sheet-name')
for rowOfCellObjects in sheet['C5':'C7']:
for cellObj in rowOfCellObjects:
print(cellObj.coordinate, cellObj.value)
Result: C5 70.82 C6 84.82 C7 96.82
Note: fname refers to excel file, get_sheet_by_name('sheet-name') refers to desired sheet and in sheet['C5':'C7'] ranges are mentioned for columns.
Check out the link for more detail. Code segment taken from here too.

XLRD is good, but for this case you might find Pandas good because it has routines to select columns by using an operator '[ ]'
Complete Working code for your context would be
import pandas as pd
file_location = "file_name.xlsx"
sheet = pd.read_excel(file_location)
print(sheet['Sl'])
Output 1 - For column 'Sl'
0 1
1 2
2 3
Name: Sl, dtype: int64
Output 2 - For column 'Name'
print(sheet['Name'])
0 John
1 Mark
2 Albert
Name: Name, dtype: object
Reference: file_name.xlsx data
Sl Name
1 John
2 Mark
3 Albert

Related

How to read specif cell with pandas library?

I want to read from excel sheet a specific cell: h6. So I try it like this:
import pandas as pd
excel_file = './docs/fruit.xlsx'
df = pd.read_excel(excel_file,'Overzicht')
sheet = df.active
x1 = sheet['H6'].value
print(x1)
But then I get this error:
File "C:\Python310\lib\site-packages\pandas\core\generic.py", line 5575, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'active'
So my questiion is: How to read specif cell from sheet from excelsheet?
Thank you
Oke, I tried with openpyxl:
import openpyxl
path = "./docs/fruit.xlsx"
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
cell_obj = sheet_obj.cell(row = 6, column = 9)
print(cell_obj.value)
But then the formula is printed. Like this:
=(H6*1000)/F6/G6
and not the value: 93
You can do this using openpyxl directly or pandas (which internally uses openpyxl behind the scene)...
Using Openpyxl
You will need to use data_only=True when you open the file. Also, make sure you know the row and column number. To read the data in H6, row would be 6 and 8 would be H
import openpyxl
path = "./docs/Schoolfruit.xlsx"
wb_obj = openpyxl.load_workbook(path, data_only=True)
sheet_obj = wb_obj.active ## Or use sheet_obj = wb_obj['Sheet1'] if you know sheet name
val = sheet_obj.cell(row = 6, column = 8).value
print(val)
Using Pandas
The other option is to use pandas read_excel() which will read the whole sheet into a dataframe. You can use iloc() or at() to read the specific cell. Note that this is probably the less optimal solution if you need to read just one cell...
Another point to note here is that, once you have read the data into a dataframe, the row 1 will be considered as the header and the first row would now be 0. So the row number would be 4 instead of 6. Similarly, the first column would now be 0 and not 1, which would change the position to [4,7]
import pandas as pd
path = "./docs/Schoolfruit.xlsx"
df = pd.read_excel(path, 'Sheet1')
print(df.iloc[4,7])
I found a solution and hope, it works for you.
import pandas as pd
excel_file = './docs/Schoolfruit.xlsx'
df = pd.read_excel(excel_file, sheet_name='active' ,header=None, skiprows=1)
print(df[7][4])
7: Hth column
4: 6th row (skipped first row and index is began from 0)

How to extract a particular set of values from excel file using a numerical range in python?

What I intend to do :
I have an excel file with Voltage and Current data which I would like to extract from a specific sheet say 'IV_RAW'. The values are only from 4th row and are in columns D and E.
Lets say the values look like this:
V(voltage)
I(Current)
47
1
46
2
45
3
0
4
-0.1
5
-10
5
Now, I just want to take out only the values starting with a voltage (V) of 45 and shouldnt take negative voltages. The corresponding current (I) values are also needed to be taken out. This has to be done for multiple excel files. So starting from a particular row number cannot be done instead voltage values should be the criterion.
What I know:
I know only how to take out the entire set of values using openxyl:
loc = ("path")
wb = load_workbook("Data") #thefilename
ws = wb["IV_raw"] #theactiveworksheet
#to extract the voltage and current data:
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2, values_only=True):
print(row)
I am a noon coder and new to python. So it will be really helpful if you guys could help. If there is a simplified versions with pandas it will be really great.
Thank you in advance
The following uses pandas which you should definitly take a look at. with sheet_name you set the sheet_name, header is the row index of the header (starting at 0, so Row 4 -> 3), usecols defines the columns using A1 notation.
The last line filters the dataframe. If I understand correctly, then you want Voltage between 0 and 45, thats what the example does and df is your resulting data_frame
import pandas as pd
file_loc = "path.xlsx"
df = pd.read_excel(file_loc,
sheet_name = 'IV_raw',
header = 3,
usecols = "D:E")
df = df[(df['V(voltage)'] > 0) & (df['V(voltage)'] < 45)]
Building on from your example, you can use the following example to get what you need
from openpyxl import load_workbook
wb = load_workbook(filepath,data_only=True) #load the file using its full path
ws = wb["Sheet1"] #theactiveworksheet
#to extract the voltage and current data:
data = ws.iter_rows(min_col=4, max_col=5, min_row=2, max_row=ws.max_row, values_only=True)
output = [row for row in data if row[0]>45]
you can try this,
import openpyxl
tWorkbook = openpyxl.load_workbook("YOUR_FILEPATH")
tDataBase = tWorkbook.active
voltageVal= "D4"
currentVal= "E4"
V = tDataBase[voltageVal].value
I = tDataBase[currentVal].value

Write an excel formula all column with python

I have existing excel document and want to update M column according to A column. And I want to start from second row to maintain first row 'header'.
Here is my code;
import openpyxl
wb = openpyxl.load_workbook('D:\Documents\Desktop\deneme/formula.xlsx')
ws=wb['Sheet1']
for i, cellObj in enumerate(ws['M'], 1):
cellObj.value = '=_xlfn.ISOWEEKNUM(A2)'.format(i)
wb.save('D:\Documents\Desktop\deneme/formula.xlsx')
When I run that code;
-first row 'header' changes.
-all columns in excel "ISOWEEKNUM(A2)", but I want it to change according to row number (A3,A4,A5... "ISOWEEKNUM(A3), ISOWEEKNUM(A4), ISOWEEKNUM(A5)....")
Edit:
I handled right now the ISOWEEKNUM issue with below code. I changed A2 to A2:A5.
import openpyxl
wb = openpyxl.load_workbook('D:\Documents\Desktop\deneme/formula.xlsx')
ws=wb['Sheet1']
for i, cellObj in enumerate(ws['M'], 1):
cellObj.value = '=_xlfn.ISOWEEKNUM(A2:A5)'.format(i)
wb.save('D:\Documents\Desktop\deneme/formula.xlsx')
But still starts from first row.
Here is an answer using pandas.
Let us consider the following spreadsheet:
First import pandas:
import pandas as pd
Then load the third sheet of your excel workbook into a dataframe called df:
df=pd.read_excel('D:\Documents\Desktop\deneme/formula.xlsx', sheet_name='Sheet3')
Update column 'column_to_update' using column 'deneme'. The line below converts the dates in the 'deneme' column from strings to datetime objects and then returns the week of the year associated with each of those dates.
df['Column_to_update'] = pd.to_datetime(df['deneme']).dt.week
You can then save your dataframe to a new excel document:
df.to_excel('./newspreadsheet.xlsx', index=False)
Here is the result:
You can see that the values in 'column_to_update' got updated from 1, 2 and 3 to 12, 12 and 18.

How to ignore a custom header in Pandas

I'm trying to find a creative way to get the dataframe of several sheets within a spreadsheet that's quite irregular but I can't find the way to do it.
If I try this:
file= 'filename.xlsx'
df = xlrd.open_workbook(file)
print(df)
This is my current output:
A | B | C
1 Random text | Empty cell|Empty cell
------------------------------------
2 Empty cell | |
------------------------------------
3 Empty cell | |
------------------------------------
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
I want to start my dataframe in the CODE row and column, but pandas just gets the "Random text" cell as the first cell
This is my desired output:
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
How would you make Pandas ignore the first rows? It has to be value-based because in the next sheets CODE starts in row 8, and in the next one in row 3
Not sure about XLRD, but Pandas has an easy way in the excel reading method that allows you to specify which row is your headers. That would be an easy fix unless you're intent on using XLRD.
You can try:
import pandas as pd
file= 'filename.xlsx'
df = pd.read_excel(open(file, 'rb'),sheet_name='sheetname', skiprows=[0,1,2])
Alternatively you can use header argument as mentioned earlier.
In my previous answer I pointed a static solution, and in this one I have added a helper function for dynamic parsing. get_header_index helper function dynamically gets the index of the row containing header keyword in the first column. You may change the col_index argument if you believe header keyword is in another column tough. Likewise you can change keyword argument's input as you like. The output dfs is dictionary of dataframes where keys are sheet names of a given workbook.
import pandas as pd
def get_header_index(sheet, col_index=0, keyword='code'):
arr = sheet[sheet.columns[int(col_index)]]
header_index = arr[arr.str.contains(str(keyword), na=False)].iloc[[0,]].index[0]
return header_index
file = 'filename.xlsx'
sheets_dict = pd.read_excel(open(file, 'rb'), sheet_name=None)
dfs = {}
for name, sheet in sheets_dict.items():
header = get_header_index(sheet, col_index=0, keyword='code') + 1
df = pd.read_excel(open(file, 'rb'), sheet_name=name, header=header)
dfs[name] = df
This is a form of what I did in mine, adjusted for your use (based on my previous comment):
for file in file_names: # Iterate through all of the individual report files
book = xlrd.open_workbook(file)
sheetname = get_sheetname(book)
if sheetname is not None: # Check that sheet name is valid
sheet = book.sheet_by_name(sheetname)
nrows = sheet.nrows
ncols = sheet.ncols
for i in range(nrows):
for j in range(ncols):
check = sheet.cell_value(i, j)
if check.contains("CODE"):
return (i, j)

Read Excel with multiple headers and unnamed column

I recieve some Excel files like that :
USA UK
plane cars plane cars
2016 2 7 1 3 # a comment after the last country
2017 3 1 8 4
There is an unknown amount of countries and there can be a comment after the last column.
When I read the Excel file like that...
df = pd.read_excel(
sourceFilePath,
sheet_name = 'Sheet1',
index_col = [0],
header = [0, 1]
)
... I have a value error :
ValueError: Length of new names must be 1, got 2
The problem is I cannot use the usecols param because I don't know how many countries there is before reading my file.
How can I read such a file ?
It's possible Pandas won't be able to fix your special use case, but you can write a program that fixes the spreadsheet using openpyxl. It has really clear documentation, but here's an overview of how to use it:
import openpyxl as xl
wb = xl.load_workbook("ExampleSheet.xlsx")
for sheet in wb.worksheets:
print("Sheet Title => {}".format(sheet.title))
print("Dimensions => {}".format(sheet.dimensions)) # just returns a string
print("Columns: {} <-> {}".format(sheet.min_column, sheet.max_column))
print("Rows: {} <-> {}".format(sheet.min_row, sheet.max_row))
for r in range(sheet.min_row, sheet.max_row + 1):
for c in range(sheet.min_column, sheet.max_column + 1):
if (sheet.cell(r,c).value != None):
print("Cell {}:{} has value {}".format(r,c,sheet.cell(r,c).value))
what about just using pd.read_csv?
once loaded, you can then determine how many columns you have with df.columns

Categories