How to ignore a custom header in Pandas - python

I'm trying to find a creative way to get the dataframe of several sheets within a spreadsheet that's quite irregular but I can't find the way to do it.
If I try this:
file= 'filename.xlsx'
df = xlrd.open_workbook(file)
print(df)
This is my current output:
A | B | C
1 Random text | Empty cell|Empty cell
------------------------------------
2 Empty cell | |
------------------------------------
3 Empty cell | |
------------------------------------
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
I want to start my dataframe in the CODE row and column, but pandas just gets the "Random text" cell as the first cell
This is my desired output:
4 CODE |HEADER 2 | HEADER 3
------------------------------------
5 INFORMATION |INFORMATION|INFORMATION
How would you make Pandas ignore the first rows? It has to be value-based because in the next sheets CODE starts in row 8, and in the next one in row 3

Not sure about XLRD, but Pandas has an easy way in the excel reading method that allows you to specify which row is your headers. That would be an easy fix unless you're intent on using XLRD.

You can try:
import pandas as pd
file= 'filename.xlsx'
df = pd.read_excel(open(file, 'rb'),sheet_name='sheetname', skiprows=[0,1,2])
Alternatively you can use header argument as mentioned earlier.

In my previous answer I pointed a static solution, and in this one I have added a helper function for dynamic parsing. get_header_index helper function dynamically gets the index of the row containing header keyword in the first column. You may change the col_index argument if you believe header keyword is in another column tough. Likewise you can change keyword argument's input as you like. The output dfs is dictionary of dataframes where keys are sheet names of a given workbook.
import pandas as pd
def get_header_index(sheet, col_index=0, keyword='code'):
arr = sheet[sheet.columns[int(col_index)]]
header_index = arr[arr.str.contains(str(keyword), na=False)].iloc[[0,]].index[0]
return header_index
file = 'filename.xlsx'
sheets_dict = pd.read_excel(open(file, 'rb'), sheet_name=None)
dfs = {}
for name, sheet in sheets_dict.items():
header = get_header_index(sheet, col_index=0, keyword='code') + 1
df = pd.read_excel(open(file, 'rb'), sheet_name=name, header=header)
dfs[name] = df

This is a form of what I did in mine, adjusted for your use (based on my previous comment):
for file in file_names: # Iterate through all of the individual report files
book = xlrd.open_workbook(file)
sheetname = get_sheetname(book)
if sheetname is not None: # Check that sheet name is valid
sheet = book.sheet_by_name(sheetname)
nrows = sheet.nrows
ncols = sheet.ncols
for i in range(nrows):
for j in range(ncols):
check = sheet.cell_value(i, j)
if check.contains("CODE"):
return (i, j)

Related

How to extract a particular set of values from excel file using a numerical range in python?

What I intend to do :
I have an excel file with Voltage and Current data which I would like to extract from a specific sheet say 'IV_RAW'. The values are only from 4th row and are in columns D and E.
Lets say the values look like this:
V(voltage)
I(Current)
47
1
46
2
45
3
0
4
-0.1
5
-10
5
Now, I just want to take out only the values starting with a voltage (V) of 45 and shouldnt take negative voltages. The corresponding current (I) values are also needed to be taken out. This has to be done for multiple excel files. So starting from a particular row number cannot be done instead voltage values should be the criterion.
What I know:
I know only how to take out the entire set of values using openxyl:
loc = ("path")
wb = load_workbook("Data") #thefilename
ws = wb["IV_raw"] #theactiveworksheet
#to extract the voltage and current data:
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2, values_only=True):
print(row)
I am a noon coder and new to python. So it will be really helpful if you guys could help. If there is a simplified versions with pandas it will be really great.
Thank you in advance
The following uses pandas which you should definitly take a look at. with sheet_name you set the sheet_name, header is the row index of the header (starting at 0, so Row 4 -> 3), usecols defines the columns using A1 notation.
The last line filters the dataframe. If I understand correctly, then you want Voltage between 0 and 45, thats what the example does and df is your resulting data_frame
import pandas as pd
file_loc = "path.xlsx"
df = pd.read_excel(file_loc,
sheet_name = 'IV_raw',
header = 3,
usecols = "D:E")
df = df[(df['V(voltage)'] > 0) & (df['V(voltage)'] < 45)]
Building on from your example, you can use the following example to get what you need
from openpyxl import load_workbook
wb = load_workbook(filepath,data_only=True) #load the file using its full path
ws = wb["Sheet1"] #theactiveworksheet
#to extract the voltage and current data:
data = ws.iter_rows(min_col=4, max_col=5, min_row=2, max_row=ws.max_row, values_only=True)
output = [row for row in data if row[0]>45]
you can try this,
import openpyxl
tWorkbook = openpyxl.load_workbook("YOUR_FILEPATH")
tDataBase = tWorkbook.active
voltageVal= "D4"
currentVal= "E4"
V = tDataBase[voltageVal].value
I = tDataBase[currentVal].value

Python how to add a column to a text file

I'm new to python and I have a challenge. I need to add a column in a text file delimited by ";". So far so good ... except that the value of this column depends on the value of another column. I will leave an example in case I was not clear
My file looks like this:
Account;points
1;500
2;600
3;1500
If the value of the points column is greater than 1000, enter 2, if less, enter 1.
In this case the file would look like this:
Account;points;column_created
1;500;1
2;600;1
3;1500;2
Approach without using pandas, this code assumes your points column will always be at the second position.
with open('stats.txt', 'r+') as file:
lines = file.readlines()
file.seek(0,0)
for line in lines:
columns = line.strip().split(";")
if int(columns[1])>1000:
file.write(";".join(columns)+";2\n")
else:
file.write(";".join(columns) + ";1\n")
File (hard drive) can't add new item between new elements. You have to read all data to memory, add new column, and write all back to file.
You could use pandas to easily add new column based on value from other colum.
In example I use io.StringIO() only to create minimal working code so everyone can copy it and text. Use read_csv('input.csv', sep=';') with your file
import pandas as pd
import io
text = '''Account;points
1;500
2;600
3;1500'''
#df = pd.read_csv('input.csv', sep=';')
df = pd.read_csv(io.StringIO(text), sep=';')
print('--- before ---')
print(df)
df['column_created'] = df['points'].apply(lambda x: 2 if x > 1000 else 1)
print('--- after ---')
print(df) #
df.to_csv('output.csv', sep=';', index=False)
Result
--- before ---
Account points
0 1 500
1 2 600
2 3 1500
--- after ---
Account points column_created
0 1 500 1
1 2 600 1
2 3 1500 2
You can use python library to create csv files. Here is the link to the documentation.
https://docs.python.org/3/library/csv.html

Get column names of Excel worksheet with OpenPyXL in readonly mode

How could I retrieve
the column names (values of the cells in the first row) in an openpyxl Read-only worksheet?
City, Population, Country in the below example worksheet
all column names in an openpyxl Read-only workbook?
City, Population, Country, frames from worksheet 1 and the other column names from all other worksheets
Example Excel worksheet:
| City | Population | Country |
| -----------|------------ | ------------ |
| Madison | 252,551 | USA |
| Bengaluru | 10,178,000 | India |
| ... | ... | ... |
Example code:
from openpyxl import load_workbook
wb = load_workbook(filename=large_file.xlsx, read_only=True)
sheet = wb.worksheets[0]
... (not sure where to go from here)
Notes:
I have to use readonly because the Excel file has over 1 million rows (don't ask)
I'd like the column names so I can eventually infer the column types and import the excel data into a PostgreSQL database
This will print every thing from row 1;
list_with_values=[]
for cell in ws[1]:
list_with_values.append(cell.value)
If for some reason you want to get a list of the column letters that are filled in you can just:
column_list = [cell.column for cell in ws[1]]
For your 2nd question;
Assuming you have stored the header values in a list called : "list_with_values"
from openpyxl import Workbook
wb = Workbook()
ws = wb['Sheet']
#Sheet is the default sheet name, you can rename it or create additional ones with wb.create_sheet()
ws.append(list_with_values)
wb.save('OutPut.xlsx')
Read-only mode provides fast access to any row or set of rows in a worksheet. Use the method iter_rows() to restric the selection. So to get the first row of the worksheet:
rows = ws.iter_rows(min_row=1, max_row=1) # returns a generator of rows
first_row = next(rows) # get the first row
headings = [c.value for c in first_row] # extract the values from the cells
Charlie Clarks answer compacted down to a one liner with list comprehension
headers = [c.value for c in next(wb['sheet_name'].iter_rows(min_row=1, max_row=1))]
This is how I handled this
from openpyxl.utils import get_column_letter
def get_columns_from_worksheet(ws):
return {
cell.value: {
'letter': get_column_letter(cell.column),
'number': cell.column - 1
} for cell in ws[1] if cell.value
}
An Example of this being used would be
from openpyxl import load_workbook
wb = load_workbook(filename='my_file.xlsx')
ws = wb['MySheet']
COLUMNS = get_columns_from_worksheet(ws)
for cell in ws[COLUMNS['MY Named Column']['letter']]:
print(cell.value)
The main reason for capturing both the letter and number code is because different functions and patterns within openpyxl use either the number or the letter so having reference to both is invaluable

How to use Python to read one column from Excel file?

I want to read the data in one column in excel, here is my code:
import xlrd
file_location = "location/file_name.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('sheet')
x = []
for cell in sheet.col[9]:
if isinstance(cell, float):
x.append(cell)
print(x)
It is wrong because there is no method in sheet called col[col.num], but I just want to extract the data from column 8 (column H), what can I do?
If you're not locked with xlrd I would probably have used pandas instead which is pretty good when working with data from anywhere:
import pandas as pd
df = pd.ExcelFile('location/test.xlsx').parse('Sheet1') #you could add index_col=0 if there's an index
x=[]
x.append(df['name_of_col'])
You could then just write the new extracted columns to a new excel file with pandas df.to_excel()
You can get the values of the 8th column like this:
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 7))
By far the easiest way to get all the values in a column using xlrd is the col_values() worksheet method:
x = []
for value in sheet.col_values(8):
if isinstance(value, float):
x.append(value)
(Note that if you want column H, you should use 7, because the indices start at 0.)
Incidentally, you can use col() to get the cell objects in a column:
x = []
for cell in sheet.col(8):
if isinstance(cell.value, float):
x.append(cell.value)
The best place to find this stuff is the official tutorial (which serves as a decent reference for xlrd, xlwt, and xlutils). You could of course also check out the documentation and the source code.
I would recommend to do it as:
import openpyxl
fname = 'file.xlsx'
wb = openpyxl.load_workbook(fname)
sheet = wb.get_sheet_by_name('sheet-name')
for rowOfCellObjects in sheet['C5':'C7']:
for cellObj in rowOfCellObjects:
print(cellObj.coordinate, cellObj.value)
Result: C5 70.82 C6 84.82 C7 96.82
Note: fname refers to excel file, get_sheet_by_name('sheet-name') refers to desired sheet and in sheet['C5':'C7'] ranges are mentioned for columns.
Check out the link for more detail. Code segment taken from here too.
XLRD is good, but for this case you might find Pandas good because it has routines to select columns by using an operator '[ ]'
Complete Working code for your context would be
import pandas as pd
file_location = "file_name.xlsx"
sheet = pd.read_excel(file_location)
print(sheet['Sl'])
Output 1 - For column 'Sl'
0 1
1 2
2 3
Name: Sl, dtype: int64
Output 2 - For column 'Name'
print(sheet['Name'])
0 John
1 Mark
2 Albert
Name: Name, dtype: object
Reference: file_name.xlsx data
Sl Name
1 John
2 Mark
3 Albert

data validation range Django and xlsxwriter

I have been using Django and xlsxwriter on a project that I am working on. I want to use data_validation in Sheet1 to pull in the lists that I have printed in Sheet2. I get the lists to print, but am not seeing the data_validation in Sheet1 when I open the file. Any insight on what I am doing incorrectly is much appreciated!
wb = xlsxwriter.Workbook(TestCass)
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = listFunction(headerToModelDic[header])
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,{'validate':'list','source':'=Sheet2!$A2:$A9'})
col += 1
wb.close()
Note: The reason I am not pulling the list directly from the site is because it is too long (longer than 256 characters). Secondly, I ultimately would like the source range in the data validation to take in variables from sheet2, however I cannot get sheet 1 to have any sort of data validation as is so I figured I would start with the absolute values.
It looks like the data ranges are wrong in the example. It appears that you are writing out the list data in a column but the data validation refers to a row of data.
Maybe in your full example there is data in that row but in the example above there isn't.
I've modified your example slightly to a non-Django example with some sample data. I've also changed the data validation range to match the written data range:
import xlsxwriter
wb = xlsxwriter.Workbook('test.xlsx')
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
headers = ['Header 1']
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = [1, 2, 3, 4, 5]
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,
{'validate':'list','source':'=Sheet2!$P2:$P6'})
col += 1
wb.close()
And here is the output:

Categories