excel sheets name in pandas dataframe - python

I have an Excel workbook that I have already loaded and put all the sheets together, now I would like to add a column where I have the name of each original sheet, I don't know if I have to do it before I put everything together, and if that's how I could do it , I am using pandas. This is my code so far, I want the sheet name or number to be in the "Week" column.
xlsx= pd.ExcelFile('archivo.xlsx')
hojas=[]
for hojaslibro in xlsx.sheet_names:
hojas.append(xlsx.parse(hojaslibro))
estado=pd.concat(hojas,ignore_index=True)
estado['Week']=0

This should work:
xl = pd.ExcelFile('archvio.xlsx')
df_combined = pd.DataFrame()
for sheet_name in xl.sheet_names:
df = xl.parse(sheet_name)
df['Week'] = sheet_name # this adds `sheet_name` into the column `Week`
df_combined = df_combined.append(df)

Related

Multiple sheets of an Excel workbook into different dataframes using Pandas

I have a Excel workbook which has 5 sheets containing data.
I want each sheet to be a different dataframe.
I tried using the below code for one sheet of my Excel Sheet
df = pd.read_excel("path",sheet_name = ['Product Capacity'])
df
But this returns the sheet as a dictionary of the sheet, not a dataframe.
I need a data frame.
Please suggest the code that will return a dataframe
If you want separate dataframes without dictionary, you have to read individual sheets:
with pd.ExcelFile('data.xlsx') as xlsx:
prod_cap = pd.read_excel(xlsx, sheet_name='Product Capacity')
load_cap = pd.read_excel(xlsx, sheet_name='Load Capacity')
# and so on
But you can also load all sheets and use a dict:
dfs = pd.read_excel('data.xlsx', sheet_name=None)
# dfs['Product Capacity']
# dfs['Load Capacity']

pd.ExcelFile does not get the real sheet_names

I'm trying to read in an Excel file with multiple sheets (s.t that all columns are strings). The below code works for that but it doen't get the correct sheet names. So my dic_excel which is a dictionary with all sheet names and the corresponding data has the following keys: 'Sheet1', 'Sheet2', 'Sheet3', etc. But the actual names of the sheets are different. How do I get the actual names of the sheets?
dic_excel={}
excel = pd.ExcelFile(excel_path)
for sheet in excel.sheet_names:
print(sheet)
columns = excel.parse(sheet).columns
converters = {col: str for col in columns}
dic_excel[sheet] = excel.parse(sheet, converters=converters)
Here is two ways to get the real names of your Excel sheets:
By using pandas.DataFrame.keys with pandas
import pandas as pd
excel = pd.read_excel(excel_path, sheet_name=None)
dic_excel = df.keys()
This will return a dictionnary of the sheetnames
By using Workbook.sheetname with openpyxl
import openpyxl
wb = openpyxl.load_workbook(excel_path)
list_excel = wb.sheetnames
This will return a list of the sheetnames

Pandas concat dataframe per excel file

I have a code that read multiple files inside the directory and every excel file have more than 10 sheet's. After that I need to exclude some sheet's every file's and the others extracted.
I got all data needed, but the problem is every sheet's from the excel created new Dataframe even I used concat so when I save it to json only the last dataframe per file saved instead of whole data.
Here's my code:
excluded_sheet = ['Sheet 2','Sheet 6']
for index, xls_path in enumerate(file_paths):
data_file = pd.ExcelFile(xls_path)
sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]
for sheet_name in sheets:
file = xls_path.rfind(".")
head, tail = os.path.split(xls_path[1:file])
df =pd.concat([pd.read_excel(xls_path, sheet_name=sheet_name, header=None)],ignore_index=True)
df.insert(loc=0, column='sheet name', value=sheet_name)
pd.DataFrame(df.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))
I didn't used sheet_name=None because I need to read sheet name and add to column values.
Data status of my dataFrame:
I got many DF because every sheet create new DF, instead of 2 DF only since I have 2 files inside the directory. Thanks guys for your help.
You can use list comprehension for join all sheetnames to one DataFrame:
...
...
sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]
file = xls_path.rfind(".")
head, tail = os.path.split(xls_path[1:file])
dfs = [pd.read_excel(xls_path,sheet_name=sheet_name,header=None) for sheet_name in sheets]
df =pd.concat(dfs,keys=sheets)
df = df.reset_index(level=0, drop=True).rename_axis('sheet name').reset_index()
pd.DataFrame(df.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))
Or create helper list dfs with append DataFrames per loop, outside loop use concat:
...
...
sheets = [ sheet for sheet in data_file.sheet_names if sheet not in excluded_sheet ]
dfs = []
for sheet_name in sheets:
file = xls_path.rfind(".")
head, tail = os.path.split(xls_path[1:file])
df = pd.read_excel(xls_path, sheet_name=sheet_name, header=None)
df.insert(loc=0, column='sheet name', value=sheet_name)
dfs.append(df)
df1 = pd.concat(dfs,ignore_index=True)
pd.DataFrame(df1.to_json(f"{json_folder_path}{tail}.json", orient='records',indent=4))

How remove numbering from output after extract xls file with pandas [Python]

I have a Python Script that extracts a specific column from an Excel .xls file, but the output has a numbering next to the extracted information, so I would like to know how to format the output so that they don't appear.
My actual code is this:
for i in sys.argv:
file_name = sys.argv[1]
workbook = pd.read_excel(file_name)
df = pd.DataFrame(workbook, columns=['NOM_LOGR_COMPLETO'])
df = df.drop_duplicates()
df = df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
print(df)
My current output:
1 Street Alpha <br>
2 Street Bravo
But the result I need is:
Street Alpha <br>
Street Bravo
without the numbering, just the name of the streets.
Thanks!
I believe you want to have a dataframe without the index. Note that you cannot have a DataFrame without the indexes, they are the whole point of the DataFrame. So for your case, you can adopt:
print(df.values)
to see the dataframe without the index column. To save the output without index, use:
writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name = df, index=False)
writer.save()
where file_name = "dataframe.xlsx" for your case.
Further references can be found at:
How to print pandas DataFrame without index
Printing a pandas dataframe without row number/index
disable index pandas data frame
Python to_excel without row names (index)?

Adding sheet name to the conceited final merged sheet in excel

I want to merge multiple excel sheets to one and to have a new column with the name of the original sheet
I'm using the following code:
list_of_sheets = list(df.keys())
cdf = pd.concat(df[sheet] for sheet in list_of_sheets)
# tried
cdf = pd.concat(df[sheet]["Brand"] for sheet in list_of_sheets)
# and
list_of_sheets = list(df.keys())
for sheet in list_of_sheets:
df[sheet]["Brand"] = sheet
cdf = pd.concat(df[sheet])
but none of them works
Does this accomplish what you want?
import pandas as pd
pd.concat(pd.read_excel("my_excel_file.xlsx", sheet_name=None))
The sheet's names will be the index of the dataframe.
First read the file:
xl = pd.ExcelFile(file)
Which should produce the following:
<pandas.io.excel.ExcelFile at 0x12cad0860>
Then iterate over the sheets, append the sheet name as a separate column and store all dfs in a list:
dfs = []
for sheet in xl.sheet_names:
df = xl.parse(sheet)
df['sheet_name'] = sheet
dfs.append(df)
In order to concat them at last:
pd.concat(dfs)

Categories