I'm trying to read an excel file that has some merged column headers using pandas. The file looks as below:
I want the output to be as below:
After loading it to pandas, the output comes as below:
Does anyone know how I can handle this in Pandas?
Thanks
This is usually called multiple headers, - and pd.read_excel also pd.read_csv has options to represent it.
simply use, header parameter.
Later flatten the header as per example below:
df = pd.read_excel('test.xlsx', header=[0,1]) # using first and second row as headers (pandas count rows from 0).
df.columns = ['.'.join(col).strip() for col in df.columns.values] # flattening headers to a single row, - joining them using ".".
you can also test/play with to_flat_index() function, but I have not found it attractive.
df.columns = df.columns.to_flat_index()
Related
I need to pull data from a column based on the column header. My only problem is the input files aren't consistent and have the column in different locations and the data doesn't start on row one.
Above is an example excel file. I want to pull the data for Market. I've got this to work using panda if the data starts at a1, but I can't get it to pull the data if it doesn't start in the first position.
How about you use this just after you pd.read_excel() statement ?
df=df.dropna(how='all',axis='columns').dropna(how='all',axis='rows')
You can then set the first row as header:
df.columns = df.iloc[0]
df = df[1:]
df
I would like the top row of the excel file to be the headers of the dataframe. (header=0 does this)
When the dataframe is saved as a .csv, I would like the headers to be on row 1 of the .csv, just as they were in the original .csv (this is what I am having trouble achieving)
I have tried setting the header= of .to_csv to both None or 0, but neither cause the headers to become row 1 of the .to_csv file.
I am now trying to set row 0 as a df1 and concatenate it with df, but am getting a 'first argument must be an iterable of pandas objects, you passed an object of type "Series"'
Can anyone offer any insight about how to approach this, or if there is an easier way?
import pandas as pd
data = pd.read_excel (r'C:\Users\dusti\Desktop\bulk export.xlsx',
sheet_name=0,
header=0)
df = pd.DataFrame(data)
df1 = df. loc[0, :]
df = pd.concat(df1, df)
df.to_excel(r'C:\Users\dusti\Desktop\bulk export1.xlsx',
header=None,
index=False)
(Please show us your dataframe header row and index, e.g. post df.head(4) as text. We need to see your index).
Possible issues:
pandas .to_excel() and to_csv() header argument expects True (bool) or else a list of string column names.
This is different behavior than read_csv(header) which can also take an int (row number(s) to use as the column names).
But you're trying to pass the int header=0 into to_excel()/to_csv()
If your index is a multiindex (is it?) and you use option to_excel(..., index=False), pandas has an ongoing open known issue
Export to excel for multiindex columns #11292. Solutions: a) use index=True or else b) don't create a multindex on your dataframe, or c) unstack() your multiindex.
I am having a excel sheet where I skipped multiple rows and finally arrived at a dataframe with some little structure. But I have a dataframe which looks like this. Bold are headers.
There are some columns on top which I hid in this screenshot as well. While reading a dataframe by skipping rows from excel, there is a multi level indexing.
I wanted to have the numbers in header to come as a row. Please advice how to achieve this.
Thank you in advance
You can skip header with header = None if you use .read_csv
df = pd.read_csv(file_path, header=None, usecols=[3,6])
The following will add your current columns as the last row in the dataframe. You could then put this row into position 0, or rename the columns, if necessary.
row = pd.Series(df.columns, index=df.columns)
df.append(row, ignore_index=True)
I have a dataframe that I've imported as follows.
df = pd.read_excel("./Data.xlsx", sheet_name="Customer Care", header=None)
I would like to set the first three rows as column headers but can't figure out how to do this. I gave the following a try:
df.columns = df.iloc[0:3,:]
but that doesn't seem to work.
I saw something similar in this answer. But it only applies if all sub columns are going to be named the same way, which is not necessarily the case.
Any recommendations would be appreciated.
df = pd.read_excel(
"./Data.xlsx",
sheet_name="Customer Care",
header=[0,1,2]
)
This will tell pandas to read the first three rows of the excel file as multiindex column labels.
If you want to modify the rows after you load them then set them as columns
#set the first three rows as columns
df.columns=pd.MultiIndex.from_arrays(df.iloc[0:3].values)
#delete the first three rows (because they are also the columns
df=df.iloc[3:]
I have a word2vec dataframe like this which saved from save_word2vec_format using Gensim under txt file. After using pandas to read this file. (Picture below). How to delete first row and make them as a index?
My txt file: https://drive.google.com/file/d/1O206N93hPSmvMjwc0W5ATyqQMdMwhRlF/view?usp=sharing
try this,
to replace index as header,
_X_T.index=_X_T.columns
to replace first row as header,
_X_T.index=_X_T.iloc[0]
save the row:
new_index = df.iloc[0]
drop it to avoid length mismatch:
df.drop(df.index[0], inplace=True)
and set it:
df.set_index(new_index, inplace=True)
you will get a SettingWithCopyWarning but that's the most elegant solution i could come up with.
if you want to set the headers (and not the first row) do:
df.index = df.columns