Exclude column from being read using pd.ExcelFile().parse() - python

I would like to exclude certain columns from being read when using pd.ExcelFile('my.xls').parse()
Excel file I am trying to parse has too many columns to list them all in usecols argument since I only need to get rid off a single column that is causing trouble.
Is there like a simple way to ~ invert list (I know you can't do that) passed to usecols or something?

We can usually do
head = list(pd.read_csv('your.xls', nrows = 1))
df = pd.read_excel('your.xls', usecols = [col for col in head if col != 'the one drop']))
However , why not read whole file then drop it
df = pd.read_excel('your.xls').drop('the col drop', axis = 1)

Related

Summary Row for a pd.DataFrame with multiindex

I have a multiIndex dataframe created with pandas similar to this one:
nest = {'A1': dfx[['aa','bb','cc']],
'B1':dfx[['dd']],
'C1':dfx[['ee', 'ff']]}
reform = {(outerKey, innerKey): values for outerKey, innerDict in nest.items() for innerKey, values in innerDict.items()}
dfzx = pd.DataFrame(reform)
What I am trying to achieve is to add a new row at the end of the dataframe that contains a summary of the total for the three categories represented by the new index (A1, B1, C1).
I have tried with df.loc (what I would normally use in this case) but I get error. Similarly for iloc.
a1sum = dfzx['A1'].sum().to_list()
a1sum = sum(a1sum)
b1sum = dfzx['B1'].sum().to_list()
b1sum = sum(b1sum)
c1sum = dfzx['C1'].sum().to_list()
c1sum = sum(c1sum)
totalcat = a1sum, b1sum, c1sum
newrow = ['Total', totalcat]
newrow
dfzx.loc[len(dfzx)] = newrow
ValueError: cannot set a row with mismatched columns
#Alternatively
newrow2 = ['Total', a1sum, b1sum, c1sum]
newrow2
dfzx.loc[len(dfzx)] = newrow2
ValueError: cannot set a row with mismatched columns
How can I fix the mistake? Or else is there any other function that would allow me to proceed?
Note: the DF is destined to be moved on an Excel file (I use ExcelWriter).
The type of results I want to achieve in the end is this one (gray row "SUM"
I came up with a sort of solution on my own.
I created a separate DataFrame in Pandas that contains the summary.
I used ExcelWriter to have both dataframes on the same excel worksheet.
Technically It would be then possible to style and format data in Excel (xlsxwriter or framestyle seem to be popular modules to do so). Alternatively one should be doing that manually.

How to extract inside of column to several columns

I have excel file and import to dataframe. I want to extract inside of column to several columns.
Here is original
After importing to pandas in python, I get this data with '\n'
So, I want to extract inside of column. Could you all share idea or code?
My expected columns are....
Don't worry no one is born knowing everything about SO. Considering the data you gave, specially that 'Vector:...' is not separated by '\n', the following works:
import pandas as pd
import numpy as np
data = pd.read_excel("the_data.xlsx")
ok = []
l = len(data['Details'])
for n in range(l):
x = data['Details'][n].split()
x[2] = x[2].lstrip('Vector:')
x = [v for v in x if v not in ['Type:', 'Mission:']]
ok += x
values = np.array(ok).reshape(l, 3)
df = pd.DataFrame(values, columns=['Type', 'Vector', 'Mission'])
data.drop('Details', axis=1, inplace=True)
final = pd.concat([data, df], axis=1)
The process goes like this:
First you split all elements of the Details columns as a list of strings. Second you deal with the 'Vector:....' special case and filter column names. Third you store all the values in a list which will inturn be converted to a numpy array with shape (length, 3). Finally you drop the old 'Details' column and perform a concatenation with the df created from splited strings.
You may want to try a more efficient way to transform your data when reading by trying to use this ideas inside the pd.read_excel method using converters

Missing values in pandas column multiindex

I am reading with pandas excel sheets like this one:
using
df = pd.read_excel('./question.xlsx', sheet_name = None, header = [0,1])
which results in multiindex dataframe with multiindex.
What poses a problem here is that the empty fields are filled by default with 'Title', whereas I would prefer to use a distinct label. I cannot skip the first row since I am dealing with bigger data frames where the first and the second rows contain repeating labels (hence the use of the multiindex).
Your help will be much appreciated.
Assuming that you want to have empty strings instead of repeating the first label, you can read the 2 lines and build the MultiIndex directly:
df1 = pd.read_excel('./question.xlsx', header = None, nrows=2).fillna('')
index = pd.MultiIndex.from_arrays(df1.values)
it gives:
MultiIndex([('Title', '#'),
( '', 'Price'),
( '', 'Quantity')],
)
By the way, if you wanted a different label for empty fields, you can just use it as the parameter for fillna.
Then, you just read the remaining data, and set the index by hand:
df1 = pd.read_excel('./question.xlsx', header = None, skiprows=2)
df1.columns = index

Selecting Columns by a string - Pytables

I know we can access columns using table.cols.somecolumn, but I need to apply the same operation on 10-15 columns of my table. So I'd like an iterative solution. I have the names of the columns as strings in a list : ['col1','col2','col3'].
So I'm looking for something along the lines of:
for col in columnlist:
thiscol = table.cols[col]
#apply whatever operation
Try this:
columnlist = ['col1','col2','col3']
for col in columnlist:
thiscol = getattr(table.cols, col)

Removing index column in pandas when reading a csv

I have the following code which imports a CSV file. There are 3 columns and I want to set the first two of them to variables. When I set the second column to the variable "efficiency" the index column is also tacked on. How can I get rid of the index column?
df = pd.DataFrame.from_csv('Efficiency_Data.csv', header=0, parse_dates=False)
energy = df.index
efficiency = df.Efficiency
print efficiency
I tried using
del df['index']
after I set
energy = df.index
which I found in another post but that results in "KeyError: 'index' "
When writing to and reading from a CSV file include the argument index=False and index_col=False, respectively. Follows an example:
To write:
df.to_csv(filename, index=False)
and to read from the csv
df.read_csv(filename, index_col=False)
This should prevent the issue so you don't need to fix it later.
df.reset_index(drop=True, inplace=True)
DataFrames and Series always have an index. Although it displays alongside the column(s), it is not a column, which is why del df['index'] did not work.
If you want to replace the index with simple sequential numbers, use df.reset_index().
To get a sense for why the index is there and how it is used, see e.g. 10 minutes to Pandas.
You can set one of the columns as an index in case it is an "id" for example.
In this case the index column will be replaced by one of the columns you have chosen.
df.set_index('id', inplace=True)
If your problem is same as mine where you just want to reset the column headers from 0 to column size. Do
df = pd.DataFrame(df.values);
EDIT:
Not a good idea if you have heterogenous data types. Better just use
df.columns = range(len(df.columns))
you can specify which column is an index in your csv file by using index_col parameter of from_csv function
if this doesn't solve you problem please provide example of your data
One thing that i do is df=df.reset_index()
then df=df.drop(['index'],axis=1)
To remove or not to create the default index column, you can set the index_col to False and keep the header as Zero. Here is an example of how you can do it.
recording = pd.read_excel("file.xls",
sheet_name= "sheet1",
header= 0,
index_col= False)
The header = 0 will make your attributes to headers and you can use it later for calling the column.
It works for me this way:
Df = data.set_index("name of the column header to start as index column" )

Categories