pandas.read_excel with identical column names in excel - python

When i import an excel table with pandas.read_excel there is a problem (or a feature :-) ) with identical column names. For example the excel-file has two columns named "dummy", after the import in a datframe the second column is named "dummy.1".
Is there a way to import without the renaming option ?

Now I don't see the point why you would want this. However, as I could think of a workaround I might as well post it.
import pandas as pd
cols = pd.read_excel('text.xlsx', header=None,nrows=1).values[0] # read first row
df = pd.read_excel('text.xlsx', header=None, skiprows=1) # skip 1 row
df.columns = cols
print(df)
Returns:
col1 col1
0 1 1
1 2 2
2 3 3

Related

Skip initial empty rows and columns while reading in pandas

I have a excel like below
I have to read the excel and do some operations. The problem is I have to skip the empty rows and columns.In the above example it should read only from B3:D6. But with below code, it considers all the empty rows also like below
Code i'm using
import pandas as pd
user_input = input("Enter the path of your file: ")
user_input_sheet_master = input("Enter the Sheet name : ")
master = pd.read_excel(user_input,user_input_sheet_master)
print(master.head(5))
How to ignore the empty rows and columns to get the below output
ColA ColB ColC
0 10 20 30
1 23 NaN 45
2 NaN 30 50
Based on some research i have tried using df.dropna(how='all') but it also deleted the COLA and COLB. I cannot hardcode value for skiprows or skipcolumns because it may not be same format every time.The no of rows and columns to be skipped may vary. Sometimes there may not be any empty rows or columns. In that case, there is no need to delete anything.
You surely need to use dropna
df = df.dropna(how='all').dropna(axis=1, how='all')
EDIT:
If we have following file:
And then use this code:
df = pd.read_excel('tst1.xlsx', header=None)
df = df.dropna(how='all').dropna(how='all', axis=1)
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
new_df looks following way:
If we start with:
And use exactly the same code, I get:
Finally, start from:
Get the same as in the first case.

How to drop rows in a data frame having any entry with the value zero ? ('collections.OrderedDict' object has no attribute 'dropna')

I am trying to drop all rows from dataframe where any entry in any column of the row has the value zero.
I am placing a Minimal Working Example below
import pandas as pd
df = pd.read_excel('trial.xlsx',sheet_name=None)
df
I am getting the dataframe as follows
OrderedDict([('Sheet1', type query answers
0 abc 100 90
1 def 0 0
2 ghi 0 0
3 jkl 5 1
4 mno 1 1)])
I am trying to remove the rows using the dropna() using the following code.
df = df.dropna()
df
i am getting an error saying 'collections.OrderedDict' object has no attribute 'dropna''. I tried going through the various answers provided here and here, but the error remains.
Any help would be greatly appreciated!
The reason why you are getting an OrderedDict object is because you are feeding sheet_name=None parameter to the read_excel method of the library. This will load all the sheets into a dictionary of DataFrames.
If you only need the one sheet, specify it in the sheet_name parameter, otherwise remove it to read the first sheet.
import pandas as pd
df = pd.read_excel('trial.xlsx') #without sheet_name will read first sheet
print(type(df))
df = df.dropna()
or
import pandas as pd
df = pd.read_excel('trial.xlsx', sheet_name='Sheet1') #reads specific sheet
print(type(df))
df = df.dropna()

Pandas drop first columns after csv read

Is there a way to reference an object within the line of the instantiation ?
See the following example :
I wanted to drop the first column (by index) of a csv file just after reading it (usually pd.to_csv outputs the index as first col) :
df = pd.read_csv(csvfile).drop(self.columns[[0]], axis=1)
I understand self should be placed in the object context but it here describes what I intent to do.
(Of course, doing this operation in two separate lines works perfectly.)
One way is to use pd.DataFrame.iloc:
import pandas as pd
from io import StringIO
mystr = StringIO("""col1,col2,col3
a,b,c
d,e,f
g,h,i
""")
df = pd.read_csv(mystr).iloc[:, 1:]
# col2 col3
# 0 b c
# 1 e f
# 2 h i
Assuming you know the total number of columns in the dataset, and the indexes you want to remove -
a = range(3)
a.remove(1)
df = pd.read_csv('test.csv', usecols = a)
Here 3 is the total number of columns, and I wanted to remove 2nd column. You can directly write index of columns to use

pandas multiple separator not working

I'm having an issue importing a dataset with multiple separators. The files are mostly tab separated, but there is a single column that has around 700 values that are all semi-colon delimited.
I saw a previous similar question and the solution is simply to specify multiple separators as follows using the 'sep' argument:
dforigin = pd.read_csv(filename, header=0, skiprows=6,
skipfooter=1, sep='\t|;', engine='python')
This does not work for some reason. If I do this it just looks like a mess. Up to this point my workaround has been to import the file as tab-separated, cut out the offending column ('emg data', which is offscreen just to the right of the last column) and save as a temporary .csv, reimport the data and then append it to the initial dataframe.
My workaround feels a bit sloppy and I'm wondering if anybody can help make it a cleaner process.
IIUC, you want the semicolon-delimited values from that one column to each occupy a column in your data frame, alongside the other initial columns from your file. In that case, I'd suggest you read in the file with sep='\t' and then split out the semicolon column afterwards.
With sample data:
data = {'foo':[1,2,3], 'bar':['a;b;c', 'i;j;k', 'x;y;z']}
df = pd.DataFrame(data)
df
bar foo
0 a;b;c 1
1 i;j;k 2
2 x;y;z 3
Concat df with a new data frame, constructed of the splitted semicolon column:
pd.concat([df.drop('bar', 1),
df.bar.str.split(";", expand=True)], axis=1)
foo 0 1 2
0 1 a b c
1 2 i j k
2 3 x y z
Note: If your actual data don't include a column name for the semicolon-separated column, but if it's definitely the last column in the table, then per unutbu's suggestion, replace df.bar with df.iloc[:, -1].

How to remove index from a created Dataframe in Python?

I have created a Dataframe df by merging 2 lists using the following command:
import pandas as pd
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
But I'd like to remove the first column (The index column) and make the column called Name the first column. I tried using del df['index'] and index_col=0. But they didn't work. I also checked reset_index() and that is not what I need. I would like to completely remove the whole index column from a Dataframe that has been created like this (As mentioned above). Someone please help!
You can use set_index, docs:
import pandas as pd
list1 = [1,2]
list2 = [2,5]
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
print (df)
Name Probability
0 1 2
1 2 5
df.set_index('Name', inplace=True)
print (df)
Probability
Name
1 2
2 5
If you need also remove index name:
df.set_index('Name', inplace=True)
#pandas 0.18.0 and higher
df = df.rename_axis(None)
#pandas bellow 0.18.0
#df.index.name = None
print (df)
Probability
1 2
2 5
If you want to save your dataframe to a spreadsheet for a report.. it is possible to format the dataframe to eliminate the index column using xlsxwriter.
writer = pd.ExcelWriter("Probability" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Probability', startrow=3, startcol=0, index=False)
writer.save()
index=False will then save your dataframe without the index column.
I use this all the time when building reports from my dataframes.
I think the best way is to hide the index using the hide_index method
df = df.style.hide_index()
this will hide the index from the dataframe.

Categories