Skip initial empty rows and columns while reading in pandas - python

I have a excel like below
I have to read the excel and do some operations. The problem is I have to skip the empty rows and columns.In the above example it should read only from B3:D6. But with below code, it considers all the empty rows also like below
Code i'm using
import pandas as pd
user_input = input("Enter the path of your file: ")
user_input_sheet_master = input("Enter the Sheet name : ")
master = pd.read_excel(user_input,user_input_sheet_master)
print(master.head(5))
How to ignore the empty rows and columns to get the below output
ColA ColB ColC
0 10 20 30
1 23 NaN 45
2 NaN 30 50
Based on some research i have tried using df.dropna(how='all') but it also deleted the COLA and COLB. I cannot hardcode value for skiprows or skipcolumns because it may not be same format every time.The no of rows and columns to be skipped may vary. Sometimes there may not be any empty rows or columns. In that case, there is no need to delete anything.

You surely need to use dropna
df = df.dropna(how='all').dropna(axis=1, how='all')
EDIT:
If we have following file:
And then use this code:
df = pd.read_excel('tst1.xlsx', header=None)
df = df.dropna(how='all').dropna(how='all', axis=1)
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
new_df looks following way:
If we start with:
And use exactly the same code, I get:
Finally, start from:
Get the same as in the first case.

Related

Creating a dataframe from several .txt files - each file being a row with 25 values

So, I have 7200 txt files, each with 25 lines. I would like to create a dataframe from them, with 7200 rows and 25 columns -- each line of the .txt file would be a value a column.
For that, first I have created a list column_names with length 25, and tested importing one single .txt file.
However, when I try this:
pd.read_csv('Data/fake-meta-information/1-meta.txt', delim_whitespace=True, names=column_names)
I get 25x25 dataframe, with values only on the first column. How do I read this into the dataframe in a way that I can get the txt lines to be imputed as values into the columns, and not imputing everything into the first column and creating 25 rows?
My next step would be creating a for loop to append each text file as a new row.
Probably something like this:
dir1 = *folder_path*
list = os.listdir(dir1)
number_files = len(list)
for i in range(number_files):
title = list[i]
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True, names=column_names)
df = df.append(df_temp,ignore_index=True)
I hope I have been clear. Thank you all in advance!
read_csv generates a row per line in the source file but you want them to be columns. You could read the rows and pivot to columns, but since these files have a single value per line, you can just read them in numpy and use each resulting array as a row in a dataframe.
import numpy as np
import pandas as pd
from pathlib import Path
dir1 = Path(".")
df = pd.DataFrame([np.loadtxt(filename) for filename in dir1.glob("*.txt")])
print(df)
tdelaney's answer is probably "better" than mine, but if you want to keep your code more stylistically closer to what you are currently doing the following is another option.
You are getting your current output (25x25 with data in the first column only) because your read data is 25x1 but you are forcing the dataframe to have 25 columns with your names=column_names parameter.
To solve, just wait until the end to apply the column names:
Get a 25x1 df (drop the names param):
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True)
Append the 25x1 df forming a 25x7200 df: df = df.append(df_temp,ignore_index=True)
Transpose the df forming the final 7200x25 df: df=df.T
Add column names: df.columns=column_names

How to drop rows in a data frame having any entry with the value zero ? ('collections.OrderedDict' object has no attribute 'dropna')

I am trying to drop all rows from dataframe where any entry in any column of the row has the value zero.
I am placing a Minimal Working Example below
import pandas as pd
df = pd.read_excel('trial.xlsx',sheet_name=None)
df
I am getting the dataframe as follows
OrderedDict([('Sheet1', type query answers
0 abc 100 90
1 def 0 0
2 ghi 0 0
3 jkl 5 1
4 mno 1 1)])
I am trying to remove the rows using the dropna() using the following code.
df = df.dropna()
df
i am getting an error saying 'collections.OrderedDict' object has no attribute 'dropna''. I tried going through the various answers provided here and here, but the error remains.
Any help would be greatly appreciated!
The reason why you are getting an OrderedDict object is because you are feeding sheet_name=None parameter to the read_excel method of the library. This will load all the sheets into a dictionary of DataFrames.
If you only need the one sheet, specify it in the sheet_name parameter, otherwise remove it to read the first sheet.
import pandas as pd
df = pd.read_excel('trial.xlsx') #without sheet_name will read first sheet
print(type(df))
df = df.dropna()
or
import pandas as pd
df = pd.read_excel('trial.xlsx', sheet_name='Sheet1') #reads specific sheet
print(type(df))
df = df.dropna()

Pandas - Write multiple dataframes to single excel sheet

I have a dataframe with 45 columns and 1000 rows. My requirement is to create a single excel sheet with the top 2 values of each column and their percentages (suppose col 1 has the value 'python' present 500 times in it, the percentage should be 50)
I used:
writer = pd.ExcelWriter('abc.xlsx')
df = pd.read_sql('select * from table limit 1000', <db connection sring>)
column_list = df.columns.tolist()
df.fillna("NULL", inplace = True)
for obj in column_list:
df1 = pd.DataFrame(df[obj].value_counts().nlargest(2)).to_excel(writer,sheet_name=obj
writer.save()
This writes the output in separate excel tabs of the same document. I need them in a single sheet in the below format:
Column Name Value Percentage
col1 abc 50
col1 def 30
col2 123 40
col2 456 30
....
Let me know any other functions as well to get to this output.
The first thing that jumps out to me is that you are changing the sheet name each time, by saying sheet_name=obj If you get rid of that, that alone might fix your problem.
If not, I would suggest concatenating the results into one large DataFrame and then writing that DataFrame to Excel.
for obj in column_list:
df = pd.DataFrame(df[obj].value_counts().nlargest(2))
if df_master is None:
df_master = df
else:
df_master = pd.concat([df_master,df])
df_master.to_excel("abc.xlsx")
Here's more information on stacking/concatenating dataframes in Pandas
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

How to preserve the format when writing to csv using pandas?

I have a text file like this:
id,name,sex,
1,Sam,M,
2,Ann,F,
3,Peter,
4,Ben,M,
Then, I read the file:
df = pd.read_csv('data.csv')
After that, I write it to another file:
df.to_csv('new_data.csv', index = False)
Then, I get
id,name,sex,Unnamed: 3
1,Sam,M,
2,Ann,F,
3,Peter,,
4,Ben,M,
You see that there are two commas instead of one in the fourth line.
How to preserve the format when using pd.to_csv?
pandas is preserving the format - the 3d row has no sex, and as such the csv should have an empty column - that is why you get to commas, since you are separating an empty column.
Your original text file was not a valid csv file.
What you want to do is something else, which is not write a valid csv file - you will have to do this yourself, I do not know of any existing method to create your format.
The problem in your code is that you have a comma after the sex column in your file. So read_csv thinks it's a new column, which has no name and data.
df= pd.read_csv('data.csv')
df
id name sex Unnamed: 3
0 1 Sam M NaN
1 2 Ann F NaN
2 3 Peter NaN NaN
3 4 Ben M NaN
Hence you have an extra Unnamed column. So when you write the to_csv, it adds two empty values in the 3rd row and hence why, two ,.
Try:
df = pd.read_csv('data.csv', use_cols = ['id', 'name', 'sex'])
df.to_csv('new_data.csv', index = False)

How to remove index from a created Dataframe in Python?

I have created a Dataframe df by merging 2 lists using the following command:
import pandas as pd
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
But I'd like to remove the first column (The index column) and make the column called Name the first column. I tried using del df['index'] and index_col=0. But they didn't work. I also checked reset_index() and that is not what I need. I would like to completely remove the whole index column from a Dataframe that has been created like this (As mentioned above). Someone please help!
You can use set_index, docs:
import pandas as pd
list1 = [1,2]
list2 = [2,5]
df=pd.DataFrame({'Name' : list1,'Probability' : list2})
print (df)
Name Probability
0 1 2
1 2 5
df.set_index('Name', inplace=True)
print (df)
Probability
Name
1 2
2 5
If you need also remove index name:
df.set_index('Name', inplace=True)
#pandas 0.18.0 and higher
df = df.rename_axis(None)
#pandas bellow 0.18.0
#df.index.name = None
print (df)
Probability
1 2
2 5
If you want to save your dataframe to a spreadsheet for a report.. it is possible to format the dataframe to eliminate the index column using xlsxwriter.
writer = pd.ExcelWriter("Probability" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Probability', startrow=3, startcol=0, index=False)
writer.save()
index=False will then save your dataframe without the index column.
I use this all the time when building reports from my dataframes.
I think the best way is to hide the index using the hide_index method
df = df.style.hide_index()
this will hide the index from the dataframe.

Categories