Fetch non-empty cells from .xls file using pandas - python

i am new to python. i want to fetch the values from the cells & empty cells should be discarded.
i want to loop through rows & columns & assign to list
import pandas as pd
from pandas import ExcelFile
from pandas import ExcelWriter
df=pd.read_excel('16Junedata_03062020_80163767_action_03062020_80163767_2624_01.xls', sheet_name='Sheet4')
#newdf = df.fillna({'business_day':0,'zone_id':0,'site_id':0,'device_id':0})
#newdf = df.fillna(method="ffill")
z_id= df['zone_id']
d_id= df['device_id']
s_id= df['site_id']
vst= df['visit_start_time']
# print(z_id)
# print(d_id)
# print(s_id)
for a,zone_id in z_id.iteritems():
for b,site_id in s_id.iteritems():
print(site_id)

You can get the list of non NaN values using below code, this is only for one column:
zone_id_updated = []
for item in df.zone_id.iteritems():
if pd.isna(item[1])==False:
zone_id_updated.append(item[1])
Similarly can be done for other columns.

Related

How to skip rows while importing csv?

How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python
You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)

Using pandas, how do I turn one csv file column into list and then filter a different csv with the created list?

Basically I have one csv file called 'Leads.csv' and it contains all the sales leads we already have. I want to turn this csv column 'Leads' into a list and then check a 'Report' csv to see if any of the leads are already in there and then filter it out.
Here's what I have tried:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
leads_list = df_leads['Leads'].values.tolist()
df = pd.read_csv('Report.csv')
df = df.loc[(~df['Leads'].isin(leads_list))]
df.to_csv('Filtered Report.csv', index=False)
Any help is much appreciated!
You can try:
import pandas as pd
df_leads = pd.read_csv('Leads.csv')
df = pd.read_csv('Report.csv')
set_filtered = set(df['Leads'])-(set(df_leads['Leads']))
df_filtered = df[df['Leads'].isin(set_filtered)]
Note: Sets, are significantly faster than lists for this operation.

Add columns to existing CSV file

I am trying to add new columns to an existing csv that already has rows and columns that looks like this:
I would like it to append all the new column names to the columns after column 4.
The code I currently have is adding all the new columns to the bottom of the csv:
def extract_data_from_report3():
with open('OMtest.csv', 'a', newline='') as f_out:
writer = csv.writer(f_out)
writer.writerow(
['OMGroup:OMRegister', 'OMGroup', 'OMRegister', 'RegisterType', 'Measures', 'Description', 'GeneratedOn'])
Is there any way to do this effectively?
You can use the pandas lib, without iterating through the values. Here an example
new_header = ['OMGroup:OMRegister', 'OMGroup', 'OMRegister', 'RegisterType', 'Measures', 'Description', 'GeneratedOn']
# Import pandas package
import pandas as pd
my_df = pd.read_csv(path_to_csv)
for column_name in new_header:
new_column = [ ... your values ...] #should be a list of your dataframe size
my_df[column_name] = new_column
keep in mind that the new column should have the same size of the number of rows of your table to work
If you need only to add the new columns without values, you can do as such:
for column_name in new_header:
new_column = ["" for i in range(len(mydf.index))] #should be a list of dataframe size
my_df[column_name] = new_column
Then you can write back the csv in this way:
my_df.to_csv(path_to_csv)
Here details on the read_csv method
Here details on the to_csv method

Comparing two Microsoft Excel files in Python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?
You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.
Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

Read a specific column of a certain cell range and store the values using Pandas

I am trying to figure out a way to read data from a specific column from a certain cell range and store it into a array using pandas.
For example my Excel sheet consists of :
test | p
Food| Price
Chicken| 8.54
Beef |6.73
Vegetables| 3.2
Total Price |18.47
Note: there is a an empty space on the first row for a reason.
Note: | indicates cell separation.
I am trying to get the price values which start from Row B3 to row B5 and store them into an array via [8.54,6.73,3.2].
So far the code I have is:
import pandas as pd
xl_workbook = pd.ExcelFile("readme.xlsx") # Load the excel workbook
df = xl_workbook.parse("Sheet1") # Parse the sheet into a dataframe
x1_list = df['p'].tolist() # Cast the desired column into a python list
print(x1_list)
Which then results to [nan, u'price',8.54,6.73,3.2]
If I just wanted to read the values 8.54, 6.73, and 3.2, to result in [8.54,6.73,3.2] how would I do this?
Is there a way to grab a certain column of a certain cell range?
As written, you could use read_excel in Pandas. This assumes you have consistent formatting.
import pandas as pd
# define the file name and "sheet name"
fn = 'Book1.xlsx'
sn = 'Sheet1'
data = pd.read_excel(fn, sheetname=sn, index_col=0, skiprows=1, header=0, skip_footer=1)

Categories