Comparing two Microsoft Excel files in Python - python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?

You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.

Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

Related

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()
You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

How to skip rows while importing csv?

How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python
You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)

How to optimize python script to pyspark def function

I am writing a pyspark program that takes a txt file and then add a few columns to the left(beginning) of the columns in the file.
My text file looks like this:
ID,Name,Age
1233,James,15
After I run the program I want it to add two columns named creation_DT and created_By to the left of the table. I am trying to get it to look like this:
Creation_DT,Created_By,ID,Name,Age
"current timestamp", Sean,1233,James,15
This code below get my required output but I was wondering if there was an easier way to do this to optimize my script below using pyspark.
import pandas as pd
import numpy as np
with open
df = pd.read_csv("/home/path/Sample Text Files/sample5.txt", delimiter = ",")
df=pd.DataFrame(df)
df.insert(loc=0, column='Creation_DT', value=pd.to_datetime('today'))
df.insert(loc=1, column='Create_BY',value="Sean")
df.write("/home/path/new/new_file.txt")
Any ideas or suggestions?
yes it is relatively easy to convert to pyspark code
from pyspark.sql import DataFrame, functions as sf
import datetime
# read in using dataframe reader
# path here if you store your csv in local, should use file:///
# or use hdfs:/// if you store your csv in a cluster/HDFS.
spdf = (spark.read.format("csv").option("header","true")
.load("file:///home/path/Sample Text Files/sample5.txt"))
spdf2 = (
spdf
.withColumn("Creation_DT", sf.lit(datetime.date.today().strftime("%Y-%m-%d")))
.withColumn("Create_BY", sf.lit("Sean"))
spdf2.write.csv("file:///home/path/new/new_file.txt")
this code assumes you are appending the creation_dt or create_by using the same value.
I don't see you use any pyspark in your code, so I'll just use pandas this way:
cols = df.columns
df['Creation_DT'] =pd.to_datetime('today')
df['Create_BY']="Sean"
cols = cols.insert(0, 'Create_BY')
cols = cols.insert(0, 'Creation_DT')
df.columns = cols
df.write("/home/path/new/new_file.txt")

How can compare two excel files for checking the format in python?

I have one excel sheet with right format(Certain number of headers and specific names). Here I have another excel sheet and I have to check this excel sheet for right format or not(have to be the same number of header and same header names, no issue if the values below header will changed.). how can solve this issue ? NLP or any other suitable method is there?
If you have to compare two Excel you could try something like this (I add also some example Excels):
def areHeaderExcelEqual(excel1, excel2) :
equals = True
if len(excel1.columns) != len(excel2.columns):
return False
for i in range(len(excel1.columns)):
if excel1.columns[i] != excel2.columns[i] :
equals = False
return equals
And that's an application:
import pandas as pd
#create first example Excel
df_out = pd.DataFrame([('string1',1),('string2',2), ('string3',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp1.xlsx')
#create second example Excel
df_out = pd.DataFrame([('string5',1),('string2',5), ('string2',3)], columns=['Name', 'Value'])
df_out.to_excel('tmp2.xlsx')
# create third example Excel
df_out = pd.DataFrame([('string1',1),('string4',2), ('string3',3)], columns=['MyName', 'MyValue'])
df_out.to_excel('tmp3.xlsx')
excel1 = pd.read_excel('tmp1.xlsx')
excel2 = pd.read_excel('tmp2.xlsx')
excel3 = pd.read_excel('tmp3.xlsx')
print(areHeaderExcelEqual(excel1, excel2))
print(areHeaderExcelEqual(excel1, excel3))
Note: Excel's files are provided just to see the different outputs.
For example, excel1 looks like this:
The idea is the same for the other files. To have more insights, see How to create dataframes.
Here's you're code:
f1 = pd.read_excel('file1.xlsx')
f2 = pd.read_excel('file2.xlsx')
print(areHeaderExcelEqual(f1, f2))
You can use pandas for that comparison.
import pandas as pd
f1 = pd.read_excel('sheet1.xlsx')
f2 = pd.read_excel('sheet2.xlsx')
header_threshold = 5 # any number of headers
print(len(f1.columns) == header_threshold)
print(f1.columns) # get the column names as values

Join two csv files with pandas/python without duplicates

I would like to concatenate 2 csv files. Each CSV file has the following structure:
File 1
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
File 2
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
I got a final csv that look like
Final file
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
So I have done this:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
full_df = pd.concat(df1,df2)
full_df = full_df.groupby(['id','category_id','lat','lng']).count()
full_df2 = full_df[['id','category_id']].groupby('id').agg('count')
full_df2.to_csv("final.csv",index=False)
I tried to groupby by id, categoy_id, lat and lng, the name could change
After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1
I don't understand about groupby because when i tried to print I got just the count value.
One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.
full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')
Check the documentation for drop_duplicates if you need further help.
I could resolve this problemen with the next code:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)
df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)
df_final2.to_csv('final', index=False)

Categories