Join two csv files with pandas/python without duplicates - python

I would like to concatenate 2 csv files. Each CSV file has the following structure:
File 1
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
File 2
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
I got a final csv that look like
Final file
id,name,category-id,lat,lng
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208431
4c29e1c197,Area51,4bf58dd8d,45.44826958,9.144208432
4ede330477,Punto Snai,4bf58dd8d,45.44833354,9.144086353
51efd91d49,Gelateria Cecilia,4bf58dd8d,45.44848931,9.144008735
5748729449,Duomo Di Milano,52e81612bc,45.463898,9.192034
So I have done this:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
full_df = pd.concat(df1,df2)
full_df = full_df.groupby(['id','category_id','lat','lng']).count()
full_df2 = full_df[['id','category_id']].groupby('id').agg('count')
full_df2.to_csv("final.csv",index=False)
I tried to groupby by id, categoy_id, lat and lng, the name could change
After the first groupby I want to groupby again but now by id and category_id because as showed in my example the first row changed in long but that is probably because file2 is an update of file1
I don't understand about groupby because when i tried to print I got just the count value.

One way to solve this problem is to just use df.drop_duplicates() after you have concatenated the two DataFrames. Additionally, drop_duplicates has an argument "keep", which allows you to specify that you want to keep the last occurrence of the duplicates.
full_df = pd.concat([df1,df2])
unique_df = full_df.drop_duplicates(keep='last')
Check the documentation for drop_duplicates if you need further help.

I could resolve this problemen with the next code:
import pandas as pd
df1=pd.read_csv("file1.csv")
df2=pd.read_csv("file2.csv")
df_final=pd.concat([df1,df2]).drop_duplicates(subset=['id','category_id','lat','lng']).reset_index(drop=True)
print(df_final.shape)
df_final2=df_final.drop_duplicates(subset=['id','category_id']).reset_index(drop=True)
df_final2.to_csv('final', index=False)

Related

df = pd.read_csv('file.csv') adds random numbers and commas to file. Pandas in python

I am trying to read a csv file using pandas as so:
df = pd.read_csv('file.csv')
Here is the file before:
,schoolId,Name,Meetings Present
0,991,Jimmy Nuetron,2
1,992,Jimmy Fuetron,6
2,993,Cam Nuetron,4
Here is the file after:
,Unnamed: 0,schoolId,Name,Meetings Present
0,0.0,991.0,Jimmy Nuetron,2.0
1,1.0,992.0,Jimmy Fuetron,6.0
2,2.0,993.0,Cam Nuetron,4.0
0,,,,3
Why is it adding the numbers and columns when I run the read_csv method?
How can I prevent this without adding a seperator?
pandas.read_csv is actuallly not adding the column Unnamed: 0 because it already exists in your .csv (who apparently/probably was generated by the method pandas.DataFrame.to_csv).
You can get rid of this (extra) column by making it as an index :
df = pd.read_csv('file.csv', index_col=0)

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()
You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

How to skip rows while importing csv?

How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python
You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)

Comparing two Microsoft Excel files in Python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?
You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.
Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

Merging csv columns while checking ID of the first column

I have a 4 csv files exported from e-shop database I need to merge them by columns, which I would maybe manage to do alone. But the problem is to match the right columns
First file:
"ep_ID","ep_titleCS","ep_titlePL".....
"601","Kancelářská židle šedá",NULL.....
...
Second file:
"pe_photoID","pe_productID","pe_sort"
"459","603","1"
...
Third file:
"epc_productID","epc_categoryID","epc_root"
"2155","72","1"
...
Fourth file:
"ph_ID","ph_titleCS"...
"379","5391132275.jpg"
...
I need to match the rows so rows with same "ep_ID" and "epc_productID" are merged together and rows with same "ph_ID", "pe_photoID" too. I don't really know where to start, hopefully, I wrote it understandably
Update:
I am using :
files = ['produkty.csv', 'prirazenifotek.csv', 'pprirazenikategorii.csv', 'adresyfotek.csv']
dfs = []
for f in files:
df = pd.read_csv(f,low_memory=False)
dfs.append(df)
first_and_third =pd.merge(dfs[0],dfs[1],left_on = "ep_ID",right_on="pe_photoID")
first_and_third.to_csv('new_filepath.csv', index=False)
Ok this code works, but it does two things in another way than I need:
When there is a row in file one with ID = 1 for example and in the next file two there is 5 rows with bID = 1, then it creates 5 rows int the final file I would like to have one row that would have multiple values from every row with bID = 1 in file number two. Is it possible?
And it seems to be deleting some rows... not sure till i get rid of the "duplicates"...
You can use pandas's merge method to merge the csvs together. In your question you only provide keys between the 1st and 3rd files, and the 2nd and 4th files. Not sure if you want one giant table that has them all together -- if so you will need to find another intermediary key, maybe one you haven't listed(?).
import pandas as pd
files = ['path_to_first_file.csv', 'second_file.csv', 'third_file.csv', 'fourth_file.csv']
dfs = []
for f in files:
df = pd.read_csv(f)
dfs.append(df)
first_and_third = dfs[0].merge(dfs[2], left_on='ep_ID', right_on='epc_productID', how='left')
second_and_fourth = dfs[1].merge(dfs[3], left_on='pe_photoID', right_on='ph_ID', how='left')
If you want to save the dataframe back down to a file you can do so:
first_and_third.to_csv('new_filepath.csv', index=False)
index=False assumes you have no index set on the dataframe and don't want the dataframe's row numbers to be included in the final csv.

Categories