Remove duplicates from multiple csv files and save in another directory - python

I'm new to python / pandas. I've got multiple csv files in a directory. I want to remove duplicates in all the files and save new files to another directory.
Below is what I've tried:
import pandas as pd
import glob
list_files = (glob.glob("directory path/*.csv"))
for file in list_files:
df = pd.read_csv(file)
df_new = df.drop_duplicates()
df_new.to_csv(file)
This code runs but doesn't yield expected results. A couple of issues.
files are overwritten in the existing directory.
there is an additional index column being added which is not required.
what changes need to be done in the code to get the same set of files with the same file names without duplicate rows to another directory?

Add index=False parameter to_csv method to prevent new index column;
Change path in to_csv method to prevent overwriting;
import pandas as pd
import glob
list_files = (glob.glob("directory path/*.csv"))
for file in list_files:
df = pd.read_csv(file)
df_new = df.drop_duplicates()
new_filename = f'new_directory/{file}'
df_new.to_csv(new_filename, index=False)

final code below.
import pandas as pd
import glob
import os
list_files = (glob.glob("directory path/*.csv"))
for file in list_files:
df = pd.read_csv(file)
filename = os.path.basename(file)
df_new = df.drop_duplicates()
new_filename = f'new_directory/{filename}'
df_new.to_csv(new_filename, index=False)

Related

How to read multiple csv files with specific name from a folder and merge them?

I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)

Pandas - Trying to store multiple .txt files in a .csv

I have a folder with about 500 .txt files. I would like to store the content in a csv file, with 2 columns, column 1 being the name of the file and column 2 being the file content in string. So I'd end up with a CSV file with 501 rows.
I've snooped around SO and tried to find similar questions, and came up with the following code:
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
for files in os.listdir(path):
with open(files, 'r') as file:
try:
df = pd.read_csv(file, header=None, delim_whitespace=True)
except EmptyDataError:
df = pd.DataFrame()
return df.to_csv('file.csv', index=False)
However it returns an empty .csv file. Am I doing something wrong?
There are several problems on your code. One of them is that pd.read_csv is not opening file because you're not passing the path to the given file. I think you should try to play from this code
import os
import pandas as pd
from pandas.io.common import EmptyDataError
def Aggregate_txt_csv(path):
files = os.listdir(path)
df = []
for file in files:
try:
d = pd.read_csv(os.path.join(path, file), header=None, delim_whitespace=True)
d["file"] = file
except EmptyDataError:
d = pd.DataFrame({"file":[file]})
df.append(d)
df = pd.concat(df, ignore_index=True)
df.to_csv('file.csv', index=False)
Use pathlib
Path.glob() to find all the files
When using path objects, file.stem returns the file name from the path.
Use pandas.concat to combine the dataframes in df_list
from pathlib import Path
import pandas as pd
p = Path('e:/PythonProjects/stack_overflow') # path to files
files = p.glob('*.txt') # get all txt files
df_list = list() # create an empty list for the dataframes
for file in files: # iterate through each file
with file.open('r') as f:
text = '\n'.join([line.strip() for line in f.readlines()]) # join all rows in list as a single string separated with \n
df_list.append(pd.DataFrame({'filename': [file.stem], 'contents': [text]})) # create and append a dataframe
df_all = pd.concat(df_list) # concat all the dataframes
df_all.to_csv('files.txt', index=False) # save to csv
I noticed there's already an answer, but I've gotten it to work with a relatively simple piece of code. I've only edited the file read-in a little bit, and the dataframe is outputting successfully.
Link here
import pandas as pd
from pandas.io.common import EmptyDataError
import os
def Aggregate_txt_csv(path):
result = []
print(os.listdir(path))
for files in os.listdir(path):
fullpath = os.path.join(path, files)
if not os.path.isfile(fullpath):
continue
with open(fullpath, 'r', errors='replace') as file:
try:
content = '\n'.join(file.readlines())
result.append({'title': files, 'body': content})
except EmptyDataError:
result.append({'title': files, 'body': None})
df = pd.DataFrame(result)
return df
df = Aggregate_txt_csv('files')
print(df)
df.to_csv('result.csv')
Most importantly here, I am appending to an array so as not to run pandas' concatenate function too much, as that would be pretty bad for performance. Additionally, reading in the file should not need read_csv, as there isn't a set format for the file. So using '\n'.join(file.readlines()) allows you to read in the file plainly and take out all lines into a string.
At the end, I convert the array of dictionaries into a final dataframe, and it returns the result.
EDIT: for paths that aren't the current directory, I updated it to append the path so that it could find the necessary files, apologies for the confusion

How to merge more csv files in Python?

I am trying to merge all found csv files in a given directory. The problem is that all csv files have almost the same header, only one column differs. I want to add that column from all csv files to the merged csv file(and also 4 common columns for all csv).
So far, I have this:
import pandas as pd
from glob import glob
interesting_files = glob(
"C:/Users/iulyd/Downloads/*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list, sort=False)
full_df.to_csv("C:/Users/iulyd/Downloads/merged_pands.csv", index=False)
With this code I managed to merge all csv files, but the problem is that some columns are empty in the first "n" rows, and only after some rows they get their proper values(from the respective csv). How can I make the values begin normally, after the column header?
Probably just you need add the name columns :
import pandas as pd
from glob import glob
interesting_files = glob(
"D:/PYTHON/csv/*.csv")
df_list = []
for filename in sorted(interesting_files):
print(filename)
#time,latitude,longitude
df_list.append(pd.read_csv(filename,usecols=["time", "latitude", "longitude","altitude"]))
full_df = pd.concat(df_list, sort=False)
print(full_df.head(10))
full_df.to_csv("D:/PYTHON/csv/mege.csv", index=False)

Not full Import multiple csv files into pandas and concatenate into one DataFrame

Please help me to find solution for the problem with importing data from multiple csv files to one DataFrame in python.
Code is:
import pandas as pd
import os
import glob
path = r'my_full_path'
os.chdir(path)
results = pd.DataFrame()
for counter, current_file in enumerate(glob.glob("*.csv")):
namedf = pd.read_csv(current_file, header=None, sep=",", delim_whitespace=True)
results = pd.concat([results, namedf], join='outer')
results.to_csv('Result.csv', index=None, header=None, sep=",")
The problem is that some part of data are moving to the rows instead of new columns as required.
What is wrong in my code?
P.S.: I found questions about importing multiple csv-files to DataFrame, for example here: Import multiple csv files into pandas and concatenate into one DataFrame, but solution doesn't solve my issue:-(
it was solved by using join inside of pd.read_csv.read_csv() -> append(dataFrames) -> concat:
def get_merged_files(files_list, **kwargs):
dataframes = []
for file in files_list:
df = pd.read_csv(os.path.join(file), **kwargs)
dataframes.append(df)
return pd.concat(dataframes, axis=1)
You can try using this:
import pandas as pd
import os
files = [file for file in os.listdir('./Your_Folder')] # Here is where all the files are located.
all_csv_files = pd.DataFrame()
for file in files:
df = pd.read_csv("./Your_Folder/"+file)
all_csv_files = pd.concat([all_csv_files, df])
all_csv_files.to_csv("All_CSV_Files_Concat.csv", index=False)

Adding file name in a Column while merging multible csv files to pandas- Python

I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)

Categories