How to merge more csv files in Python? - python

I am trying to merge all found csv files in a given directory. The problem is that all csv files have almost the same header, only one column differs. I want to add that column from all csv files to the merged csv file(and also 4 common columns for all csv).
So far, I have this:
import pandas as pd
from glob import glob
interesting_files = glob(
"C:/Users/iulyd/Downloads/*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list, sort=False)
full_df.to_csv("C:/Users/iulyd/Downloads/merged_pands.csv", index=False)
With this code I managed to merge all csv files, but the problem is that some columns are empty in the first "n" rows, and only after some rows they get their proper values(from the respective csv). How can I make the values begin normally, after the column header?

Probably just you need add the name columns :
import pandas as pd
from glob import glob
interesting_files = glob(
"D:/PYTHON/csv/*.csv")
df_list = []
for filename in sorted(interesting_files):
print(filename)
#time,latitude,longitude
df_list.append(pd.read_csv(filename,usecols=["time", "latitude", "longitude","altitude"]))
full_df = pd.concat(df_list, sort=False)
print(full_df.head(10))
full_df.to_csv("D:/PYTHON/csv/mege.csv", index=False)

Related

How to read multiple csv files with specific name from a folder and merge them?

I am trying to read multiple files from a folder with specific name (1.car.csv, 2.car.csv and so on) and trying to add a new label after each iteration at right most of the dataset and merge all the csv files into one csv file. As the ".car.csv" is constant, I think I can use a for loop with .format(index) function to run over the csv files. All of the csv files has got same attributes.
Kindly help me!
glob is used to get all files in the folder that match the pattern *.csv
pd.read_csv is used to read each file as a DataFrame
index_col=None you are telling Pandas to not use any of the columns as the index, and instead to create a default index for the DataFrame.
header=0 you are telling Pandas to use the first row of the CSV file as the header row.
pd.concat is used to merge all the DataFrames into a single DataFrame merged_df
axis=0 means that the concatenation should happen along the rows (vertically)
ignore_index=True the concatenation is performed such that the original indices of the individual DataFrames are discarded, and a new default index is created for the resulting DataFrame.
import glob
import pandas as pd
path = r'<path to folder containing csv files>'
all_files = glob.glob(path + "/*.csv")
lst = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
lst.append(df)
merged_df = pd.concat(lst, axis=0, ignore_index=True)
This can be easily done with a CSV tool like miller:
mlr --csv cat --filename bla1.csv *.car.csv
This will concatenate the files (without repeating the header) and prepend the filename as the first column.
You can use the pandas library this way:
import pandas as pd
import os
# path to folder where the csv files are stored
path = '/path/to/folder'
result = pd.DataFrame()
for i in range(1, n+1):
filename = "{}.car.csv".format(i)
file_path = os.path.join(path, filename)
df = pd.read_csv(file_path)
df['new_label'] = i
result = pd.concat([result, df], ignore_index=True)
result.to_csv('final_result.csv', index=False)
The n in the code above should be replaced with the number of csv files you have in the folder.
If you need any explanation of the code (in case you're new to python or dataframes) just comment below.
Using pathlib and pandas you can use .assign() to enter the new column and finally .concat() to concatenate all the files into one.
from pathlib import Path
import pandas as pd
input_path = Path("path/to/car/files/").glob("*car.csv")
output_path = "path/to/output"
pd.concat(
(pd.read_csv(x).assign(new_label="new data") for x in input_path), ignore_index=True
).to_csv(f"{output_path}/final.csv", index=False)

Merging multiple csv files(unnamed colums) from a folder in python

import pandas as pd
import os
import glob
path = r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF'
files = glob.glob(os.path.join(path, '*.csv'))
combined_data = pd.DataFrame()
for file in files :
data = pd.read_csv(file)
print(data)
combined_data = pd.concat([combined_data,data],axis=0,ignore_index=True)
combined_data.to_csv(r'C:\Users\avira\Desktop\CC\SAIL\Merging\CISF\data2.csv')
The files are merging diagonally,ie-next to the last cell of the first file, is the beginning of second file. ALSO, it is taking the first entry of file as column names.
All of my files are without column names. How do I vertically merge my files,and provide coluumn names to the merged csv.
For the header problem while reading csv , u can do this:
pd.read_csv(file, header=None)
While dumping the result u can pass list containing the header names
df.to_csv(file_name,header=['col1','col2'])
You need to read the csv with no headers and concat:
data = pd.read_csv(file, header=None)
combined_data = pd.concat([combined_data, data], ignore_index=True)
If you want to give the columns meaningful names:
combined_data.columns = ['name1', 'name2', 'name3']

Open a csv file, delete columns containing a word in their name, save in a new csv Pandas, Python

I have a csv file. In this file there are a lot of columns named as:
value_number, species_name, color_name, unnamed:1 etc...
How can I implement a python script using pandas that open a csv file, search in its columns that have the word unnamed and then delete them and save the cleaned csv as a new csv with an updated name.
Here is the script:
import os, glob
import pandas as pd
path= "/home/CaptainSnake/Desktop/Test_class/class_data"
all_files = glob.glob(os.path.join(path, "Classification_testData_*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in all_files)
df_merged = pd.concat(df_from_each_file, ignore_index=False, sort=False)
df_merged.columns
df_merged.columns.str.match('Unnamed')
df_merged.loc[:, ~df_merged.columns.str.match('Unnamed')]
df_merged.to_csv( "Classification_testData_cleaned.csv")
This script in particular, does the merging of all csv with a particular name...then It should clear the new csv from unnamed: etc..
This will remove all the columns that starts with 'Unnamed':
filtered_cols = [i for i in df.columns if not i.startswith("Unnamed")]
df[filtered_cols].to_csv('filename.csv')

Read selected data from multiple files

I have 200 .txt files and need to extract one row data from each file and create a different dataframe.
For example (abc1.txt,abc2.txt, .etc) set of files and i need to extract 5th row data from each file and create a dataframe. When reading files, columns need to be separated by '/t' sign.
like this
data = pd.read_csv('abc1.txt', sep="\t", header=None)
I can not figure out how to do all this with a loop. Can you help?
Here is my answer:
import pandas as pd
from pathlib import Path
path = Path('path/to/dir')
files = path.glob('*.txt')
to_concat = []
for f in files:
df = pd.read_csv(f, sep="\t", header=None, nrows=5).loc[4:4]
to_concat.append(df)
result = pd.concat(to_concat)
I have used nrows to read only first 5 rows and then .loc[4:4] to get dataframe rather than series (when you use .loc[4].
Here you go:
import os
import pandas as pd
directory = 'C:\\Users\\PC\\Desktop\\datafiles\\'
aggregate = pd.DataFrame()
for filename in os.listdir(directory):
if filename.endswith(".txt"):
data = pd.read_csv(directory+filename, sep="\t", header=None)
row5 = pd.DataFrame(data.iloc[4]).transpose()
aggregate = aggregate.append(row5)

Adding file name in a Column while merging multible csv files to pandas- Python

I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)

Categories