How to combine a large number of dataframes? - python

I have many .txt files in a folder.
For example, each .txt file is like below.
FileA = pd.DataFrame({'Id':["a","b","c"],'Id2':["a","b","z"],'Amount':[10, 30,50]})
FileB= pd.DataFrame({'Id':["d","e","f","z"],'Id2':["g","h","i","j"],'Amount':[10, 30,50,100]})
FileC= pd.DataFrame({'Id':["r","e"],'Id2':["o","i"],'Amount':[6,33]})
FileD...
I want to extract the first row of each dataframe in the folder, and then combine all of them.
So what I did is below.
To make a list of the txt files, I did the following.
txtfiles = []
for file in glob.glob("*.txt"):
txtfiles.append(file)
To extract first row and combine all of them, I did below.
pd.read_table(txtfiles[0])[:1].append([pd.read_table(txtfiles[1])[:1],pd.read_table(txtfiles[2])[:1]],pd.read_table.......)
If the number of txt. files is small, I can do in this way, but in case there are many .txt files, I need an automation method.
Does anyone know how to automate this?
Thanks for your help!

Based on Sid's answer to this post:
input_path = r"insert/your/path" # use the patk where you stored the txt files
all_files = glob.glob(os.path.join(input_path, "*.txt"))
df_from_each_file = (pd.read_csv(f, nrows=1) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
Update Using pd.read_csv was not properly ingesting the file. Replacing read_csv with read_table should give the expected results
input_path = r"insert/your/path" # use the patk where you stored the txt files
all_files = glob.glob(os.path.join(input_path, "*.txt"))
df_from_each_file = (pd.read_table(f, nrows=1) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

Related

How to create a data frame from different Text files

I was working on a project where I have to scrape the some text files from a source. I completed this task and I have 140 text file.
This is one of the text file I have scraped.
I am trying to create a dataframe where I should have one row for each text file. So I wrote the below code:-
import pandas as pd
import os
txtfolder = r'/home/spx072/Black_coffer_assignment/' #Change to your folder path
#Find the textfiles
textfiles = []
for root, folder, files in os.walk(txtfolder):
for file in files:
if file.endswith('.txt'):
fullname = os.path.join(root, file)
textfiles.append(fullname)
# textfiles.sort() #Sort the filesnames
#Read each of them to a dataframe
for filenum, file in enumerate(textfiles, 1):
if filenum==1:
df = pd.read_csv(file, names=['data'], sep='delimiter', header=None)
df['Samplename']=os.path.basename(file)
else:
tempdf = pd.read_csv(file, names=['data'], sep='delimiter', header=None)
tempdf['Samplename']=os.path.basename(file)
df = pd.concat([df, tempdf], ignore_index=True)
df = df[['Samplename','data']] #
The code runs fine, but the dataframe I am getting is some thing like this :-
I want that each text file should be inside a single row like:-
1.txt should be in df['data'][0],
2.txt should be in df'data' and so on.
I tried different codes and also check several questions but still unable to get the desired result. Can anyone help.
I'm not shure why you need pd.read_csv() for this. Try it with pure python:
result = pd.DataFrame(columns=['Samplename', 'data'])
for file in textfiles:
with open(file) as f:
data = f.read()
result = pd.concat([result, pd.DataFrame({'Samplename' : file, 'data': data}, index=[0])], axis=0, ignore_index=True)

Data from one dataframe going to another dataframe Pandas

I am running a loop to open and modify a set of files in a directory using pandas. I am testing on a subset of 10 files and one of them is somehow transposing onto the other and I have no idea why. I have a column for filename and it is the correct file, but using data from the other. It's only this file and I can't figure out why. In the end I get a concatinated dataset where a subset are identical minus the "filename". It seems to be happening before line 8 because that output file has the incorrect info as well. The source files are indeed different and the names of the files are not the same.
Thank you for any help!
for filename in os.listdir(directory):
if filename.endswith(".xlsx"):
df = pd.read_excel(filename, header = None)
for i, row in df.iterrows():
if row.notnull().all():
df2 = df.iloc[(i+1):].reset_index(drop=True)
df2.columns = list(df.iloc[i])
df2.to_excel(filename+"test.xlsx", index=filename)
all_filenames = glob.glob(os.path.join(directory,'*test2.xlsx'))
CAT = pd.concat([pd.read_excel(f) for f in all_filenames ], ignore_index=True, sort=False)
CAT.pop("Unnamed: 0")
CAT.to_excel("All_DF.xlsx", index=filename)
CAT.to_csv("All_DF.csv", index=filename)

Python - for all csv files in folder make new csv file with header names and the file they came from

I have a code where I am writing to five csv files, and after all of the CSV files are created, I would like to run a function to put all of the headers into a csv or xlsx file where each row represents a header in a file.
So in a folder called "Example" there are 5 csv files, called "1.csv", "2.csv"... "5.csv"; for the code I would like to have, a new file would be created called "Headers of files in Example", where the first column is the name of the csv file the header came from, and the second column contains the headers. Ultimately looking like this:contents of Headers of files in example, where the headers of 1.csv are a,b,c and so on.
My python coding is fairly basic at this point, but I definitely think what I would like to do is possible. Any suggestions to help would be greatly appreciated!
After some more digging I was able to find some code that did what I wanted it to, after some slight modifications:
import csv
import glob
import pandas as pd
def headers():
path = r'path to folder containing csv files/'
all_files = glob.glob(path + "*.csv")
files = all_files
myheaders = ['filename', 'header']
with open("Headers of foldername.csv", "w", newline='') as fw:
cw = csv.writer(fw, delimiter=",")
for filename in files:
with open(filename, 'r') as f:
cr = csv.reader(f)
# get title
for column_name in (x.strip() for x in next(cr)):
cw.writerow([filename, column_name])
file = pd.read_csv("Headers of foldername.csv")
file.to_csv("Headers of foldername.csv", header=myheaders, index=False)
Given you have the DataFrames in the memory, you just need to create a new DataFrame, I like to use dictionaries of lists to create it, then for each file/dataframe you extract the columns and upload it to the mock DataFrame.
Later you can save the new DataFrame to a file.
summary_df = {
'file_name': list(),
'headers': list()}
for file, filename in zip(list_of_files, list_of_names):
aux_headers = file.columns.to_list()
summary_df['headers'] += aux_headers
summary_df['file_name'] += [filename] * len(aux_headers)
summary_df = pd.DataFrame(summary_df)
I hope this piece of code helps. Essentially what it does is to iterate over all files you want, their names in file_names then read them using pandas. Once the csv is loaded you extract the headers with df.columns and store it in a list which is then saves as a new csv by pandas.
import pandas as pd
header_names = []
file_names = ['1.csv', '2.csv']
for file_name in file_names:
df = pd.read_csv(file_name)
header_names.extend(list(df.columns))
new_df = pd.DataFrame(l)
new_df.to_csv("headers.csv")

Creating one unique file from many others saved in a folder

I've a list of csv files (approx. 100) that I'd like to include in one single csv file.
The list is found using
PATH_DATA_FOLDER = 'mypath/'
list_files = os.listdir(PATH_DATA_FOLDER)
for f in list_files:
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, f)).columns)
df = pd.DataFrame(columns=list_columns)
print(df)
Which returns the files (it is just a sample, since I have 100 and more files):
['file1.csv', 'name2.csv', 'example.csv', '.DS_Store']
This, unfortunately, includes also hidden files, that I'd like to exclude.
Each file has the same columns:
Columns: [Name, Surname, Country]
I'd like to find a way to create one unique file with all these fields, plus information of the original file (e.g., adding a new column with the file name).
I've tried with
df1 = pd.read_csv(os.path.join(PATH_DATA_FOLDER, f))
df1['File'] = f # file name
df = df.append(df1)
df = df.reset_index(drop=True).drop_duplicates() # I'd like to drop duplicates in both Name and Surname
but it returns a dataframe with the last entry, so I guess the problem is in the for loop.
I hope you can provide some help.
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#drop duplicates and reset index
combined_csv.drop_duplicates().reset_index(drop=True)
#Save the combined file
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Have you tried using glob?
filenames = glob.glob("mypath/*.csv") #list of all you csv files.
df = pd.DataFrame(columns=["Name", "Surname", "Country"])
for filename in filenames:
df = df.append(pd.read_csv(filename))
df = df.drop_duplicates().reset_index(drop=True)
Another way would be concatenating the csv files using the cat command after removing the headers and then read the concatenated csv file using pd.read_csv.

Trying to merge different files csv and to label the columns

I'm trying to get a single dataset by merging several cvs files within one folder. So I would like to merge the different file, which each have 4 columns. I would also like to label the four columns using names=[] in pd.concatenate.
I'm using this code:
path = r'C:\Users\chiar\Desktop\folder' # defining the path
all_files = glob.glob(path + "/*.csv")
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True, names=['quat_1', 'quat_2', 'quat_3', 'quat_4'])
The problem is that instead of getting 4 columns I get 25, and I don't get labeling.
Could someone tell me what I'm doing wrong? Thank you very much!
Use parameter names in read_csv if no header in files:
name = ['quat_1', 'quat_2', 'quat_3', 'quat_4']
df = pd.concat((pd.read_csv(f, names=names) for f in all_files), ignore_index=True)

Categories