Extracting data from multiple files with python - python

I'm trying to extract data from a directory with 12 .txt files. Each file contains 3 columns of data (X,Y,Z) that i want to extract. I want to collect all the data in one df(InforDF), but so far i only succeeded in creating a df with all of the X,Y and Z data in the same column. This is my code:
import pandas as pd
import numpy as np
import os
import fnmatch
path = os.getcwd()
file_list = os.listdir(path)
InfoDF = pd.DataFrame()
for file in file_list:
try:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
except Exception as e:
print(e)
What am i doing wrong?

df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
this line replace df at each iteration of the loop, that's why you only have the last one at the end of your program.
what you can do is to save all your dataframe in a list and concatenate them at the end
df_list = []
for file in file_list:
try:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df_list.append(pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'}))
df = pd.concat(df_list)
alternatively, you can write it:
df_list = pd.concat([pd.read_table(open(file, 'r'), delim_whitespace=True, names={'X','Y','Z'}) for file in file_list if fnmatch.fnmatch(file, '*.txt')])

I think you need glob for select all files, create list of DataFrames dfs in list comprehension and then use concat:
files = glob.glob('*.txt')
dfs = [pd.read_csv(fp, delim_whitespace=True, names=['X','Y','Z']) for fp in files]
df = pd.concat(dfs, ignore_index=True)

As camilleri mentions above, you are overwriting df in your loop
Also there is no point catching a general exception
Solution: Create an empty dataframe InfoDF before the loop and then use append or concat to populate it with smaller dfs
import pandas as pd
import numpy as np
import os
import fnmatch
path = os.getcwd()
file_list = os.listdir(path)
InfoDF = pd.DataFrame(columns={'X','Y','Z'}) # create empty dataframe
for file in file_list:
if fnmatch.fnmatch(file, '*.txt'):
filedata = open(file, 'r')
df = pd.read_table(filedata, delim_whitespace=True, names={'X','Y','Z'})
InfoDF.append(df, ignore_index=True)
print InfoDF

Related

How to concatenate a list of csv dataframe by for loop

I have multiple csv files, and I'm trying to concatenate the desired columns for all csv files in the folder.
Here's my code:
import pandas as pd
import numpy as np
import os
path_dataset = r"C:\Users\KL"
def get_file(path_dataset):
files = os.listdir(path_dataset)
files.sort()
file_list = []
for file in files:
path = path_dataset + "\\" + file
if (file.startswith("OS")) and (file.endswith(".csv")):
file_list.append(path)
return file_list
read_columns = ["LX", "LY", "LZ", "LA"]
read_files = get_file(path_dataset)
for file in read_files:
df = pd.read_csv(file, usecols=read_columns)
all_df = [df]
Concat_table = pd.concat(all_df, axis=0)
Concat_table = Concat_table.sort_values(["LX", "LY", "LZ", "LA"])
Concat_table.to_csv(os.path.join(path_dataset, "Concate_all.csv"), index=False)
I was only able to read one file but not for all csv files. How can I solve this? Thank you.
You should initialise and append each DataFrame to the all_df list as you read them, then concat that list. This is the same as what you are doing in your get_file function.
all_df = []
for file in read_files:
df = pd.read_csv(file, usecols=read_columns)
all_df.append(df)
Concat_table = pd.concat(all_df)

Importing all csv files under a path in seaprate pandas dataframes

I have a lot of csv files in a folder. For the example say file1.csv to file9.csv. What I want is to import each of these files in separarate dataframes. Importing it in 1 dataframe is no option since all the files have different columns. The code below loops through all the csv files in this filepath and is supposed to import them in different dataframes. However only I don't see 9 dataframes but only 1 called df. Why isn't this working. I thought df.name would do the trick of creating the separate dataframes but it doesn't. Does anyone know what I should change to have this work.
import pandas as pd
import os
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df.name = name_df
You only see 1 dataframe called df because in every iteration of the loop you overwrite the previous dataframe. What you can do, is have an array of dataframes or a dictionary of dataframes
Dictionary Approach
This is useful if you want to access dataframes by name.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_dict = dict()
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df_dict[name_df] = pd.read_csv(filename)
List Approach
This is useful if you want to access dataframes by index.
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
df = pd.read_csv(filename)
df_list.append(df)
Append them to a list of data frames and access by list index e.g. df_list[0]:
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
df_list = []
for filename in all_files:
df = pd.read_csv(filename)
df_list.append(df)
You are overwriting the df object each time you loop. I would suggest using a dict of dataframes in this case.
import os
import pandas as pd
import glob
filepath = r'C:/Source data'
all_files = glob.glob(filepath + "/*.csv")
# create the empty dict to be filled in the loop
dfs = {}
for filename in all_files:
name_df = os.path.basename(filename)
name_df = name_df.replace('.csv','')
# add the df to the dict with the filename as its key
dfs[name_df] = pd.read_csv(filename)
# then use it like this
print(dfs['file9'])
I tried the examples with the dictionary and with the list as well. Both work. Thanks a lot all for your help.

Merge all the csv file in one folder and add a new column to pandas dataframe with partial file name in Python

I have a folder inside have over 100 CSV files. They all have the same prefix name.
eg:
shcool.Math001.csv
School.Math002.csv.
School.Physics001.csv. etc... They all contain the same number of columns.
How can I merge all the CSV files in one data frame in Python and add a new column with those files names but the prefix name "School." needs to be removed?
I found some code example online but did not sovle my problem:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t') for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)
Try this, haven't tested:
import os
import pandas as pd
path ='<folder path to CSVs>'
dfs = []
for filename in os.listdir(path):
sample_df = pd.read_csv(filename)
sample_df['filename'] = ''.join(filename[7:])
dfs.append(sample_df)
df = pd.concat(dfs, axis=0, ignore_index=True)
Add DataFrame.assign in generator comprehension for add new column:
path = r'C:\\Users\\me\\data\\'
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_csv(f, sep='\t').assign(New=+os.path.basename(f[7:]).split('.')[0]) for f in all_files)
concatdf = pd.concat(df_from_each_file, ignore_index=True)

how to read mutliple csv files and store them in different dataframe?

Say I have 200 csv files, I want to read these csv files at one time, and store each csv file in different data frames like df1 for the first file and so on up to df200. Doing manual like df1=pd.read_csv takes a lot of time up to 200. How do I do this using pandas?
I have tried using for loop, but unable to approach, stuck.
import pandas as pd
import glob
all_files = glob.glob("file_path" + "/*.csv")
dfs_dict = {}
for idx, filename in enumerate(all_files):
df = pd.read_csv(filename, index_col=None, header=0)
dfs_dict["df" + str(idx)] = df
Try using this :
import pandas as pd
import glob
path = r'path of the folder where all csv exists'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
li will have all the csv's... you can furthur preprocess them to separate them into different files,
or if all the csv's have the same column and you want to concatenate them to a single dataframe, you could use the concat function in pandas over li to return the single dataframe.
import pandas as pd
import os
dfs=[] #empty list of dataframes
dirname = #where your files are
for root,folders,files in os.walk(dirname):
for file in files:
fp = os.path.join(root,file)
df=pd.read_csv(fp)
dfs.append(df)
df=pd.concat(dfs)

Adding file name in a Column while merging multible csv files to pandas- Python

I have multiple csv files in the same folder with all the same data columns,
20100104 080100;5369;5378.5;5365;5378;2368
20100104 080200;5378;5385;5377;5384.5;652
20100104 080300;5384.5;5391.5;5383;5390;457
20100104 080400;5390.5;5391;5387;5389.5;392
I want to merge the csv files into pandas and add a column with the file name to each line so I can track where it came from later. There seems to be similar threads but I haven't been able to adapt any of the solutions. This is what I have so far. The merge data into one data frame works but I'm stuck on the adding file name column,
import os
import glob
import pandas as pd
path = r'/filepath/'
all_files = glob.glob(os.path.join(path, "*.csv"))
names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')]
list_ = []
for file_ in all_files:
list_.append(pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None ))
df = pd.concat(list_)
Instead of using a list just use DataFrame's append.
df = pd.DataFrame()
for file_ in all_files:
file_df = pd.read_csv(file_,sep=';', parse_dates=[0], infer_datetime_format=True,header=None )
file_df['file_name'] = file_
df = df.append(file_df)

Categories