Alternate Between Relative and Absolute Path in Same Loop - python

I am trying to:
Loop through a directory of CSV files
Append the file name as a new column to each file
Concatenate every file into single master file
But I get stuck at step #3, when converting the absolute path back into a relative path because my output looks like ../../../../Desktop/2018.12.31.csv when I just want it to be 2018.12.31.
For example, say the directory contains two files: 2018.12.31.csv and 2018.11.30.csv.
2018.12.31.csv
A B
1 2
2018.11.30.csv
A B
3 4
After running my program:
import os
import pandas as pd
folder = ('/Users/user/Desktop/copy')
files = os.listdir(folder)
file_list = list()
for file in files:
file = os.path.join(folder, file)
if file.endswith('.csv'):
df = pd.read_csv(file, sep=";")
df['filename'] = os.path.relpath(file)
file_list.append(df)
all_days = pd.concat(file_list, axis=0, ignore_index=True, sort=False)
all_days.to_csv("/Users/user/Desktop/copy/all.csv")
I want the output to be:
A B filename
1 2 2018.12.31
3 4 2018.11.30
But instead it's:
A B filename
1 2 ../../../../Desktop/copy/2018.12.31.csv
3 4 ../../../../Desktop/copy/2018.11.30.csv

os.path.relpath returns the file location relative to your current directory. You can get the original filename using os.path.basename(path), or just keep the filename as a separate variable and set df['filename'] = file_orig.

If you already have the full filepath to the .csv files, you can use the os.path module to get just the filename:
df['filename'] = os.path.splitext(os.path.split(file)[1])[0]
os.path.splitext() splits the path string in to a tuple with the extension as the second element.
os.path.split() splits the path string into a tuple with the filename (including extension) as the second element.
If you are only ever using .csv files you could simplify to:
df['filename'] = os.path.split(file)[1][:-4]

Related

Generate Pandas DataFrames from CSV file list

To frame the question. I am searching a directory for all csv files. I am saving the path of each csv file along with the delineation into a DataFrame. I know want to iterate over the DataFrame, and read in the specific csv file into a dataframe with a name generated from the original filename. I cannot figure out how to dynamically generate these dataframes. I started coding a few days ago so apologies if the syntax is poor.
# Looks in a given directory and all subsequent subdirectories for the extension ".csv"
# Reads path to all csv files and creates a list
PATH = "Z:\Adam"
EXT = "*.csv"
all_csv_files = [file
for path, subdir, files in os.walk(PATH)
for file in glob(os.path.join(path, EXT))]
# The list of csv file directories is read into a DataFrame
# Dataframe is then split into columns based on the \\ found in the path
df_csv_path = pd.DataFrame(all_csv_files, columns =['Path'])
df_split_path = df_csv_path['Path'].str.split('\\', n = -1, expand = True)
df_split_path = df_split_path.rename(columns = {0:'Drive',1:'Main',2:'Project',3:'Imaging Folder', 4:'Experimental Group',5:'Experimental Rep',6:'File Name'})
df_csv_info = df_split_path.join(df_csv_path['Path'])
# Generates a Dataframe for each of the csv files found in directory
# Dataframe has a name based on the csv filename
for index in df_csv_info.index:
filepath = ""
filename = df_csv_info['File Name'].values[index]
filepath = str(df_csv_info['Path'].values[index])
filename = pd.read_csv(filepath)
The best way is to create a dictionary whose keys are the filenames and the values are the corresponding DataFrame. Instead of using os.path and glob, the modern approach is to use pathlib from the standard library.
Assuming that you don't actually need the DataFrame containing the filenames and just want the DataFrames for each csv file, you can simply do
from pathlib import Path
PATH = Path("Z:\Adam")
EXT = "*.csv"
# dictionary holding all the files DataFrames with the format {"filename": file_DataFrame}
files_dfs = {}
# recursive search for csv files in PATH folder and subfolders
for csv_file in PATH.rglob(EXT):
filename = csv_file.name # get the filename
df = pd.read_csv(csv_file) # read the csv file as a DataFrame
files_dfs[filename] = df # add the DataFrame to the dictionary
Then, to access the DataFrame of a specific file you can do
filename_df = files_dfs["<filename>"]

Python: [Errno 2] No such file or directory, whilst it IS in the directory

I have the following code:
file_name = os.listdir("/Users/sophieramaekers/Downloads/CSV files")
#remove the .csv part
file_name_noext = [f.replace('.csv', '') for f in file_name]
n_subjects = np.arange(0,len(file_name_noext))
#set path
path = r'/Users/sophieramaekers/Downloads/CSV files'
all_files = glob.glob(os.path.join(path, "*.csv"))
#read all files
df_from_each_file=[]
for f in n_subjects:
df_sub = pd.read_csv(file_name[f])
df_from_each_file.append(df_sub) #append each df_sub to the df_from_each_file list
But I keep getting the error That the files are not in the directory, whilst I'm 100% sure they are in the Downloads/CSV files directory. Anybody who can help me out?
I also tried it this way:
df_from_each_file = [pd.read_csv(f) for f in all_files] #read all files
But then it only reads the first two files and repeats those two files for the rest of the length of n_subjects. (so for example let's say the files are names a ab abc abcd abcde, then the df_from_each_file list looks like this: [a ab a ab a] instead of [a ab abc abcd abcde]
file_name is the list of file names (duh) without the full path, so it looks for them in the working directory which (I assume) isn't "/Users/sophieramaekers/Downloads/CSV files".
So try something like:
df_from_each_file=[]
for f in n_subjects:
df_sub = pd.read_csv(os.path.join(path, file_name[f])) # adding the full path to the file
df_from_each_file.append(df_sub)

How to match the .mp4 files present in a folder with the names in .csv file and sort according to some column value, in python?

I have a folder containing about 500 .mp4 files :
abc.mp4
lmn.mp4
ijk.mp4
Also I have a .csv file containing the file names (>500) and some values associated with them:
file name value
abc.mp4 5
xyz.mp4 3
lmn.mp4 5
rgb.mp4 4
I want to match the file names of .csv and folder and then place the mp4 files in separate folders depending on the value.
**folder 5:**
abc.mp4
lmn.mp4
**folder 3:**
xyz.mp4
and so on
I tried link
names=[]
names1=[]
for dirname, dirnames, filenames in os.walk('./videos_test'):
for filename in filenames:
if filename.endswith('.mp4'):
names.append(filename)
file = open('names.csv',encoding='utf-8-sig')
lns = csv.reader(file)
for line in lns:
nam = line [0]
sc=line[1]
names1.append(nam)
if nam in names:
print (nam, line[1])
if line[1]==5
print ('5')
print(nam) %just prints the name of file not save
else if line[1]==3
print ('3')
print(nam)
does not give any result.
I'd recommend you to use pandas if you're going to handle csv files.
Here's a code that will automatically create the folders, and put the files in the right place for you using shutil and pandas. I have assumed that your csv's columns are "filename" and "value". Change them if there's a mismatch.
import pandas as pd
import shutil
import os
path_to_csv_file = "file.csv"
df = pd.read_csv(path_to_csv_file)
mp4_root = "mp4_root"
destination_path = "destination_path"
#In order to remove the folder if previously created. You can delete this if you don't like it.
if os.path.isdir(destination_path):
shutil.rmtree(destination_path)
os.mkdir(destination_path)
unique_values = pd.unique(df['value'])
for u in unique_values:
os.mkdir(os.path.join(destination_path, str(u)))
#Here we iterate over the rows of your csv file, and concatenate the value and the filename to the destination_path with our new folder structure.
for index, row in df.iterrows():
cur_path = os.path.join(destination_path, str(row['value']), str(row['filename']))
source_path = os.path.join(mp4_root, str(row['filename']))
shutil.copyfile(source_path, cur_path)
EDIT: If there's a file that is in the csv but not present in the source folder, you could check it before (more pythonic) or you could handle it via a try/catch exception check.(Not recommended)
Check the code below.
source_files = os.listdir(mp4_root)
for index, row in df.iterrows():
if str(row['filename']) not in source_files:
continue
cur_path = os.path.join(destination_path, str(row['value']), str(row['filename']))
source_path = os.path.join(mp4_root, str(row['filename']))
shutil.copyfile(source_path, cur_path)

How to merge 2000 CSV files saved in different subfolders within the same main folder

Hey People I would like to merge 2000 Csv files into one of 2000 sub-folders. Each sub-folder contains three Csv files with different names. so I need to select only one Csv from each folder.
I know the code for how to merge bunch of Csv files if they are in the same - folder.
import pandas as pd
import glob
path = r'Total_csvs'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)
But my problems with 2000 Csv files look totally different.
Folder structure is:
Main folder (with in this 2000 subfolders, within subfolders I had multiple Csv Files and I need to select only one Csv file from this. Finally concating all 2000 Csv files.)
Coming to Naming Conventions (all the subfolders had different names, but the subfolder name and the Csv name within the subfolder is same)
Any suggestions or a sample code (how to read 2000 Csv from sub-folders) would be helpful.
Thanks in advance
You can loop through all the subfolders using os.listdir.
Since the CSV filename is the same as the subfolder name, simply use the subfolder name to construct the full path name.
import os
import pandas
folders = os.listdir("Total_csvs")
li = []
for folder in folders:
# Since they are the same name
selected_csv = folder
filename = os.path.join(folder, selected_csv + ".csv")
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)
We can iterate on every subfolder, determine expected_csv_path, check if it exists. If it exists, we add them to our all_files list.
Try following:
import pandas as pd
import os
path = r'Total_csvs'
li = []
for f in os.listdir(path):
expected_csv_path = os.path.join(path, f, f + '.csv')
csv_exists = os.path.isfile(expected_csv_path)
if csv_exists:
df = pd.read_csv(expected_csv_path, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True, sort=False)
frame.to_csv('Total.csv',index=False)
If you are using Python 3.5 or newer you could use glob.glob in recursive manner following way:
import glob
path = r'Total_csvs'
all_csv = glob.glob(path+"/**/*.csv",recursive=True)
Now all_csv is list of relative paths to all *.csv inside Total_csv and subdirectories of Total_csv and subdirectories of subdirectories of Total_csv and so on.
For example purpose lets assume that all_csv is now:
all_csv = ['Total_csvs/abc/abc.csv','Total_csv/abc/another.csv']
So we need to get files with names correnponding to directory of their residence, this could be done following way:
import os
def check(x):
directory,filename = x.split(os.path.sep)[-2:]
return directory+'.csv'==filename
all_csv = [i for i in all_csv if check(i)]
print(all_csv) #prints ['Total_csvs/abc/abc.csv']
Now all_csv is list of paths to all .csv you are seeking and you can use it same way as you did with all_csv in "flat" (non-recursive) case.
You can do it without joining paths:
import pathlib,pandas
lastparent=None
for ff in pathlib.Path("Total_csvs").rglob("*.csv"): # recursive glob
print(ff)
if(ff.parent!=lastparent): # process the 1st file in the dir
lastparent= ff.parent
df = pd.read_csv(str(ff),... )
...etc.

import all csv files in directory as pandas dfs and name them as csv filenames

I'm trying to write a script that will import all .csv files in a directory to my workspace as dataframes. Each dataframe should be named as the csv file (minus the extension: .csv).
This is what i have so far, but struggling to understand how to assign the correct name to the dataframe in the loop. I've seen posts that suggest using exec() but this does not seem like a great solution.
path = "../3_Data/Benefits" # dir path
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
dfn = file.split('\\')[-1].split('.')[0] # create string for df name
dfn = pd.read_csv(file,skiprows=5) # This line should assign to the value stored in dfn
Any help appreciated, thanks.
DataFrame have no name their index can have a name. This is how to set it.
import glob
import os
path = "./data/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
# Getting the file name without extension
file_name = os.path.splitext(os.path.basename(file))[0]
# Reading the file content to create a DataFrame
dfn = pd.read_csv(file)
# Setting the file name (without extension) as the index name
dfn.index.name = file_name
# Example showing the Name in the print output
# FirstYear LastYear
# Name
# 0 1990 2007
# 1 2001 2001
# 2 2001 2008

Categories