Reading Specific Files from folder in Python - python

I have folder with 12000 csv sample files and I need to read only certain files of interest from it. I have a list with filenames that I want to read from that folder. Here's my code so far
Filenames # list that contains name of filenames that I want to read
# Import data
data_path= "/MyDataPath"
data = []
i=0
# Import csv files
#I feel I am doing a mistake here with looping Filenames[i]
for file in glob.glob(f"{data_path}/{Filenames[i]}.csv", recursive=False):
df = pd.read_csv(file,header=None)
# Append dataframe
data.append(df)
i=i+1
This code only reads first file and ignores all other.

The problem is you are not iterating over the Filenames.
Try the following:
# i=0
# Import csv files
#I feel I am doing a mistake here with looping Filenames[i]
for f in Filenames:
file = glob.glob(f"{data_path}/{f}.csv", recursive=False)
df = pd.read_csv(file,header=None)
# Append dataframe
data.append(df)
# i=i+1

Related

Python - for all csv files in folder make new csv file with header names and the file they came from

I have a code where I am writing to five csv files, and after all of the CSV files are created, I would like to run a function to put all of the headers into a csv or xlsx file where each row represents a header in a file.
So in a folder called "Example" there are 5 csv files, called "1.csv", "2.csv"... "5.csv"; for the code I would like to have, a new file would be created called "Headers of files in Example", where the first column is the name of the csv file the header came from, and the second column contains the headers. Ultimately looking like this:contents of Headers of files in example, where the headers of 1.csv are a,b,c and so on.
My python coding is fairly basic at this point, but I definitely think what I would like to do is possible. Any suggestions to help would be greatly appreciated!
After some more digging I was able to find some code that did what I wanted it to, after some slight modifications:
import csv
import glob
import pandas as pd
def headers():
path = r'path to folder containing csv files/'
all_files = glob.glob(path + "*.csv")
files = all_files
myheaders = ['filename', 'header']
with open("Headers of foldername.csv", "w", newline='') as fw:
cw = csv.writer(fw, delimiter=",")
for filename in files:
with open(filename, 'r') as f:
cr = csv.reader(f)
# get title
for column_name in (x.strip() for x in next(cr)):
cw.writerow([filename, column_name])
file = pd.read_csv("Headers of foldername.csv")
file.to_csv("Headers of foldername.csv", header=myheaders, index=False)
Given you have the DataFrames in the memory, you just need to create a new DataFrame, I like to use dictionaries of lists to create it, then for each file/dataframe you extract the columns and upload it to the mock DataFrame.
Later you can save the new DataFrame to a file.
summary_df = {
'file_name': list(),
'headers': list()}
for file, filename in zip(list_of_files, list_of_names):
aux_headers = file.columns.to_list()
summary_df['headers'] += aux_headers
summary_df['file_name'] += [filename] * len(aux_headers)
summary_df = pd.DataFrame(summary_df)
I hope this piece of code helps. Essentially what it does is to iterate over all files you want, their names in file_names then read them using pandas. Once the csv is loaded you extract the headers with df.columns and store it in a list which is then saves as a new csv by pandas.
import pandas as pd
header_names = []
file_names = ['1.csv', '2.csv']
for file_name in file_names:
df = pd.read_csv(file_name)
header_names.extend(list(df.columns))
new_df = pd.DataFrame(l)
new_df.to_csv("headers.csv")

Finding the number of rows for all files within a folder

Hello I am trying to find the number of rows for all files within a folder. I am trying to do this for a folder that contains only ".txt" files and for a folder that contains ."csv" files.
I know that the way to get the number of rows for a SINGLE ".txt" file is something like this:
file = open("sample.txt","r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Whereas for a SINGLE ".csv" file is something like this:
file = open("sample.csv")
reader = csv.reader(file)
lines= len(list(reader))
print(lines)
But how can I do this for ALL files within a folder? That is, how can I loop each of these procedures across all files within a folder and, ideally, export the output into an excel sheet with columns akin to these:
Filename Number of Rows
1.txt 900
2.txt 653
and so on and so on.
Thank you so much for your help.
You can use glob to detect the files and then just iterate over them.
Other methods : How do I list all files of a directory?
import glob
# 1. list all text files in the directory
rel_filepaths = glob.glob("*.txt")
# 2. (optional) create a function to read the number of rows in a file
def count_rows(filepath):
res = 0
f = open(filepath, 'r')
res = len(f.readlines())
f.close()
return res
# 3. iterate over your files and use the count_row function
counts = [count_rows(filepath) for filepath in rel_filepaths]
print(counts)
Then, if you want to export this result in a .csv or .xslx file, I recommend using pandas.
import pandas as pd
# 1. create a new table and add your two columns filled with the previous values
df = pd.DataFrame()
df["Filename"] = rel_filepaths
df["Number of rows"] = counts
# 2. export this dataframe to `.csv`
df.to_csv("results.csv")
You can also use pandas.ExcelWriter() if you want to use the .xlsx format. Link to documentation & examples : Pandas - ExcelWriter doc

Iterating through txt files in directory, saving filenames

I'm iterating through files in a directory and would like to save the filename and some stuff I extract from the files in the same pandas dataframe. How can I save the names the txt files in a list (which I would then insert into pandas dataframe as a separate column) while going through all the files in a directory?
Here's part of my code:
columns_df = ['file', 'stuff']
df_stuff = pd.DataFrame(columns = columns_df)
filenamelist = []
stufflist = []
os.chdir(r'path\to\directory')
for file in glob.glob('*.txt'):
# Extract some stuff from file and append to stufflist (DONE)
# Save filename in the filenamelist (THE PROBLEM)
df_stuff['stuff'] = stufflist
df_stuff['file'] = filenamelist
Do you need this functionality?
for file in glob.glob('*.txt'):
filenamelist.append(file)

Loop through excel files in subfolders

I am trying to loop through my files in different folers
the first part of the code is working :
from os import walk
import pandas as pd
path = r'C:\Users\Sarah\Desktop\test2'
my_files = []
for (dirpath, dirnames, filenames) in walk(path):
my_files.extend(filenames)
print(my_files)
the code successfully print all the files with my subfolders
however the problem comes in this part when I try to extract excel columns different files and save them in a directory
all_dicts_list = []
for file_name in my_files:
#Display sheets names using pandas
pd.set_option('display.width',300)
mosul_file = file_name
xl = pd.ExcelFile(mosul_file)
mosul_df = xl.parse(0, header=[1], index_col=[0,1,2])
#Read Excel and Select columns
mosul_file = pd.read_excel(file_name, sheet_name = 0 ,
index_clo=None, na_values= ['NA'], usecols = "C , F ,G")
#Remove NaN values
data_mosul_df = mosul_file.apply (pd.to_numeric, errors='coerce')
data_mosul_df = mosul_file.dropna()
#Save to Dictionary
datamosulx = data_mosul_df.to_dict()
all_dicts_list.append(datamosulx)
all dictionaries will be in all_dicts_list
I get an error FileNotFoundError: [Errno 2] No such file or directory I don't understand the problem or how to fix it.
Thank you
It's hard to tell because you might have lost some of the formatting from copy and pasting but make sure that after the
for file_name in my_files:
anything that you want in the for loop needs to be indented with tabs or spaces to the same level.
print out mosul_file after allocating it to see whether this could be the case and then indent appropriately.

How to match the .mp4 files present in a folder with the names in .csv file and sort according to some column value, in python?

I have a folder containing about 500 .mp4 files :
abc.mp4
lmn.mp4
ijk.mp4
Also I have a .csv file containing the file names (>500) and some values associated with them:
file name value
abc.mp4 5
xyz.mp4 3
lmn.mp4 5
rgb.mp4 4
I want to match the file names of .csv and folder and then place the mp4 files in separate folders depending on the value.
**folder 5:**
abc.mp4
lmn.mp4
**folder 3:**
xyz.mp4
and so on
I tried link
names=[]
names1=[]
for dirname, dirnames, filenames in os.walk('./videos_test'):
for filename in filenames:
if filename.endswith('.mp4'):
names.append(filename)
file = open('names.csv',encoding='utf-8-sig')
lns = csv.reader(file)
for line in lns:
nam = line [0]
sc=line[1]
names1.append(nam)
if nam in names:
print (nam, line[1])
if line[1]==5
print ('5')
print(nam) %just prints the name of file not save
else if line[1]==3
print ('3')
print(nam)
does not give any result.
I'd recommend you to use pandas if you're going to handle csv files.
Here's a code that will automatically create the folders, and put the files in the right place for you using shutil and pandas. I have assumed that your csv's columns are "filename" and "value". Change them if there's a mismatch.
import pandas as pd
import shutil
import os
path_to_csv_file = "file.csv"
df = pd.read_csv(path_to_csv_file)
mp4_root = "mp4_root"
destination_path = "destination_path"
#In order to remove the folder if previously created. You can delete this if you don't like it.
if os.path.isdir(destination_path):
shutil.rmtree(destination_path)
os.mkdir(destination_path)
unique_values = pd.unique(df['value'])
for u in unique_values:
os.mkdir(os.path.join(destination_path, str(u)))
#Here we iterate over the rows of your csv file, and concatenate the value and the filename to the destination_path with our new folder structure.
for index, row in df.iterrows():
cur_path = os.path.join(destination_path, str(row['value']), str(row['filename']))
source_path = os.path.join(mp4_root, str(row['filename']))
shutil.copyfile(source_path, cur_path)
EDIT: If there's a file that is in the csv but not present in the source folder, you could check it before (more pythonic) or you could handle it via a try/catch exception check.(Not recommended)
Check the code below.
source_files = os.listdir(mp4_root)
for index, row in df.iterrows():
if str(row['filename']) not in source_files:
continue
cur_path = os.path.join(destination_path, str(row['value']), str(row['filename']))
source_path = os.path.join(mp4_root, str(row['filename']))
shutil.copyfile(source_path, cur_path)

Categories