Finding the number of rows for all files within a folder - python

Hello I am trying to find the number of rows for all files within a folder. I am trying to do this for a folder that contains only ".txt" files and for a folder that contains ."csv" files.
I know that the way to get the number of rows for a SINGLE ".txt" file is something like this:
file = open("sample.txt","r")
Counter = 0
Content = file.read()
CoList = Content.split("\n")
for i in CoList:
if i:
Counter += 1
print("This is the number of lines in the file")
print(Counter)
Whereas for a SINGLE ".csv" file is something like this:
file = open("sample.csv")
reader = csv.reader(file)
lines= len(list(reader))
print(lines)
But how can I do this for ALL files within a folder? That is, how can I loop each of these procedures across all files within a folder and, ideally, export the output into an excel sheet with columns akin to these:
Filename Number of Rows
1.txt 900
2.txt 653
and so on and so on.
Thank you so much for your help.

You can use glob to detect the files and then just iterate over them.
Other methods : How do I list all files of a directory?
import glob
# 1. list all text files in the directory
rel_filepaths = glob.glob("*.txt")
# 2. (optional) create a function to read the number of rows in a file
def count_rows(filepath):
res = 0
f = open(filepath, 'r')
res = len(f.readlines())
f.close()
return res
# 3. iterate over your files and use the count_row function
counts = [count_rows(filepath) for filepath in rel_filepaths]
print(counts)
Then, if you want to export this result in a .csv or .xslx file, I recommend using pandas.
import pandas as pd
# 1. create a new table and add your two columns filled with the previous values
df = pd.DataFrame()
df["Filename"] = rel_filepaths
df["Number of rows"] = counts
# 2. export this dataframe to `.csv`
df.to_csv("results.csv")
You can also use pandas.ExcelWriter() if you want to use the .xlsx format. Link to documentation & examples : Pandas - ExcelWriter doc

Related

Reading Specific Files from folder in Python

I have folder with 12000 csv sample files and I need to read only certain files of interest from it. I have a list with filenames that I want to read from that folder. Here's my code so far
Filenames # list that contains name of filenames that I want to read
# Import data
data_path= "/MyDataPath"
data = []
i=0
# Import csv files
#I feel I am doing a mistake here with looping Filenames[i]
for file in glob.glob(f"{data_path}/{Filenames[i]}.csv", recursive=False):
df = pd.read_csv(file,header=None)
# Append dataframe
data.append(df)
i=i+1
This code only reads first file and ignores all other.
The problem is you are not iterating over the Filenames.
Try the following:
# i=0
# Import csv files
#I feel I am doing a mistake here with looping Filenames[i]
for f in Filenames:
file = glob.glob(f"{data_path}/{f}.csv", recursive=False)
df = pd.read_csv(file,header=None)
# Append dataframe
data.append(df)
# i=i+1

How to add the folder filenames in a pandas column, where each name is the one where one csv concatenated comes from?

I am reading one csv file in several folder, everytime this is the same csv name.
I have this function:
def get_all_projects(path,var):
folder=[]
for i in os.listdir(path):
folder.append(i)
df=[]
for x in range(len(folder)):
try:
df.append(pd.read_csv(path+'\\'+folder[x]+'\\'+var+'.csv',sep=';',header=0))
except:
pass
table = pd.concat(df)
table = table.reset_index(drop=True)
return(table)
The result of this function is a dataframe with all csvs concatenated.
I would like to add a column to this csv, the folder filename of each csv file.
How can I do this in the function ?
You can do something along the lines:
Update the try block with below:
try:
file_path = path+'\\'+folder[x]+'\\'+var+'.csv'
tmp = pd.read_csv(file_path, sep=';',header=0)
tmp['folder_filename'] = file_path # tweak this for exact value
df.append(tmp)

Extracting all specific rows (separately) from multiple csv files and combine rows to save as a new file

I have a number of csv files. I need to extract all respective rows from each file and save it as a new file.
i.e. first output file must contain first rows of all input files and so on.
I have done the following.
import pandas as pd
import os
import numpy as np
data = pd.DataFrame('', columns =['ObjectID', 'SPI'], index = np.arange(1,100))
path = r'C:\Users\bikra\Desktop\Pandas'
i = 1
for files in os.listdir(path):
if files[-4:] == '.csv':
for j in range(0,10, 1):
#print(files)
dataset = pd.read_csv(r'C:\Users\bikra\Desktop\Pandas'+'\\'+files)
spi1 = dataset.loc[j,'SPI']
data.loc[i]['ObjectID'] = files[:]
data.loc[i]['SPI'] = spi1
data.to_csv(r'C:\Users\bikra\Desktop\Pandas\output\\'+str(j)+'.csv')
i + 1
It works well when index (i.e. 'j' ) is specified. But when I tried to loop, the output csv file contains only first row. Where am I wrong?
You better use append:
data = data.append(spi1)

How To Print File Names Conditionally Based On Multiple Imported csv Files

I was wondering if there is a way that prints out the file names conditionally based on the multiple imported csv files. My procedure is:
Set my path.
Grab all the csv files in this path.
Import all these csv files grabbing only the numbers of each file names and store this in 'new_column'.
Check number of columns of each file and want to exclude the files that are not having 10 columns (acvhieved using shape[1]).
Now, I want to print out the actual file names that don't have 10 columns -> I am stuck here.
I have no problems up to number 4. However, I am stuck on 5. How do I achieve 5. ?
# setting my path
path = r'my\path'
# make a function that grabs all csv files in my path
all_files = glob.glob(path + "/*.csv")
# grab the numeric part of each file
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0)
# import all the actual csv files and add a 'new_column' column based on the "get_numbers_from_filename" function
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df['new_column'] = get_numbers_from_filename(filename)
li.append(df)
# check frequency of column numbers for each file using a frequency table
result = []
for lis in li:
result.append(lis.shape[1])
# make this a dataframe
result = pd.DataFrame(result, columns = ['shape'])
# actual checking step
result['shape'].value_counts()
# grab only shape == 10 files to correctly concatenate
result = []
for lis in li:
if lis.shape[1] == 10:
result.append(lis)
## my solution for part 5:
# print and save all the paths of my directory
path = os.listdir(path)
# grab file names if columns numbers are not 10
result3 = []
for paths in path:
for list in li:
if lis.shape[1] != 10:
result3.append(paths)
my solution gives an empty string []

Copy column,add some text and write in a new csv file

I want to make a script that would copy 2nd column from multiple csv files in a folder and add some text before saving it to a single csv file .
here is what i want to do :
1.) Grab data in the 2nd column from all csv files
2.) Append text "hello" & "welcome" to each row at start and end
3.) Write the data into a single file
I tried creating it using pandas
import os
import pandas as pd
dataframes = [pd.read_csv(p, index_col=2, header=None) for p in ('1.csv','2.csv','3.csv')]
merged_dataframe = pd.concat(dataframes, axis=0)
merged_dataframe.to_csv("all.csv", index=False)
The Problem is -
In above code I am forced to mention the file names manually which is very difficult, as a solution I need to include all csv file *.csv
Need to use something like writr.writerow(("Hello"+r[1]+"welcome"))
As there are multiple csv files with many rows(around 100k) in each file so i need to speed up as well.
Here is a sample of the csv files:
"1.csv" "2.csv" "3.csv"
a,Jac b,William c,James
And here is how I would like the output to look all.csv:
Hello Jac welcome
Hello William welcome
Hello James welcome
Any solution using .merge() .append() or .concat() ??
How can I achieve this using python ?
You don't need pandas for this. Here's a really simple way of doing this with csv
import csv
import glob
with open("path/to/output", 'w') as outfile:
for fpath in glob.glob('path/to/directory/*.csv'):
with open(fpath) as infile:
for row in csv.reader(infile):
outfile.write("Hello {} welcome\n".format(row[1]))
1) If you would like to import all .csv files in a folder, you can just use
for i in [a in os.listdir() if a[-4:] == '.csv']:
#code to read in .csv file and concatenate to existing dataframe
2) To append the text and write to a file, you can map a function to each element of the dataframe's column 2 to add the text.
#existing dataframe called df
df[df.columns[1]].map(lambda x: "Hello {} welcome".format(x)).to_csv(<targetpath>)
#replace <targetpath> with your target path
See http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.to_csv.html for all the various parameters you can pass in to to_csv.
Here is a non-pandas solution using built in csv module. Not sure about speed.
import os
import csv
path_to_files = "path to files"
all_csv = os.path.join(path_to_files, "all.csv")
file_list = os.listdir(path_to_files)
names = []
for file in file_list:
if file.endswith(".csv"):
path_to_current_file = os.path.join(path_to_files, file)
with open(path_to_current_file, "r") as current_csv:
reader = csv.reader(current_csv, delimiter=',')
for row in reader:
names.append(row[1])
with open(all_csv, "w") as out_csv:
writer = csv.writer(current_csv, delimiter=',')
for name in names:
writer.writerow(["Hello {} welcome".format(name))

Categories