Select files based on content of a variable in Python - python

I am trying to select from a list of CSV files, all the files that contain a string. The string is included in a variable. So far this is what I got, but it's not working:
import os, re
import glob
for x in range(1,3,1):
id = ['sbj'+ str(x)]
id = str(id)
csvfile=[]
for file in glob.glob("*.csv"):
if id in file:
print(file)
Anyone knows how to do it?

Related

Multiple download - CSV file

I have a script, below, that can download files from a particular row from 1 only CSV file. I have no problem with it, it works well and all files are downloaded into my 'Python Project' folder, root.
But I would like to add functions here, First, download not only 1 but multiple (20 or more) CSV files then I don't have to change the name manually here - open('name1.csv') everytime my script has done the job. Second request, downloads need to be placed in a folder with the same name of the csv file that downloads come from. Hopefully I'm clear enough :)
Then I could have:
name1.csv -> name1 folder -> download from name1 csv
name2.csv -> name2 folder -> download from name2 csv
name3.csv -> name3 folder -> download from name3 csv
...
Any help or suggestions will be more than appreciate :) Many thanks!
from collections import Counter
import urllib.request
import csv
import os
with open('name1.csv') as csvfile: #need to add multiple .csv files here.
reader = csv.DictReader(csvfile)
title_counts = Counter()
for row in reader:
name, ext = os.path.splitext(row['link'])
title = row['title']
title_counts[title] += 1
title_filename = f"{title}_{title_counts[title]}{ext}".replace('/', '-') #need to create a folder for each CSV file with the download inside.
urllib.request.urlretrieve(row['link'], title_filename)
You need to add an outer loop which will iterate over files in specific folder. You can use either os.listdir() which returns list of all entries or glob.iglob() with *.csv pattern to get only files with .csv extension.
Also there are some minor improvements you can make in your code. You're using Counter in the way that it can be replaced with defaultdict or even simple dict. Also urllib.request.urlretrieve() is a part of legacy interface which might get deprecated, so you can replace it with combination of urllib.request.urlopen() and shutil.copyfileobj().
Finally, to create a folder you can use os.mkdir() but previously you need to check whether folder already exists using os.path.isdir(), it's required to prevent FileExistsError exception.
Full code:
from os import mkdir
from os.path import join, splitext, isdir
from glob import iglob
from csv import DictReader
from collections import defaultdict
from urllib.request import urlopen
from shutil import copyfileobj
csv_folder = r"/some/path"
glob_pattern = "*.csv"
for file in iglob(join(csv_folder, glob_pattern)):
with open(file) as csv_file:
reader = DictReader(csv_file)
save_folder, _ = splitext(file)
if not isdir(save_folder):
mkdir(save_folder)
title_counter = defaultdict(int)
for row in reader:
url = row["link"]
title = row["title"]
title_counter[title] += 1
_, ext = splitext(url)
save_filename = join(save_folder, f"{title}_{title_counter[title]}{ext}")
with urlopen(url) as req, open(save_filename, "wb") as save_file:
copyfileobj(req, save_file)
For 1: Just loop over a list containing the names of your desired files.
The list can be retrieved using "os.listdir(path)" which returns a list of the files contained inside your "path" (a folder containing the csv files in your case).

Search for multiple files in folder with particular file name, and select the most recent file, or filename with the largest numeric value

Total beginner here, I'm writing a program that searches for a particular string as file names in a folder, the folder has only PDFs.
For each file name I search for, typically it returns multiple files, like below:
200031-2018-252-20190828102708.pdf
200031-2018-252-20190828102735.pdf
but I'm only interested in opening the most recently created/modified file. In this care, it would be '200031-2018-252-20190828102735.pdf'
So two ways I can sort this:
1, either I select the most recent file, or
2, select the file with the largest numeric value.
Now I've written up until the code that can return the list of files with the same file name, but how do I select and open the most recent file?
Below is the code I've written so far:
import openpyxl
import pyperclip
import glob
import PyPDF2
import os
from pathlib import Path
import fitz
#define year
year='-2018'
#change directory of folder of where the documents are
os.chdir('G:\\Current Users\\Research analyst project management\\Tim\\PCC KPIs\\automate\\New folder')
#open excel file
wb=openpyxl.load_workbook('Grad_Rates_Audit_2017_New_vs_Old.xlsx')
#select sheet
sheet = wb["Campus"]
#assign variable to cell value
cell_value=str(sheet.cell(8,1).value)
#define search value
search_value=cell_value+year
#search for file name in folder
dir_path = Path('G:/Current Users/Research analyst project management/Tim/PCC KPIs/automate/New folder')
pdf_files = dir_path.glob('*.pdf')
for pdf_file in pdf_files:
if search_value in pdf_file.name:
print (pdf_file.name)
The 'print (pdf_file.name)' returns the following results:
200031-2018-252-20190828102708.pdf
200031-2018-252-20190828102735.pdf
You can use use the max function and pass to the key argument a slice of the filename containing only the timestamp. To achieve this you can use the .stem function, which uses the path returned from glob and returns the final path component without its suffix, after that slice the remaining string to get only the timestamp portion.
...
#search for file name in folder
dir_path = Path('G:/Current Users/Research analyst project management/Tim/PCC KPIs/automate/New folder')
list_of_files = dir_path.glob(f'*{search_value}*.pdf')
mostRecent = max(list_of_files, key=lambda fl: fl.stem[-13:])

Get all folder names exept for one in python

i need to get all folder names EXCEPT for "Archives" using Path() ONLY as i need to use glob later in the for loop. Im on Kali Linux and the file structure is ./sheets/ and then the folders Archives, and Test(ALSO NOTHINGS INSIDE) with files creds.json and sheets.py.
# Imports
from pathlib import Path
import pandas as pd
import pygsheets
import glob
import os
# Setup
gc = pygsheets.authorize(service_file='./creds.json')
email = str(input("Enter email to share sheet: "))
folderName = Path("./") # <<<<<<<<<<<<< HERE IS PROBLEM
for file in folderName.glob("*.txt"):
if not Path("./Archives").exists():
os.mkdir("./Archives")
df = pd.DataFrame()
df['name'] = ['John', 'Steve', 'Sarah', 'YESSSS']
gc.create(folderName)
sh = gc.open(file)
sh.share(email, role='writer', type='user')
wks = sh[0]
wks.set_dataframe(df,(1,1))
i expect the output to the variable folderName be any folder name except Archives as a string.
my goal is a script that when run, gets the folder name in ./sheets/ (Test in this case) as the newly created spreadsheet's name, get file names as headers, and stuff in files (seperated by newlines) as the stuff underneath the header(file names) then shares the sheet with me at my email. using pygsheets by the way
from pathlib import Path
p = Path('./sheets')
# create Archives under p if needed
archives = p / 'Archives'
if not archives.exists(): archives.mkdir()
# find all directories under p that don't include 'Archives'
folder_dirs = [x for x in p.glob('**/*') if x.is_dir() and 'Archives' not in x.parts]
# find all *.txt* files under p that don't include 'Archives'
txt_file_dirs = [x for x in p.glob('**/*.txt') if x.is_file() and 'Archives' not in x.parts]
for file in txt_file_dirs:
file_name = file.stem
When using pathlib you will be working with objects, not strings and there are many methods for working on those objects, within the library.

How to read particular text files out-of multiple files in a sub directories in python

I have a one folder, within it contains 5 sub-folders.
Each sub folder contains some 'x.txt','y.txt' and 'z.txt' files and it repeats in every sub-folders
Now I need to read and print only 'y.txt' file from all sub-folders.
My problem is I'm unable to read and print 'y.txt' files. Can you tell me how solve this problem.
Below is my code which I have written for reading y.txt file
import os, sys
import pandas as pd
file_path = ('/Users/Naga/Desktop/Python/Data')
for root, dirs, files in os.walk(file_path):
for name in files:
print(os.path.join(root, name))
pd.read_csv('TextInformation.txt',delimiter=";", names = ['Name', 'Value'])
error :File TextInformation.txt does not exist: 'TextInformation.txt'
You could also try the following approach to fetch all y.txt files from your subdirectories:
import glob
import pandas as pd
# get all y.txt files from all subdirectories
all_files = glob.glob('/Users/Naga/Desktop/Python/Data/*/y.txt')
for file in all_files:
data_from_this_file = pd.read_csv(file, sep=" ", names = ['Name', 'Value'])
# do something with the data
Subsequently, you can apply your code to all the files within the list all_files. The great thing with glob is that you can use wilcards (*). Using them you don't need the names of the subdirectories (you can even use it within the filename, e.g. *y.txt). Also see the documentation on glob.
Your issue is forgot adding the parent path of 'y.txt' file. I suggest this code for you, hope it help.
import os
pth = '/Users/Naga/Desktop/Python/Data'
list_sub = os.listdir(pth)
filename = 'TextInformation.txt'
for sub in list_sub:
TextInfo = open('{}/{}/{}'.format(pth, sub, filename), 'r').read()
print(TextInfo)
I got you a little code. you can personalize it anyway you like but the code works for you.
import os
for dirPath,foldersInDir,fileName in os.walk(path_to_main_folder):
if fileName is not []:
for file in fileName:
if file.endswith('y.txt'):
loc = os.sep.join([dirPath,file])
y_txt = open(loc)
y = y_txt.read()
print(y)
But keep in mind that {path_to_main} is the path that has the subfolders.

Locating files by name for copying elsewhere

New to Python...
I'm trying to have python take a text file of file names (new name on each row), and store them as strings ...
i.e
import os, shutil
files_to_find = []
with open('C:\\pathtofile\\lostfiles.txt') as fh:
for row in fh:
files_to_find.append(row.strip)
...in order to search for these files in directories and then copy any found files somewhere else...
for root, dirs, files in os.walk('D:\\'):
for _file in files:
if _file in files_to_find:
print ("Found file in: " + str(root))
shutil.copy(os.path.abspath(root + '/' + _file), 'C:\\destination')
print ("process completed")
Despite knowing these files exist, the script runs without any errors but without finding any files.
I added...
print (files_to_find)
...after the first block of code to see if it was finding anything and saw screeds of "built-in method strip of str object at 0x00000000037FC730>,
Does this tell me it's not successfully creating strings to compare file names against? I wonder where I'm going wrong?
Use array to create a list of files.
import os
import sys
import glob
import shutil
def file_names(self,filepattern,dir):
os.chdir(dir)
count = len(glob.glob(filepattern))
file_list = []
for line in sorted(glob.glob(filepattern)):
line = line.split("/")
line = line[-1]
file_list.append(line)
return file_list
The loop over the array list to compare.

Categories