I have a script, below, that can download files from a particular row from 1 only CSV file. I have no problem with it, it works well and all files are downloaded into my 'Python Project' folder, root.
But I would like to add functions here, First, download not only 1 but multiple (20 or more) CSV files then I don't have to change the name manually here - open('name1.csv') everytime my script has done the job. Second request, downloads need to be placed in a folder with the same name of the csv file that downloads come from. Hopefully I'm clear enough :)
Then I could have:
name1.csv -> name1 folder -> download from name1 csv
name2.csv -> name2 folder -> download from name2 csv
name3.csv -> name3 folder -> download from name3 csv
...
Any help or suggestions will be more than appreciate :) Many thanks!
from collections import Counter
import urllib.request
import csv
import os
with open('name1.csv') as csvfile: #need to add multiple .csv files here.
reader = csv.DictReader(csvfile)
title_counts = Counter()
for row in reader:
name, ext = os.path.splitext(row['link'])
title = row['title']
title_counts[title] += 1
title_filename = f"{title}_{title_counts[title]}{ext}".replace('/', '-') #need to create a folder for each CSV file with the download inside.
urllib.request.urlretrieve(row['link'], title_filename)
You need to add an outer loop which will iterate over files in specific folder. You can use either os.listdir() which returns list of all entries or glob.iglob() with *.csv pattern to get only files with .csv extension.
Also there are some minor improvements you can make in your code. You're using Counter in the way that it can be replaced with defaultdict or even simple dict. Also urllib.request.urlretrieve() is a part of legacy interface which might get deprecated, so you can replace it with combination of urllib.request.urlopen() and shutil.copyfileobj().
Finally, to create a folder you can use os.mkdir() but previously you need to check whether folder already exists using os.path.isdir(), it's required to prevent FileExistsError exception.
Full code:
from os import mkdir
from os.path import join, splitext, isdir
from glob import iglob
from csv import DictReader
from collections import defaultdict
from urllib.request import urlopen
from shutil import copyfileobj
csv_folder = r"/some/path"
glob_pattern = "*.csv"
for file in iglob(join(csv_folder, glob_pattern)):
with open(file) as csv_file:
reader = DictReader(csv_file)
save_folder, _ = splitext(file)
if not isdir(save_folder):
mkdir(save_folder)
title_counter = defaultdict(int)
for row in reader:
url = row["link"]
title = row["title"]
title_counter[title] += 1
_, ext = splitext(url)
save_filename = join(save_folder, f"{title}_{title_counter[title]}{ext}")
with urlopen(url) as req, open(save_filename, "wb") as save_file:
copyfileobj(req, save_file)
For 1: Just loop over a list containing the names of your desired files.
The list can be retrieved using "os.listdir(path)" which returns a list of the files contained inside your "path" (a folder containing the csv files in your case).
Related
I'm looking to retrieve a list of CSV files, and use these names as variables to open and retrieve their content. Something like this:
import csv
import os
files = os.listdir('C:/csvs')
with open(files[0], 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
if line[1]=="**STAFF**":
pass
else:
print(line)
If I print files[0], I do get the correct content, but when I try the above code it does not work.
os.listdir(directory_path) gives filenames which are inside the folder. To actually use the file you need the full path (absolute or relative). This can be easily done by appending each file's name to the directory_path like this:
import os
files = os.listdir(directory_path)
full_file_path = os.path.join(directory_path, files[0])
You can also use glob to save the trouble of joining the paths.
I am trying to select from a list of CSV files, all the files that contain a string. The string is included in a variable. So far this is what I got, but it's not working:
import os, re
import glob
for x in range(1,3,1):
id = ['sbj'+ str(x)]
id = str(id)
csvfile=[]
for file in glob.glob("*.csv"):
if id in file:
print(file)
Anyone knows how to do it?
I'm able to get the zip from HTTPs response and store in a specific folder using below code snippet:
z = zipfile.ZipFile(io.BytesIO(statement_resp.content))
z.extractall("/pathtostore")
However, in /pathtostore the zip file gets extracted with some random name. Is there a way to control the name of zip files created while extracting itself?
Currently, after zip extraction, below is the directory structure:
/pathtostore/ZaXyzzz
--> ZaXyzzz is the zip name.
I'm looking for something as below:
/pathtostore/1234_2020_03_02
--> 1234_2020_03_02 (cid_curdate) is the zip name which I want.
PS: I cannot read the zip and rename it as there could be multiple zip present inside /pathtostore
You can get names z.namelist() and read every file separatelly z.read() and write it with new name using standarad open(), write(), close()
Minimal example.
It may need more code if zipfile has folders
import zipfile
import datetime
import os
z = zipfile.ZipFile('input.zip')
folder = '/pathtostore'
os.makedirs(folder, exist_ok=True)
today = datetime.date.today().strftime('%Y_%m_%d')
cid = 0
for old_name in z.namelist():
cid += 1
new_name = os.path.join(folder, '{:04}_{}'.format(cid, today))
print(old_name, '->', new_name)
data = z.read(old_name)
with open(new_name, 'wb') as fh:
fh.write(data)
You can read the zipfile's ZipInfo structures and modify its filename attribute for the write
from pathlib import Path
z = zipfile.ZipFile(io.BytesIO(statement_resp.content))
for info in z.getinfo():
# implement your extraction policy here. Remove root
# path name and add our own
zip_path = Path(z.filename)
z.filename = str(Path("1234_2020_03_02").joinpath(*zip_path.parts[1:]))
x.extract(info)
I have a csv file that looks like this:
dc_identifier,aubrey_identifier
AR0776-280206-LT513-01,metadc1084267
AR0776-280206-LT513-02,metadc1083385
AR0776-280206-LT513-03,metadc1084185
AR0776-280206-LT513-04,metadc1083449
AR0776-280206-LT513-05,metadc1084294
AR0776-280206-LT513-06,metadc1083393
AR0776-280206-LT513-07,metadc1083604
AR0776-280206-LT513-08,metadc1083956
AR0776-280206-LT513-09,metadc1083223
AR0776-280206-LT513-10,metadc1084224
I need to create folders with the "metadc#######" names within the directory that the script will live in.
Here's what I have so far:
import os
import fileinput
path = 'C:\Users\gpp0020\Desktop\TestDir'
textFile = 'C:\Users\gpp0020\Desktop\TestDir\kxas_ids.csv'
myList = open(textFile, 'rb+')
for line in myList:
for item in line.strip().split(','):
os.makedirs(os.path.join(path, item))
print 'created', item
However! I also need the program to grab files that are named with the identifiers (AR0776-280206-LT513-01, etc) and put them in the corresponding metadc number, according to the csv. Each file is doubled (one .mkv file, and one .mkv.md5 checksum file) and both need to go into the folder.
What's the best way to go about this?
Use the csv library to help with reading the file in:
import csv
import os
import shutil
path = r'C:\Users\gpp0020\Desktop\TestDir'
with open('kxas_ids.csv', 'r', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for dv, aubrey in csv_input:
os.makedirs(os.path.join(path, aubrey), exist_ok=True)
mkv = '{}.mkv'.format(dv)
shutil.copy2(os.path.join(path, mkv), os.path.join(path, aubrey, mkv))
mkv_md5 = '{}.mkv.md5'.format(dv)
shutil.copy2(os.path.join(path, mkv_md5), os.path.join(path, aubrey, mkv_md5))
This would for example:
Create a folder called C:\Users\gpp0020\Desktop\TestDir\metadc108426
Copy a file called AR0776-280206-LT513-01.mkv into it.
Copy a file called AR0776-280206-LT513-01.mkv.md5 into it.
It assumes that all files are found in path
I have a folder (not zipped) containing multiple zip files (no other file type within folder). Each zip has the same type of text files containing different data saved within.
I know how to read in each separately, but I am looking to loop the process without having to type in each zip name. The zipfile archive does not seem to allow wild cards, so I cannot loop using this method. Is it possible to loop the process using glob?
The goal is to get the agency names without extracting all the zipfiles.
Single file read
import os
os.listdir('C:\\NTM\\Test\\')
['00003_32_332.zip', '00011_273_569.zip', '00012_258_276.zip']
import glob
glob.glob('C:\\NTM\\Test\\*.zip')
['C:\\NTM\\Test\\00003_32_332.zip', 'C:\\NTM\\Test\\00011_273_569.zip', 'C:\\NTM\\Test\\00012_258_276.zip']
import zipfile
archive=zipfile.ZipFile('C:\\NTM\\Test\\00011_273_569.zip')
testagency=archive.open('agency.txt')
testagency.read()
'agency_id,agency_name,nVRT,ValleyRide'
Update:
Now, that I can loop through the zip files and loop through to get the text file - I cannot print the agency_name from all of the zip files in the folder. My current code only prints the name of the last agency from the text file of the last zip file in the folder. Am I missing some compound statement structure?
def csv_dict_reader(file_obj):
reader=csv.DictReader(file_obj, delimiter=',')
for row in reader:
print(row['agency_name'])
if name == 'main':
with archive.open('agency.txt')as f_obj:
csv_dict_reader(f_obj)
Whatcom Transportation Authority
Sample Code
import glob
import zipfile
dirName = '/backup/'
zipList = glob.glob(diName+'*.zip')
for zipname in zipList:
archive = zipfile.ZipFile(zipname)
fileList = archive.namelist()
for fileName in fileList:
if fileName.endswith('.txt'):
archive.extract(fileName)
archive.close()
Thanks Jean-Francois!
for archive_name in glob.glob('C:\\NTM\\Test\\*.zip'):
archive=zipfile.ZipFile(archive_name)
testagency=archive.open('agency.txt')
testagency.read()
As I could not comment on Fuji Komalans comment.
Here is the fixed code.
import glob
import zipfile
dirName = 'C:/test/'
zipList = glob.glob(dirName + '*.zip')
print(zipList)
for zipname in zipList:
archive = zipfile.ZipFile(zipname)
fileList = archive.namelist()
for fileName in fileList:
if fileName.endswith('.txt'):
archive.extract(fileName)
print(fileName)
archive.close()