Extracting Zip From HTTP Response in Python

Extracting Zip From HTTP Response in Python - python

I'm able to get the zip from HTTPs response and store in a specific folder using below code snippet:
z = zipfile.ZipFile(io.BytesIO(statement_resp.content))
z.extractall("/pathtostore")
However, in /pathtostore the zip file gets extracted with some random name. Is there a way to control the name of zip files created while extracting itself?
Currently, after zip extraction, below is the directory structure:
/pathtostore/ZaXyzzz
--> ZaXyzzz is the zip name.
I'm looking for something as below:
/pathtostore/1234_2020_03_02
--> 1234_2020_03_02 (cid_curdate) is the zip name which I want.
PS: I cannot read the zip and rename it as there could be multiple zip present inside /pathtostore

You can get names z.namelist() and read every file separatelly z.read() and write it with new name using standarad open(), write(), close()
Minimal example.
It may need more code if zipfile has folders
import zipfile
import datetime
import os
z = zipfile.ZipFile('input.zip')
folder = '/pathtostore'
os.makedirs(folder, exist_ok=True)
today = datetime.date.today().strftime('%Y_%m_%d')
cid = 0
for old_name in z.namelist():
cid += 1
new_name = os.path.join(folder, '{:04}_{}'.format(cid, today))
print(old_name, '->', new_name)
data = z.read(old_name)
with open(new_name, 'wb') as fh:
fh.write(data)

You can read the zipfile's ZipInfo structures and modify its filename attribute for the write
from pathlib import Path
z = zipfile.ZipFile(io.BytesIO(statement_resp.content))
for info in z.getinfo():
# implement your extraction policy here. Remove root
# path name and add our own
zip_path = Path(z.filename)
z.filename = str(Path("1234_2020_03_02").joinpath(*zip_path.parts[1:]))
x.extract(info)

Related

Multiple download - CSV file

I have a script, below, that can download files from a particular row from 1 only CSV file. I have no problem with it, it works well and all files are downloaded into my 'Python Project' folder, root.
But I would like to add functions here, First, download not only 1 but multiple (20 or more) CSV files then I don't have to change the name manually here - open('name1.csv') everytime my script has done the job. Second request, downloads need to be placed in a folder with the same name of the csv file that downloads come from. Hopefully I'm clear enough :)
Then I could have:
name1.csv -> name1 folder -> download from name1 csv
name2.csv -> name2 folder -> download from name2 csv
name3.csv -> name3 folder -> download from name3 csv
...
Any help or suggestions will be more than appreciate :) Many thanks!
from collections import Counter
import urllib.request
import csv
import os
with open('name1.csv') as csvfile: #need to add multiple .csv files here.
reader = csv.DictReader(csvfile)
title_counts = Counter()
for row in reader:
name, ext = os.path.splitext(row['link'])
title = row['title']
title_counts[title] += 1
title_filename = f"{title}_{title_counts[title]}{ext}".replace('/', '-') #need to create a folder for each CSV file with the download inside.
urllib.request.urlretrieve(row['link'], title_filename)

You need to add an outer loop which will iterate over files in specific folder. You can use either os.listdir() which returns list of all entries or glob.iglob() with *.csv pattern to get only files with .csv extension.
Also there are some minor improvements you can make in your code. You're using Counter in the way that it can be replaced with defaultdict or even simple dict. Also urllib.request.urlretrieve() is a part of legacy interface which might get deprecated, so you can replace it with combination of urllib.request.urlopen() and shutil.copyfileobj().
Finally, to create a folder you can use os.mkdir() but previously you need to check whether folder already exists using os.path.isdir(), it's required to prevent FileExistsError exception.
Full code:
from os import mkdir
from os.path import join, splitext, isdir
from glob import iglob
from csv import DictReader
from collections import defaultdict
from urllib.request import urlopen
from shutil import copyfileobj
csv_folder = r"/some/path"
glob_pattern = "*.csv"
for file in iglob(join(csv_folder, glob_pattern)):
with open(file) as csv_file:
reader = DictReader(csv_file)
save_folder, _ = splitext(file)
if not isdir(save_folder):
mkdir(save_folder)
title_counter = defaultdict(int)
for row in reader:
url = row["link"]
title = row["title"]
title_counter[title] += 1
_, ext = splitext(url)
save_filename = join(save_folder, f"{title}_{title_counter[title]}{ext}")
with urlopen(url) as req, open(save_filename, "wb") as save_file:
copyfileobj(req, save_file)

For 1: Just loop over a list containing the names of your desired files.
The list can be retrieved using "os.listdir(path)" which returns a list of the files contained inside your "path" (a folder containing the csv files in your case).

How to open and read text files in a folder python

I have a folder which has a text files in it. I want to be able to put in a path to this file and have python go through the folder, open each file and append its content to a list.
import os
folderpath = "/Users/myname/Downloads/files/"
inputlst = [os.listdir(folderpath)]
filenamelist = []
for filename in os.listdir(folderpath):
if filename.endswith(".txt"):
filenamelist.append(filename)
print(filename list)
So far this outputs:
['test1.txt', 'test2.txt', 'test3.txt', 'test4.txt', 'test5.txt', 'test6.txt', 'test7.txt', 'test8.txt', 'test9.txt', 'test10.txt']
I want to have the code take each of these files, open them and put all of its content into a single huge list not just print the file name. Is there any way to do this?

You should use file open for this.
Read here a documentation about its advanced options
Anyway, here is one way how you can do it:
import os
folderpath = r"yourfolderpath"
inputlst = [os.listdir(folderpath)]
filenamecontent = []
for filename in os.listdir(folderpath):
if filename.endswith(".txt"):
f = open(os.path.join(folderpath,filename), 'r')
filenamecontent.append(f.read())
print(filenamecontent)

If you are using Python3, you can use :
for filename in filename_list :
with open(filename,"r") as file_handler :
data = file_handler.read()
Please do mind that you will need the full (either relative or absolute) path to your file in filename
This way, your file handler will be automatically closed when you get out of the with scope.
More information around here : https://docs.python.org/fr/3/library/functions.html#open
On a side note, in order to list files, you might want to have a look to glob and use :
filename_list = glob.glob("/path/to/files/*.txt")

You can use fileinput
Code:
import fileinput
folderpath = "your_path_to_directory_where_files_are_stored"
file_list = [a for a in os.listdir(folderpath) if a.endswith(".txt")]
# This will return all the files which are in .txt format
get_all_files = fileinput.input(file_list)
with open("alldata.txt", 'ab+') as writefile:
for line in get_all_files:
writefile.write(line+'\n')
The above code will read all the data from .txt from a specified directory(folderpath) and store it in alldata.txt So, you wanted to have that long list, that list is now stored in .txt file if you want, else you can remove the write process.
Links:
https://docs.python.org/3/library/fileinput.html
https://docs.python.org/3/library/functions.html#open

Move pairs of files (.txt & .xml) into their corresponding folder using Python

I have been working this challenge for about a day or so. I've looked at multiple questions and answers asked on SO and tried to 'MacGyver' the code used for my purpose, but still having issues.
I have a directory (lets call it "src\") with hundreds of files (.txt and .xml). Each .txt file has an associated .xml file (let's call it a pair). Example:
src\text-001.txt
src\text-001.xml
src\text-002.txt
src\text-002.xml
src\text-003.txt
src\text-003.xml
Here's an example of how I would like it to turn out so each pair of files are placed into a single unique folder:
src\text-001\text-001.txt
src\text-001\text-001.xml
src\text-002\text-002.txt
src\text-002\text-002.xml
src\text-003\text-003.txt
src\text-003\text-003.xml
What I'd like to do is create an associated folder for each pair and then move each pair of files into its respective folder using Python. I've already tried working from code I found (thanks to a post from Nov '12 by Sethdd, but am having trouble figuring out how to use the move function to grab pairs of files. Here's where I'm at:
import os
import shutil
srcpath = "PATH_TO_SOURCE"
srcfiles = os.listdir(srcpath)
destpath = "PATH_TO_DEST"
# grabs the name of the file before extension and uses as the dest folder name
destdirs = list(set([filename[0:9] for filename in srcfiles]))
def create(dirname, destpath):
full_path = os.path.join(destpath, dirname)
os.mkdir(full_path)
return full_path
def move(filename, dirpath):
shutil.move(os.path.join(srcpath, filename)
,dirpath)
# create destination directories and store their names along with full paths
targets = [
(folder, create(folder, destpath)) for folder in destdirs
]
for dirname, full_path in targets:
for filename in srcfile:
if dirname == filename[0:9]:
move(filename, full_path)
I feel like it should be easy, but Python isn't something I work with everyday and it's been a while since my scripting days... Any help would be greatly appreciated!
Thanks,
WK2EcoD

Use the glob module to interate all of the 'txt' files. From that you can parse and create the folders and copy the files.

The process should be as simple as it appears to you as a human.
for file_name in os.listdir(srcpath):
dir = file_name[:9]
# if dir doesn't exist, create it
# move file_name to dir
You're doing a lot of intermediate work that seems to be confusing you.
Also, insert some simple print statements to track data flow and execution flow. It appears that you have no tracing output so far.

You can do it with os module. For every file in directory check if associated folder exists, create if needed and then move the file. See the code below:
import os
SRC = 'path-to-src'
for fname in os.listdir(SRC):
filename, file_extension = os.path.splitext(fname)
if file_extension not in ['xml', 'txt']:
continue
folder_path = os.path.join(SRC, filename)
if not os.path.exists(folder_path):
os.mkdir(folderpath)
os.rename(
os.path.join(SRC, fname),
os.path.join(folder_path, fname)
)

My approach would be:
Find the pairs that I want to move (do nothing with files without a pair)
Create a directory for every pair
Move the pair to the directory
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import os, shutil
import re
def getPairs(files):
pairs = []
file_re = re.compile(r'^(.*)\.(.*)$')
for f in files:
match = file_re.match(f)
if match:
(name, ext) = match.groups()
if ext == 'txt' and name + '.xml' in files:
pairs.append(name)
return pairs
def movePairsToDir(pairs):
for name in pairs:
os.mkdir(name)
shutil.move(name+'.txt', name)
shutil.move(name+'.xml', name)
files = os.listdir()
pairs = getPairs(files)
movePairsToDir(pairs)
NOTE: This script works when called inside the directory with the pairs.

Read .txt from multiple .zip in folder

I have a folder (not zipped) containing multiple zip files (no other file type within folder). Each zip has the same type of text files containing different data saved within.
I know how to read in each separately, but I am looking to loop the process without having to type in each zip name. The zipfile archive does not seem to allow wild cards, so I cannot loop using this method. Is it possible to loop the process using glob?
The goal is to get the agency names without extracting all the zipfiles.
Single file read
import os
os.listdir('C:\\NTM\\Test\\')
['00003_32_332.zip', '00011_273_569.zip', '00012_258_276.zip']
import glob
glob.glob('C:\\NTM\\Test\\*.zip')
['C:\\NTM\\Test\\00003_32_332.zip', 'C:\\NTM\\Test\\00011_273_569.zip', 'C:\\NTM\\Test\\00012_258_276.zip']
import zipfile
archive=zipfile.ZipFile('C:\\NTM\\Test\\00011_273_569.zip')
testagency=archive.open('agency.txt')
testagency.read()
'agency_id,agency_name,nVRT,ValleyRide'
Update:
Now, that I can loop through the zip files and loop through to get the text file - I cannot print the agency_name from all of the zip files in the folder. My current code only prints the name of the last agency from the text file of the last zip file in the folder. Am I missing some compound statement structure?
def csv_dict_reader(file_obj):
reader=csv.DictReader(file_obj, delimiter=',')
for row in reader:
print(row['agency_name'])
if name == 'main':
with archive.open('agency.txt')as f_obj:
csv_dict_reader(f_obj)
Whatcom Transportation Authority

Sample Code
import glob
import zipfile
dirName = '/backup/'
zipList = glob.glob(diName+'*.zip')
for zipname in zipList:
archive = zipfile.ZipFile(zipname)
fileList = archive.namelist()
for fileName in fileList:
if fileName.endswith('.txt'):
archive.extract(fileName)
archive.close()

Thanks Jean-Francois!
for archive_name in glob.glob('C:\\NTM\\Test\\*.zip'):
archive=zipfile.ZipFile(archive_name)
testagency=archive.open('agency.txt')
testagency.read()

As I could not comment on Fuji Komalans comment.
Here is the fixed code.
import glob
import zipfile
dirName = 'C:/test/'
zipList = glob.glob(dirName + '*.zip')
print(zipList)
for zipname in zipList:
archive = zipfile.ZipFile(zipname)
fileList = archive.namelist()
for fileName in fileList:
if fileName.endswith('.txt'):
archive.extract(fileName)
print(fileName)
archive.close()

Unzip nested zip files in python

I am looking for a way to unzip nested zip files in python. For example, consider the following structure (hypothetical names for ease):
Folder
ZipfileA.zip
ZipfileA1.zip
ZipfileA2.zip
ZipfileB.zip
ZipfileB1.zip
ZipfileB2.zip
...etc. I am trying to access text files that are within the second zip. I certainly don't want to extract everything, as the shear numbers would crash the computer (there is several hundred zips in the first layer, and almost 10,000 in the second layer (per zip)).
I have been playing around with the 'zipfile' module - I am able open the 1st level of zipfiles. E.g.:
zipfile_obj = zipfile.ZipFile("/Folder/ZipfileA.zip")
next_layer_zip = zipfile_obj.open("ZipfileA1.zip")
However, this returns a "ZipExtFile" instance (not a file or zipfile instance) - and I can't then go on and open this particular data type. That I can't do this:
data = next_layer_zip.open(data.txt)
I can however "read" this zip file file with:
next_layer_zip.read()
But this is entirely useless! (i.e. can only read compressed data/goobledigook).
Does anyone have any ideas on how I might go about this (without using ZipFile.extract)??
I came across this, http://pypi.python.org/pypi/zip_open/ - which looks to do exactly what I want, but it doesn't seem to work for me. (keep getting "[Errno 2] No such file or directory:" for the files I am trying to process, using that module).
Any ideas would be much appreciated!! Thanks in advance

ZipFile needs a file-like object, so you can use StringIO to turn the data you read from the nested zip into such an object. The caveat is that you'll be loading the full (still compressed) inner zip into memory.
with zipfile.ZipFile('foo.zip') as z:
with z.open('nested.zip') as z2:
z2_filedata = cStringIO.StringIO(z2.read())
with zipfile.ZipFile(z2_filedata) as nested_zip:
print nested_zip.open('data.txt').read()

Unfortunately decompressing zip files requires random access to the archive, and the ZipFile methods (not to mention the DEFLATE algorithm itself) only provide streams. It is therefore impossible to decompress nested zip files without extracting them.

Here's a function I came up with.
def extract_nested_zipfile(path, parent_zip=None):
"""Returns a ZipFile specified by path, even if the path contains
intermediary ZipFiles. For example, /root/gparent.zip/parent.zip/child.zip
will return a ZipFile that represents child.zip
"""
def extract_inner_zipfile(parent_zip, child_zip_path):
"""Returns a ZipFile specified by child_zip_path that exists inside
parent_zip.
"""
memory_zip = StringIO()
memory_zip.write(parent_zip.open(child_zip_path).read())
return zipfile.ZipFile(memory_zip)
if ('.zip' + os.sep) in path:
(parent_zip_path, child_zip_path) = os.path.relpath(path).split(
'.zip' + os.sep, 1)
parent_zip_path += '.zip'
if not parent_zip:
# This is the top-level, so read from disk
parent_zip = zipfile.ZipFile(parent_zip_path)
else:
# We're already in a zip, so pull it out and recurse
parent_zip = extract_inner_zipfile(parent_zip, parent_zip_path)
return extract_nested_zipfile(child_zip_path, parent_zip)
else:
if parent_zip:
return extract_inner_zipfile(parent_zip, path)
else:
# If there is no nesting, it's easy!
return zipfile.ZipFile(path)
Here's how I tested it:
echo hello world > hi.txt
zip wrap1.zip hi.txt
zip wrap2.zip wrap1.zip
zip wrap3.zip wrap2.zip
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap2.zip/wrap1.zip').open('hi.txt').read()
print extract_nested_zipfile('/Users/mattfaus/dev/dev-git/wrap3.zip/wrap2.zip/wrap1.zip').open('hi.txt').read()

For those looking for a function that extracts a nested zip file (any level of nesting) and cleans up the original zip files:
import zipfile, re, os
def extract_nested_zip(zippedFile, toFolder):
""" Unzip a zip file and its contents, including nested zip files
Delete the zip file(s) after extraction
"""
with zipfile.ZipFile(zippedFile, 'r') as zfile:
zfile.extractall(path=toFolder)
os.remove(zippedFile)
for root, dirs, files in os.walk(toFolder):
for filename in files:
if re.search(r'\.zip$', filename):
fileSpec = os.path.join(root, filename)
extract_nested_zip(fileSpec, root)

I use python 3.7.3
import zipfile
import io
with zipfile.ZipFile('all.zip') as z:
with z.open('nested.zip') as z2:
z2_filedata = io.BytesIO(z2.read())
with zipfile.ZipFile(z2_filedata) as nested_zip:
print( nested_zip.open('readme.md').read())

This works for me. Just place this script with the nested zip under the same directory. It will also count the total number of files within the nested zip as well
import os
from zipfile import ZipFile
def unzip (path, total_count):
for root, dirs, files in os.walk(path):
for file in files:
file_name = os.path.join(root, file)
if (not file_name.endswith('.zip')):
total_count += 1
else:
currentdir = file_name[:-4]
if not os.path.exists(currentdir):
os.makedirs(currentdir)
with ZipFile(file_name) as zipObj:
zipObj.extractall(currentdir)
os.remove(file_name)
total_count = unzip(currentdir, total_count)
return total_count
total_count = unzip ('.', 0)
print(total_count)

My approach to such a problem is this, includes self-assigned objects:
import os
import re
import zipfile
import pandas as pd
# import numpy as np
path = r'G:\Important\Data\EKATTE'
# DESCRIBE
archives = os.listdir(path)
archives = [ar for ar in archives if ar.endswith(".zip")]
contents = pd.DataFrame({'elec_date':[],'files':[]})
for a in archives:
archive = zipfile.ZipFile( path+'\\'+a )
filelist = archive.namelist()
# archive.infolist()
for i in archive.namelist():
if re.match('.*zip', i):
sub_arch = zipfile.ZipFile(archive.open(i))
sub_names = [x for x in sub_arch.namelist()]
for s in sub_names:
exec(f"{s.split('.')[0]} = pd.read_excel(sub_arch.open(s), squeeze=True)")
The archive can be found on Bulgaria's National Statistics Institute page (direct link):
https://www.nsi.bg/sites/default/files/files/EKATTE/Ekatte.zip

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Zip From HTTP Response in Python - python

Related

Multiple download - CSV file

How to open and read text files in a folder python

Move pairs of files (.txt & .xml) into their corresponding folder using Python

Read .txt from multiple .zip in folder

Unzip nested zip files in python

Categories

Resources