What is the glob2 module? - python

import glob2
from datetime import datetime
filenames = glob2.glob("*.txt")
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".txt", 'w') as file:
for filename in filenames:
with open(filename, "r") as f:
file.write(f.read() + "\n")
I was working in python and came across this name glob, googled it and couldn't find any answer, what does glob do, why is it used for?

from glob docs
"The glob module finds all the pathnames matching a specified pattern(...)"
i skip the imports import glob2 and
from datetime import datetime
get all the filenames in the directory where filename is any and it is extension is text
filenames = glob2.glob("*.txt")
open new file which name is current datetime in the format as specified in the strftime and open it with write access as variable 'file'
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".txt", 'w') as file:
for each filenames in found files which names / paths are stored in filenames variable...
for filename in filenames:
with the filename open for read access as f:
with open(filename, "r") as f:
write all content from f into file and add \n to the end (\n = new line)
file.write(f.read() + "\n")

I also saw "glob2"-module used in a kaggle-notebook and researched my own answer in what is the difference to "glob".
All features of "glob2" are in the current included "glob"-implementation of python.
So there is no reason to use "glob2" anymore.
As for what glob does in general, BlueTomato already provided a nice link and description.

Related

write() with Python and give the files a specific name

The following Python code extracts images from a Pdf file and saves them as jp2 files. The files are then named im1.jp2 and im2.jp2 and seem to be overwritten with a new pdf file from the path on the next run.
How can I give the jp2 files a specific name within the Write() method? E.g. pathname_im1.jp2? Or is it possible to rename it directly?
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
with open(image.name, "wb") as fp:
fp.write(image.data)
Well, there are actually a lot of good ways to do this. As it was pointed out in the comments, you could just use an enumerate:
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for index, image in enumerate(page.images):
filename = f'{index}{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
Or you could append the datetime (which I think is better and more reliable across different runs, if you haven't anything better to use).
from datetime import datetime
from PyPDF2 import PdfReader
from pathlib import Path
pdfdirpath = Path('C:/Users/...')
pathlist = pdfdirpath.glob('*.pdf')
# Note this will use the same datetime for all images
date = datetime.now().strftime("%Y%m%d_%H%M%S")
for path in pathlist:
reader = PdfReader(path)
for page in reader.pages:
for image in page.images:
filename = f'{date}_{image.name}'
with open(filename, "wb") as fp:
fp.write(image.data)
You can then modify this based on how exactly you want the thing.

How to iterate through and unzip ".gz" files in python?

I have several sub-folders, each of which containing twitter files which are zipped. I want python to iterate through these sub-folders and turn them into regular JSON files.
I have more than 300 sub-folders, each of which containing about 1000 or more of these zipped files.
A sample of these files is named:
00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D"
Thanks in advance
I have tried the codes below, just to see if I can extract one of those files, but none worked.
import zipfile
zip_ref = zipfile.ZipFile('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0', 'r')
zip_ref.extractall('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
zip_ref.close()
I have also tried:
import tarfile
tar = tarfile.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
tar.extractall()
tar.close
here is my third try (and no luck):
import gzip
import json
with gzip.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D'
, 'rb') as f:
d = json.loads(f.read().decode("utf-8"))
There is another very similar threat on stackover flow, but my question is different in that my zipped file is originally JSON, and when I use this last method I get this error:
Exception has occurred: json.decoder.JSONDecodeError
Expecting value: line 1 column 1 (char 0)
Simple script that answers the question: it traverses, checks if file (fname) is a gzip (via magic number because I'm cynical) and unzips it.
import json
import gzip
import binascii
import os
def is_gz_file(filepath):
with open(filepath, 'rb') as test_f:
return binascii.hexlify(test_f.read(2)) == b'1f8b'
rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
for fname in fileList:
filepath = os.path.join(dirName,fname)
if is_gz_file(filepath):
f = gzip.open(filepath, 'rb')
json_content = json.loads(f.read())
print(json_content)
Tested and it works.

Extracting zip with password to another dir without foldername

I have this password protected zip folder:
folder_1\1.zip
When I extract this it gives me
1\image.png
How can I extract this to another folder without its folder name? Just the contents of it: image.png
So far I have done all stackoverflows solutions and took me 11 hrs straight just to solve this.
import zipfile
zip = zipfile.ZipFile('C:\\Users\\Desktop\\folder_1\\1.zip', 'r')
zip.setpassword(b"virus")
zip.extractall('C:\\Users\\Desktop') <--target dir to extract all contents
zip.close()
EDIT:
This code worked for me: (Now I want many paths to be extracted at once, any ideas?
import os
import shutil
import zipfile
my_dir = r"C:\\Users\\Desktop"
my_zip = r"C:\\Users\\Desktop\\test\\folder_1\\1.zip"
with zipfile.ZipFile(my_zip) as zip_file:
zip_file.setpassword(b"virus")
for member in zip_file.namelist():
filename = os.path.basename(member)
# skip directories
if not filename:
continue
# copy file (taken from zipfile's extract)
source = zip_file.open(member)
target = file(os.path.join(my_dir, filename), "wb")
with source, target:
shutil.copyfileobj(source, target)
You can use the ZipFile.read() method to read the specific file in the archive, open your target file for writing by joining the target directory with the base name of the source file, and then write what you read to it:
import zipfile
import os
zip = zipfile.ZipFile('C:\\Users\\Desktop\\folder_1\\1.zip', 'r')
zip.setpassword(b"virus")
for name in zip.namelist():
if not name.endswith(('/', '\\')):
with open(os.path.join('C:\\Users\\Desktop', os.path.basename(name)), 'wb') as f:
f.write(zip.read(name))
zip.close()
And if you have several paths containing 1.zip for extraction:
import zipfile
import os
for path in 'C:\\Users\\Desktop\\folder_1', 'C:\\Users\\Desktop\\folder_2', 'C:\\Users\\Desktop\\folder_3':
zip = zipfile.ZipFile(os.path.join(path, '1.zip'), 'r')
zip.setpassword(b"virus")
for name in zip.namelist():
if not name.endswith(('/', '\\')):
with open(os.path.join('C:\\Users\\Desktop', os.path.basename(name)), 'wb') as f:
f.write(zip.read(name))
zip.close()

How to sequentially read all the files in a directory and export the contents in Python?

I have a directory /directory/some_directory/ and in that directory I have a set of files. Those files are named in the following format: <letter>-<number>_<date>-<time>_<dataidentifier>.log, for example:
ABC1-123_20162005-171738_somestring.log
DE-456_20162005-171738_somestring.log
ABC1-123_20162005-153416_somestring.log
FG-1098_20162005-171738_somestring.log
ABC1-123_20162005-031738_somestring.log
DE-456_20162005-171738_somestring.log
I would like to read those a subset of those files (for example, read only files named as ABC1-123*.log) and export all their contents to a single csv file (for example, output.csv), that is, a CSV file that will have all the data from the inidividual files collectively.
The code that I have written so far:
#!/usr/bin/env python
import os
file_directory=os.getcwd()
m_class="ABC1"
m_id="123"
device=m_class+"-"+m_id
for data_file in sorted(os.listdir(file_dir)):
if str(device)+"*" in os.listdir(file_dir):
print data_file
I don't know how to read a only a subset of filtered files and also how to export them to a common csv file.
How can I achieve this?
just use re lib to match file name pattern, and use csv lib to export.
Only a few adjustments, You were close
filesFromDir = os.listdir(os.getcwd())
fileList = [file for file in filesFromDir if file.startswith(device)]
f = open("LogOutput.csv", "ab")
for file in fileList:
#print "Processing", file
with open(file, "rb") as log_file:
txt = log_file.read()
f.write(txt)
f.write("\n")
f.close()
Your question could be better stated, based on your current code snipet, I'll assume that you want to:
Filter files in a directory based on glob pattern.
Concatenate their contents to a file named output.csv.
In python you can achieve (1.) by using glob to list filenames.
import glob
for filename in glob.glob('foo*bar'):
print filename
That would print all files starting with foo and ending with bar in
the current directory.
For (2.) you just read the file and write its content to your desired
output, using python's open() builtin function:
open('filename', 'r')
(Using 'r' as the mode you are asking python to open the file for
"reading", using 'w' you are asking python to open the file for
"writing".)
The final code would look like the following:
import glob
import sys
device = 'ABC1-123'
with open('output.csv', 'w') as output:
for filename in glob.glob(device+'*'):
with open(filename, 'r') as input:
output.write(input.read())
You can use the os module to list the files.
import os
files = os.listdir(os.getcwd())
m_class = "ABC1"
m_id = "123"
device = m_class + "-" + m_id
file_extension = ".log"
# filter the files by their extension and the starting name
files = [x for x in files if x.startswith(device) and x.endswith(file_extension)]
f = open("output.csv", "a")
for file in files:
with open(file, "r") as data_file:
f.write(data_file.read())
f.write(",\n")
f.close()

Unzipping multiple zip files in a directory?

I need to unzip multiple files within a directory and convert them to a csv file.
The files are numbered in order within the file, 1.gz, 2.gz, 3.gz etc
Can this be done within a single script or do I have to do it manually?
edit: current code is
#! /usr/local/bin/python
import gzip
import csv
import os
f = gzip.open('1.gz', 'rb')
file_content = f.read()
filename = '1.txt'
target = open ('1.txt', 'w')
target.write(file_content)
target.close()
filename = '1.csv'
txt_file = '1.txt'
csv_file = '1.csv'
in_txt = csv.reader(open(txt_file, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
dirname = '/home/user/Desktop'
filename = "1.txt"
pathname = os.path.abspath(os.path.join(dirname, filename))
if pathname.startswith(dirname):
os.remove(pathname)
f.close()
Current plan is to do a count for the total number of .gz files per directory and use a loop for each file to unzip and print the txt/csv out.
Is it feasible or is there a better way to this?
Also, is python similar to perl in which the double quotes interpretes the string?
You hardly need Python for this :)
But you can do this in a single Python script. You'll need to use:
os
os.path (possibly)
gzip
glob (will get your a nice glob listing of files. e.g: glob("*.gz"))
Have a read up on these modules over at https://docs.python.org/ and have a go! :)

Categories