Looping through Base64 txt files for bulk conversion to images? - python

I have a large number of txt files that contain the base64 encoding for image files. Each txt file has a single encoding line starting with "data:image/jpeg;base64,/9j/.........". I got the following to work as far as saving the image:
import base64
import os
import fnmatch
os.chdir(r'D:\Users\dubs\slidesets'):)
with open('data.image.jpeg.0bac61939da0c.txt', 'r') as file:
str = file.read().replace('data:image/jpeg;base64,', '')
print str
picname = open("data.image.jpeg.0bac61939da0c.jpg", "wb")
picname.write(str.decode('base64'))
picname.close()
My end goal would be to look in a directory for any txt file with "jpeg" in the name, get and edit the string from it, change to image, and save the image in the same directory with the same filename ('data.image.jpeg.0bff54917a8c7.txt' to 'data.image.jpeg.0bff54917a8c7.jpg').
import fnmatch
import os
import base64
os.chdir(r'D:\Users\dubs\slidesets')
for file in os.listdir(r'D:\Users\dubs\slidesets')
if fnmatch.fnmatch(file, "*jpeg*.txt"):
newname = os.path.basename(file).replace(".txt", ".jpg")
with open(file, 'r') as file:
str = file.read().replace('data:image/jpeg;base64,', '')
picname = open("newname", "wb")
picname.write(str.decode('base64'))
picname.close()
The error that I am getting:
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
AttributeError: 'str' object has no attribute 'decode'
I tried "newname" 'newname' and newname because I was unsure how that works with a variable instead, but that didn't help. Not sure why it works for one file in my top code but not in the loop?

Related

Cannot open file in subdirectory with ZipFile

For whatever reason i cannot open or access the file in this subdirectory. I need to be able to open and read files within subdirectories of a zipped folder. Here is my code.
import zipfile
import os
for root, dirs, files in os.walk('Z:\\STAR'):
for name in files:
if '.zip' in name:
try:
zipt=zipfile.ZipFile(os.path.join(root,name),'r')
dirlist=zipfile.ZipFile.namelist(zipt)
for item in dirlist:
if 'Usb' in item:
input(item)
with zipt.open(item,'r') as f:
a=f.readlines()
input(a[0])
else:pass
except Exception as e:
print('passed trc file {}{} because of {}'.format(root,name,e))
else:pass
This code currently gives me the error:
File "StarMe_tracer2.py", line 133, in tracer
if 'er99' in line:
TypeError: a bytes-like object is required, not 'str'
The content read from the file object opened with ZipFile.open is bytes rather than a string, so testing if a string 'er99' is in a line of bytes would fail with a TypeError.
Instead, you can either decode the line before you test:
if 'er99' in line.decode():
or convert the bytes stream to a text stream with io.TextIOWrapper:
import io
...
with io.TextIOWrapper(zipt.open(item,'r'), encoding='utf-8') as f:

Can't write details extracted from DICOM file to csv file

I am not able to write the details extracted from a DICOM file to a CSV file. Here's the code which I have used -
import pydicom
import os
import pandas as pd
import csv
import glob
data_dir= 'C:\\Users\\dmgop\\Personal\\TE Project - Pneumonia\\stage_1_test_images_dicom'
patients= os.listdir(data_dir)
myFile= open('patientdata.csv','w')
for image in patients:
lung = pydicom.dcmread(os.path.join(data_dir, image))
print (lung)
writer = csv.writer(myFile)
writer.writerows(lung)
break
The error which is coming up is as follows -
Traceback (most recent call last): File "C:\Users\dmgop\AppData\Local\Programs\Python\Python36\lib\site-packages\pydicom-1.2.0rc1-py3.6.egg\pydicom\dataelem.py",
line 344, in getitem
return self.value[key] TypeError: 'PersonName3' object does not support indexing
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\dmgop\Personal\TE
Project - Pneumonia\detail_extraction.py", line 14, in
writer.writerows(lung) File "C:\Users\dmgop\AppData\Local\Programs\Python\Python36\lib\site-packages\pydicom-1.2.0rc1-py3.6.egg\pydicom\dataelem.py",
line 346, in getitem
raise TypeError("DataElement value is unscriptable " TypeError: DataElement value is unscriptable (not a Sequence)
Assuming the "break" statement in your for loop means you only want the info of the first image, try:
import pydicom
import os
import csv
data_dir = 'C:\\Users\\dmgop\\Personal\\TE Project-Pneumonia\\stage_1_test_images_dicom'
patients = os.listdir(data_dir)
with open('file.csv','w') as myfile:
writer = csv.writer(myFile)
# patients[0] means get the first filename, no need for the for loop
lung = pydicom.dcmread(os.path.join(data_dir, patients[0]))
print(lung.formatted_lines)
# pay attention to the function_call --> formatted_lines()
writer.writerows(lung.formatted_lines())
Have a look at the Pydicom docs for FileDataset which is the return type for the dcmread method.
Should you want to write the data for all files in the directory, try the following:
import pydicom
import os
import csv
data_dir = 'C:\\Users\\dmgop\\Personal\\TE Project-Pneumonia\\stage_1_test_images_dicom'
patients = os.listdir(data_dir)
with open('file.csv','w') as myfile:
writer = csv.writer(myfile)
for patient in patients:
if patient.lower().endswith('.dcm'):
lung = pd.dcmread(os.path.join(data_dir, patient))
writer.writerows(lung.formatted_lines())
Also have a look at the last part of this paragraph on the use of 'with open() as'

word count PDF files when walking directory

Hello Stackoverflow community!
I'm trying to build a Python program that will walk a directory (and all sub-directories) and do a accumulated word count total on all .html, .txt, and .pdf files. When reading a .pdf file it requires a little something extra (PdfFileReader) to parse the file. When parsing a .pdf files I'm getting the following error and the program stops:
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
When not parsing .pdf files the problem completely successfully.
CODE
#!/usr/bin/python
import re
import os
import sys
import os.path
import fnmatch
import collections
from PyPDF2 import PdfFileReader
ignore = [<lots of words>]
def extract(file_path, counter):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
def search(path):
print path
counter = collections.Counter()
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
if file.lower().endswith(('.html', '.txt')):
print file
extract(os.path.join(root, file), counter)
if file.lower().endswith(('.pdf')):
file_path = os.path.abspath(os.path.join(root, file))
print file_path
with open(file_path, 'rb') as f:
reader = PdfFileReader(f)
extract(os.path.join(root, reader), counter)
contents = reader.getPage(0).extractText().split('\n')
extract(os.path.join(root, contents), counter)
pass
else:
extract(path, counter)
print(counter.most_common(50))
search(sys.argv[1])
The full error
Traceback (most recent call last):File line 50, in <module> search(sys.argv[1])
File line 36, in search extract(os.path.join(root, reader), counter)
File line 68, in join if b.startswith('/'):
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
It appears there is a failure when calling the extract function with the .pdf file. Any help/guidance would be greatly appreciated!
Expected Results (works w/out .pdf files)
[('cyber', 5101), ('2016', 5095), ('date', 4912), ('threat', 4343)]
The problems is that this line
reader = PdfFileReader(f)
returns an object of type PdfFileReader. You're then passing this object to the extract() function which is expecting a file path and not a PdfFileReader object.
Suggestion would be to move the PDF related processing that you currently have in the search() function to the extract function() instead. Then, in the extract function, you would check to see if it is a PDF file and then act accordingly. So, something like this:
def extract(file_path, counter):
if file_path.lower().endswith(('.pdf')):
reader = PdfFileReader(file)
contents = reader.getPage(0).extractText().split('\n')
counter.update([x for x in contents if x not in ignore and len(x) > 2])
elif file_path.lower().endswith(('.html', '.txt')):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
else:
## some other file type...
Haven't tested the code snippet above but hopefully you should get the idea.

How to read Arabic text from PDF using Python script

I have a code written in Python that reads from PDF files and convert it to text file.
The problem occurred when I tried to read Arabic text from PDF files. I know that the error is in the coding and encoding process but I don't know how to fix it.
The system converts Arabic PDF files but the text file is empty.
and display this error:
Traceback (most recent call last): File
"C:\Users\test\Downloads\pdf-txt\text maker.py", line 68, in
f.write(content) UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 50: ordinal not in range(128)
Code:
import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
''' (str) -> str
Verifies if the provided absolute path does exist.
'''
abs_path = raw_input(prompt)
while path.exists(abs_path) != True:
print "\nThe specified path does not exist.\n"
abs_path = raw_input(prompt)
return abs_path
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
for filename in files:
if filename.endswith('.pdf'):
t=os.path.join(directory,filename)
list.append(t)
m=len(list)
print (m)
i=0
while i<=m-1:
path=list[i]
print(path)
head,tail=os.path.split(path)
var="\\"
tail=tail.replace(".pdf",".txt")
name=head+var+tail
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for j in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(j).extractText() + "\n"
print strftime("%H:%M:%S"), " pdf -> txt "
f=open(name,'w')
content.encode('utf-8')
f.write(content)
f.close
i=i+1
You have a couple of problems:
content.encode('utf-8') doesn't do anything. The return value is the encoded content, but you have to assign it to a variable. Better yet, open the file with an encoding, and write Unicode strings to that file. content appears to be Unicode data.
Example (works for both Python 2 and 3):
import io
f = io.open(name,'w',encoding='utf8')
f.write(content)
If you don't close the file properly, you may see no content because the file is not flushed to disk. You have f.close not f.close(). It's better to use with, which ensures the file is closed when the block exits.
Example:
import io
with io.open(name,'w',encoding='utf8') as f:
f.write(content)
In Python 3, you don't need to import and use io.open but it still works. open is equivalent. Python 2 needs the io.open form.
you can use anthor library called pdfplumber instead of using pypdf or PyPDF2
import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
my_page = pdf.pages[10]
thepages=my_page.extract_text()
reshaped_text = arabic_reshaper.reshape(thepages)
bidi_text = get_display(reshaped_text)
print(bidi_text)

Send file contents over ftp python

I have this Python Script
import os
import random
import ftplib
from tkinter import Tk
# now, we will grab all Windows clipboard data, and put to var
clipboard = Tk().clipboard_get()
# print(clipboard)
# this feature will only work if a string is in the clipboard. not files.
# so if "hello, world" is copied to the clipboard, then it would work. however, if the target has copied a file or something
# then it would come back an error, and the rest of the script would come back false (therefore shutdown)
random_num = random.randrange(100, 1000, 2)
random_num_2 = random.randrange(1, 9999, 5)
filename = "capture_clip" + str(random_num) + str(random_num_2) + ".txt"
file = open(filename, 'w') # clears file, or create if not exist
file.write(clipboard) # write all contents of var "foo" to file
file.close() # close file after printing
# let's send this file over ftp
session = ftplib.FTP('ftp.example.com','ftp_user','ftp_password')
session.cwd('//logs//') # move to correct directory
f = open(filename, 'r')
session.storbinary('STOR ' + filename, f)
f.close()
session.quit()
The file will send the contents created by the Python script (under variable "filename" eg: "capture_clip5704061.txt") to my FTP Server, though the contents of the file on the local system do not equal the file on the FTP server. As you can see, I use the ftplib module. Here is my error:
Traceback (most recent call last):
File "script.py", line 33, in<module>
session.storbinary('STOR ' + filename, f)
File "C:\Users\willi\AppData\Local\Programs\Python\Python36\lib\ftplib.py", line 507, in storbinary
conn.sendall(buf)
TypeError: a bytes-like object is required, not 'str'
Your library expects the file to be open in binary mode, it appears. Try the following:
f = open(filename, 'rb')
This ensures that the data read from the file is a bytes object rather than str (for text).

Categories