Can't write details extracted from DICOM file to csv file - python

I am not able to write the details extracted from a DICOM file to a CSV file. Here's the code which I have used -
import pydicom
import os
import pandas as pd
import csv
import glob
data_dir= 'C:\\Users\\dmgop\\Personal\\TE Project - Pneumonia\\stage_1_test_images_dicom'
patients= os.listdir(data_dir)
myFile= open('patientdata.csv','w')
for image in patients:
lung = pydicom.dcmread(os.path.join(data_dir, image))
print (lung)
writer = csv.writer(myFile)
writer.writerows(lung)
break
The error which is coming up is as follows -
Traceback (most recent call last): File "C:\Users\dmgop\AppData\Local\Programs\Python\Python36\lib\site-packages\pydicom-1.2.0rc1-py3.6.egg\pydicom\dataelem.py",
line 344, in getitem
return self.value[key] TypeError: 'PersonName3' object does not support indexing
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "C:\Users\dmgop\Personal\TE
Project - Pneumonia\detail_extraction.py", line 14, in
writer.writerows(lung) File "C:\Users\dmgop\AppData\Local\Programs\Python\Python36\lib\site-packages\pydicom-1.2.0rc1-py3.6.egg\pydicom\dataelem.py",
line 346, in getitem
raise TypeError("DataElement value is unscriptable " TypeError: DataElement value is unscriptable (not a Sequence)

Assuming the "break" statement in your for loop means you only want the info of the first image, try:
import pydicom
import os
import csv
data_dir = 'C:\\Users\\dmgop\\Personal\\TE Project-Pneumonia\\stage_1_test_images_dicom'
patients = os.listdir(data_dir)
with open('file.csv','w') as myfile:
writer = csv.writer(myFile)
# patients[0] means get the first filename, no need for the for loop
lung = pydicom.dcmread(os.path.join(data_dir, patients[0]))
print(lung.formatted_lines)
# pay attention to the function_call --> formatted_lines()
writer.writerows(lung.formatted_lines())
Have a look at the Pydicom docs for FileDataset which is the return type for the dcmread method.
Should you want to write the data for all files in the directory, try the following:
import pydicom
import os
import csv
data_dir = 'C:\\Users\\dmgop\\Personal\\TE Project-Pneumonia\\stage_1_test_images_dicom'
patients = os.listdir(data_dir)
with open('file.csv','w') as myfile:
writer = csv.writer(myfile)
for patient in patients:
if patient.lower().endswith('.dcm'):
lung = pd.dcmread(os.path.join(data_dir, patient))
writer.writerows(lung.formatted_lines())
Also have a look at the last part of this paragraph on the use of 'with open() as'

Related

I have a folder named sampledata with 4 excel files .I want to read content of each excel file

I have a folder named sampledata with 4 excel files (ie, 'b.xlsx', 'call.xlsx', 'Daily.xlsx', 'Whatsapp metadata.xlsx'). I want to read content of each excel file in python.
Can anyone help me?
fi contains path for each file.
import os
import xlrd
import pandas as pd
path='/Users/user78/Downloads/references/forensic/sampledata/'
root,dir,files=next(os.walk(path),[])
print(files)
excel_count=0
text_count=0
excel_files=[]
text_files=[]
for file in files:
fi=os.path.join(root,file)
print(type(fi))
print(fi)
with open(fi,'r') as f:
workbook=xlrd.open_workbook(f)
sheet=workbook.sheet_by_index(0)
for i in range(sheet.ncols):
print(sheet.cell_value(0,i))
Above is my code and the resulting error is given below.
['b.xlsx', 'call.xlsx', 'Daily.xlsx', 'Whatsapp metadata.xlsx', '~$call metadata.xlsx', '~$Whatsapp metadata.xlsx']
<class 'str'>
/Users/user78/Downloads/references/forensic/sampledata/b.xlsx
Traceback (most recent call last):
File "C:\Users\user78\Downloads\references\forensic\sample.py", line 28, in <module>
workbook=xlrd.open_workbook(f)
File "C:\Users\user78\AppData\Local\Programs\Python\Python36\lib\site-packages\xlrd\__init__.py", line 110, in open_workbook
filename = os.path.expanduser(filename)
File "C:\Users\user78\AppData\Local\Programs\Python\Python36\lib\ntpath.py", line 312, in expanduser
path = os.fspath(path)
TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
Can anyone help me?
xlrd.open_workbook() expects a filename as the first parameter, not an already opened file as in your case. Try xlrd.open_workbook(fi) and remove the with-context.
The XLRD-Pypi-page contains a simple example code for reference.
However as already mentioned by Shijith and also clearly stated in the library's description, XLRD should only be used for the "old" excel files.
You can do that by using Pandas.read_excel()
Code Structure
According to pandas.read_excel()
pandas.read_excel(io, sheet_name=0, header=0,
names=None, index_col=None, usecols=None,
squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None,
skiprows=None, nrows=None, na_values=None,
keep_default_na=True, na_filter=True, verbose=False,
parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True,
mangle_dupe_cols=True, storage_options=None
Code Syntax
import pandas as pd
b_df = pd.read_excel('sampledata/b.xlsx')
b_df_lidt = b_df.to_list() # if you want to convert the data into list.
call_df = pd.read_excel('sampledata/call.xlsx')
call_df_list = call_df.to_list()
....

Getting Assertion error while reading the PDF file python - pypdf2

I am getting the below error when I try to read a PDF file.
Code:
from PyPDF2 import PdfFileReader
import os
os.chdir("Path to dir")
pdf_document = 'sample.pdf'
pdf = PdfFileReader(pdf_document,'rb') #Error here
Error:
Traceback (most recent call last):
File "/home/krishna/PycharmProjects/sample/sample.py", line 9, in
pdf = PdfFileReader(filehandle)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1084, in init
self.read(stream)
File "/home/krishna/PycharmProjects/AI_DRC/venv/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1838, in read
assert start >= last_end
AssertionError
NOTE: File is 18 MB in size
Here I wrote this and it completely works for me, The pdf is in same folder, you can use os to get a path value of string type too
import PyPDF2
pdf_file = PyPDF2.PdfFileReader("Sample.pdf")#addressing the file, you can use os method it works on that as well
page_content = pdf_file.getPage(0).extractText()# here I get the psge number one(index zero) and then extracted its content
print(page_content)#you can then do whatever you want with it
I think the problem with your program is that "rb" thing, you use it in normal file handling, PyPDF2 already has methods called PdfFileReader, PdfFileWriter and PdfFileMerger.
Hope it helped
If you counter any problem just mention, and I will try to get back at it.

Looping through Base64 txt files for bulk conversion to images?

I have a large number of txt files that contain the base64 encoding for image files. Each txt file has a single encoding line starting with ".........". I got the following to work as far as saving the image:
import base64
import os
import fnmatch
os.chdir(r'D:\Users\dubs\slidesets'):)
with open('data.image.jpeg.0bac61939da0c.txt', 'r') as file:
str = file.read().replace('data:image/jpeg;base64,', '')
print str
picname = open("data.image.jpeg.0bac61939da0c.jpg", "wb")
picname.write(str.decode('base64'))
picname.close()
My end goal would be to look in a directory for any txt file with "jpeg" in the name, get and edit the string from it, change to image, and save the image in the same directory with the same filename ('data.image.jpeg.0bff54917a8c7.txt' to 'data.image.jpeg.0bff54917a8c7.jpg').
import fnmatch
import os
import base64
os.chdir(r'D:\Users\dubs\slidesets')
for file in os.listdir(r'D:\Users\dubs\slidesets')
if fnmatch.fnmatch(file, "*jpeg*.txt"):
newname = os.path.basename(file).replace(".txt", ".jpg")
with open(file, 'r') as file:
str = file.read().replace('data:image/jpeg;base64,', '')
picname = open("newname", "wb")
picname.write(str.decode('base64'))
picname.close()
The error that I am getting:
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
AttributeError: 'str' object has no attribute 'decode'
I tried "newname" 'newname' and newname because I was unsure how that works with a variable instead, but that didn't help. Not sure why it works for one file in my top code but not in the loop?

Unable to Zip a File in Buffer

i Want to Zip the CSV File in (Buffer) Using zipFile in Python
Below is My Code Which I Have Tried And Error Log Attached
I Dont want to use the compression in df.to_csv due to Version issue
import pandas as pd
import numpy as np
import io
import zipfile
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
s_buf = io.StringIO()
df.to_csv(s_buf,index=False)
s_buf.seek(0)
s_buf.name = 'my_filename.csv'
localfile= io.BytesIO()
localzip = io.BytesIO()
zf = zipfile.ZipFile(localzip, mode="w",compression=zipfile.ZIP_DEFLATED)
zf.writestr(localfile, s_buf.read())
zf.close()
with open("D:/my_zip.zip", "wb") as f:
f.write(zf.getvalue())
Error I am Getting
Traceback (most recent call last):
File "C:/Users/Window/PycharmProjects/dfZip/dfZiptest.py", line 25, in <module>
zf.writestr(localfile, s_buf.read())
File "C:\Python\Python37\lib\zipfile.py", line 1758, in writestr
date_time=time.localtime(time.time())[:6])
File "C:\Python\Python37\lib\zipfile.py", line 345, in __init__
null_byte = filename.find(chr(0))
AttributeError: '_io.BytesIO' object has no attribute 'find'
zf = zipFile.ZipFile("localzip.zip", mode="w", compression=zipfile.ZIP_DEFLATED)
zf.write(filename + '.cvs', s_buf.read())
zf.close
What you are doing here is
1 - You initializa the path of the ZipFile
2 - You simply pass the name and then the file you want to be written to the archive. In your case you were passing io.BytesIO() as a name, which made no sense to Python, thus the error.
I would strongly advice you, to resolve any Version issues first, because while 'clever' solution may seem like a quick way out, they tend to rack up a terrible technical debt latter, which can and will be a nightmare to deal with.
You are passing a io.BytesIO() object as the first argument to ZipFile.writestr() where it expects either an archive name or a ZipInfo object.
zf.writestr(localfile, s_buf.read())
zinfo_or_arcname is either the file name it will be given in the
archive, or a ZipInfo instance.
source: Docs

word count PDF files when walking directory

Hello Stackoverflow community!
I'm trying to build a Python program that will walk a directory (and all sub-directories) and do a accumulated word count total on all .html, .txt, and .pdf files. When reading a .pdf file it requires a little something extra (PdfFileReader) to parse the file. When parsing a .pdf files I'm getting the following error and the program stops:
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
When not parsing .pdf files the problem completely successfully.
CODE
#!/usr/bin/python
import re
import os
import sys
import os.path
import fnmatch
import collections
from PyPDF2 import PdfFileReader
ignore = [<lots of words>]
def extract(file_path, counter):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
def search(path):
print path
counter = collections.Counter()
if os.path.isdir(path):
for root, dirs, files in os.walk(path):
for file in files:
if file.lower().endswith(('.html', '.txt')):
print file
extract(os.path.join(root, file), counter)
if file.lower().endswith(('.pdf')):
file_path = os.path.abspath(os.path.join(root, file))
print file_path
with open(file_path, 'rb') as f:
reader = PdfFileReader(f)
extract(os.path.join(root, reader), counter)
contents = reader.getPage(0).extractText().split('\n')
extract(os.path.join(root, contents), counter)
pass
else:
extract(path, counter)
print(counter.most_common(50))
search(sys.argv[1])
The full error
Traceback (most recent call last):File line 50, in <module> search(sys.argv[1])
File line 36, in search extract(os.path.join(root, reader), counter)
File line 68, in join if b.startswith('/'):
AttributeError: 'PdfFileReader' object has no attribute 'startswith'
It appears there is a failure when calling the extract function with the .pdf file. Any help/guidance would be greatly appreciated!
Expected Results (works w/out .pdf files)
[('cyber', 5101), ('2016', 5095), ('date', 4912), ('threat', 4343)]
The problems is that this line
reader = PdfFileReader(f)
returns an object of type PdfFileReader. You're then passing this object to the extract() function which is expecting a file path and not a PdfFileReader object.
Suggestion would be to move the PDF related processing that you currently have in the search() function to the extract function() instead. Then, in the extract function, you would check to see if it is a PDF file and then act accordingly. So, something like this:
def extract(file_path, counter):
if file_path.lower().endswith(('.pdf')):
reader = PdfFileReader(file)
contents = reader.getPage(0).extractText().split('\n')
counter.update([x for x in contents if x not in ignore and len(x) > 2])
elif file_path.lower().endswith(('.html', '.txt')):
words = re.findall('\w+', open(file_path).read().lower())
counter.update([x for x in words if x not in ignore and len(x) > 2])
else:
## some other file type...
Haven't tested the code snippet above but hopefully you should get the idea.

Categories