Extract the text from url which is Pdf

Extract the text from url which is Pdf - python

I want to extract the text from this pdf. I cannot do this with pypdf as the document was scanned.

Your pdf that you want to extract text from is actually just a bunch of scanned photos. Since PdfFileReader and other pdf readers extract text based on the metadata of the document you won't get any results with that (If text isn't already embedded in the PDF, then you'll need to use OCR to extract the text.).
You can use Tesseract for that, Tesseract doesn't ocr pdf's so transform .pdf to .tiff with something like convert:
convert -density 300 /path/to/my/document.pdf -depth 8 -strip -background white -alpha off file.tiff
Then use tesseract on that file:
tesseract file.tiff output.txt

import java.io.BufferedInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class PDFreader {
public static void main(String[] args) throws Exception
{
URL url = new URL("http:/....view.php?fil_Id=5515");
byte[] response = null;
try (InputStream in = new BufferedInputStream(url.openStream());
ByteArrayOutputStream out = new ByteArrayOutputStream()) {
byte[] buf = new byte[1024];
int n = 0;
int counter = 0;
while (-1 != (n = in.read(buf))) {
out.write(buf, 0, n);
counter = counter + n;
}
response = out.toByteArray();
}
OutputStream os = new FileOutputStream("abc.pdf");
os.write(response);
os.close();
File file = new File("abc.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
document.close();
}
}

Related

Extract zip file inline in Oracle OCI - Object Storage without downloading to save time

Is it possible to extract a zip file 'inline' which is in cloud say Oracle cloud, Object storage. Meaning, without downloading it and extracting it in the o/s and re-uploading it to object storage, because the file size is big and we need to save time on upload/download.. Any sample code, with Oracle Functions, or python, java etc. ? Is it possible ? I tried with S3 browser/explorer or other similar tools, but that basically at the background, downloads and extract on local computer.

If I understand the question correctly, your use case would be that you have a compressed value on the server and want to extract it on the server and keep it there.
This is possible and mostly depends on how the values has been compressed.
If you use the Lempel-Ziv-Welch algorithm used in the UTL_COMPRESS package, you can extract it directly in PL/SQL.
For other formats like zip, you will need to use some custom Java code like the following example:
CREATE OR REPLACE
JAVA SOURCE NAMED ZIP_Java
AS
import java.io.*;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;
import java.util.zip.ZipInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.BufferedOutputStream;
import java.sql.Blob;
public class Java_Class {
public static int ZipBlob(Blob inLob, Blob[] outLob, String filename) {
try {
// create the zipoutputstream from the end of the outLob
Blob zipLob = outLob[0];
OutputStream os = zipLob.setBinaryStream(1);
ZipOutputStream zos = new ZipOutputStream(os);
// add one zip entry
ZipEntry entry = new ZipEntry(filename);
zos.putNextEntry(entry);
// write data to the zip lob
long len = inLob.length();
long offset = 0;
byte[] buffer;
int chunksize = 32768;
while (offset < len) {
buffer = inLob.getBytes(offset + 1, chunksize);
if (buffer == null)
break;
zos.write(buffer, 0, buffer.length);
offset += buffer.length;
}
zos.closeEntry();
zos.close();
outLob[0] = zipLob;
} catch (Exception e) {
System.out.println("Exception: " + e.toString());
e.printStackTrace(System.out);
return 0;
}
return 1;
}
public static int UnzipBlob(Blob inLob, Blob[] outLob, String filename) {
try {
final int kBUFFER = 2048;
InputStream inps = inLob.getBinaryStream();
ZipInputStream zis = new ZipInputStream(inps);
ZipEntry entry;
Blob fileLob = outLob[0];
OutputStream os = fileLob.setBinaryStream(1);
while((entry = zis.getNextEntry()) != null) {
if (entry.getName().equalsIgnoreCase(filename)) {
byte data[] = new byte[kBUFFER];
BufferedOutputStream dest = new BufferedOutputStream(os, kBUFFER);
int count;
while ((count = zis.read(data, 0, kBUFFER)) != -1) {
dest.write(data, 0, count);
}
dest.flush();
dest.close();
}
}
zis.close();
return 1;
} catch (Exception e) {
System.out.println("Exception: " + e.toString());
e.printStackTrace();
return 0;
}
}
}
/
CREATE OR REPLACE
FUNCTION ZipBlobJava(theSource IN BLOB, theDestination IN OUT NOCOPY BLOB, theFilename IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'Java_Class.ZipBlob(java.sql.Blob, java.sql.Blob[], java.lang.String) return int';
/
CREATE OR REPLACE
FUNCTION UnzipBlobJava(theSource IN BLOB, theDestination IN OUT NOCOPY BLOB, theFilename IN VARCHAR2) RETURN NUMBER
AS LANGUAGE JAVA NAME 'Java_Class.UnzipBlob(java.sql.Blob, java.sql.Blob[], java.lang.String) return int';
/

camera image incorrectly formatted in ctypes pointer (python)

I am using a DLL library to call functions to operate a camera in python, and i'm able to retrieve the image using ctypes but it's formatted incorrectly. The returned image is duplicated and half of it is blank. what do i need to do to fix this?
I have a labview program that correctly takes images from the camera, so that is how they are supposed to look like.
Correct image retrieved using Labview
Image retrieved using Python:
the image is duplicated and also sideways in python.
python code:
from ctypes import *
import numpy as np
import matplotlib.pyplot as plt
mydll = windll.LoadLibrary('StTrgApi.dll')
hCamera = mydll.StTrg_Open()
print(hCamera)
im_height = 1200
im_width = 1600
dwBufferSize = im_height * im_width
pbyteraw = np.zeros((im_height, im_width), dtype=np.uint16)
dwNumberOfByteTrans = 0
dwNumberOfByteTrans = (c_ubyte * dwNumberOfByteTrans)()
dwFrameNo = 0
dwFrameNo = (c_ubyte * dwFrameNo)()
dwMilliseconds = 3000
mydll.StTrg_TakeRawSnapShot(hCamera,
pbyteraw.ctypes.data_as(POINTER(c_int16)), dwBufferSize*2,
dwNumberOfByteTrans, dwFrameNo, dwMilliseconds)
print(pbyteraw)
plt.matshow(pbyteraw)
plt.show()
C++ code for taking the image:
DWORD dwBufferSize = 0;
if(!StTrg_GetRawDataSize(hCamera, &dwBufferSize))
{
_tprintf(TEXT("Get Raw Data Size Failed.\n"));
return(-1);
}
PBYTE pbyteRaw = new BYTE[dwBufferSize];
if(NULL != pbyteRaw)
{
DWORD dwNumberOfByteTrans = 0;
DWORD dwFrameNo = 0;
DWORD dwMilliseconds = 3000;
for(DWORD dwPos = 0; dwPos < 10; dwPos++)
{
if(StTrg_TakeRawSnapShot(hCamera, pbyteRaw, dwBufferSize,
&dwNumberOfByteTrans, &dwFrameNo, dwMilliseconds))
{
TCHAR szFileName[MAX_PATH];
if(is2BytesMode)
{
_stprintf_s(szFileName, _countof(szFileName), TEXT("%s\\%u.tif"), szBitmapFilePath, dwFrameNo);
StTrg_SaveImage(dwWidth, dwHeight, STCAM_PIXEL_FORMAT_16_MONO_OR_RAW, pbyteRaw, szFileName, 0);
}
else
{
_stprintf_s(szFileName, _countof(szFileName), TEXT("%s\\%u.bmp"), szBitmapFilePath, dwFrameNo);
StTrg_SaveImage(dwWidth, dwHeight, STCAM_PIXEL_FORMAT_08_MONO_OR_RAW, pbyteRaw, szFileName, 0);
}
_tprintf(TEXT("Save Image:%s\n"), szFileName);
}
else
{
_tprintf(TEXT("Fail:StTrg_TakeRawSnapShot\n"));
break;
}
}
delete[] pbyteRaw;
}

Based on your C code, something like this should work, but it is untested since I don't have your camera library. If you are using 32-bit Python, make sure the library calls are __stdcall to use WinDLL, else use CDLL. 64-bit Python it doesn't matter. Defining the argument types and return type helps catch errors. For output parameters, create instances of the correct ctype, then pass byref(). The way you were currently doing the output parameters was likely the cause of your crash. Setting argtypes would have detected that the values weren't pointers to DWORDs.
from ctypes import *
from ctypes import wintypes as w
mydll = WinDLL('StTrgApi')
mydll.StTrg_Open.argtypes = None
mydll.StTrg_Open.restype = w.HANDLE
mydll.StTrg_GetRawDataSize.argtypes = w.HANDLE,w.PDWORD
mydll.StTrg_GetRawDataSize.restype = None
mydll.StTrg_TakeRawSnapShot.argtypes = w.HANDLE,w.PBYTE,w.DWORD,w.PDWORD,w.PDWORD,w.DWORD
mydll.StTrg_TakeRawSnapShot.restype = None
hCamera = mydll.StTrg_Open()
print(hCamera)
dwBufferSize = w.DWORD()
mydll.StTrg_GetRawDataSize(hCamera,byref(dwBufferSize))
pbyteraw = (w.BYTE * dwbufferSize)()
dwNumberOfByteTrans = w.DWORD() # output parameters. Pass byref()
dwFrameNo = w.DWORD() # output parameters. Pass byref()
dwMilliseconds = 3000
mydll.StTrg_TakeRawSnapShot(hCamera,
pbyteraw,
dwbufferSize,
byref(dwNumberOfByteTrans),
byref(dwFrameNo),
dwMilliseconds)

parsing of javascript objects using python

friends!
I'm starting to learn python. I have a problem with obtaining the required value from javascript text. Here is the code, which I managed to download from website:
[<script src="//maps.google.com/maps?file=api&v=2&sensor=false&key=ABQIAAAAOjFUxXImJbfYejRUbw0-uBSoJppdodHXaiZe2O5Byw3T7kzYihSys_Exmi235-oDCy6xEhVelBMhBQ" type="text/javascript"></script>, <script type="text/javascript">
var map_shop = null;
var marker_shop = null;
function google_maps_shop_initialize()
{
if (GBrowserIsCompatible())
{
map_shop = new GMap2(document.getElementById("map_canvas_shop"));
point_center = new GLatLng(51.6663267, 39.1898874);
marker_shop = new GMarker(point_center);
map_shop.addOverlay(marker_shop);
map_shop.setCenter(point_center, 13);
//Create new Tile Layer
var gTileUrlTemplate = '//mt1.google.com/vt/lyrs=m#121,transit|vm:1&hl=ru&opts=r&x={X}&y={Y}&z={Z}';
var tileLayerOverlay = new GTileLayerOverlay(
new GTileLayer(null, null, null, {
tileUrlTemplate: gTileUrlTemplate,
isPng:true,
opacity:1
})
);
map_shop.addOverlay(tileLayerOverlay);
}
}
google_maps_shop_initialize();
</script>]
I want to print only one line from text, which contains coordinates point_center = new GLatLng(51.6663267, 39.1898874);
I'm trying decide it using re module, but the problem is that number of line may vary and I get empty output with this code:
if re.match("point_center = new GLatLng", line):
print (line)
Desirable output looks like this:
51.6663267, 39.1898874

If the Javascript is .txt format then you can simply do this:
from ast import literal_eval as make_tuple
with open("filename.txt") as f:
for line in f:
if "point_center = new GLatLng" in line:
linestring = line
linestring = linestring[26:]
linestring = make_tuple(linestring)
Your output should be a tuple.

How to read .dcm in Xcode using python?

I'm trying to create an app for viewing and analyzing DICOM slices. I have done this app in MATLAB, but MATLAB does not have enough tools to build a really nice GUI and 3D picture is bad. So, I was trying to use ITK and VTK to build an app in Xcode for a long period of time but without any success. One day I found xcodeproject PythonDicomDocument - this project (written in python) can read and show DICOM image! I have read a tutorial about python and cocoa but I still can't understand how this project works - it has file PythonDicomDocumentDocument.py:
from Foundation import *
from AppKit import *
from iiDicom import *
import objc
import dicom
import numpy
import Image
class PythonDicomDocumentDocument(NSDocument):
imageView = objc.IBOutlet('imageView')
def init(self):
self = super(PythonDicomDocumentDocument, self).init()
self.image = None
return self
def windowNibName(self):
return u"PythonDicomDocumentDocument"
def windowControllerDidLoadNib_(self, aController):
super(PythonDicomDocumentDocument, self).windowControllerDidLoadNib_(aController)
if self.image:
self.imageView.setImageScaling_(NSScaleToFit)
self.imageView.setImage_(self.image)
def dataOfType_error_(self, typeName, outError):
return None
def readFromData_ofType_error_(self, data, typeName, outError):
return NO
def readFromURL_ofType_error_(self, absoluteURL, typeName, outError):
if absoluteURL.isFileURL():
slice = iiDcmSlice.alloc().initWithDicomFileSlice_(absoluteURL.path())
dicomImage = slice.sliceAsNSImage_context_(True, None)
if dicomImage:
self.image = dicomImage
#self.image = dicomImage
return True, None
return False, None
and file main.m:
**#import "<"Python/Python.h>**
**#import "<"Cocoa/Cocoa.h>**
int main(int argc, char *argv[])
{
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
NSBundle *mainBundle = [NSBundle mainBundle];
NSString *resourcePath = [mainBundle resourcePath];
NSArray *pythonPathArray = [NSArray arrayWithObjects: resourcePath, [resourcePath stringByAppendingPathComponent:#"PyObjC"], #"/System/Library/Frameworks/Python.framework/Versions/Current/Extras/lib/python/", nil];
setenv("PYTHONPATH", [[pythonPathArray componentsJoinedByString:#":"] UTF8String], 1);
NSArray *possibleMainExtensions = [NSArray arrayWithObjects: #"py", #"pyc", #"pyo", nil];
NSString *mainFilePath = nil;
for (NSString *possibleMainExtension in possibleMainExtensions) {
mainFilePath = [mainBundle pathForResource: #"main" ofType: possibleMainExtension];
if ( mainFilePath != nil ) break;
}
if ( !mainFilePath ) {
[NSException raise: NSInternalInconsistencyException format: #"%s:%d main() Failed to find the Main.{py,pyc,pyo} file in the application wrapper's Resources directory.", __FILE__, __LINE__];
}
Py_SetProgramName("/usr/bin/python");
Py_Initialize();
PySys_SetArgv(argc, (char **)argv);
const char *mainFilePathPtr = [mainFilePath UTF8String];
FILE *mainFile = fopen(mainFilePathPtr, "r");
int result = PyRun_SimpleFile(mainFile, (char *)[[mainFilePath lastPathComponent] UTF8String]);
if ( result != 0 )
[NSException raise: NSInternalInconsistencyException
format: #"%s:%d main() PyRun_SimpleFile failed with file '%#'. See console for errors.", __FILE__, __LINE__, mainFilePath];
[pool drain];
return result;
}
So I want to "translate" MATLAB code for reading .dcm:
directory = uigetdir; % after this command Finder window will appear and user will choose a folder with .dcm files
fileFolder = directory; % the path to the folder is saved to a variable fileFolder
dirOutput = dir(fullfile(fileFolder,'*.dcm')); % choose files .dcm in specified folder %and save their names
fileNames = {dirOutput.name}';
Names = char(fileNames);
numFrames = numel(fileNames); % count the number of files in the folder
for i = 1:numFrames
Volume(:,:,i) = dicomread(fullfile(fileFolder,Names(i,:))); % create a 3D array of %DICOM pixel data
end;
Could anyone please tell me how to run the same code for reading .dcm files in Xcode using python???
I've heard that python and MATLAB are similar.

Congratulations on choosing Python for working with DICOM; the SciPy/numpy/matplotlib clan is much better at dealing with huge amounts of volume data than MATLAB (or at least GNU Octave) in my experience.
Trivia load and display code using GDCM's python bindings, a ConvertNumpy.py from GDCM's examples and matplotlib:
#!/usr/bin/env python
import gdcm
import ConvertNumpy
import numpy as np
import matplotlib.pyplot as plt
def loadDicomImage(filename):
reader=gdcm.ImageReader()
reader.SetFileName(filename)
reader.Read()
gdcmimage=reader.GetImage()
return ConvertNumpy.gdcm_to_numpy(gdcmimage)
image=loadDicomImage('mydicomfile.dcm')
plt.gray()
plt.imshow(image)
plt.show()
Note that If your DICOM data contains "padding" values significantly outside your image's air-bone range it might confuse imshow's auto-scaling; use vmax,vmin parameters to that call to specify the range you actually want to see, or implement your own window-levelling code (which is trivial in numpy).

Problem in downloading the file using sharepoint copy.asmx

I am trying to download the file from the document using sharepoint webservices called copy.asmx. its onlt 100 kb file size.
But its not downloading the file.
The web services iteself return empty stream (out byte[] Stream) in the web service response. is that any memory issue.
Also it returning like "download_document()out of memory"
Note: I am using the MFP printer to view this application.

Please try below function. you need to pass FileURL(Full web url for document), Title(Pass name you want to give for downloaded file.)
public string DownLoadfiletolocal(string FileURL, string Title)
{
//Copy.Copy is a webservice object that I consumed.
Copy.Copy CopyObj = new Copy.Copy();
CopyObj.Url = SiteURL + "/_vti_bin/copy.asmx"; // Dynamically passing SiteURL
NetworkCredential nc2 = new NetworkCredential();
nc2.Domain = string.Empty;
nc2.UserName = _UserName;
nc2.Password = _Password;
string copySource = FileURL; //Pass full url for document.
Copy.FieldInformation myFieldInfo = new Copy.FieldInformation();
Copy.FieldInformation[] myFieldInfoArray = { myFieldInfo };
byte[] myByteArray;
// Call the web service
uint myGetUint = CopyObj.GetItem(copySource, out myFieldInfoArray, out myByteArray);
// Convert into Base64 String
string base64String;
base64String = Convert.ToBase64String(myByteArray, 0, myByteArray.Length);
// Convert to binary array
byte[] binaryData = Convert.FromBase64String(base64String);
// Create a temporary file to write the text of the form to
string tempFileName = Path.GetTempPath() + "\\" + Title;
// Write the file to temp folder
FileStream fs = new FileStream(tempFileName, FileMode.Create, FileAccess.ReadWrite);
fs.Write(binaryData, 0, binaryData.Length);
fs.Close();
return tempFileName;
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract the text from url which is Pdf - python

I want to extract the text from this pdf. I cannot do this with pypdf as the document was scanned.

Related

Extract zip file inline in Oracle OCI - Object Storage without downloading to save time

camera image incorrectly formatted in ctypes pointer (python)

parsing of javascript objects using python

How to read .dcm in Xcode using python?

Problem in downloading the file using sharepoint copy.asmx

Categories

Resources