Extraction of data from pdf - python

I am developing a project as part of an internship for a research institute.
For that I have to feed my SQL database, with data from pdf files, like the one below:
https://www.mairie-orly.fr/content/download/11449/88884/file/Menu+du+mois+de+janvier+2021.pdf
I currently convert my pdf files to JSON via the pdftotext API, this is the best JSON output I have found so far. Here is the result:
{
"document": {
"page": {
"#index": "0",
"row": [
{
"column": [
{
"text": ""
},
{
"text": ""
},
{
"text": {
"#fontName": "Arial",
"#fontSize": "10.0",
"#fontStyle": "Bold",
"#x": "271.10",
"#y": "101.78",
"#width": "29.40",
"#height": "9.72",
"#text": "LUNDI"
}
},
{
"text": {
"#fontName": "Arial",
"#fontSize": "10.0",
"#fontStyle": "Bold",
"#x": "476.09",
"#y": "101.78",
"#width": "31.26",
"#height": "9.72",
"#text": "MARDI"
}
etc....etc...etc...
I need to feed my database with a menu and a corresponding date.
Maybe my conversion is not adapted to the problem?
How can I retrieve a menu and the corresponding date? Which path should I follow?
Thanks

If you want to extract the data from the pdf you can try to make a request using urllib and retrive the text using tika apache. With urllib you can choose to save the file local and get the text from the file like this
import urllib.request
from tika import parser
url = 'https://www.mairie-orly.fr/content/download/11449/88884/file/Menu+du+mois+de+janvier+2021.pdf'
urllib.request.urlretrieve(url, './a.pdf')
file_data = parser.from_file('./a.pdf')
text = file_data['content']
#print for debug
print(text)
For tika you need to install java (Latest version) and pip install tika
hope it helps! enois

Related

Preventing libtorrent creating torrents with the attr attribute

I'm trying to make a Python app that creates torrent files from other files. I need this app to always create the same infohash to any given file (including files with the same content but different name), so I create a symlink with the name "blob" before creating the torrent, and it works. The only problem I'm facing is that if the file has executable permissions, it writes the "attr" key in the bencoded torrent, changing its infohash.
I found this discussion by looking for this attribute: https://libtorrent-discuss.narkive.com/bXlfCX2K/libtorrent-extra-elements-in-the-torrent-files-array
The code I'm using is this:
import libtorrent as lt
import os
os.symlink(file,"blob")
fs = lt.file_storage()
lt.add_files(fs,"blob")
t = lt.create_torrent(fs)
lt.set_piece_hashes(t,".")
os.remove("blob")
t.add_tracker("udp://tracker.openbittorrent.com:6969/announce",0)
t.set_creator('libtorrent')
torrent = lt.bencode(t.generate())
with open("file.torrent", "wb") as f:
f.write(torrent)
The bencoded torrent in json format:
{
"announce": "udp://tracker.openbittorrent.com:6969/announce",
"announce-list": [
[
"udp://tracker.openbittorrent.com:6969/announce"
]
],
"created by": "libtorrent",
"creation date": 1650389319,
"info": {
"attr": "x",
"file tree": {
"blob": {
"": {
"attr": "x",
"length": 3317054,
"pieces root": "<hex>...</hex>"
}
}
},
"length": 3317054,
"meta version": 2,
"name": "blob",
"piece length": 32768,
"pieces": "<hex>...</hex>"
},
"piece layers": {
"<hex>...</hex>": "<hex>...</hex>"
}
}
Is there any way to tell libtorrent to ignore this key attribute when creating the torrent to ensure that it always makes the same infohash?

Elasticsearch prevent indexing of Markdown hyperlinks

I am building a Markdown file content search using Elasticsearch. Currently the whole content inside the MD file is indexed in Elasticsearch. But the problem is it shows results like this [Mylink](https://link-url-here.org), [Mylink2](another_page.md)
in the search results.
I would like to prevent indexing of hyperlinks and reference to other pages. When someone search for "Mylink" it should only return the text without the URL. It would be great if someone could help me with the right solution for this.
You need to render Markdown in your indexing application, then remove HTML tags and save it alongside with the markdown source.
I think you have two main solutions for this problem.
first: clean the data in your source code before indexing it into Elasticsearch.
second: use the Elasticsearch filter to clean the data for you.
the first solution is the easy one but if you need to do this process inside the Elasticsearch you need to create a ingest pipeline.
then you can use the Script processor to clean the data you need by a ruby script that can find your regex and remove it
You could use an ingest pipeline with a script processor to extract the link text:
1. Set up the pipeline
PUT _ingest/pipeline/clean_links
{
"description": "...",
"processors": [
{
"script": {
"source": """
if (ctx["content"] == null) {
// nothing to do here
return
}
def content = ctx["content"];
Pattern pattern = /\[([^\]\[]+)\](\(((?:[^\()]+)+)\))/;
Matcher matcher = pattern.matcher(content);
def purged_content = matcher.replaceAll("$1");
ctx["purged_content"] = purged_content;
"""
}
}
]
}
The regex can be tested here and is inspired by this.
2. Include the pipeline when ingesting the docs
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
POST my-index/_doc?pipeline=clean_links
{
"content": "[Mylink2](another_page.md)"
}
The python docs are here.
3. Verify
GET my-index/_search?filter_path=hits.hits._source
should yield
{
"hits" : {
"hits" : [
{
"_source" : {
"purged_content" : "Mylink anotherLink",
"content" : "[Mylink](https://link-url-here.org) [anotherLink](http://dot.com)"
}
},
{
"_source" : {
"purged_content" : "Mylink2",
"content" : "[Mylink2](another_page.md)"
}
}
]
}
}
You could instead replace the original content if you want to fully discard them from your _source.
In contrast, you could go a step further in the other direction and store the text + link pairs in a nested field of the form:
{
"content": "...",
"links": [
{
"text": "Mylink",
"href": "https://link-url-here.org"
},
...
]
}
so that when you later decide to make them searchable, you'll be able to do so with precision.
Shameless plug: you can find other hands-on ingestion guides in my Elasticsearch Handbook.

singer tap-zendesk - how to extract catalog.json with selected:True from discovery mode

I am using singer's tap-zendesk library and want to extract data from specific schemas.
I am running the following command in sync mode:
tap-zendesk --config config.json --catalog catalog.json.
Currently my config.json file has the following parameters:
{
"email": "<email>",
"api_token": "<token>",
"subdomain": "<domain>",
"start_date": "<start_date>"
}
I've managed to extract data by putting 'selected':true under schema, properties and metadata in the catalog.json file. But I was wondering if there was an easier way to do this? There are around 15 streams I need to go through.
I manage to get the catalog.json file through the discovery mode command:
tap-zendesk --config config.json --discover > catalog.json
The output looks something like the following, but that means that I have to go and add selected:True under every field.
{
"streams": [
{
"stream": "tickets",
"tap_stream_id": "tickets",
"schema": {
**"selected": "true"**,
"properties": {
"organization_id": {
**"selected": "true"**,},
"metadata": [
{
"breadcrumb": [],
"metadata": {
**"selected": "true"**
}
The selected=true needs to be applied only once per stream. This needs to be added to the metadata section under the stream where breadcrumbs = []. This is very poorly documented.
Please see this blog post for some helpful details: https://medium.com/getting-started-guides/extracting-ticket-data-from-zendesk-using-singer-io-tap-zendesk-57a8da8c3477

How to insert variable value into JSON string for use with PDAL

I'm trying to use the Python extension for PDAL to read in a laz file.
To do so, I'm using the simple pipeline structure as exampled here: https://gis.stackexchange.com/questions/303334/accessing-raw-data-from-laz-file-in-python-with-open-source-software. It would be useful for me, however, to insert the value contained in a variable for the "filename:" field. To do so, I've tried the following, where fullFileName is a str variable containing the name (full path) of the file, but I am getting an error that no such file exists. I am assuming my JSON syntax is slightly off or something; can anyone help?
pipeline="""{
"pipeline": [
{
"type": "readers.las",
"filename": "{fullFileName}"
}
]
}"""
You can follow this code:
import json
import pdal
file = "D:/Lidar data/input.laz"
pipeline={
"pipeline": [
{
"type": "readers.las",
"filename": file
},
{
"type": "filters.sort",
"dimension": "Z"
}
]
}
r = pdal.Pipeline(json.dumps(pipeline))
r.validate()
points = r.execute()
print(points)

Python: Access a dictionary that is located inside of a text file

I am working on a quick and dirty script to get Chromium's bookmarks and turn them into a pipe menu for Openbox. Chromium stores it's bookmarks in a file called Bookmarks that stores information in a dictionary form like this:
{
"checksum": "99999999999999999999",
"roots": {
"bookmark_bar": {
"children": [ {
"date_added": "9999999999999999999",
"id": "9",
"name": "Facebook",
"type": "url",
"url": "http://www.facebook.com/"
}, {
"date_added": "999999999999",
"id": "9",
"name": "Twitter",
"type": "url",
"url": "http://twitter.com/"
How would I open this dictionary in this file in Python and assign it to a variable. I know you open a file with open(), but I don't really know where to go from there. In the end, I want to be able to access the info in the dictionary from a variable like this bookmarks[bookmarks_bar][children][0][name] and have it return 'Facebook'
Do you know if this is a json file? If so, python provides a json library.
Json can be used as a data serialization/interchange format. It's nice because it's cross platform. Importing this like you ask seems fairly easy, an example from the docs:
>>> import json
>>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
So in your case it would look something like:
import json
with open(file.txt) as f:
text = f.read()
bookmarks = json.loads(text)
print bookmarks[bookmarks_bar][children][0][name]
JSON is definitely the "right" way to do this, but for a quick-and-dirty script eval() might suffice:
with open('file.txt') as f:
bookmarks = eval(f.read())

Categories