Convert BibTex file to database entries using Python - python

Given a bibTex file, I need to add the respective fields(author, title, journal etc.) to a table in a MySQL database (with a custom schema).
After doing some initial research, I found that there exists Bibutils which I could use to convert a bib file to xml. My initial idea was to convert it to XML and then parse the XML in python to populate a dictionary.
My main questions are:
Is there a better way I could do this conversion?
Is there a library which directly parses a bibTex and gives me the fields in python?
(I did find bibliography.parsing, which uses bibutils internally but there is not much documentation on it and am finding it tough to get it to work).

Old question, but I am doing the same thing at the moment using the Pybtex library, which has an inbuilt parser:
from pybtex.database.input import bibtex
#open a bibtex file
parser = bibtex.Parser()
bibdata = parser.parse_file("myrefs.bib")
#loop through the individual references
for bib_id in bibdata.entries:
b = bibdata.entries[bib_id].fields
try:
# change these lines to create a SQL insert
print b["title"]
print b["journal"]
print b["year"]
#deal with multiple authors
for author in bibdata.entries[bib_id].persons["author"]:
print author.first(), author.last()
# field may not exist for a reference
except(KeyError):
continue

My workaround is to use bibtexparser to export relevant fields to a csv file;
import bibtexparser
import pandas as pd
with open("../../bib/small.bib") as bibtex_file:
bib_database = bibtexparser.load(bibtex_file)
df = pd.DataFrame(bib_database.entries)
selection = df[['doi', 'number']]
selection.to_csv('temp.csv', index=False)
And then write the csv to a table in the database, and delete the temp.csv.
This avoids some complication with pybtex I found.

You can also use Python BibtexParser: https://github.com/sciunto/python-bibtexparser
Documentation: https://bibtexparser.readthedocs.org
It's very straight forward (I use it in production).
For the record, I am not the developer of this library.

Converting to XML is a fine idea.
XML exists as an application-independent data format, so that you can parse it with readily-available libraries; using it as an intermediary has no particular drawbacks. In fact, you can usually import XML into a database without even going through a programming language such as Python (although the amount of Python you'd have to write for a task like this is trivial).
So far as I know, there is no direct, mature bibTeX reader for Python.

You could use the Perl package Bib2ML (aka. Bib2HTML). It contains a bib2sql tool that generates a SQL database from a BibTeX database, with the following schema:
An alternative tool: bibsql and bibtosql.
Then you can feed it to your schema by writing some SQL conversion queries.

Related

Modifying XML file from string source - how to do it?

I have a problem, where I want to change some lines in my XML, but this XML is not in file, it is in string. I am using Python 3.x and lib xml.etree.ElementTree for this purpose.
I have these piece of code which I know works for files in project, but as I said, I want no files, only operations on string sources.
source_tree = ET.ElementTree(ET.fromstring(source_config))
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
This works, but I don't know how to save it. I tried
source_tree.write(source_config, encoding='latin-1') but this doesn't work (treats all XML as a name).
I don't think you need both source_tree and source_tree_root. By having both, you're creating two separate things. When you write using source_tree, you don't get the changes made to source_tree_root.
Try creating an ElementTree from source_tree_root (which is just an Element), like this (untested since you didn't supply an mcve)...
source_tree_root = ET.fromstring(source_config)
for item in source_tree_root.iter('generation'):
item.text = item.text.replace(self.firstarg, self.secondarg)
ET.ElementTree(source_tree_root).write("output.xml")
OK, I thought that you were using lxml. My bad. Well here is how you would do it with lxml. I think that lxml is superior.
Here is the basic way of parsing a string into an XML document.
from lxml import etree
doc = etree.fromstring(yourstring)
for e in doc.xpath('//sometagname'):
e.set('foo', 'bar')

LibreOffice/other method of filling template .txt file for import into Zim Wiki

I am using the application Zim Wiki(cross-platform, FOSS), which I am using to keep a personal wiki with lots of data coming from tables, copy and pasting, my own writing, and downloading and attaching .png and .html files for viewing and offline access.
The data that is not written or pasted can be stored in tables in the form of names, url addresses, and the names and locations of images and other attachments.
To insert into zim, I can use the front end with WSIWYG, or to make the skeleton of each entry, I could modify a template text entry. If I do this, nothing matters except for the location and identity of each character in each line.
By supplying the text in this image:
DandelionDemo source text,
--I can make this entry for Dandelion:
DandelionDemo Wiki.
So, I can generate and name the Wiki entry in Zim, which creates the .txt file for me, and inserts the time stamp and title, so, the template for this type of entry without the pasted fields would be:
**Full Scientific Name: **[[|]]**[syn]**
**Common Name(s): **
===== =====
**USDA PLANTS entry for Code:** [[https://plants.usda.gov/core/profile?symbol=|]] **- CalPhotos available images for:** [[https://calphotos.berkeley.edu/cgi/img_query?query_src=photos_index&where-taxon=|]]
**---**
**From - Wikipedia **[[wp?]] **- **[[/Docs/Plants/]]
{{/Docs/Plants/?height=364}}{{/Docs/Plants/?height=364}}
**()** //,// [[|(source)]]
**()** //// [[|(source)]]
**Wikipedia Intro: **////
---
So the first line with content, after the 31st character(which is a tab), you paste "http... {etc}. Then the procedure would insert "Taraxacum officinale... {etc}" after the "|", or what was the 32nd character, and so on. This data could be from "table1" and "table2", or combining the tables to make an un-normalized "table1table2", where each row could be converted to text or a .csv or I don't know, what do you think?
Is there a way, in LibreOffice to do this? I have used LibreOffice Base to generate a "book" form that populated fields, but it was much less complex data, without wiki liking and drag-and-drop pasting of images and attachments. So maybe the answer is to go simpler? The tables are not currently part of a registered database, but I could do that, once I have decided on the method of doing this.
I am ultimately looking for a "way", hopefully an "easy" way. However, that may not be in LibreOffice. If not, I know that I could do this in Python, but I haven't learned much about Python yet. If it involves a language, that is the first and only one I don't know that I will invest in learning for this project. If you know a "way" to do this in Python, let me know, and my first project and way of framing my study process will be in learning the methods that you share.
If you know of some other Linux GUI, I am definitely interested, but only in active free and open source builds that involve minimal/no compiling. I know the basics of SQL and DBMS's. In the past, have gotten Microsoft SQL server lite to work, but not DBeaver, yet. If you know of a CLI way also let me know, but I am a self-taught outdoors-loving Linux newb and mostly know about how to tweak little settings in programs, how to use moderately easy programs like ImageMagick, and I have built a few Lamp stacks for Drupal and Wordpress (no BASH etc...).
Thank you very much!
Ok, since you want to learn some python, let me propose you a way to do it this. First you need a template engine -like jinja2 (there are many others)-, a data source in our example a .csv file, -could be other like a db- and finally some code that reads the csv line by line and mix the content with the template.
Sample CSV file:
1;sample
2;dandelion
3;just for fun
Sample template:
**Full Scientific Name: **[[|]]**[syn]**
**Common Name(s): *{{name}}*
===== =====
USDA PLANTS entry for Code: *{{symbol}}*
---
Sample code:
#!/usr/bin/env/python
#
# Using the file system load
#
# We now assume we have a file in the same dir as this one called
# test_template.ziim
#
from jinja2 import Environment, FileSystemLoader
import os
import csv
# Capture our current directory
THIS_DIR = os.path.dirname(os.path.abspath(__file__))
def print_zim_doc():
# Create the jinja2 environment.
# Notice the use of trim_blocks, which greatly helps control whitespace.
j2_env = Environment(loader=FileSystemLoader(THIS_DIR),
trim_blocks=True)
template = j2_env.get_template('test_template.zim')
with open('example.csv') as File:
reader = csv.reader(File, delimiter=';')
for row in reader:
result = template.render(
symbol=row[0]
, name=row[1]
)
# to save the results
with open(row[0]+".txt", "wt") as fh:
fh.write(result)
fh.close()
if __name__ == '__main__':
print_zim_doc()
The code is pretty simple, reads the template located in the same folder as the python code, opens the csv file (also located in the same place), iterates over each line of the csv and renders the template using the values of the csv columns to fill the {{var_name}} in the template, finally saves the rendered result in a new file named as one of the csv column values. This sample will generate 3 files (1.txt, 2.txt, 3.txt). From here you can extend and improve the code to get your desired results.

Pulling data out of MS Word with pywin32

I am running python 3.3 in Windows and I need to pull strings out of Word documents. I have been searching far and wide for about a week on the best method to do this. Originally I tried to save the .docx files as .txt and parse through using RE's, but I had some formatting problems with hidden characters - I was using a script to open a .docx and save as .txt. I am wondering if I did a proper File>SaveAs>.txt would it strip out the odd formatting and then I could properly parse through? I don't know but I gave up on this method.
I tried to use the docx module but I've been told it is not compatible with python 3.3. So I am left with using pywin32 and the COM. I have used this successfully with Excel to get the data I need but I am having trouble with Word because there is FAR less documentation and reading through the object model on Microsoft's website is over my head.
Here is what I have so far to open the document(s):
import win32com.client as win32
import glob, os
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = True
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
So at this point I can do something like
print(doc.Content.Text)
And see the contents of the files, but it still looks like there is some odd formatting in there and I have no idea how to actually parse through to grab the data I need. I can create RE's that will successfully find the strings that I'm looking for, I just don't know how to implement them into the program using the COM.
The code I have so far was mostly found through Google. I don't even think this is that hard, it's just that reading through the object model on Microsoft's website is like reading a foreign language. Any help is MUCH appreciated. Thank you.
Edit: code I was using to save the files from docx to txt:
for path, dirs, files in os.walk(r'mypath'):
for doc in [os.path.abspath(os.path.join(path, filename)) for filename in files if fnmatch.fnmatch(filename, '*.docx')]:
print("processing %s" % doc)
wordapp.Documents.Open(doc)
docastxt = doc.rstrip('docx') + 'txt'
wordapp.ActiveDocument.SaveAs(docastxt,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
If you don't want to learn the complicated way Word models documents, and then how that's exposed through the Office object model, a much simpler solution is to have Word save a plain-text copy of the file.
There are a lot of options here. Use tempfile to create temporary text files and then delete them, or store permanent ones alongside the doc files for later re-use? Use Unicode (which, in Microsoft speak, means UTF-16-LE with a BOM) or encoded text? And so on. So, I'll just pick something reasonable, and you can look at the Document.SaveAs, WdSaveFormat, etc. docs to modify it.
wdFormatUnicodeText = 7
for infile in glob.glob(os.path.join(r'mypath', '*.docx')):
print(infile)
doc = word.Documents.Open(infile)
txtpath = os.path.splitext('infile')[0] + '.txt'
doc.SaveAs(txtpath, wdFormatUnicodeText)
doc.Close()
with open(txtpath, encoding='utf-16') as f:
process_the_file(f)
As noted in your comments, what this does to complex things like tables, multi-column text, etc. may not be exactly what you want. In that case, you might want to consider saving as, e.g., wdFormatFilteredHTML, which Python has nice parsers for. (It's a lot easier to BeautifulSoup a table than to win32com-Word it.)
oodocx is my fork of the python-docx module that is fully compatible with Python 3.3. You can use the replace method to do regular expression searches. Your code would look something like:
from oodocx import oodocx
d = oodocx.Docx('myfile.docx')
d.replace('searchstring', 'replacestring')
d.save('mynewfile.docx')
If you just want to remove strings, you can pass an empty string to the "replace" parameter.

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,
pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file
As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

Python - Convert CSV to DBF

I would like to convert a csv file to dbf using python (for use in geocoding which is why I need the dbf file) - I can easily do this in stat/transfer or other similar programs but I would like to do as part of my script rather than having to go to an outside program. There appears to be a lot of help questions/answers for converting DBF to CSV but I am not having any luck the other way around.
An answer using dbfpy is fine, I just haven't had luck figuring out exactly how to do it.
As an example of what I am looking for, here is some code I found online for converting dbf to csv:
import csv,arcgisscripting
from dbfpy import dbf
gp = arcgisscripting.create()
try:
inFile = gp.GetParameterAsText(0) #Input
outFile = gp.GetParameterAsText(1)#Output
dbfFile = dbf.Dbf(open(inFile,'r'))
csvFile = csv.writer(open(outFile, 'wb'))
headers = range(len(dbfFile.fieldNames))
allRows = []
for row in dbfFile:
rows = []
for num in headers:
rows.append(row[num])
allRows.append(rows)
csvFile.writerow(dbfFile.fieldNames)
for row in allRows:
print row
csvFile.writerow(row)
except:
print gp.getmessage()
It would be great to get something similar for going the other way around.
Thank you!
Duplicate question at: Convert .csv file into .dbf using Python?
Promising answer there (among others) is
Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.
For example, you could try:
http://packages.python.org/dbf/
http://code.activestate.com/recipes/362715-dbf-reader-and-writer/
You could also just open the CSV file in OpenOffice or Excel and save it in dBase format.
I assume you want to create attribute files for the Esri Shapefile format or something like that. Keep in mind that DBF files usually use ancient character encodings like CP 850. This may be a problem if your geo data contains names in foreign languages. However, Esri may have specified a different encoding.
EDIT: just noted that you do not want to use external tools.

Categories