I have created a script which scrapes many pdfs for abstract and keywords. I also have a collection of bibtex-files in which I want to place the texts I've extracted. What I'm looking for is a way of adding elements to the bibtex files.
I have written a short parser:
#!/usr/bin/python
#-*- coding: utf-8
import os
from pybtex.database.input import bibtex
dir_path = "nime_archive/nime/bibtex/"
num_texts = 0
class Bibfile:
def __init__(self,bibs):
self.bibs = bibs
for a in self.bibs.entries.keys():
num_text += 1
print bibs.entries[a].fields['title']
#Need to implement a way of getting just the nime-identificator
try:
print bibs.entries[a].fields['url']
except:
print "couldn't find URL for text: %s " % a
print "creating new bibfile"
bibfiles = []
parser = bibtex.Parser()
for infile in os.listdir(dir_path):
if infile.endswith(".bib"):
print infile
bibfiles = Bibfile(parser.parse_file(dir_path+infile))
My question is if there is possible to use Pybtex to add elements into the existing bibtex-files (or create a copy) so I can merge my extractions with what is already available. If this is not possible in Pybtex, what other bibtex parser can I use?
I've never used pybtex, but from a quick glance, you can add entries. Since self.bibs.entries appears to be a dict, you can come up with a unique key, and add more entries to it. Example:
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
self.bibs.entries[key] = new_entry
(caveat: untested)
If you wonder where I got this example form: have a look in the tests/ subdirectory of the source of pybtex. I got the above code example mainly from tests/database_test/data.py. Tests can be a good source of documentation if the actual documentation is lacking.
.data.add_entry(key, entry) works for me. Here I used an entry manually created (taken from Evert's example) but you can copy an existing entry from another bib that you're also parsing.
from pybtex.database.input.bibtex import Parser
from pybtex.core import Entry, Person
key = "some_unique_string"
new_entry = Entry('article',
fields={
'language': u'english',
'title': u'Predicting the Diffusion Coefficient in Supercritical Fluids',
'journal': u'Ind. Eng. Chem. Res.',
'volume': u'36',
'year': u'1997',
'pages': u'888-895',
},
persons={'author': [Person(u'Liu, Hongquin'), Person(u'Ruckenstein, Eli')]},
)
newbib_parser = Parser()
newbib_parser.data.add_entry(key, new_entry)
print newbib_parser.data
Related
Suppose you have a .bib file containing bibtex-formatted entries. I want to extract the "title" field from an entry, and then format it to a readable unicode string.
For example, if the entry was:
#article{mypaper,
author = {myself},
title = {A very nice {title} with annoying {symbols} like {\^{a}}}
}
what I want to extract is the string:
A very nice title with annoying symbols like â
I am currently trying to use the pybtex package, but I cannot figure out how to do it. The command-line utility pybtex-format does a good job in converting full .bib files, but I need to do this inside a script and for single title entries.
Figured it out:
def load_bib(filename):
from pybtex.database.input.bibtex import Parser
parser = Parser()
DB = parser.parse_file(filename)
return DB
def get_title(entry):
from pybtex.plugin import find_plugin
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
sentence = style.format_title(entry, 'title')
data = {'entry': entry,
'style': style,
'bib_data': None}
T = sentence.f(sentence.children, data)
title = T.render(backend)
return title
DB = load_bib("bibliography.bib")
print ( get_title(DB.entries["entry_label"]) )
where entry_label must match the label you use in latex to cite the bibliography entry.
Building upon the answer by Daniele, I wrote this function that lets one render fields without having to use a file.
from io import StringIO
from pybtex.database.input.bibtex import Parser
from pybtex.plugin import find_plugin
def render_fields(author="", title=""):
"""The arguments are in bibtex format. For example, they may contain
things like \'{i}. The output is a dictionary with these fields
rendered in plain text.
If you run tests by defining a string in Python, use r'''string''' to
avoid issues with escape characters.
"""
parser = Parser()
istr = r'''
#article{foo,
Author = {''' + author + r'''},
Title = {''' + title + '''},
}
'''
bib_data = parser.parse_stream(StringIO(istr))
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
entry = bib_data.entries["foo"]
data = {'entry': entry, 'style': style, 'bib_data': None}
sentence = style.format_author_or_editor(entry)
T = sentence.f(sentence.children, data)
rendered_author = T.render(backend)[0:-1] # exclude period
sentence = style.format_title(entry, 'title')
T = sentence.f(sentence.children, data)
rendered_title = T.render(backend)[0:-1] # exclude period
return {'title': rendered_title, 'author': rendered_author}
I got a list in Python with Twitter user information and exported it with Pandas to an Excel file.
One row is one Twitter user with nearly all information of the user (name, #-tag, location etc.)
Here is my code to create the list and fill it with the user data:
def get_usernames(userids, api):
fullusers = []
u_count = len(userids)
try:
for i in range(int(u_count/100) + 1):
end_loc = min((i + 1) * 100, u_count)
fullusers.extend(
api.lookup_users(user_ids=userids[i * 100:end_loc])
)
print('\n' + 'Done! We found ' + str(len(fullusers)) + ' follower in total for this account.' + '\n')
return fullusers
except:
import traceback
traceback.print_exc()
print ('Something went wrong, quitting...')
The only problem is that every row is in JSON object and therefore one long comma-seperated string. I would like to create headers (no problem with Pandas) and only write parts of the string (i.e. ID or name) to colums.
Here is an example of a row from my output.xlsx:
User(_api=<tweepy.api.API object at 0x16898928>, _json={'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')
I have two ideas, but I don't know how to realize them due to my lack of skills and experience with Python.
Create a loop which saves certain parts ('id','name' etc.) from the JSON-string in colums.
Cut off the User(_api=<tweepy.api. API object at 0x16898928>, _json={ at the beginning and ) at the end, so that I may export they file as CSV.
Could anyone help me out with one of my two solutions or suggest a "simple" way to do this?
fyi: I want to do this to gather data for my thesis.
Try the python json library:
import json
jsonstring = "{'id': 12345, 'id_str': '12345', 'name': 'Jane Doe', 'screen_name': 'jdoe', 'location': 'Nirvana, NI', 'description': 'Just some random descrition')"
jsondict = json.loads(jsonstring)
# type(jsondict) == dictionary
Now you can just extract the data you want from it:
id = jsondict["id"]
name = jsondict["name"]
newdict = {"id":id,"name":name}
I am currently learning to code in python, but working with XML files is giving me some trouble. I tried to write an XML-file using some data that i filtered from a JSON-file.
The XML-file I want to write should look like this:
<?xml version='1.0' encoding='UTF-8'?>
<collection>
<work>
<title>Title</title>
<dimensions>
<width>Width (cm)</width>
<height>Height (cm)</height>
</dimensions>
<acquisition>
<number>AccessionNumber</number>
<year>year of DateAcquired</year>
</acquisition>
</work>
[...]
</collection>
It can be written into one line in the XML, since it doesn't need to be pretty.
My python code at the moment is looking like this:
import xml.etree.ElementTree as ET
root = ET.Element('collection')
tree = ET.ElementTree(root)
for artwork in artworks_filtered_list:
work = ET.SubElement(root, 'work')
title = ET.SubElement(work, 'title')
title.text = artwork['Title']
dimensions = ET.SubElement(work, 'dimensions')
if 'Width (cm)' in artwork:
width = ET.SubElement(dimensions, 'width')
width.text = str(artwork['Width (cm)'])
height = ET.SubElement(dimensions, 'height')
height.text = str(artwork['Height (cm)'])
acquisition = ET.SubElement(work, 'acquisition')
number = ET.SubElement(acquisition, 'number')
number.text = str(artwork['AccessionNumber'])
year = ET.SubElement(acquisition, 'year')
year.text = str(artwork['DateAcquired'][:4])
tree.write('example.xml', encoding='UTF-8', xml_declaration=True)
Since width is missing in some artwork data, I needed to check if it exists for each entry. Otherwise I get an error message.
artworks_filtered_list is a list of dictionaries that contains entries for different artworks and is looking like this:
artworks_filtered_list = [
{
"Title": "Interval",
"Artist": ["David Hartt"],
"ConstituentID": [47183],
"ArtistBio": ["Canadian, born 1967"],
"Nationality": ["Canadian"],
"BeginDate": [1967],
"EndDate": [0],
"Gender": ["Male"],
"Date": "2016",
"Medium": "Aluminum and tempered glass",
"Dimensions": 'Wall: 102 × 218 × 4" (259.1 × 553.7 × 10.2 cm)',
"CreditLine": "Fund for the Twenty First Century",
"AccessionNumber": "1772.2015.5",
"Classification": "Installation",
"Department": "Media and Performance Art",
"DateAcquired": "2015-12-11",
"Cataloged": "Y",
"ObjectID": 205745,
"URL": "http://www.moma.org/collection/works/205745",
"ThumbnailURL": None,
"Depth (cm)": 10.16002032,
"Height (cm)": 259.080518161,
"Width (cm)": 553.7211074422,
},
...,
]
This is my code right now. It is working and creating the XML-file as intended, but i feel like there might be more code than needed. Is there a way to get the same result with less repetitive/ prettier code? (It should still use ElementTree)
I would probably use some sort of mapping to reduce the amount of redundant code. Something like this:
#!/usr/bin/python
import datetime
import xml.etree.ElementTree as ET
artworks_filtered_list = [...as in your example...]
# This by itself reduces the amount of code by about 50% :)
def add_text_element(root, tag, text):
new = ET.SubElement(root, tag)
new.text = text
# According to your question, acquisition.year should be just the
# year from DateAcquired, so we need a method to extract and return
# the year.
def extract_year(val):
dt = datetime.datetime.strptime(val, '%Y-%m-%d')
return dt.year
# This is called once for every work of art in artworks_filtered_list
def append_work(root, work):
# create the <work> container element
new = ET.SubElement(root, 'work')
# create empty dimension and acquisition elements. This adds them
# to the new <work> element.
dimensions = ET.SubElement(new, 'dimensions')
acquisition = ET.SubElement(new, 'acquisition')
# Add our title
add_text_element(new, 'title', work['Title'])
# Now build a map that will link keys from a work of art in
# artworks_filtered_list to XML elements. Each item in this list is
# a 4-tuple, where the items are:
#
# 1. The parent element to which we will be adding a new element
# 2. The name of the new element
# 3. The dictionary key from which we will get the text value
# 4. A function to transform the value, if necessary (or None)
#
attrmap = [
(dimensions, 'width', 'Width (cm)', None),
(dimensions, 'height', 'Height (cm)', None),
(acquisition, 'number', 'AccessionNumber', None),
(acquisition, 'year', 'DateAcquired', extract_year),
]
# And now use the above map to transform the artwork dictionary
# into XML elements.
for parent, tag, key, xform in attrmap:
if key in work:
add_text_element(parent, tag,
str(xform(work[key]) if xform else work[key]))
def main():
root = ET.Element('collection')
for work in artworks_filtered_list:
append_work(root, work)
print(ET.tostring(root).decode('utf-8'))
if __name__ == '__main__':
main()
Given your sample input, the above code produces:
<collection>
<work>
<dimensions>
<width>553.7211074422</width>
<height>259.080518161</height>
</dimensions>
<acquisition>
<number>1772.2015.5</number>
<year>2015</year>
</acquisition>
<title>Interval</title>
</work>
</collection>
...although it doesn't actually pretty-print it. If you were to use lxml.etree instead of xml.etree, tostring would be able to pretty-print XML for you.
I have this method that writes json data to a file. The title is based on books and data is the book publisher,date,author, etc. The method works fine if I wanted to add one book.
Code
import json
def createJson(title,firstName,lastName,date,pageCount,publisher):
print "\n*** Inside createJson method for " + title + "***\n";
data = {}
data[title] = []
data[title].append({
'firstName:', firstName,
'lastName:', lastName,
'date:', date,
'pageCount:', pageCount,
'publisher:', publisher
})
with open('data.json','a') as outfile:
json.dump(data,outfile , default = set_default)
def set_default(obj):
if isinstance(obj,set):
return list(obj)
if __name__ == '__main__':
createJson("stephen-king-it","stephen","king","1971","233","Viking Press")
JSON File with one book/one method call
{
"stephen-king-it": [
["pageCount:233", "publisher:Viking Press", "firstName:stephen", "date:1971", "lastName:king"]
]
}
However if I call the method multiple times , thus adding more book data to the json file. The format is all wrong. For instance if I simply call the method twice with a main method of
if __name__ == '__main__':
createJson("stephen-king-it","stephen","king","1971","233","Viking Press")
createJson("william-golding-lord of the flies","william","golding","1944","134","Penguin Books")
My JSON file looks like
{
"stephen-king-it": [
["pageCount:233", "publisher:Viking Press", "firstName:stephen", "date:1971", "lastName:king"]
]
} {
"william-golding-lord of the flies": [
["pageCount:134", "publisher:Penguin Books", "firstName:william","lastName:golding", "date:1944"]
]
}
Which is obviously wrong. Is there a simple fix to edit my method to produce a correct JSON format? I look at many simple examples online on putting json data in python. But all of them gave me format errors when I checked on JSONLint.com . I have been racking my brain to fix this problem and editing the file to make it correct. However all my efforts were to no avail. Any help is appreciated. Thank you very much.
Simply appending new objects to your file doesn't create valid JSON. You need to add your new data inside the top-level object, then rewrite the entire file.
This should work:
def createJson(title,firstName,lastName,date,pageCount,publisher):
print "\n*** Inside createJson method for " + title + "***\n";
# Load any existing json data,
# or create an empty object if the file is not found,
# or is empty
try:
with open('data.json') as infile:
data = json.load(infile)
except FileNotFoundError:
data = {}
if not data:
data = {}
data[title] = []
data[title].append({
'firstName:', firstName,
'lastName:', lastName,
'date:', date,
'pageCount:', pageCount,
'publisher:', publisher
})
with open('data.json','w') as outfile:
json.dump(data,outfile , default = set_default)
A JSON can either be an array or a dictionary. In your case the JSON has two objects, one with the key stephen-king-it and another with william-golding-lord of the flies. Either of these on their own would be okay, but the way you combine them is invalid.
Using an array you could do this:
[
{ "stephen-king-it": [] },
{ "william-golding-lord of the flies": [] }
]
Or a dictionary style format (I would recommend this):
{
"stephen-king-it": [],
"william-golding-lord of the flies": []
}
Also the data you are appending looks like it should be formatted as key value pairs in a dictionary (which would be ideal). You need to change it to this:
data[title].append({
'firstName': firstName,
'lastName': lastName,
'date': date,
'pageCount': pageCount,
'publisher': publisher
})
def replace_acronym(): # function not yet implemented
#FIND
for abbr, text in acronyms.items():
if abbr == acronym_edit.get():
textadd.insert(0,text)
#DELETE
name = acronym_edit.get().upper()
name.upper()
r =dict(acronyms)
del r[name]
with open('acronym_dict.py','w')as outfile:
outfile.write(str(r))
outfile.close() # uneccessary explicit closure since used with...
message ='{0} {1} {2} \n '.format('Removed', name,'with its text from the database.')
display.insert('0.0',message)
#ADD
abbr_in = acronym_edit.get()
text_in = add_expansion.get()
acronyms[abbr_in] = text_in
# write amended dictionary
with open('acronym_dict.py','w')as outfile:
outfile.write(str(acronyms))
outfile.close()
message ='{0} {1}:{2}{3}\n '.format('Modified entry', abbr_in,text_in, 'added')
display.insert('0.0',message)
I am trying to add the functionality of editing my dictionary entries in my tkinter widget. The dictionary is in the format {ACRONYM: text, ACRONYM2: text2...}
What I thought the function would achieve is to find the entry in the dictionary, delete both the acronym and its associated text and then add whatever the acronym and text have been changed to. What happens is for example if I have an entry TEST: test and I want to modify it to TEXT: abc what is returned by the function is TEXT: testabc - appending the changed text although I have (I thought) overwritten the file.
What am I doing wrong?
That's a pretty messy lookin' function. The acronym replacement itself can be done pretty simple:
acronyms = {'SONAR': 'SOund Navigation And Ranging',
'HTML': 'HyperText Markup Language',
'CSS': 'Cascading Style Sheets',
'TEST': 'test',
'SCUBA': 'Self Contained Underwater Breathing Apparatus',
'RADAR': 'RAdio Detection And Ranging',
}
def replace_acronym(a_dict,check_for,replacement_key,replacement_text):
c = a_dict.get(check_for)
if c is not None:
del a_dict[check_for]
a_dict[replacement_key] = replacement_text
return a_dict
new_acronyms = replace_acronym(acronyms,'TEST','TEXT','abc')
That works perfect for me (in Python 3). You could just call this in another function that writes the new_acronyms dict into the file or do whatever else you want with it 'cause it's no longer tied to just being written to the file.