How to extract data from text files?

How to extract data from text files? - python

So I am having a set of files that I need to extract data from and write in a new txt file, and I am not sure how to do this with Python. Below is a sample data. I am trying to extract the parts from NSF Org, File and Abstract.
Title : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales:
Mitochondrial DNA and Historical Demography
Type : Award
NSF Org : DEB
Latest
Amendment
Date : August 1, 1991
File : a9000006
Award Number: 9000006
Award Instr.: Continuing grant
Prgm Manager: Scott Collins
DEB DIVISION OF ENVIRONMENTAL BIOLOGY
BIO DIRECT FOR BIOLOGICAL SCIENCES
Start Date : June 1, 1990
Expires : November 30, 1992 (Estimated)
Expected
Total Amt. : $179720 (Estimated)
Investigator: Stephen R. Palumbi (Principal Investigator current)
Sponsor : U of Hawaii Manoa
2530 Dole Street
Honolulu, HI 968222225 808/956-7800
NSF Program : 1127 SYSTEMATIC & POPULATION BIOLO
Fld Applictn: 0000099 Other Applications NEC
61 Life Science Biological
Program Ref : 9285,
Abstract :
Commercial exploitation over the past two hundred years drove the great
Mysticete whales to near extinction. Variation in the sizes of populations
prior to exploitation, minimalpopulation size during exploitation and
current population sizes permit analyses of the effects of differing levels
of exploitation on species with different biogeographical distributions and
life-history characteristics.

You're not giving me much to go on but, what I do to read input files from a txt file. This is in Java, hopefully you'll know how to store it in an array of some sort
import java.util.Scanner;
import java.io.*;
public class ClockAngles{
public static void main (String [] args) throws IOException {
Scanner reader = null;
String input = "";
try {
reader = new Scanner (new BufferedReader (new FileReader("FilePath")));
while (reader.hasNext()) {
input = reader.next();
System.out.print(input);
}
}
finally {
if (reader != null) {
reader.close();
}
}
Python code
#!/bin/env python2.7
# Change this to the file with the time input
filename = "filetext"
storeData = []
class Whatever:
def __init__(self, time_str):
times_list = time_str.split('however you want input to be read')
self.a = int(times_list[0])
self.b = int(times_list[1])
self.c = int(times_list[2])
# prints the data
def __str__(self):
return str(self.a) + " " + str(self.b) + " " + str(self.c)

Related

Number of format specifications in 'msgid' and 'msgstr' does not match msgfmt: found 1 fatal error

Django-Admin generates the following error: Execution of msgfmt failed: /home/djuka/project_app/locale/eng/LC_MESSAGES/django.po:49: number of format specifications in 'msgid' and 'msgstr' does not match
msgfmt: found 1 fatal error
CommandError: compilemessages generated one or more errors.
I have the same number of strings in msgid and msgstr and i still get an error. Here is the code.
#: reBankMini/templates/reBankMiniApp/index.html:135
#, python-format
msgid ""
"Leta 2020 je po svetu nastalo rekordnih 54 milijonov ton elektronskih "
"odpadkov, kar je 21 odstotkov več kot v zadnjih petih letih. E-odpadki so "
"najhitreje rastoči gospodinjski odpadki na svetu, ki jih povzročajo predvsem "
"višje stopnje porabe električne in elektronske opreme, kratki življenjski "
"cikli in malo možnosti za popravila. Le 17,4%% zavrženih e-odpadkov leta "
"2020 je bilo recikliranih. Pri procesu recikliranju pride do številnih "
"okolju nevarnih reakcij. Mi recikliramo drugače. Prizadevamo si za varne, "
"trajnostne in popravljive izdelke, s katerimi bo vsak uporabnik aktivno "
"vključen v reševanje okoljskih problemov."
msgstr ""
"In 2020, a record 54 million tons of electronic "
"waste was generated worldwide, which is 21 percent more than in the last five years. E-waste is "
"the fastest growing household waste in the world, caused mainly by "
"higher levels of consumption of electrical and electronic equipment, short life "
"cycles and little opportunity for repair. Only 17.4% of e-waste discarded in "
"2020 was recycled. Many environmentally hazardous reactions occur "
"during the recycling process. We recycle differently. We strive for safe, "
"sustainable and repairable products that will actively involve every user "
"in solving environmental problems."
Here is the index.html file
<div class="rebank-desc">
<h3>
{% trans 'Leta 2020 je po svetu nastalo rekordnih 54 milijonov ton elektronskih odpadkov, kar je 21 odstotkov več kot v zadnjih petih letih. E-odpadki so najhitreje rastoči gospodinjski odpadki na svetu, ki jih povzročajo predvsem višje stopnje porabe električne in elektronske opreme, kratki življenjski cikli in malo možnosti za popravila. Le 17,4% zavrženih e-odpadkov leta 2020 je bilo recikliranih. Pri procesu recikliranju pride do številnih okolju nevarnih reakcij. Mi recikliramo drugače. Prizadevamo si za varne, trajnostne in popravljive izdelke, s katerimi bo vsak uporabnik aktivno vključen v reševanje okoljskih problemov.' %}
</h3>
<h3>
{% trans 'Predstavljamo vam RebankMini, prvi power bank (prenosna baterija) na svetu, ki ga poganjajo reciklirane baterijske celice. rEbankMini je v celoti izdelan v Sloveniji in s tem prvi power bank izdelan v Evropi. Sestavljen je iz prestižnega slovenskega lesa, hrasta in oreha. Z nakupom enega rEbankMini-ja nam pomagate zmanjšati izpust toplogrednih plinov za 14,6 kg (CO2eq), ki bi nastali pri proizvodnji izdelka, podobnega našemu.' %}
</h3>
</div>

I'm not exactly sure what you are trying to do, but I would make a msg class in you app's models.py file. Also, are you trying to make the msgid a translated version of english? If so, making a msg class in your models.py file would provide you with a pk or id which you can use as a replacement for the translated version. The code would look something like this:
from django.db import models
from django.utils import timezone
class Msg(models.Model):
msgContents = models.TextField()
sender = models.ForeignKey(User, on_delete=models.CASCADE)
date_sent = models.DateTimeField(default=timezone.now)
def __str__(self):
return '<What you want the Django admin page to use as the title>' # Example: return f'{self.sender} - {self.pk}'
I'm not exactly sure if this will help, but it's commonly used code for a message / posting system.

bibtexparser - pyparsing.ParseException: Expected end of text

I'm using bibtexparser to parse a bibtex file.
import bibtexparser
with open('MetaGJK12842.bib','r') as bibfile:
bibdata = bibtexparser.load(bibfile)
While parsing I get the error message:
Could not parse properly, starting at
#article{Frenn:EvidenceBasedNursing:1999,
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyparsing.py", line 3183, in parseImpl
raise ParseException(instring, loc, self.errmsg, self)
pyparsing.ParseException: Expected end of text (at char 5773750),
(line:47478, col:1)`
The line refers to the following bibtex entry:
#article{Frenn:EvidenceBasedNursing:1999,
author = {Frenn, M.},
title = {A Mediterranean type diet reduced all cause and cardiac mortality after a first myocardial infarction [commentary on de Lorgeril M, Salen P, Martin JL, et al. Mediterranean dietary pattern in a randomized trial: prolonged survival and possible reduced cancer rate. ARCH INTERN MED 1998;158:1181-7]},
journal = {Evidence Based Nursing},
uuid = {15A66A61-0343-475A-8700-F311B08BB2BC},
volume = {2},
number = {2},
pages = {48-48},
address = {College of Nursing, Marquette University, Milwaukee, WI},
year = {1999},
ISSN = {1367-6539},
url = {},
keywords = {Treatment Outcomes;Mediterranean Diet;Mortality;France;Neoplasms -- Prevention and Control;Phase One Excluded - No Assessment of Vegetable as DV;Female;Phase One - Reviewed by Hao;Myocardial Infarction -- Diet Therapy;Diet, Fat-Restricted;Phase One Excluded - No Fruit or Vegetable Study;Phase One Excluded - No Assessment of Fruit as DV;Male;Clinical Trials},
tags = {Phase One Excluded - No Assessment of Vegetable as DV;Phase One Excluded - No Fruit or Vegetable Study;Phase One - Reviewed by Hao;Phase One Excluded - No Assessment of Fruit as DV},
accession_num = {2000008864. Language: English. Entry Date: 20000201. Revision Date: 20130524. Publication Type: journal article},
remote_database_name = {rzh},
source_app = {EndNote},
EndNote_reference_number = {4413},
Secondary_title = {Evidence Based Nursing},
Citation_identifier = {Frenn 1999a},
remote_database_provider = {EBSCOhost},
publicationStatus = {Unknown},
abstract = {Question: text.},
notes = {(0) abstract; commentary. Journal Subset: Core Nursing; Europe; Nursing; Peer Reviewed; UK \& Ireland. No. of Refs: 1 ref. NLM UID: 9815947.}
}
What is wrong with this entry?

It seems that the issue has been addressed and resolved in the project repository (see Issue 147)
Until the next release, installing the library from the git repository can serve as a temporary fix.
pip install --upgrade git+https://github.com/sciunto-org/python-bibtexparser.git#master

I had this same error and found an entry near the line mentioned in the error that had a line like this
...
year = {1959},
month =
}
When I removed the null month item it parsed for me.

Suggesting similar sentences

I am trying to create a sentence auto-complete model which will suggest similar sentences.
Problem: I have a sentence corpora of more than 20000 sentences. I want to create a program that would suggest similar sentences to a user as the user types in with his/her keyboard.
for example -
user: wh
suggestions: [{'what is your name?'},{'what is your profession?'},{'what do you want?'}, {'where are you?'}]
user: what is your
suggestions: [{'what is your name?'},{'what is your profession?'}]
Note:
The ordering of words is important, i.e prefix of sentence and user input should be the same.
The sentence suggestion are from available text corpora.
My approach:-
Till now I have only come up with a solution that uses trie data structure to store every sentence in text corpora.
I want to know if there are any machine learning techniques that could be implemented for sentence suggestion that also takes sentence prefix into account.
I would really appreciate anyone who could point me in the right direction.

Text generation is a common application of RNNs. Given a sentence prefix the neural network can be trained to predict the most probable next words.
A very interesting article written by Andrej Karpathy can be found here along with the corresponding github repo.
Another popular method utilizes Markov Chains for text generation (for example see here )

if you want to use Lucene relevency, MoreLikeThis similar sentence. or you can apply the cosine similarity for same. hope this will help.
public static void main(String[] args) throws IOException {
Main m = new Main();
m.init();
m.writerEntries();
m.findSilimar("doduck prototype");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void init() throws IOException{
analyzer = new StandardAnalyzer(Version.LUCENE_42);
config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //do not write on disk
}
public void writerEntries() throws IOException{
IndexWriter indexWriter = new IndexWriter(indexDir, config);
indexWriter.commit();
Document doc1 = createDocument("1","doduck","prototype your idea");
Document doc2 = createDocument("2","doduck","love programming");
Document doc3 = createDocument("3","We do", "prototype");
Document doc4 = createDocument("4","We love", "challange");
indexWriter.addDocument(doc1);
indexWriter.addDocument(doc2);
indexWriter.addDocument(doc3);
indexWriter.addDocument(doc4);
indexWriter.commit();
indexWriter.forceMerge(100, true);
indexWriter.close();
}
private Document createDocument(String id, String title, String content) {
FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(true); //TermVectors are needed for MoreLikeThis
Document doc = new Document();
doc.add(new StringField("id", id, Store.YES));
doc.add(new Field("title", title, type));
doc.add(new Field("content", content, type));
return doc;
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader(searchForSimilar);
Query query = mlt.like(sReader, null);
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
}

Python convert C header file to dict

I have a C header file which contains a series of classes, and I'm trying to write a function which will take those classes, and convert them to a python dict. A sample of the file is down the bottom.
Format would be something like
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
I'm hoping to turn it into something like
{CFGFunctions:{ABC:{AA:"myFuncName"}, BB:...}}
# Or
{CFGFunctions:{ABC:{AA:{myFuncName:"string or list or something"}, BB:...}}}
In the end, I'm aiming to get the filepath string (which is actually a path to a folder... but anyway), and the class names in the same class as the file/folder path.
I've had a look on SO, and google and so on, but most things I've found have been about splitting lines into dicts, rather then n-deep 'blocks'
I know I'll have to loop through the file, however, I'm not sure the most efficient way to convert it to the dict.
I'm thinking I'd need to grab the outside class and its relevant brackets, then do the same for the text remaining inside.
If none of that makes sense, it's cause I haven't quite made sense of the process myself haha
If any more info is needed, I'm happy to provide.
The following code is a quick mockup of what I'm sorta thinking...
It is most likely BROKEN and probably does NOT WORK. but its sort of the process that I'm thinking of
def get_data():
fh = open('CFGFunctions.h', 'r')
data = {} # will contain final data model
# would probably refactor some of this into a function to allow better looping
start = "" # starting class name
brackets = 0 # number of brackets
text= "" # temp storage for lines inside block while looping
for line in fh:
# find the class (start
mt = re.match(r'Class ([\w_]+) {', line)
if mt:
if start == "":
start = mt.group(1)
else:
# once we have the first class, find all other open brackets
mt = re.match(r'{', line)
if mt:
# and inc our counter
brackets += 1
mt2 = re.match(r'}', line)
if mt2:
# find the close, and decrement
brackets -= 1
# if we are back to the initial block, break out of the loop
if brackets == 0:
break
text += line
data[start] = {'tempText': text}
====
Sample file
class CfgFunctions {
class ABC {
class Control {
file = "abc\abc_sys_1\Modules\functions";
class assignTracker {
description = "";
recompile = 1;
};
class modulePlaceMarker {
description = "";
recompile = 1;
};
};
class Devices
{
file = "abc\abc_sys_1\devices\functions";
class registerDevice { recompile = 1; };
class getDeviceSettings { recompile = 1; };
class openDevice { recompile = 1; };
};
};
};
EDIT:
If possible, if I have to use a package, I'd like to have it in the programs directory, not the general python libs directory.

As you detected, parsing is necessary to do the conversion. Have a look at the package PyParsing, which is a fairly easy-to-use library to implement parsing in your Python program.
Edit: This is a very symbolic version of what it would take to recognize a very minimalistic grammer - somewhat like the example at the top of the question. It won't work, but it might put you in the right direction:
from pyparsing import ZeroOrMore, OneOrMore, \
Keyword, Literal
test_code = """
class CFGFunctions {
class ABC {
class AA {
file = "abc/aa/functions"
class myFuncName{ recompile = 1; };
};
class BB
{
file = "abc/bb/functions"
class funcName{
recompile=1;
}
}
};
};
"""
class_tkn = Keyword('class')
lbrace_tkn = Literal('{')
rbrace_tkn = Literal('}')
semicolon_tkn = Keyword(';')
assign_tkn = Keyword(';')
class_block = ( class_tkn + identifier + lbrace_tkn + \
OneOrMore(class_block | ZeroOrMore(assignment)) + \
rbrace_tkn + semicolon_tkn \
)
def test_parser(test):
try:
results = class_block.parseString(test)
print test, ' -> ', results
except ParseException, s:
print "Syntax error:", s
def main():
test_parser(test_code)
return 0
if __name__ == '__main__':
main()
Also, this code is only the parser - it does not generate any output. As you can see in the PyParsing docs, you can later add the actions you want. But the first step would be to recognize the what you want to translate.
And a last note: Do not underestimate the complexities of parsing code... Even with a library like PyParsing, which takes care of much of the work, there are many ways to get mired in infinite loops and other amenities of parsing. Implement things step-by-step!
EDIT: A few sources for information on PyParsing are:
http://werc.engr.uaf.edu/~ken/doc/python-pyparsing/HowToUsePyparsing.html
http://pyparsing.wikispaces.com/
(Particularly interesting is http://pyparsing.wikispaces.com/Publications, with a long list of articles - several of them introductory - on PyParsing)
http://pypi.python.org/pypi/pyparsing_helper is a GUI for debugging parsers
There is also a 'tag' Pyparsing here on stackoverflow, Where Paul McGuire (the PyParsing author) seems to be a frequent guest.
* NOTE: *
From PaulMcG in the comments below: Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing

Script like google suggest in python

I am writing a script that works like google suggest. Problem is that I am trying to get a suggestion for next 2 most likely words.
The example uses a txt file working_bee.txt. When writing a text "mis" I should get suggestions like "Miss Mary , Miss Taylor, ...". I only get "Miss, ...". I suspect the Ajax responseText method gives only a single word?
Any ideas what is wrong?
# Something that looks like Google suggest
def count_words(xFile):
frequency = {}
words=[]
for l in open(xFile, "rt"):
l = l.strip().lower()
for r in [',', '.', "'", '"', "!", "?", ":", ";"]:
l = l.replace(r, " ")
words += l.split()
for i in range(len(words)-1):
frequency[words[i]+" "+words[i+1]] = frequency.get(words[i]+" "+words[i+1], 0) + 1
return frequency
# read valid words from file
ws = count_words("c:/mod_python/working_bee.txt").keys()
def index(req):
req.content_type = "text/html"
return '''
<script>
function complete(q) {
var xhr, ws, e
e = document.getElementById("suggestions")
if (q.length == 0) {
e.innerHTML = ''
return
}
xhr = XMLHttpRequest()
xhr.open('GET', 'suggest_from_file.py/complete?q=' + q, true)
xhr.onreadystatechange = function() {
if (xhr.readyState == 4) {
ws = eval(xhr.responseText)
e.innerHTML = ""
for (i = 0; i < ws.length; i++)
e.innerHTML += ws[i] + "<br>"
}
}
xhr.send(null)
}
</script>
<input type="text" onkeyup="complete(this.value)">
<div id="suggestions"></div>
'''
def complete(req, q):
req.content_type = "text"
return [w for w in ws if w.startswith(q)]
txt file:
IV. Miss Taylor's Working Bee
"So you must. Well, then, here goes!" Mr. Dyce swung her up to his shoulder and went, two steps at a time, in through the crowd of girls, so that he arrived there first when the door was opened. There in the hall stood Miss Mary Taylor, as pretty as a pink.
"I heard there was to be a bee here this afternoon, and I've brought Phronsie; that's my welcome," he announced.
"See, I've got a bag," announced Phronsie from her perch, and holding it forth.
So the bag was admired, and the girls trooped in, going up into Miss Mary's pretty room to take off their things. And presently the big library, with the music-room adjoining, was filled with the gay young people, and the bustle and chatter began at once.
"I should think you'd be driven wild by them all wanting you at the same minute." Mr. Dyce, having that desire at this identical time, naturally felt a bit impatient, as Miss Mary went about inspecting the work, helping to pick out a stitch here and to set a new one there, admiring everyone's special bit of prettiness, and tossing a smile and a gay word in every chance moment between.
"Oh, no," said Miss Mary, with a little laugh, "they're most of them my Sunday- school scholars, you know."

Looking at your code I believe you are not sending the correct thing to Apache. You are sending apache a list and apache is expecting a string. I would suggest changing your return to json:
import json
def complete(req, q):
req.content_type = "text"
return json.dumps([w for w in ws if w.startswith(q)])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.