I am new to Prodigy and haven't fully figured out the paradigm.
For a project, I would like to manually annotate names from texts. My team has developed our own model to recognize the names, so I only want to use the annotated texts (produced with Prodigy) as a golden standard for our model.
To do so, I have a csv file texts.csv with the text in one of the columns. Do I need to convert this file into a json, or can I also run Prodigy on the csv file?
Also, what is the code that I need to run to start the ner_manual with this dataset?
I suppose, I have to start with:
!python -m prodigy ner.manual
However, it is unclear to me how I should run the rest. Can someone help me with this?
File Format
I believe for the recipes that say "Text Source" you can use jsonl, json, csv, or txt (reference the section that says "Text Source": https://prodi.gy/docs/api-loaders). Ner.manual says "Text Source" so I think it should work. (reference: https://prodi.gy/docs/recipes#ner-manual)
ner.manual
In regards to running ner.manual try taking a look at this documentation https://prodi.gy/docs/
The documentation contains a good example:
python -m prodigy ner.manual ner_news_headlines blank:en ./news_headlines.jsonl --label PERSON,ORG,PRODUCT,LOCATION
ner_news_headlines is the name of the dataset (it could be named anything)
blank:en is a blank english model
./news_headlines.jsonl is the name of the jsonl file that you will be annotating (use whatever file name your file is)
PERSON,ORG,PRODUCT,LOCATION are the labels that you will annotate your data with (change these to whatever labels you want to use, be sure to separate with commas not spaces)
I'm also pretty new to prodigy so someone else may have a better answer.
Related
I'm Trying to convert .txt data file to STDF (ATE Standard Test Data Format, commonly used in semiconductor tests) file.
Is there any way to do that?
Are there any libraries in Python which would help in cases like this?
Thanks!
You can try Semi-ATE STDF library:
It supports only ver. 4. You can use conda-forge or pypi to install it.
It is of course possible since Python is Turing complete. However, you should use one of the available open source or commercial libraries to handle the STDF writing if you are not familiar with STDF. Even one mis-placed byte in the binary output will wreck your file.
It is impossible to say whether an existing tool can do this for you because a text file can have anything in it. Your text file will need to adhere to the tool's expectations of where the necessary header data (lot id, program name, etc.), test names and numbers, part identifiers, test results and so on will be in the text file.
I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.
I am interested in gleaning information from an ESRI .shp file.
Specifically the .shp file of a polyline feature class.
When I open the .dbf of a feature class, I get what I would expect: a table that can open in excel and contains the information from the feature class' table.
However, when I try to open a .shp file in any program (excel, textpad, etc...) all I get is a bunch of gibberish and unusual ASCII characters.
I would like to use Python (2.x) to interpret this file and get information out of it (in this case the vertices of the polyline).
I do not want to use any modules or non built-in tools, as I am genuinely interested in how this process would work and I don't want any dependencies.
Thank you for any hints or points in the right direction you can give!
Your question, basically, is "I have a file full of data stored in an arbitrary binary format. How can I use python to read such a file?"
The answer is, this link contains a description of the format of the file. Write a dissector based on the technical specification.
If you don't want to go to all the trouble of writing a parser, you should take look at pyshp, a pure Python shapefile library. I've been using it for a couple of months now, and have found it quite easy to use.
There's also a python binding to shapelib, if you search the web. But I found the pure Python solution easier to hack around with.
might be a long shot, but you should check out ctypes, and maybe use the .dll file that came with a program (if it even exists lol) that can read that type of file. in my experience, things get weird when u start digging around .dlls
How can I extract the tables, text and the pictures in an ODT(OpenDocumentText) file to output them to another ODT file using Python on Ubuntu?
OOoPy seems to be a good fit. I've never used it, but it comes with documentation and code examples, and it can read and write ODT files.
An easy way is to just rename the foo.odt to foo.zip and then extract it. the extracted directory contains many files including Pictures.
However I think it's better to change it's type to docx and then do the process on docx (extract it). Because it extract images with better name (image1, image2, ...).
I am working on bio project.
I have .pdb (protein data bank) file which contains information about the molecule.
I want to find out the following of a molecule in the .pdb file:
Molecular Mass.
H bond donor.
H bond acceptor.
LogP.
Refractivity.
Is there any module in python which can deal with .pdb file in finding this?
If not then can anyone please let me know how can I do the same?
I found some modules like sequtils and protienparam but they don't do such things.
I have researched first and then posted, so, please don't down-vote.
Please comment, if you still down-vote as to why you did so.
Thanks in advance.
I don't know if it fits your needs, but Biopython looks like it might help.
PDB file also outputs an XML file PDBML that can be easily parsed using an xml parsing library
http://pdbml.pdb.org/
A pdb file can contain pretty much anything.
A lot of projects allows you to parse them. Some specific to biology and pdb files, other less specific but that will allow you to do more (setup calculations, measure distances, angles, etc.).
I think you got downvoted because these projects are numerous: you are not the only one wanting to do that so the chances that something perfectly fitting your needs exists are really high.
That said, if you just want to parse pdb files for this specific need, just do it yourself:
Open the files with a text editor.
Identify where the relevant data are (keywords, etc.).
Make a Python function that opens the file and look for the keywords.
Extract the figures from the file.
Done.
This can be done with a short script written in less than 10 minutes (other reason why downvoting).