Python loadarff fails for string attributes

Python loadarff fails for string attributes - python

I am trying to load an arff file using Python's 'loadarff' function from scipy.io.arff. The file has string attributes and it is giving the following error.
>>> data,meta = arff.loadarff(fpath)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/home/eex608/conda3_envs/PyT3/lib/python3.6/site-packages/scipy/io/arff/arffread.py", line 805, in loadarff
return _loadarff(ofile)
File "/data/home/eex608/conda3_envs/PyT3/lib/python3.6/site-packages/scipy/io/arff/arffread.py", line 838, in _loadarff
raise NotImplementedError("String attributes not supported yet, sorry")
NotImplementedError: String attributes not supported yet, sorry
How to read the arff successfully?

Since SciPy's loadarff converts containings of arff file into NumPy array, it does not support strings as attributes.
In 2020, you can use liac-arff package.
import arff
data = arff.load(open('your_document.arff', 'r'))
However, make sure your arff document does not contain inline comments after a meaningful text.
So there won't be such inputs:
#ATTRIBUTE class {F,A,L,LF,MN,O,PE,SC,SE,US,FT,PO} %Check and make sure that FT and PO should be there
Delete or move comment to the next line.
I'd got such mistake in one document and it took some time to figure out what's wrong.

Related

Cannot read Statistics Canada sdmx file

I am trying to read some Canadian census data from Statistics Canada
(the XML option for the "Canada, provinces and territories" geograpic level). I see that the xml file is in the SDMX format and that there is a structure file provided, but I cannot figure out how to read the data from the xml file.
It seems there are 2 options in Python, pandasdmx and sdmx1, both of which say they can read local files. When I try
import sdmx
datafile = '~/Documents/Python/Generic_98-401-X2016059.xml'
canada = sdmx.read_sdmx(datafile)
It appears to read the first 903 lines and then produces the following:
Traceback (most recent call last):
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 238, in read_message
raise NotImplementedError(element.tag, event) from None
NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/__init__.py", line 126, in read_sdmx
return reader().read_message(obj, **kwargs)
File "/home/username/.local/lib/python3.10/site-packages/sdmx/reader/xml.py", line 259, in read_message
raise XMLParseError from exc
sdmx.exceptions.XMLParseError: NotImplementedError: ('{http://www.SDMX.org/resources/SDMXML/schemas/v2_0/message}GenericData', 'start')
Is this happening because I've not loaded the structure of the sdmx file (Structure_98-401-X2016059.xml in the zip file from the StatsCan link above)? If so, how do I go about loading that and telling sdmx to use that when reading datafile?
The documentation for sdmx and pandasdmx only show examples of loading files from online providers and not from local files, so I'm stuck. I have limited familiarity with python so any help is much appreciated.
For reference, I can read the file in R using the instructions from the rsdmx github. I would like to be able to do the same/similar in Python.
Thanks in advance.

From a cursory inspection of the documentation, it seems that Statistics Canada is not one of the sources that is included by default. There is however an sdmx.add_source function. I suggest you try that (before loading the data).

As per the sdmx1 developer, StatsCan is using the older, unsupported version of the SDMX (v. 2.0). The current version is 2.1 and rsdmx1 only supports this (support is also going towards the upcoming v.3).

Using Python, NLTK, to analyse German text

I am a beginner in Python and currently trying to use NLTK to analyze German text (extract the German noun and it's frequency of German text) by following this tutorial: https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/
There are several issues that I faced during the process and I am not able to solve them.
When I follow the website to execute the code below:
import random
tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]
and it comes out with this
Traceback (most recent call last):
File "test2.py", line 7, in <module>
tagged_sents = list(corp.tagged_sents())
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 130, in tagged_sents
return LazyMap(get_tagged_words, self._grids(fileids))
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\conll.py", line 215, in _grids
return concat(
File "C:\Users\User\anaconda3\lib\site-packages\nltk\corpus\reader\util.py", line 433, in concat
raise ValueError("concat() expects at least one object!")
ValueError: concat() expects at least one object!
Then I try to fix by following this solution https://teamtreehouse.com/community/randomshuffle-crashes-when-passed-a-range-somenums-randomshufflerange5250
and alter the
tagged_sents = list(corp.tagged_sents())
to
tagged_sents = list(range(5,250))
And the ValueError didn't come out, I don't know what (5,250) means, although I have read the explanation.
Then I continue to execute the follow step
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)
And it shows
Traceback (most recent call last):
File "test1.py", line 90, in <module>
from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
ModuleNotFoundError: No module named 'ClassifierBasedGermanTagger'
I have already downloaded the ClassifierBasedGermanTagger.py and init.py and put them in the folder which link to the VS CODE, don't know if it is correct as the passage said:
'Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:'
Please help me to fix these problems, thanks!

First of all, welcome to StackOverflow! Before posting a question, please make sure that you have done your own research and most of the time it solves the problem.
Secondly, range(start, end) is a very basic function in Python to get list of numbers based on the input and I don't think using it like the way you did is going to solve the problem. I would suggest you to use print to see what kind of data is being populated in corp and start debugging from there. Maybe corp is just empty and that's why you don't get any tagged_sents.
For the the import part, it is not clear to me where did you put the ClassifierBasedGermanTagger.py but wherever it is, it is not visible to your code. You can try to put your code (test2.py) and ClassifierBasedGermanTagger.py in the same directory. Read the link below for more details on how to properly import module in Python.
https://docs.python.org/3/reference/import.html

Python-Weka-Wrapper3 removing attributes from arff file error

I have an arff file and I need to remove the first 5 attributes from it (without manually deleting them). I tried to use the Python-Weka-Wrapper3 as it is explained here which enables the filtering options of Weka, however I get an error while using the following code:
import weka.filters as Filter
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "1,2,3,4,5"])
The error that I receive is the following:
Traceback (most recent call last):
File "/home/user/Desktop/file_loading.py", line 16, in <module>
removing = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
TypeError: 'module' object is not callable
What could be the reason for this error? Also I would appreciate if anyone knows an alternative way to remove attributes from an arff file using Python.

You are attempting to call the module object instead of the class object.
Try using:
from weka.filters import Filter
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "1,2,3,4,5"])

Invalid Signature Exception with .sav file

New to scipy but not to python. Trying to import a .sav file to scipy so I can do some basic work on it. But, each time I try to import the file using scipy.io.readsav(), python throws an error:
Traceback (most recent call last):
File "<ipython-input-7-743be643d8a1>", line 1, in <module>
dataset = io.readsav("c:/users/me/desktop/survey.sav")
File "C:\Users\me\Anaconda3\lib\site-packages\scipy\io\idl.py", line 726, in readsav
raise Exception("Invalid SIGNATURE: %s" % signature)
Exception: Invalid SIGNATURE: b'$F'
Any idea what's happening? I can open the file in R and manipulate the data, but I'd like to do it in Python. Running Anaconda on Windows.

scipy.io.readsav() reads IDL SAVE files. You have tagged this question spss, so I assume you are trying to read an SPSS file. The format of an SPSS .sav file is not the same as the format of an IDL SAVE file.

Look on pypi for savReaderWriter for Python code to read and write sav files.

Read .net pajek file using python igraph library

I am trying to load a .net file using python igraph library. Here is the sample code:
import igraph
g = igraph.read("s.net",format="pajek")
But when I tried to run this script I got the following errors:
Traceback (most recent call last):
File "demo.py", line 2, in <module>
g = igraph.read('s.net',format="pajek")
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 3703, in read
return Graph.Read(filename, *args, **kwds)
File "C:\Python27\lib\site-packages\igraph\__init__.py", line 2062, in Read
return reader(f, *args, **kwds)
igraph._igraph.InternalError: Error at .\src\foreign.c:574: Parse error in Pajek
file, line 1 (syntax error, unexpected ARCSLINE, expecting VERTICESLINE), Parse error
Kindly provide some hint over it.

Either your file is not a regular Pajek file or igraph's Pajek parser is not able to read this particular Pajek file. (Writing a Pajek parser is a bit of hit-and-miss since the Pajek file format has no formal specification). If you send me your Pajek file via email, I'll take a look at it.
Update: you were missing the *Vertices section of the Pajek file. Adding a line like *Vertices N (where N is the number of vertices in the graph) resolves your problem. I cannot state that this line is mandatory in Pajek files because of lack of formal specification for the file format, but all the Pajek files I have seen so far included this line so I guess it's pretty standard.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python loadarff fails for string attributes - python

Related

Cannot read Statistics Canada sdmx file

Using Python, NLTK, to analyse German text

Python-Weka-Wrapper3 removing attributes from arff file error

Invalid Signature Exception with .sav file

Read .net pajek file using python igraph library

Categories

Resources