Creating an ARFF file from python output

Creating an ARFF file from python output - python

gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html': {'dail': 1, 'focus': 1, 'actions': 1, 'trade': 2, 'protest': 1, 'identify': 1, 'previous': 1, 'detectives': 1, 'republican': 1, 'group': 1, 'monitor': 1, 'clashes': 1, 'civil': 1, 'charge': 1, 'breaches': 1, 'travelling': 1, 'main': 1, 'disrupt': 1, 'real': 1, 'policing': 3, 'march': 6, 'finance': 1, 'drawn': 1, 'assistant': 1, 'protesters': 1, 'emphasised': 1, 'department': 1, 'traffic': 2, 'outbreak': 1, 'culprits': 1, 'proportionate': 1, 'instructions': 1, 'warned': 2, 'commanders': 1, 'michael': 2, 'exploit': 1, 'culminating': 1, 'large': 2, 'continue': 1, 'team': 1, 'hijack': 1, 'disorder': 1, 'square': 1, 'leaders': 1, 'deal': 2, 'people': 3, 'streets': 1, 'demonstrations': 2, 'observed': 1, 'street': 2, 'college': 1, 'organised': 1, 'operation': 1, 'special': 1, 'shown': 1, 'attendance': 1, 'normal': 1, 'unions': 2, 'individuals': 1, 'safety': 2, 'prosecuted': 1, 'ira': 1, 'ground': 1, 'public': 2, 'told': 1, 'body': 1, 'stewards': 2, 'obey': 1, 'business': 1, 'gathered': 1, 'assemble': 1, 'garda': 5, 'sinn': 1, 'broken': 1, 'fachtna': 1, 'management': 2, 'possibility': 1, 'groups': 3, 'put': 1, 'affiliated': 1, 'strong': 2, 'security': 1, 'stage': 1, 'behaviour': 1, 'involved': 1, 'route': 2, 'violence': 1, 'dublin': 3, 'fein': 1, 'ensure': 2, 'stand': 1, 'act': 2, 'contingency': 1, 'troublemakers': 2, 'facilitate': 2, 'road': 1, 'members': 1, 'prepared': 1, 'presence': 1, 'sullivan': 2, 'reassure': 1, 'number': 3, 'community': 1, 'strategic': 1, 'visible': 2, 'addressed': 1, 'notify': 1, 'trained': 1, 'eirigi': 1, 'city': 4, 'gpo': 1, 'from': 3, 'crowd': 1, 'visit': 1, 'wood': 1, 'editor': 1, 'peaceful': 4, 'expected': 2, 'today': 1, 'commissioner': 4, 'quay': 1, 'ictu': 1, 'advance': 1, 'murphy': 2, 'gardai': 6, 'aware': 1, 'closures': 1, 'courts': 1, 'branch': 1, 'deployed': 1, 'made': 1, 'thousands': 1, 'socialist': 1, 'work': 1, 'supt': 2, 'feehan': 1, 'mr': 1, 'briefing': 1, 'visited': 1, 'manner': 1, 'irish': 2, 'metropolitan': 1, 'spotters': 1, 'organisers': 1, 'in': 13, 'dissident': 1, 'evidence': 1, 'tom': 1, 'arrangements': 3, 'experience': 1, 'allowed': 1, 'sought': 1, 'rally': 1, 'connell': 1, 'officers': 3, 'potential': 1, 'holding': 1, 'units': 1, 'place': 2, 'events': 1, 'dignified': 1, 'planned': 1, 'independent': 1, 'added': 2, 'plans': 1, 'congress': 1, 'centre': 3, 'comprehensive': 1, 'measures': 1, 'yesterday': 2, 'alert': 1, 'important': 1, 'moving': 1, 'plan': 2, 'highly': 1, 'law': 2, 'senior': 2, 'fair': 1, 'recent': 1, 'refuse': 1, 'attempt': 1, 'brady': 1, 'liaising': 1, 'conscious': 1, 'light': 1, 'clear': 1, 'headquarters': 1, 'wing': 1, 'chief': 2, 'maintain': 1, 'harcourt': 1, 'order': 2, 'left': 1}}
I have a python script that extracts words from text files and counts the number of times they occur in the file.
I want to add them to an ".ARFF" file to use for weka classification.
Above is an example output of my python script.
How do I go about inserting them into an ARFF file, keeping each text file separate. Each file is differentiated by {"with their words in here!!"}

I know it's pretty easy to generate an arff file on your own, but I still wanted to make it simpler so I wrote a python package
https://github.com/ubershmekel/arff
It's also on pypi so easy_install arff

There are details on the ARFF file format here and it's very simple to generate. For example, using a cut-down version of your Python dictionary, the following script:
import re
d = { 'gardai-plan-crackdown-on-troublemakers-at-protest-2438316.html':
{'dail': 1,
'focus': 1,
'actions': 1,
'trade': 2,
'protest': 1,
'identify': 1 }}
for original_filename in d.keys():
m = re.search('^(.*)\.html$',original_filename,)
if not m:
print "Ignoring the file:", original_filename
continue
output_filename = m.group(1)+'.arff'
with open(output_filename,"w") as fp:
fp.write('''#RELATION wordcounts
#ATTRIBUTE word string
#ATTRIBUTE count numeric
#DATA
''')
for word_and_count in d[original_filename].items():
fp.write("%s,%d\n" % word_and_count)
Generates output of the form:
#RELATION wordcounts
#ATTRIBUTE word string
#ATTRIBUTE count numeric
#DATA
dail,1
focus,1
actions,1
trade,2
protest,1
identify,1
... in a file called gardai-plan-crackdown-on-troublemakers-at-protest-2438316.arff. If that's not exactly what you want, I'm sure you can easily alter it. (For example, if the "words" might have spaces or other punctuation in them, you probably want to quote them.)

This project seems to be a bit more up to date. You can install it via
pip:
$ pip install liac-arff
or easy_install:
$ easy_install liac-arff

Related

Sort the keys of a dictionary by key using a list and for loop [duplicate]

This question already has answers here:
convert a dict to sorted dict in python
(2 answers)
Closed 2 years ago.
I need to sort this dictionary that counts the times that some words appear in a song:
word_freq = {'love': 25, 'conversation': 1, 'every': 6, "we're": 1, 'plate': 1, 'sour': 1, 'jukebox': 1, 'now': 11, 'taxi': 1, 'fast': 1, 'bag': 1, 'man': 1, 'push': 3, 'baby': 14, 'going': 1, 'you': 16, "don't": 2, 'one': 1, 'mind': 2, 'backseat': 1, 'friends': 1, 'then': 3, 'know': 2, 'take': 1, 'play': 1, 'okay': 1, 'so': 2, 'begin': 1, 'start': 2, 'over': 1, 'body': 17, 'boy': 2, 'just': 1, 'we': 7, 'are': 1, 'girl': 2, 'tell': 1, 'singing': 2, 'drinking': 1, 'put': 3, 'our': 1, 'where': 1, "i'll": 1, 'all': 1, "isn't": 1, 'make': 1, 'lover': 1, 'get': 1, 'radio': 1, 'give': 1, "i'm": 23, 'like': 10, 'can': 1, 'doing': 2, 'with': 22, 'club': 1, 'come': 37, 'it': 1, 'somebody': 2, 'handmade': 2, 'out': 1, 'new': 6, 'room': 3, 'chance': 1, 'follow': 6, 'in': 27, 'may': 2, 'brand': 6, 'that': 2, 'magnet': 3, 'up': 3, 'first': 1, 'and': 23, 'pull': 3, 'of': 6, 'table': 1, 'much': 2, 'last': 3, 'i': 6, 'thrifty': 1, 'grab': 2, 'was': 2, 'driver': 1, 'slow': 1, 'dance': 1, 'the': 18, 'say': 2, 'trust': 1, 'family': 1, 'week': 1, 'date': 1, 'me': 10, 'do': 3, 'waist': 2, 'smell': 3, 'day': 6, 'although': 3, 'your': 21, 'leave': 1, 'want': 2, "let's": 2, 'lead': 6, 'at': 1, 'hand': 1, 'how': 1, 'talk': 4, 'not': 2, 'eat': 1, 'falling': 3, 'about': 1, 'story': 1, 'sweet': 1, 'best': 1, 'crazy': 2, 'let': 1, 'too': 5, 'van': 1, 'shots': 1, 'go': 2, 'to': 2, 'a': 8, 'my': 33, 'is': 5, 'place': 1, 'find': 1, 'shape': 6, 'on': 40, 'kiss': 1, 'were': 3, 'night': 3, 'heart': 3, 'for': 3, 'discovering': 6, 'something': 6, 'be': 16, 'bedsheets': 3, 'fill': 2, 'hours': 2, 'stop': 1, 'bar': 1}
In order to do it I need:
To create a new list just with the keys of the dictionary.
keys = list(word_freq.keys())
Sort the key list.
keys.sort()
Create an empty dictionary.
word_freq2 = {}
Use a for loop lo iterate each value of the list. For each iterated, find the corresponding value in the first dictionary and insert the key-value pair to the new empty dictionary.
This is my best solution up to now:
for key in keys:
if key in word_freq:
word_freq2.update({key: value})
print(word_freq2)
The problem is that I don't know how to add the correct value because right know I receive just 1 as a value, as I show here:
{'a': 1, 'about': 1, 'all': 1, 'although': 1, 'and': 1, 'are': 1, 'at': 1, 'baby': 1, 'backseat': 1, 'bag': 1, 'bar': 1, 'be': 1, 'bedsheets': 1, 'begin': 1, 'best': 1, 'body': 1, 'boy': 1, 'brand': 1, 'can': 1, 'chance': 1, 'club': 1, 'come': 1, 'conversation': 1, 'crazy': 1, 'dance': 1, 'date': 1, 'day': 1, 'discovering': 1, 'do': 1, 'doing': 1, "don't": 1, 'drinking': 1, 'driver': 1, 'eat': 1, 'every': 1, 'falling': 1, 'family': 1, 'fast': 1, 'fill': 1, 'find': 1, 'first': 1, 'follow': 1, 'for': 1, 'friends': 1, 'get': 1, 'girl': 1, 'give': 1, 'go': 1, 'going': 1, 'grab': 1, 'hand': 1, 'handmade': 1, 'heart': 1, 'hours': 1, 'how': 1, 'i': 1, "i'll": 1, "i'm": 1, 'in': 1, 'is': 1, "isn't": 1, 'it': 1, 'jukebox': 1, 'just': 1, 'kiss': 1, 'know': 1, 'last': 1, 'lead': 1, 'leave': 1, 'let': 1, "let's": 1, 'like': 1, 'love': 1, 'lover': 1, 'magnet': 1, 'make': 1, 'man': 1, 'may': 1, 'me': 1, 'mind': 1, 'much': 1, 'my': 1, 'new': 1, 'night': 1, 'not': 1, 'now': 1, 'of': 1, 'okay': 1, 'on': 1, 'one': 1, 'our': 1, 'out': 1, 'over': 1, 'place': 1, 'plate': 1, 'play': 1, 'pull': 1, 'push': 1, 'put': 1, 'radio': 1, 'room': 1, 'say': 1, 'shape': 1, 'shots': 1, 'singing': 1, 'slow': 1, 'smell': 1, 'so': 1, 'somebody': 1, 'something': 1, 'sour': 1, 'start': 1, 'stop': 1, 'story': 1, 'sweet': 1, 'table': 1, 'take': 1, 'talk': 1, 'taxi': 1, 'tell': 1, 'that': 1, 'the': 1, 'then': 1, 'thrifty': 1, 'to': 1, 'too': 1, 'trust': 1, 'up': 1, 'van': 1, 'waist': 1, 'want': 1, 'was': 1, 'we': 1, "we're": 1, 'week': 1, 'were': 1, 'where': 1, 'with': 1, 'you': 1, 'your': 1}

This code seems to work just fine:
word_freq = {'love': 25, 'conversation': 1, 'every': 6, "we're": 1, 'plate': 1, 'sour': 1, 'jukebox': 1, 'now': 11, 'taxi': 1, 'fast': 1, 'bag': 1, 'man': 1, 'push': 3, 'baby': 14, 'going': 1, 'you': 16, "don't": 2, 'one': 1, 'mind': 2, 'backseat': 1, 'friends': 1, 'then': 3, 'know': 2, 'take': 1, 'play': 1, 'okay': 1, 'so': 2, 'begin': 1, 'start': 2, 'over': 1, 'body': 17, 'boy': 2, 'just': 1, 'we': 7, 'are': 1, 'girl': 2, 'tell': 1, 'singing': 2, 'drinking': 1, 'put': 3, 'our': 1, 'where': 1, "i'll": 1, 'all': 1, "isn't": 1, 'make': 1, 'lover': 1, 'get': 1, 'radio': 1, 'give': 1, "i'm": 23, 'like': 10, 'can': 1, 'doing': 2, 'with': 22, 'club': 1, 'come': 37, 'it': 1, 'somebody': 2, 'handmade': 2, 'out': 1, 'new': 6, 'room': 3, 'chance': 1, 'follow': 6, 'in': 27, 'may': 2, 'brand': 6, 'that': 2, 'magnet': 3, 'up': 3, 'first': 1, 'and': 23, 'pull': 3, 'of': 6, 'table': 1, 'much': 2, 'last': 3, 'i': 6, 'thrifty': 1, 'grab': 2, 'was': 2, 'driver': 1, 'slow': 1, 'dance': 1, 'the': 18, 'say': 2, 'trust': 1, 'family': 1, 'week': 1, 'date': 1, 'me': 10, 'do': 3, 'waist': 2, 'smell': 3, 'day': 6, 'although': 3, 'your': 21, 'leave': 1, 'want': 2, "let's": 2, 'lead': 6, 'at': 1, 'hand': 1, 'how': 1, 'talk': 4, 'not': 2, 'eat': 1, 'falling': 3, 'about': 1, 'story': 1, 'sweet': 1, 'best': 1, 'crazy': 2, 'let': 1, 'too': 5, 'van': 1, 'shots': 1, 'go': 2, 'to': 2, 'a': 8, 'my': 33, 'is': 5, 'place': 1, 'find': 1, 'shape': 6, 'on': 40, 'kiss': 1, 'were': 3, 'night': 3, 'heart': 3, 'for': 3, 'discovering': 6, 'something': 6, 'be': 16, 'bedsheets': 3, 'fill': 2, 'hours': 2, 'stop': 1, 'bar': 1}
keys = list(word_freq.keys())
keys.sort()
word_freq2 = {}
for key in keys:
word_freq2[key] = word_freq[key]
print(word_freq2)

Best model for variable selection with big data?

I posted a question earlier about some code but now I realize I should be more broad with the general idea. Basically, I'm trying to build a statistical model with about 1000 observations and 2000 variables. I would like to determine which variables are most influential in effecting my dependent variable with high significance. I don't plan to use the model for prediction, just for variable selection. My independent variables are binary and dependent variable is continuous. I've tried multiple linear regression and fixed models with tools such as statsmodels and scikit-learn. However, I have encountered issues such as having more variables than observations. I would prefer to solve the problem in python since I have basic knowledge in it. However, stats is very new to me so I don't know the best direction. Any help is appreciated.
Tree method
import pandas as pd
from sklearn import tree
from sklearn import preprocessing
data=pd.read_excel('data_file.xlsx')
y=data.iloc[:, -1]
X=data.iloc[:, :-1]
le=preprocessing.LabelEncoder()
y=le.fit_transform(y)
clf=tree.DecisionTreeClassifier()
clf=clf.fit(X,y)
tree.export_graphviz(clf, out_file='tree.dot')
Or if I output to text file, the first few lines are:
digraph Tree {
node [shape=box] ;
0 [label="X[685] <= 0.5\ngini = 0.995\nsamples = 1097\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 1, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 11, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 3, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n11, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n17, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 14, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 1, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 1\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 1, 2, 13, 2\n1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 7, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 1, 3\n1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 4, 1, 5, 1\n4, 1, 2, 3, 3]"] ;
1 [label="X[990] <= 0.5\ngini = 0.995\nsamples = 1040\nvalue = [2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1\n1, 1, 1, 8, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 4, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1\n1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2\n1, 1, 1, 1, 1, 1, 30, 3, 1, 3, 1, 1, 2, 1\n1, 5, 1, 2, 1, 4, 2, 1, 1, 1, 1, 1, 1, 1\n1, 1, 2, 1, 1, 1, 3, 1, 1, 3, 1, 2, 1, 1\n1, 7, 3, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1\n6, 2, 1, 2, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1\n1, 1, 1, 1, 1, 1, 1, 1, 3, 7, 6, 1, 1, 1\n1, 1, 3, 4, 1, 1, 1, 1, 1, 4, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1\n1, 4, 1, 1, 4, 2, 1, 1, 1, 2, 1, 1, 2, 2\n11, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 12, 1\n1, 1, 3, 1, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1\n6, 1, 0, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, 1\n1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 0, 1, 1\n1, 1, 1, 1, 1, 9, 1, 2, 1, 2, 1, 1, 1, 1\n4, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1\n1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3\n1, 7, 1, 1, 2, 1, 2, 7, 1, 1, 1, 1, 1, 11\n1, 1, 2, 2, 2, 1, 1, 10, 1, 1, 5, 21, 1, 1\n1, 1, 2, 1, 1, 1, 1, 1, 5, 15, 3, 1, 1, 1\n1, 1, 1, 3, 1, 1, 2, 1, 3, 1, 1, 0, 1, 1\n1, 1, 6, 1, 1, 1, 1, 1, 1, 14, 1, 1, 1, 1\n16, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 4\n1, 1, 1, 6, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1\n1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 1, 0, 1\n3, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 3, 1\n1, 2, 1, 12, 1, 1, 1, 1, 8, 2, 0, 1, 1, 2\n1, 1, 3, 1, 1, 6, 1, 1, 1, 3, 1, 1, 2, 0\n1, 1, 1, 1, 4, 1, 1, 2, 1, 3, 2, 4, 1, 3\n1, 1, 1, 1, 1, 7, 1, 1, 2, 1, 0, 1, 3, 2\n1, 1, 1, 0, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1\n9, 1, 2, 5, 6, 1, 1, 1, 2, 9, 2, 2, 13, 1\n1, 1, 1, 2, 1, 3, 1, 1, 6, 1, 3, 1, 0, 3\n1, 0, 1, 1, 2, 0, 1, 2, 1, 1, 0, 1, 5, 1\n4, 1, 0, 3, 3]"] ;

I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()), as further steps I'd suggest to get any methods mentioned earlier.

1.Variante A: Feature Selection Scikit
For future selection scikitoffers a lot of different approaches:
https://scikit-learn.org/stable/modules/feature_selection.html
Here it sumps up the comments from above.
2.Variante B: Feature Selection with linear regression
You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.
3.Variante C: PCA (not for binary case)
Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.
The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.
Scikik-learnhas a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Keras GridSearch model prediction

I'm battling with a weird issue that I can't seem to figure out.
So, I used KerasClassifier and GridSearch to build and search for the best parameters for my model. This part worked fine.
After this, I tried predicting on my test data which is where the weird thing happened.
Assuming my grid_search object is grid and my test data is X_test, I noticed that the result of grid.best_estimator_.predict(X_test) is completely different from the result of grid_best_estimator_.model.predict(X_test).
For more context, here's a sample of the result from grid.best_estimator_.predict(X_test):
1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 3,
1, 1, 1, 1, 3, 3, 1, 3, 1, 3, 1, 1, 1, 1, 0, 1, 1, 1, 3, 1, 3, 3,
1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1, 3, 1, 1, 3,
1, 1, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 3, 1, 1,
1, 3, 1, 1, 3, 1, 3, 1, 1, 0, 1, 1, 3, 1, 1, 3, 3, 1, 1, 1, 3, 1,
1, 3, 1, 3, 1, 3, 1, 1, 3, 1, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1,
1, 1, 3, 1, 1, 3, 3, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1,
1, 3])
and here's the result from the grid_best_estimator_.model.predict(X_test):
[[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[3.47690374e-01 4.35497969e-01 9.62351710e-02 1.20576508e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[4.48489130e-01 3.48928362e-01 1.13302141e-01 8.92804191e-02]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[2.65852152e-03 2.72439304e-03 5.55709645e-04 9.94061410e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[4.16748554e-01 3.44473362e-01 1.22328281e-01 1.16449825e-01]
[1.14751011e-01 2.33341262e-01 3.13971192e-02 6.20510638e-01]
[8.30730610e-03 1.07289189e-02 1.87594432e-03 9.79087830e-01]
In an attempt to debug this, I've tried to call np.argmax() on the output of grid_best_estimator_.model(X_test). Then tried (result_of_best_estimator == result_of_model).all() which returns False.
So, am i missing something? Or do I misunderstand how this is supposed to work?

how to make the values of dictionary appear in alphabetical order

Let's say I have this dictionary:
{'song': 1, 'like': 1, 'most': 1, 'neer': 1, 'hides': 1, 'live': 1, 'yours': 1, 'come': 2, 'not': 1, 'rage': 1, 'deserts': 1, 'be': 2, 'graces': 1, 'metre': 1, 'rights': 1, 'tomb': 1, 'stretched': 1, 'verse': 1, 'write': 1, 'the': 2, 'beauty': 1, 'all': 1, 'should': 2, 'it': 3, 'rhyme': 1, 'is': 1, 'this': 1, 'in': 4, 'earthly': 1, 'numbers': 1, 'to': 2, 'if': 2, 'my': 3, 'yet': 1, 'less': 1, 'would': 1, 'life': 1, 'an': 1, 'alive': 1, 'number': 1, 'a': 2, 'child': 1, 'say': 1, 'tongue': 1, 'heavenly': 1, 'knows': 1, 'men': 1, 'could': 1, 'half': 1, 'so': 1, 'parts': 1, 'their': 1, 'high': 1, 'with': 2, 'believe': 1, 'such': 1, 'that': 1, 'papers': 1, 'eyes': 1, 'antique': 1, 'age': 2, 'were': 2, 'fresh': 1, 'lies': 1, 'than': 1, 'poet': 1, 'termed': 1, 'old': 1, 'touches': 1, 'and': 5, 'but': 2, 'some': 1, 'of': 4, 'time': 2, 'touched': 1, 'twice': 1, 'will': 1, 'yellowed': 1, 'you': 1, 'though': 1, 'heaven': 1, 'poets': 1, 'truth': 1, 'who': 1, 'i': 1, 'faces': 1, 'which': 1, 'scorned': 1, 'shows': 1, 'filled': 1, 'your': 6, 'true': 1, 'as': 1}
How do I go by making each key-value pair ordered alphabetically by the key?
I tried doing:
for key,value in sorted(freqs.items()):
freqs[key]=value
but that doesn't do anything. I want it to look like this:
ab 5
and 8
...
yours 2

Dicts are not sorted data structures, but you can traverse them in a sorted manner using:
for key in sorted(freqs.keys()):
print freqs[key]

collections.OrderedDict is for this purpose. Example:
>>> # regular unsorted dictionary
>>> d = {'banana': 3, 'apple':4, 'pear': 1, 'orange': 2}
>>> # dictionary sorted by key
>>> OrderedDict(sorted(d.items(), key=lambda t: t[0]))
OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])

Dictionaries in Python are unsorted by nature. If you were to call a dictionary twice in two different places, you can expect them to be in a different order. Unless I am understanding something wrong?

Try one of these:
https://pypi.python.org/pypi/treap
https://pypi.python.org/pypi/red-black-tree-mod
The treap is a hybrid of a tree and a heap. It works like a sorted (by key) dictionary.
The red-black tree is a tree. It also works like a sorted, by key, dictionary.
Some say treaps are faster than red-black trees on average, but that treaps have a greater standard deviation in operation times.
Both of them do almost everything in O(logn) time, except sorting. They both keep everything sorted by key, nonstop.
Sometimes it's better to sort the keys of a standard dictionary, but it's rarely a good idea to sort inside a loop.

how to write a report into a file?

I'm having problems with my code. I want to be able to write a report (which is a dictionary) into a file. I have this as my report:
{longs: ['stretched']
avglen: 4.419354838709677
freqs: {'could': 1, 'number': 1, 'half': 1, 'scorned': 1, 'come': 2, 'numbers': 1, 'rage': 1, 'metre': 1, 'termed': 1, 'heavenly': 1, 'touches': 1, 'i': 1, 'their': 1, 'poets': 1, 'a': 2, 'lies': 1, 'verse': 1, 'an': 1, 'as': 1, 'eyes': 1, 'touched': 1, 'knows': 1, 'tongue': 1, 'not': 1, 'yet': 1, 'filled': 1, 'heaven': 1, 'of': 4, 'earthly': 1, 'hides': 1, 'to': 2, 'stretched': 1, 'deserts': 1, 'this': 1, 'tomb': 1, 'write': 1, 'yellowed': 1, 'that': 1, 'alive': 1, 'some': 1, 'so': 1, 'such': 1, 'should': 2, 'like': 1, 'than': 1, 'antique': 1, 'yours': 1, 'but': 2, 'age': 2, 'less': 1, 'fresh': 1, 'time': 2, 'rhyme': 1, 'true': 1, 'neer': 1, 'all': 1, 'in': 4, 'live': 1, 'be': 2, 'your': 6, 'who': 1, 'truth': 1, 'child': 1, 'twice': 1, 'shows': 1, 'poet': 1, 'most': 1, 'life': 1, 'song': 1, 'will': 1, 'my': 3, 'if': 2, 'parts': 1, 'were': 2, 'you': 1, 'is': 1, 'papers': 1, 'it': 3, 'which': 1, 'rights': 1, 'with': 2, 'say': 1, 'old': 1, 'beauty': 1, 'high': 1, 'and': 5, 'would': 1, 'believe': 1, 'faces': 1, 'though': 1, 'men': 1, 'graces': 1, 'the': 2}
shorts: ['i', 'a']
count: 93
mosts: your}
and my code is:
def write_report(r, filename):
input_file=open(filename, "w")
for k, v in r.items():
line = '{}, {}'.format(k, v)
print(line, file=input_file)
input_file.close()
return input_file
but if I name r as the report, it gives me syntax error.
I changed it to this code now:
def write_report(r, filename):
with open(filename, "w") as f:
for k, v in r.items():
f.write('{}, {}'.format(k, v) )
return f
but I get this error:
<_io.TextIOWrapper name='sonnet_017.txt' mode='w' encoding='UTF-8'>

This
<_io.TextIOWrapper name='sonnet_017.txt' mode='w' encoding='UTF-8'>
is not an error message. It is the repr of the return value of the function. You have return f in your function, so it is returning the file object.
E.g.:
>>> f = open('junk.txt', 'w')
>>> f
<_io.TextIOWrapper name='junk.txt' mode='w' encoding='UTF-8'>
Here's your function in action:
>>> r
{'bar': 12.345, 'foo': 'abc'}
>>> write_report(r, "junk.txt")
<_io.TextIOWrapper name='junk.txt' mode='w' encoding='UTF-8'>
Now read the file back and see what we get:
>>> with open("junk.txt", "r") as f:
... contents = f.read()
...
>>> contents
'bar, 12.345foo, abc'
At a minimum, you might want to modify the write_report function to include a newline after writing each key/value pair:
f.write('{}, {}\n'.format(k, v) )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating an ARFF file from python output - python

I know it's pretty easy to generate an arff file on your own, but I still wanted to make it simpler so I wrote a python package https://github.com/ubershmekel/arff It's also on pypi so easy_install arff

This project seems to be a bit more up to date. You can install it via pip: $ pip install liac-arff or easy_install: $ easy_install liac-arff

Related

Sort the keys of a dictionary by key using a list and for loop [duplicate]

Best model for variable selection with big data?

Keras GridSearch model prediction

how to make the values of dictionary appear in alphabetical order

how to write a report into a file?

Categories

Resources