Problem re-creating a forloop from a Kaggle competition - python

Recently for a course at University, our teacher asked us to re-create one of Kaggle's competitions. I chose to do this one.
I was able to follow the tutorial relatively well, until I reached the for loop they wrote to clean the text in the data frame. Here it is:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size
# Initialize an empty list to hold the clean reviews
clean_train_reviews = []
# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list
for i in xrange( 0, num_reviews ):
# Call our function for each one, and add the result to the list of
# clean reviews
clean_train_reviews.append( review_to_words( train["review"][i] ) )
My problem is when they use 'xrange' in the loop, since that was never assigned to anything during the tutorial and is, therefore, returning an error when I try to run the code. I also checked the code provided by the author on github and they did the exact same thing. So, my question is: Is this simply a mistake on their end or am I missing something? If I am not missing anything, what should be in the place of'xrange'?
I've tried assigning the relevant dataframe column to a variable I can then use here, but I then get a TypeError, stating 'Series' object is not callable. My knowledge in Python is a bit elementary still, so I apologize if I am simply missing something obvious. Appreciate any help!

range and xrange are both python functions, the only thing, xrange is used only in python 2.
Both are implemented in different ways and have different characteristics associated with them. The points of comparison are:
Return Type:
Type 'list',
Type 'xrange'
Memory:
Big size,
Small size
Operation Usage:
As range() returns the list, all the operations that can be applied on the list can be used on it. On the other hand, as xrange() returns the xrange object, operations associated to list cannot be applied on them, hence a disadvantage.
Speed:
Because of the fact that xrange() evaluates only the generator object containing only the values that are required by lazy evaluation, therefore is faster in implementation than range().
More info:
source
You are probably having this issue because you are using python3.
Just use range function instead.

Related

str.split() in the for-loop instantiation, does it cause slower execution?

I'm a sucker for reducing code to its bare minimum and love keeping it short and slim, but occasionally I get into the dilemma of whether I'm doing more harm than good. Below is an example of a situation I frequently encounter and where I start pondering if I am minifying at the expense of speed.
str = "my name is john"
##Alternative 1
for el in str.split(" "):
print(el)
##Alternative 2
splittedStr = str.split(" ")
for el in splittedStr:
print(el)
What is faster? I'd assume it's the second one because we don't split the string after every iteration (not even sure we do that)?
str.split(" ") does the exact same thing in both cases. It creates an anonymous list of the split strings. In the second case you have the minor overhead of assigning it to a variable and then fetching the value of the variable. Its wasted time if you don't need to keep the object for other reasons. But this is a trivial amount of time compared to other object referencing taking place in the same loop. Alternative 2 also leaves the data in memory which is another small performance issue.
The real reason Alternative 1 is better than 2, IMHO, is that it doesn't leave the hint that splittedStr is going to be needed later.
Look my friend, if you want to actually reduce the amount of time in the code in general,loop on a tuple instead of list but assigning the result in a variable then using the variable is not the best approach is you just reserved a memory location just to store the value but sometimes you can do that just for the sake of having a clean code like if you have more than one operation in one line like
min(str.split(mylist)[3:10])
In this case, it is better to have a variable called min_value for example just to make things cleaner.
returning back to the performance issue, you could actually notice the difference in performance if you loop through a list or a tuple like
This is looping through a tuple
for i in (1,2,3):
print(i)
& This is looping through a list
for i in [1,2,3]:
print(i)
you will find that using tuple will be faster !

Python: Easy way to loop through dictionary parameters from a list of evaluated strings?

I have a dictionary created from a json file. This dictionary has a nested structure and every few weeks additional parameters are added.
I use a script to generate additional copies of the existing parameters when I want multiple "legs" added. So I first add the additional legs. So say I start with 1 leg as my template and I want 10 legs, I will just clone that leg 9 more times and add it to the list.
Then I loop through each of the parameters (called attributes) and have to clone certain elements for each leg that was added so that it has a 1:1 match. I don't care about the content so cloning the first leg value is fine.
So I do the following:
while len(data['attributes']['groupA']['params']['weights']) < legCount:
data['attributes']['groupA']['params']['weights'].append(data['attributes']['groupA']['params']['weights'][0])
while len(data['attributes']['groupB']['paramsGroup']['factors']) < legCount:
data['attributes']['groupB']['paramsGroup']['factors'].append(data['attributes']['groupB']['paramsGroup']['factors'][0])
while len(data['attributes']['groupC']['items']['delta']) < legCount:
data['attributes']['groupC']['items']['delta'].append(data['attributes']['groupC']['items']['delta'][0])
What I'd like to do is make these attributes all strings and just loop through them dynamically so that when I need to add additional ones, I can just paste one string into my list and it works without having another while loop.
So I converted it to this:
attribs = [
"data['attributes']['groupA']['params']['weights']",
"data['attributes']['groupB']['paramsGroup']['factors']",
"data['attributes']['groupC']['items']['delta']",
"data['attributes']['groupD']['xxxx']['yyyy']"
]
for attrib in attribs:
while len(eval(attrib)) < legCount:
eval(attrib).append(eval(attrib)[0])
In this case eval is safe because there is no user input, just a defined list of entries. Tho I wouldn't mind finding an alternative to eval either.
It works up until the last line. I don't think the .append is working on the eval() result. It's not throwing an error.. just not appending to the element.
Any ideas on the best way to handle this?
Not 100% sure this will fix it, but I do notice one thing.
In your above code in your while condition you are accessing:
data['attributes']['groupA']['params']['weights']
then you are appending to
data['attributes']['groupA']['params']['legs']
In your below code it looks like you are appending to 'weights' on the first iteration. However, this doesn't explain the other attributes you are evaluating... just one red flag I noticed.
Actually my code was working. I was just checking the wrong variable. Thanks Me! :)

How do I convert a GP to a string and back again using DEAP in Python?

I'm doing a project in Genetic Programming and I need to be able to convert a genetic program (of class deap.creator.Individual) to string, change some things (while keeping the problem 100% syntactically aligned with DEAP), and then put it back into a population of individuals for further evolution.
However, I've only been able to convert my string back to class gp.PrimitiveTree using the from_string method.
The only constructors for creator.Individual I see generate entire populations blindly or construct an Individual from an existing Individual/s. No methods to only create one individual from an existing gp.PrimitiveTree.
So, does anybody have any idea how I go about that?
Note: Individual is self-defined, but it is standard across all DEAP examples and is created using
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMax)
After many many hours I believe I've figured this out.
So, I'd become confused between two of the DEAP modules: 'creator' and 'toolbox'.
In order for me to create an individual with a given PrimitiveTree I simply needed to do:
creator.Individual(myPrimativeTree)
What you do not do is:
toolbox.individual(myPrimativeTree)
as that usually gets setup as the initialiser itself, and thus doesn't take arguments.
I hope that this can save somebody a decent chunk of time at some point in the future.
Individual to string: str(individual)
In order to create an Individual from a string: Primitive Tree has class method from_string:
https://deap.readthedocs.io/en/master/api/gp.html#deap.gp.PrimitiveTree.from_string
In your Deap evolution, to create an individual from string, you can try something like(note use of creator vs toolbox):
creator.Individual.from_string("add(IN1, IN2)", pset)
But the individual expression, as a string, needs to be such as it would if you did str(individual), aka stick to your pset when creating your string. So in my above example string I believe you would need to have a pset similar to:
pset = gp.PrimitiveSetTyped("MAIN", [float]*2, float, "IN")
pset.addPrimitive(operator.add, [float,float], float)

Python conciseness confuses me

I have been looking at Pandas: run length of NaN holes, and this code fragment from the comments in particular:
Series([len(list(g)) for k, g in groupby(a.isnull()) if k])
As a python newbie, I am very impressed by the conciseness but not sure how to read this. Is it short for something along the lines of
myList = []
for k, g in groupby(a.isnull()) :
if k:
myList.append(len(list(g)))
Series(myList)
In order to understand what is going on I was trying to play around with it but get an error:
list object is not callable
so not much luck there.
It would be lovely if someone could shed some light on this.
Thanks,
Anne
You've got the translation correct. However, the code you give cannot be run because a is a free variable.
My guess is that you are getting the error because you have assigned a list object to the name list. Don't do that, because list is a global name for the type of a list.
Also, in future please always provide a full stack trace, not just one part of it. Please also provide sufficient code that at least there are no free variables.
If that is all of your code, then you have only a few possibilities:
myList.append is really a list
len is really a list
list is really a list
isnull is really a list
groupby is really a list
Series is really a list
The error exists somewhere behind groupby.
I'm going to go ahead and strike out myList.append (because that is impossible unless you are using your own groupby function for some reason) and Series. Unless you are importing Series from somewhere strange, or you are re-assigning the variable, we know Series can't be a list. A similar argument can be made for a.isnull.
So that leaves us with two real possibilities. Either you have re-assigned something somewhere in your script to be a list where it shouldn't be, or the error is behind groupby.
I think you're using the wrong groupby itertools.groupby takes and array or list as an argument, groupby in pandas may evaluate the first argument as a function. I especially think this because isnull() returns an array-like object.

array vs hash key search

So I'm a longtime perl scripter who's been getting used to python since I changed jobs a few months back. Often in perl, if I had a list of values that I needed to check a variable against (simply to see if there is a match in the list), I found it easier to generate hashes to check against, instead of putting the values into an array, like so:
$checklist{'val1'} = undef;
$checklist{'val2'} = undef;
...
if (exists $checklist{$value_to_check}) { ... }
Obviously this wastes some memory because of the need for a useless right-hand value, but IMO is more efficients and easier to code than to loop through an array.
Now in python, the code for this is exactly the same no matter if you're searching an list or a dictionary:
if value_to_check in checklist_which_can_be_list_or_dict:
<code>
So my real question here is: in perl, the hash method was preferred for speed of processing vs. iterating through an array, but is this true in python? Given the code is the same, I'm wondering if python does list iteration better? Should I still use the dictionary method for larger lists?
Dictionaries are hashes. An in test on a list has to walk through every element to check it against, while an in test on a dictionary uses hashing to see if the key exists. Python just doesn't make you explicitly loop through the list.
Python also has a set datatype. It's basically a hash/dictionary without the right-hand values. If what you want is to be able to build up a collection of things, then test whether something is already in that collection, and you don't care about the order of the things or whether a thing is in the collection multiple times, then a set is exactly what you want!

Categories