Python conciseness confuses me - python

I have been looking at Pandas: run length of NaN holes, and this code fragment from the comments in particular:
Series([len(list(g)) for k, g in groupby(a.isnull()) if k])
As a python newbie, I am very impressed by the conciseness but not sure how to read this. Is it short for something along the lines of
myList = []
for k, g in groupby(a.isnull()) :
if k:
myList.append(len(list(g)))
Series(myList)
In order to understand what is going on I was trying to play around with it but get an error:
list object is not callable
so not much luck there.
It would be lovely if someone could shed some light on this.
Thanks,
Anne

You've got the translation correct. However, the code you give cannot be run because a is a free variable.
My guess is that you are getting the error because you have assigned a list object to the name list. Don't do that, because list is a global name for the type of a list.
Also, in future please always provide a full stack trace, not just one part of it. Please also provide sufficient code that at least there are no free variables.

If that is all of your code, then you have only a few possibilities:
myList.append is really a list
len is really a list
list is really a list
isnull is really a list
groupby is really a list
Series is really a list
The error exists somewhere behind groupby.
I'm going to go ahead and strike out myList.append (because that is impossible unless you are using your own groupby function for some reason) and Series. Unless you are importing Series from somewhere strange, or you are re-assigning the variable, we know Series can't be a list. A similar argument can be made for a.isnull.
So that leaves us with two real possibilities. Either you have re-assigned something somewhere in your script to be a list where it shouldn't be, or the error is behind groupby.
I think you're using the wrong groupby itertools.groupby takes and array or list as an argument, groupby in pandas may evaluate the first argument as a function. I especially think this because isnull() returns an array-like object.

Related

Problem re-creating a forloop from a Kaggle competition

Recently for a course at University, our teacher asked us to re-create one of Kaggle's competitions. I chose to do this one.
I was able to follow the tutorial relatively well, until I reached the for loop they wrote to clean the text in the data frame. Here it is:
# Get the number of reviews based on the dataframe column size
num_reviews = train["review"].size
# Initialize an empty list to hold the clean reviews
clean_train_reviews = []
# Loop over each review; create an index i that goes from 0 to the length
# of the movie review list
for i in xrange( 0, num_reviews ):
# Call our function for each one, and add the result to the list of
# clean reviews
clean_train_reviews.append( review_to_words( train["review"][i] ) )
My problem is when they use 'xrange' in the loop, since that was never assigned to anything during the tutorial and is, therefore, returning an error when I try to run the code. I also checked the code provided by the author on github and they did the exact same thing. So, my question is: Is this simply a mistake on their end or am I missing something? If I am not missing anything, what should be in the place of'xrange'?
I've tried assigning the relevant dataframe column to a variable I can then use here, but I then get a TypeError, stating 'Series' object is not callable. My knowledge in Python is a bit elementary still, so I apologize if I am simply missing something obvious. Appreciate any help!
range and xrange are both python functions, the only thing, xrange is used only in python 2.
Both are implemented in different ways and have different characteristics associated with them. The points of comparison are:
Return Type:
Type 'list',
Type 'xrange'
Memory:
Big size,
Small size
Operation Usage:
As range() returns the list, all the operations that can be applied on the list can be used on it. On the other hand, as xrange() returns the xrange object, operations associated to list cannot be applied on them, hence a disadvantage.
Speed:
Because of the fact that xrange() evaluates only the generator object containing only the values that are required by lazy evaluation, therefore is faster in implementation than range().
More info:
source
You are probably having this issue because you are using python3.
Just use range function instead.

not able to change object to float in pandas dataframe

just started learning python. trying to change a columns data type from object to float to take out the mean. I have tried to change [] to () and even the "". I dont know whether it makes a difference or not. Please help me figure out what the issue is. thanks!!
My code:
df["normalized-losses"]=df["normalized-losses"].astype(float)
error which i see: attached as imageenter image description here
Use:
df['normalized-losses'] = df['normalized-losses'][~(df['normalized-losses'] == '?' )].astype(float)
Using df.normalized-losses leads to interpreter evaluating df.normalized which doesn't exist. The statement you have written executes (df.normalized) - (losses.astype(float)).There appears to be a question mark in your data which can't be converted to float.The above statement converts to float only those rows which don't contain a question mark and drops the rest.If you don't want to drop the columns you can replace them with 0 using:
df['normalized-losses'] = df['normalized-losses'].replace('?', 0.0)
df['normalized-losses'] = df['normalized-losses'].astype(float)
Welcome to Stack Overflow, and good luck on your Python journey! An important part of coding is learning how to interpret error messages. In this case, the traceback is quite helpful - it is telling you that you cannot call normalized after df, since a dataframe does not have a method of this name.
Of course you weren't trying to call something called normalized, but rather the normalized-losses column. The way to do this is as you already did once - df["normalized-losses"].
As to your main problem - if even one of your values can't be converted to a float, the columnwide operation will fail. This is very common. You need to first eliminate all of the non-numerical items in the column, one way to find them is with df[~df['normalized_losses'].str.isnumeric()].
The "df.normalized-losses" does not signify anything to python in this case. you can replace it with df["normalized-losses"]. Usually, if you try
df["normalized-losses"]=df["normalized-losses"].astype(float)
This should work. What this does is, it takes normalized-losses column from dataframe, converts it to float, and reassigns it to normalized column in the same dataframe. But sometimes it might need some data processing before you try the above statement.
You can't use - in an attribute or variable name. Perhaps you mean normalized_losses?

Pandas -- String of boolean conditions -- SettingWithCopyWarning

I am wondering if anyone can assist me with this warning I get in my code. The code DOES score items correctly, but this warning is bugging me and I can't seem to find a good fix, given that I need to string a few boolean conditions together.
Background: Imagine that I have a magical fruit identifier and I have a csv file that lists what fruit was identified and in which area (1, 2, etc.). I read in the csv file with columns of "FruitID" and "Area." An identification of "APPLE" or "apple" in Zone 1 is scored as correct/true (other identified fruits are incorrect/false). I apply similar logic for other areas, but I won't get into that.
Any ideas for how to correct this? Should I use .loc, although I'm not sure that this will work with multiple booleans. Thanks!
My code snippet that initiates the CopyWarning:
Area1_ID_df['Area 1, Score']=(Area1_ID_df['FruitID']=='APPLE')|(Area1_ID_df['FruitID']=='apple')
Stacktrace:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Pandas finds it ambiguous what you are trying to do. Certain operations return a view of the dataset, whereas other operations make a copy of the dataset. The confusion is whether you want to modify a copy of the dataset or whether you want to modify the original dataset or are trying to create something new.
https://www.dataquest.io/blog/settingwithcopywarning/ is a great link to learn more about the problem you are having.
If the line that's causing this error is truly: s = t | u, where t and u are Boolean series indexed consistently, you should not worry about SettingWithCopyWarning.
This is a warning rather than an error. The latter indicates there is a problem, the former indicates there may be a problem. In this case, Pandas guesses you may be working with a copy rather than a view.
If the result is as you expect, you can safely ignore the warning.

Could someone explain how this piece of Python code works? (sorted list with lambda function)

I was trying to sort a list of lists on the second item in each of the lists in my "unsorted list", and found this piece of code. It works, but even though I have read about lambda function I'm having some problems wrapping my head around how it works. Could someone explain how it works, and maybe give me some input if this is a good way of sorting a list of lists or if i should use a different approach. In advance, thanks!
sorted_list = sorted(unsorted_list,key=lambda l:l[1])
I don't recognize the language, but it seems that the code is sorting an array of arrays, using the element at index 1 as the sorting key.

How to access an attribute of an arbitrary element of a set

I have a non-empty set S and every s in S has an attribute s.x which I know is independent of the choice of s. I'd like to extract this common value a=s.x from S. There is surely something better than
s=S.pop()
a=s.x
S.add(s)
-- maybe that code is fast but surely I shouldn't be changing S?
Clarification: some answers and comments suggest iterating over all of S. The reason I want to avoid this is that S might be huge; my method above will I think run quickly however large S is; my only issue with it is that S changes, and I see no reason that I need to change S.
This is almost but not quite the same as this question on getting access to an element of a set when there's only one-- there are solutions which apply there which won't work here, and others which work but are inefficient. But the general trick of using next(iter(something_iterable)) to nondestructively get an element still applies:
>>> S = {1+2j, 2+2j, 3+2j}
>>> next(iter(S))
(2+2j) # Note: could have been any element
>>> next(iter(S)).imag
2.0

Categories