So I've been following a guide I got off reddit about understanding python's bracket requirements:
Is it a list? Then use brackets.
Is it a dict? Then use braces.
Otherwise, you probably want parentheses.
However, I've come across something that the above can't explain:
df.groupby('Age')['Salary'].mean()
In this case, both Age & Salary are lists (they are both columns from a df), so why do we use parentheses for Age and brackets for Salary?
Additionally, why is there a dot before mean, but not in-between ('Age') and ['Salary']?
I realise the questions I'm asking may be fairly basic. I'm working my way through Python Essential Reference (4th ed) Developer's Library. If anyone has any sources dealing with my kind of questions it would be great to see them.
Thanks
If you'll forgive me for answering the important question rather than the one you asked...
That's a very compact chain. Break it into separate lines and then use the Debugging view of an IDE to step through it the understand the datatypes involved.
query_method = df.groupby
query_string = 'Age'
query_return = query_method(query_string)
data = query_return['Salary']
data_mean = data.mean()
Step through in the PyCharm Debugger and you can see type for every variable.
There is a lot of context here that can be found in the pandas dataframe documentation.
To start off, df is an object of class pandas.DataFrame.
pandas.DataFrame has a function called groupby that takes some input. In your example, the input is 'Age'. When you pass arguments to a function it looks like this:
my_function(input)
when you have more than one input, the common way to pass them is as multiple variables, like this
my_function(input1, input2, etc, ...)
pandas.DataFrame.groupby(...) returns an object that is subscriptable or sliceable. Using slice notation is like accessing an element in an list or a dict, like this
my_list = [1,2,3]
print(my_list[0]) # --> 1
my_dict = {
"a": "apple",
"b": "banana",
"c": "cucumber"
}
print(my_dict["b"]) # --> banana
coming back to your specific question:
df.groupby('Age')['Salary'].mean()
df # df, the name of your DataFrame variable
.groupby('Age') # call the function groupby to get the frame grouped by the column 'Age'
['Salary'] # access the 'Salary' element from that groupby
.mean() # and apply the mean() function to the 'Salary' element
So it appears that you are getting a list of all the the mean salaries by age of the employee.
I hope this helps to explain
both Age & Salary are lists (they are both columns from a df),
They're Ranges / Columns, not lists. The group by function of a Dataframe returns an indexed object. Calling methods requires parenthesis, like print(). You can use square brackets to access indexed data (ref. dict() objects).
The period and paranthesis afterwards is another function call
why is there a dot before mean, but not in-between ('Age') and ['Salary']
Short answer is that foo.['bar'] is not valid syntax
But df.groupBy("Age").some_func() certainly could be done, depending on the available functions on that object
Related
I have created a datatable frame as,
DT_EX = dt.Frame({'sales':[103.07, 47.28, 162.15, 84.47, 44.97, 46.97, 34.99, 9.99, 29.99, 64.98],
'quantity':[6, 2, 8, 3, 3, 3, 1, 1, 1, 2],
'customer_lifecycle_status':['Lead','First time buyer','Active customer','Defecting customer','
'Lead','First time buyer','Lead','Lead','Lead','Lead']})
Now I'm trying to select only 2 fields from the datatable as,
DT_EX[:, f.sales, f.quantity]
In this case, It is displaying the data from quantity to sales whereas it should display them in the specified order(sales,quantity). and here another observation from this output is that- quantity fields gets sorted in an ascending order.
Keeping this case a side, Now I have tried to pass the required fields in parenthesis as
DT_EX[:, (f.sales,f.quantity)]
Here It is now producing the correct output without any sorting/jumbled of fields
It's always recommended to pass the fields to be selected in parenthesis.
Finally, I would be interested to know what has happened in the first case?
, would you please explain it clearly ?.
The primary syntax of datatable is
DT[i, j, by, ...]
That is, when you write a sequence of expressions in the square brackets, the first one is interpreted as i (row filter), the second as j (column selector), the third as by (group-by variable).
Normally, you would use a by() function to express a group-by condition, but the old syntax allowed to specify a bare column name in the third place in DT[], and it was interpreted as a group-by variable. Such use is considered deprecated nowadays, and may eventually get removed, but at least for now it is what it is.
Thus, when you write DT_EX[:, f.sales, f.quantity], the quantity column is interpreted as a group by condition (and since j does not have any reduction operations, it works essentially as a sort). Another effect of using a grouping variable is that it is moved to the front of the resulting frame, which essentially means you'll see columns (quantity, sales) in the "opposite" order of how they were listed.
If all you need is to select 2 columns from a frame, however, then you need to make sure those 2 columns are both within the j position in the list of arguments to DT[...]. This can be done via a list, or a tuple, or a dictionary:
DT_EX[:, [f.sales, f.quantity]]
DT_EX[:, (f.sales, f.quantity)]
DT_EX[:, {"SALES": f.sales, "QUANT": f.quantity}]
I'm trying to write something that answers "what are the possible values in every column?"
I created a dictionary called all_col_vals and iterate from 1 to however many columns my dataframe has. However, when reading about this online, someone stated this looked too much like Java and the more pythonic way would be to use zip. I can't see how I could use zip here.
all_col_vals = {}
for index in range(RCSRdf.shape[1]):
all_col_vals[RCSRdf.iloc[:,index].name] = set(RCSRdf.iloc[:,index])
The output looks like 'CFN Network': {nan, 'N521', 'N536', 'N401', 'N612', 'N204'}, 'Exam': {'EXRC', 'MXRN', 'HXRT', 'MXRC'} and shows all the possible values for that specific column. The key is the column name.
I think #piRSquared's comment is the best option, so I'm going to steal it as an answer and add some explanation.
Answer
Assuming you don't have duplicate columns, use the following:
{k : {*df[k]} for k in df}
Explanation
k represents a column name in df. You don't have to use the .columns attribute to access them because a pandas.DataFrame works similarly to a python dict
df[k] represents the series k
{*df[k]} unpacks the values from the series and places them in a set ({}) which only keeps distinct elements by definition (see definition of a set).
Lastly, using list comprehension to create the dict is faster than defining an empty dict and adding new keys to it via a for-loop.
I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)
I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".
I have a list of dictionaries. Which looks something like,
abc = [{"name":"bob",
"age": 33},
{"name":"fred",
"age": 18},
{"name":"mary",
"age": 64}]
Lets say I want to lookup bobs age. I know I can run a for loop through etc etc. However my questions is are there any quicker ways of doing this.
One thought is to use a loop but break out of the loop once the lookup (in this case the age for bob) has been completed.
The reason for this question is my datasets are thousands of lines long so Im looking for any performance gains I can get.
Edit : I can see you can use the following via the use of a generator, however im not too sure whether this would still iterate over all items of the list or just iterate until the the first dict containing the name bob is found ?
next(item for item in abc if item["name"] == "bob")
Thanks,
Depending on how many times you want to perform this operation, it might be worth defining a dictionary mapping names to the corresponding age (or the list of corresponding ages if more than two people can share the same name).
A dictionary comprehension can help you:
abc_dict = {x["name"]:x["age"] for x in abc}
I'd consider making another dictionary and then using that for multiple age lookups:
for person in abc:
age_by_name[person['name']] = person['age']
age_by_name['bob']
# this is a quick lookup!
Edit: This is equivalent to the dict comprehension listed in Josay's answer
Try indexing it first (once), and then using the index (many times).
You can index it eg. by using dict (keys would be what you are searching by, while the values would be what you are searching for), or by putting the data in the database. That should cover the case if you really have a lot more lookups and rarely need to modify the data.
define dictionary of dictionary like this only
peoples = {"bob":{"name":"bob","age": 33},
"fred":{"name":"fred","age": 18},
"mary": {"name":",mary","age": 64}}
person = peoples["bob"]
persons_age = person["age"]
look up "bob" then look up like "age"
this is correct no ?
You might write a helper function. Here's a take.
import itertools
# First returns the first element encountered in an iterable which
# matches the predicate.
#
# If the element is never found, StopIteration is raised.
# Args:
# pred The predicate which determines a matching element.
#
first = lambda pred, seq: next(itertools.dropwhile(lambda x: not pred(x), seq))