So I've been following a guide I got off reddit about understanding python's bracket requirements:
Is it a list? Then use brackets.
Is it a dict? Then use braces.
Otherwise, you probably want parentheses.
However, I've come across something that the above can't explain:
df.groupby('Age')['Salary'].mean()
In this case, both Age & Salary are lists (they are both columns from a df), so why do we use parentheses for Age and brackets for Salary?
Additionally, why is there a dot before mean, but not in-between ('Age') and ['Salary']?
I realise the questions I'm asking may be fairly basic. I'm working my way through Python Essential Reference (4th ed) Developer's Library. If anyone has any sources dealing with my kind of questions it would be great to see them.
Thanks
If you'll forgive me for answering the important question rather than the one you asked...
That's a very compact chain. Break it into separate lines and then use the Debugging view of an IDE to step through it the understand the datatypes involved.
query_method = df.groupby
query_string = 'Age'
query_return = query_method(query_string)
data = query_return['Salary']
data_mean = data.mean()
Step through in the PyCharm Debugger and you can see type for every variable.
There is a lot of context here that can be found in the pandas dataframe documentation.
To start off, df is an object of class pandas.DataFrame.
pandas.DataFrame has a function called groupby that takes some input. In your example, the input is 'Age'. When you pass arguments to a function it looks like this:
my_function(input)
when you have more than one input, the common way to pass them is as multiple variables, like this
my_function(input1, input2, etc, ...)
pandas.DataFrame.groupby(...) returns an object that is subscriptable or sliceable. Using slice notation is like accessing an element in an list or a dict, like this
my_list = [1,2,3]
print(my_list[0]) # --> 1
my_dict = {
"a": "apple",
"b": "banana",
"c": "cucumber"
}
print(my_dict["b"]) # --> banana
coming back to your specific question:
df.groupby('Age')['Salary'].mean()
df # df, the name of your DataFrame variable
.groupby('Age') # call the function groupby to get the frame grouped by the column 'Age'
['Salary'] # access the 'Salary' element from that groupby
.mean() # and apply the mean() function to the 'Salary' element
So it appears that you are getting a list of all the the mean salaries by age of the employee.
I hope this helps to explain
both Age & Salary are lists (they are both columns from a df),
They're Ranges / Columns, not lists. The group by function of a Dataframe returns an indexed object. Calling methods requires parenthesis, like print(). You can use square brackets to access indexed data (ref. dict() objects).
The period and paranthesis afterwards is another function call
why is there a dot before mean, but not in-between ('Age') and ['Salary']
Short answer is that foo.['bar'] is not valid syntax
But df.groupBy("Age").some_func() certainly could be done, depending on the available functions on that object
Is it possible to use list comprehension for a dataframe if I want to change one column's value based on the condition of another column's value.
The code I'm hoping to make work would be something like this:
return ['lower_level' for x in usage_time_df['anomaly'] if [y < lower_outlier for y in usage_time_df['device_years']]
Thanks!
I don't think what you want to do can be done in a list comprehension, and if it can, it will definitely not be efficient.
Assuming a dataframe usage_time_df with two columns, anomaly and device_years, if I understand correctly, you want to set the value in anomaly to lower_level when the value in device_years does not reach lower_outlier (which I guess is a float). The natural way to do that is:
usage_time_df.loc[usage_time_df['device_years'] < lower_outlier, 'anomaly'] = 'lower_level'
Can anyone Kindly help please?
I'm trying to remove three of the first characters within the string using the statement:
Data['COUNTRY_CODE'] = Data['COUNTRY1'].str[3:]
This will create a new column after removing the first three values of the string. However, I do not want this to be applied to all of the values within the same column so was hoping there would be a way to use a conditional statement such as 'Where' in order to only change the desired strings?
I assume you are using pandas so your condition check can be like:
condition_mask = Data['COL_YOU_WANT_TO_CHECK'] == 'SOME CONDITION'
Your new column can be created as:
# Assuming you want the first 3 chars as COUNTRY_CODE
Data.loc[condition_mask, 'COUNTRY_CODE'] = Data['COUNTRY1'].str[:3]
Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst
I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".