When calling a function using groupby + apply, I want to go from a DataFrame to a Series groupby object, apply a function to each group that takes a Series as input and returns a Series as output, and then assign the output from the groupby + apply call as a field in the DataFrame.
The default behavior is to have the output from groupby + apply indexed by the grouping fields, which prevents me from assigning it back to the DataFrame cleanly. I'd prefer to have the function I call with apply take a Series as input and return a Series as output; I think it's a bit cleaner than DataFrame to DataFrame. (This isn't the best way of getting to the result for this example; the real application is pretty different.)
import pandas as pd
df = pd.DataFrame({
'A': [999, 999, 111, 111],
'B': [1, 2, 3, 4],
'C': [1, 3, 1, 3]
})
def less_than_two(series):
# Intended for series of length 1 in this case
# But not intended for many-to-one generally
return series.iloc[0] < 2
output = df.groupby(['A', 'B'])['C'].apply(less_than_two)
I want the index on output to be the same as df, otherwise I cant assign
to df (cleanly):
df['Less_Than_Two'] = output
Something like output.index = df.index seems too ugly, and using the group_keys argument doesn't seem to work:
output = df.groupby(['A', 'B'], group_keys = False)['C'].apply(less_than_two)
df['Less_Than_Two'] = output
transform returns the results with the original index, just as you've asked for. It will broadcast the same result across all elements of a group. Caveat, beware that the dtype may be inferred to be something else. You may have to cast it yourself.
In this case, in order to add another column, I'd use assign
df.assign(
Less_Than_Two=df.groupby(['A', 'B'])['C'].transform(less_than_two).astype(bool))
A B C Less_Than_Two
0 999 1 1 True
1 999 2 3 False
2 111 3 1 True
3 111 4 3 False
Assuming your groupby is necessary (and the resulting groupby object will have fewer rows than your DataFrame -- this isn't the case with the example data), then assigning the Series to the 'Is.Even' column will result in NaN values (since the index to output will be shorter than the index to df).
Instead, based on the example data, the simplest approach will be to merge output -- as a DataFrame -- with df, like so:
output = df.groupby(['A','B'])['C'].agg({'C':is_even}).reset_index() # reset_index restores 'A' and 'B' from indices to columns
output.columns = ['A','B','Is_Even'] #rename target column prior to merging
df.merge(output, how='left', on=['A','B']) # this will support a many-to-one relationship between combinations of 'A' & 'B' and 'Is_Even'
# and will thus properly map aggregated values to unaggregated values
Also, I should note that you're better off using underscores than dots in variable names; unlike in R, for instance, dots act as operators for accessing object properties, and so using them in variable names can block functionality/create confusion.
Related
I have following sample dataframe with column A and B:
df:
A B
123 555
456 123
789 666
I want to know which method can be used to print out 123 (a method to print out values of A which also exist in column B). I tried following:
for i, row in df.iterrows():
if row.A in row.B:
print(row.A, row.B)
but, got error: argument of type 'float' is not iterable.
If you are trying to print any row that row.A exists in column B, then your code should be:
for i, row in df.iterrows():
if row.A in df.B:
print(row.A, row.B)
col_B = df['B'].unique()
val_B_in_A = [ i for i in df['A'].unique() if i in col_B ]
print(val_B_in_A)
Be careful with "dot" notation in dataframes, since columns can contain spaces and it starts to be a pain dealing with those. With that said,
Depending on how many rows you are iterating over, and the proportion of rows that contain unique values, it may be computationally less expensive to iterate over the unique values in 'A', and check if each one is in 'B':
import pandas as pd
tmp = []
for value in df['A'].unique():
tmp.append(df.loc[df['B']==value])
df_results = pd.concat(tmp)
print(df_results)
You could also use the built-in method .isin(), in fact, much of the power of pandas is in its array-wise operators, which are significantly quicker than most approaches involving loops:
df.loc[df['B'].isin(df['A'].unique())]
And to only show one column with the ".loc" accessor, just add
df.loc[df['B'].isin(df['A'].unique()), 'A']
And to just return the values in an optimized array
df.loc[df['B'].isin(df['A'].unique()), 'A'].values
If you are concerned with an exact match try
df['match'] = pd.Series([(df['col1']==item).sum() for item in df['col1']])
I have two dataframes. One has two important labels that have some associated columns for each label. The second one has the same labels and more useful data for those same labels. I'm trying to replace the values in the first with the values of the second for each appropriate label. For example:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'a_1':[1,2,3,4], 'a_2':[4,2,4,1], 'b_1':[1,2,3,4], 'b_2':[4,2,4,2]}
df_2 = {'n':['x','y','z','t'], 'n_1':[1,2,3,4], 'n_2':[1,2,3,4]}
I want to replace the values for n_1 and n_2 in a_1 and a_2 for a and b that are the same as n. So far i tried using the replace and map functions, and they work when I use them like this:
df.iloc[0] = df.iloc[0].replace({'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[0])
I can make the substitution for one specific line, but if I try to put that in a for loop and change the numbers I get the error Cannot index by location index with a non-integer key. If I take the ilocs from there I get the original df unchanged and without any error messages. I get the same behavior when I use the map function. The way i tried to do the for loop and the map:
for i in df:
df.iloc[i] = df.iloc[i].replace{'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[i])
df.iloc[i] = df.iloc[i].replace{'b_1':df['b_1']}, df_2['n_1'].loc(df['b'].iloc[i])
And so on. And for the map function:
for i in df:
df = df.map(df['b_1']}: df_2['n_1'].loc(df['b'].iloc[i])
df = df.map(df['a_1']}: df_2['n_1'].loc(df['a'].iloc[i])
I would like the resulting dataframe to have the same format as the first but with the values of the second, something like this:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'an_1':[1,2,3,4], 'an_2':[1,2,3,4], 'bn_1':[1,2,3,4], 'bn_2':[1,2,3,4]}
where an and bn are the values for a and b when n is equal to a or b in the second dataframe.
Hope this is comprehensible.
I need a smart and fast algorithm to delete all the rows of a Pandas dataframe [10000:37] for which I observe Boolean value False at least in one of the columns (for each row) of a twin dictionary to the dataframe (I mean that the dictionary has keys equal to the name of the columns of the dataframe while the values of each keys are lists of length 9999 of Boolean values).
I would like to apply this operation easily even in view of future implementations and program modifications, thus avoiding separate operations on the different series of values.
I state that I am not a professional programmer. can anyone recommend an appropriate route?
I will assume here that the dictionary and the dataframe have different values but share same indices. Said differently, I assume that the index of the dataframe is a RangeIndex(start=0, stop=10000, step=1).
In that case, I would build a dataframe from the twin dictionary, and use np.all to identify the rows having at least a False in any column.
Let us call df the dataframe and twin the twin dictionary, code could be:
df_twin = pd.DataFrame(twin)
df_twin['to_drop'] = np.all(df_twin, axis=1)
df_clean = df.drop(df_twin.loc[~df_twin.to_drop].index)
Using this as an example dataframe:
test_df = pd.DataFrame({ 'A': [True,True,True], 'B': [False,True,True], 'C' : [True,False,True], 'D' : [True,True,True]})
We only want the 3rd row which has True in each column:
mask = test_df.all(axis=1)
keep_df = test_df[mask]
If you only want to check columns that are keys in your dictionary:
d = { 'A': [1,2,3], 'C': [4,5,6] }
mask = test_df[d].all(axis=1)
keep_df = test_df[mask]
I have N Dataframes, named data1,data2...etc
Each dataframe has two columns 'X' and 'Y'. The lenght of each dataframe it's not the same.
I need a new dataframe consisting in the sum of the 'X' columns.
I just tried something like:
dataframesum = pd.DataFrame(0, index=np.arange(Some_number),columns = ['X']
for i in range(N):
dataframesum.add(globals()['Data%s'%i]['X'], fill_values = 0)
but it doesn't works (i'm not sure what should be the value of Some_number) and i am getting the next error:
NotImplementedError: fill_value 0 not supported
You should use a dictionary to store an arbitrary number of variables.
So let's assume you have dataframes stored in dfs = {1: df1, 2: df2, 3: df3...}.
You can then concatenate them via pd.concat:
df_concat = pd.concat(list(dfs.values()))
Finally, you can sum columns via pd.DataFrame.sum:
sums = df_concat.sum()
To take advantage of vectorised operations, you should avoid a manual for loop. In addition, use of globals() is poor practice, and can be avoided by using dict or list to store your dataframes.
I'm struggling to understand the concept behind column naming conventions, given that one of the following attempts to create a new column appears to fail:
from numpy.random import randn
import pandas as pd
df = pd.DataFrame({'a':range(0,10,2), 'c':range(0,1000,200)},
columns=list('ac'))
df['b'] = 10*df.a
df
gives the following result:
Yet, if I were to try to create column b by substituting with the following line, there is no error message, yet the dataframe df remains with only the columns a and c.
df.b = 10*df.a ### rather than the previous df['b'] = 10*df.a ###
What has pandas done and why is my command incorrect?
What you did was add an attribute b to your df:
In [70]:
df.b = 10*df.a
df.b
Out[70]:
0 0
1 20
2 40
3 60
4 80
Name: a, dtype: int32
but we see that no new column has been added:
In [73]:
df.columns
Out[73]:
Index(['a', 'c'], dtype='object')
which means we get a KeyError if we tried df['b'], to avoid this ambiguity you should always use square brackets when assigning.
for instance if you had a column named index or sum or max then doing df.index would return the index and not the index column, and similarly df.sum and df.max would screw up those df methods.
I strongly advise to always use square brackets, it avoids any ambiguity and the latest ipython is able to resolve column names using square brackets. It's also useful to think of a dataframe as a dict of series in which it makes sense to use square brackets for assigning and returning a column
Always use square brackets for assigning columns
Dot notation is a convenience for accessing columns in a dataframe. If they conflict with existing properties (e.g. if you had a column named 'max'), then you need to use square brackets to access that column, e.g. df['max']. You also need to use square brackets when the column name contains spaces, e.g. df['max value'].
A DataFrame is just an object which has the usual properties and methods. If you use dot notation for assignment, you are creating a property or method for the dataframe object. So df.val = 2 will assign df with a property val that has a value of two. This is very different from df['val'] = 2 which creates a new column in the dataframe and assigns each element in that column the value of two.
To be safe, using square bracket notation will always provide the correct result.
As an aside, your columns=list('ac')) doesn't do anything, as you are just creating a variable named columns that is never used. You may have meant df.columns = list('ac'), but you already assigned those in the creation of the dataframe, so I'm not sure what the intent is with this line of code. And remember that dictionaries are unordered, so that pd.DataFrame({'a': [...], 'b': [...]}) could potentially return a dataframe with columns ['b', 'a']. If this were the case, then assigning column names could potentially mix up the column headers.
The issue has to do with how properties are handled in python. There is no restriction in python of setting a new properties for a class, so for example you could do something like
df.myspecialstuff = ["dog", "cat", 5]
So when you do assignment like
df.b = 10*df.a
It is ambiguous whether you want to add a property or a new column, and a property is set. The easiest way to actually see what is going on with this is to use pdb and step through the code
import pdb
x = df.a
pdb.run("df.a1 = x")
This will step into the __setattr__() whereas pdb.run("df['a2'] = x") will step into __setitem__()