I have a large DataFrame with the following contents:
obj0 type obj1 counts
car P wheel 2
tree P branch 5
box I ball 23
wheel P bus 3
grass A cow 10
...
I want to see how many obj0 and type pairs I have, I do:
obj0_sums = pdf.groupby(['obj0', 'type'])['counts'].sum().
obj0 type
car P 5
tree P 15
box I 50
...
Then I need to do other operations on obj0_sums later but it is a Series where obj0 and type columns have turned into the first row. How can I make it stay a DataFrame and preserve columns after groupby operation?
try this:
In [262]: df.groupby(['obj0', 'type'], as_index=False)['counts'].sum()
Out[262]:
obj0 type counts
0 box I 23
1 car P 2
2 grass A 10
3 tree P 5
4 wheel P 3
from docs:
as_index : boolean, default True
For aggregated output, return object with group labels as the index. Only
relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
I think you might be looking for
pdf.groupby(['obj0', 'type'], as_index=False)['counts'].sum()
Related
I have this dataframe and my goal is to remove any columns that have less than 1000 entries.
Prior to to pivoting the df I know I have 880 unique well_id's with entries ranging from 4 to 60k+. I know should end up with 102 well_id's.
I tried to accomplish this in a very naïve way by collecting the wells that I am trying to remove in an array and using a loop but I keep getting a 'TypeError: Level type mismatch' but when I just use del without a for loop it works.
#this works
del df[164301.0]
del df['TB-0071']
# this doesn't work
for id in unwanted_id:
del df[id]
Any help is appreciated, Thanks.
You can use dropna method:
df.dropna(thresh=[]) #specify [here] how many non-na values you require to keep the row
The advantage of this method is that you don't need to create a list.
Also don't forget to add the usual inplace = True if you want the changes to be made in place.
You can use pandas drop method:
df.drop(columns=['colName'], inplace=True)
You can actually pass a list of columns names:
unwanted_id = [164301.0, 'TB-0071']
df.drop(columns=unwanted_ids, inplace=True)
Sample:
df[:5]
from to freq
0 A X 20
1 B Z 9
2 A Y 2
3 A Z 5
4 A X 8
df.drop(columns=['from', 'to'])
freq
0 20
1 9
2 2
3 5
4 8
And to get those column names with more than 1000 unique values, you can use something like this:
counts = df.nunique()[df.nunique()>1000].to_frame('uCounts').reset_index().rename(columns={'index':'colName'})
counts
colName uCounts
0 to 1001
1 freq 1050
I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64
I have a DataFrame with a multiindex in the columns and would like to use dictionaries to append new rows.
Let's say that each row in the DataFrame is a city. The columns contains "distance" and "vehicle". And each cell would be the percentage of the population that chooses this vehicle for this distance.
I'm constructing an index like this:
index_tuples=[]
for distance in ["near", "far"]:
for vehicle in ["bike", "car"]:
index_tuples.append([distance, vehicle])
index = pd.MultiIndex.from_tuples(index_tuples, names=["distance", "vehicle"])
Then I'm creating a dataframe:
dataframe = pd.DataFrame(index=["city"], columns = index)
The structure of the dataframe looks good. Although pandas has added Nans as default values ?
Now I would like to set up a dictionary for the new city and add it:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe["my_home_city"] = my_home_city
But this fails:
ValueError: Length of values does not match length of index
Here is the complete error message (pastebin)
UPDATE:
Thank you for all the good answers. I'm afraid I've oversimplified the problem in my example. Actually my index is nested with 3 levels (and it could become more).
So I've accepted the universal answer of converting my dictionary into a list of tuples. This might not be as clean as the other approaches but works for any multiindex setup.
Multi index is a list of tuple , we just need to modify your dict ,then we could directly assign the value
d = {(x,y):my_home_city[x][y] for x in my_home_city for y in my_home_city[x]}
df.loc['my_home_city',:]=d
df
Out[994]:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
More Info
d
Out[995]:
{('far', 'bike'): 0,
('far', 'car'): 1,
('near', 'bike'): 1,
('near', 'car'): 0}
df.columns.values
Out[996]: array([('near', 'bike'), ('near', 'car'), ('far', 'bike'), ('far', 'car')], dtype=object)
You can append to you dataframe like this:
my_home_city = {"near":{"bike":1, "car":0},"far":{"bike":0, "car":1}}
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
Output:
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
The trick is to create the dataframe row with from_dict then unstack to get structure of your original dataframe with multiindex columns then rename to get index and append.
Or if you don't want to create the empty dataframe first you can use this method to create the dataframe with the new data.
pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city').to_frame().T
Output:
far near
bike car bike car
my_home_city 0 1 1 0
Explained:
pd.DataFrame.from_dict(my_home_city)
far near
bike 0 1
car 1 0
Now, let's unstack to create multiindex and get to that new dataframe into the structure of the original dataframe.
pd.DataFrame.from_dict(my_home_city).unstack()
far bike 0
car 1
near bike 1
car 0
dtype: int64
We use rename to give that series a name which becomes the index label of that dataframe row when appended to the original dataframe.
far bike 0
car 1
near bike 1
car 0
Name: my_home_city, dtype: int64
Now if you converted that series to a frame and transposed it would look very much like a new row, however, there is no need to do this because, Pandas does intrinsic data alignment, so appending this series to the dataframe will auto-align and add the new dataframe record.
dataframe.append(pd.DataFrame.from_dict(my_home_city).unstack().rename('my_home_city'))
distance near far
vehicle bike car bike car
city NaN NaN NaN NaN
my_home_city 1 0 0 1
I don't think you even need to initialise an empty dataframe. With your d, I can get your desired output with unstack and a transpose:
pd.DataFrame(d).unstack().to_frame().T
far near
bike car bike car
0 0 1 1 0
Initialize your empty dataframe using MultiIndex.from_product.
distances = ['near', 'far']
vehicles = ['bike', 'car']
df = pd.DataFrame([], columns=pd.MultiIndex.from_product([distances, vehicles]),
index=pd.Index([], name='city'))
Your dictionary results in a square matrix (distance by vehicle), so unstack it (which will result in a Series), then convert it into a dataframe row by calling (to_frame) using the relevant city name and transposing the column into a row.
>>> df.append(pd.DataFrame(my_home_city).unstack().to_frame('my_home_city').T)
far near
bike car bike car
city
my_home_city 0 1 1 0
Just to add to all of the answers, this is just another(maybe not too different) simple example, represented in a more reproducible way :
import itertools as it
from IPython.display import display # this is just for displaying output purpose
import numpy as np
import pandas as pd
col_1, col_2 = ['A', 'B'], ['C', 'D']
arr_size = len(col_2)
col = pd.MultiIndex.from_product([col_1, col_2])
tmp_df = pd.DataFrame(columns=col)
display(tmp_df)
for s in range(3):# no of rows to add to tmp_df
tmp_dict = {x : [np.random.random_sample(1)[0] for i in range(arr_size)] for x in range(arr_size)}
tmp_ser = pd.Series(it.chain.from_iterable([tmp_dict[x] for x in tmp_dict]), index=col)
# display(tmp_dict, tmp_ser)
tmp_df = tmp_df.append(tmp_ser[tmp_df.columns], ignore_index=True)
display(tmp_df)
Some things to note about above:
The number of items to add should always match len(col_1)*len(col_2), that is the product of element lengths your multi-index is made from.
list(it.chain.from_iterable([[2, 3], [4, 5]])) simply does this [2,3,4,5]
try this workaround
append to dict
then convert to pandas data frame
at the very last step select desired columns to create multi-index with set_index()
d = dict()
for g in predictor_types:
for col in predictor_types[g]:
tot = len(ames) - ames[col].count()
if tot:
d.setdefault('type',[]).append(g)
d.setdefault('predictor',[]).append(col)
d.setdefault('missing',[]).append(tot)
pd.DataFrame(d).set_index(['type','predictor']).style.bar(color='DodgerBlue')
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.
I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.
As an example, this is a simplified version of my problem:
User Time Col1 newcol1 newcol2 newcol3 newcol4
0 1 6 [cat, dog, goat] 0 0 0 0
1 1 6 [cat, sheep] 0 0 0 0
2 1 12 [sheep, goat] 0 0 0 0
3 2 3 [cat, lion] 0 0 0 0
4 2 5 [fish, goat, lemur] 0 0 0 0
5 3 9 [cat, dog] 0 0 0 0
6 4 4 [dog, goat] 0 0 0 0
7 4 11 [cat] 0 0 0 0
At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.
Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):
def myJFunc(df):
... #initialize jnum counter
... jnum = 0;
... #loop through each row of dataframe (not including the first/zeroeth)
... for i in range(1,len(df)):
... #has user changed?
... if df.User.loc[i] == df.User.loc[i-1]:
... #has time increased by more than 1 (hour)?
... if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
... #update new columns
... df['newcol2'].loc[i-1] = 1;
... df['newcol1'].loc[i] = 1;
... #increase jnum
... jnum += 1;
... #has content changed?
... if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
... #record this change
... df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
... #different user?
... elif df.User.loc[i] != df.User.loc[i-1]:
... #update new columns
... df['newcol1'].loc[i] = 1;
... df['newcol2'].loc[i-1] = 1;
... #store jnum elsewhere (code not included here) and reset jnum
... jnum = 1;
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!
(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)
(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)
I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)
df['time_diff'] = df.groupby('User')['Time'].diff()
df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )
df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()
User Time Col1 time_diff Col1_0 Col1_0_prev
0 1 6 [cat, dog, goat] NaN cat NaN
1 1 6 [cat, sheep] 0 cat cat
2 1 12 [sheep, goat] 6 sheep cat
3 2 3 [cat, lion] NaN cat NaN
4 2 5 [fish, goat, lemur] 2 fish cat
5 3 9 [cat, dog] NaN cat NaN
6 4 4 [dog, goat] NaN dog NaN
7 4 11 [cat] 7 cat dog
As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.
Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.
'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.
Calculate these separately:
df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??
df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1
It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...
Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.
In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:
from itertools import tee, izip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
# you stuff
However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.
If iterrows is still too slow I suggest to do something like this:
Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.
Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.
df[[col1, ..]].shift() - df[[col1, ..]])
If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]]
With this delta dataframe you record the changes row-wise. You can use it a a mask to update the columns you need from you original dataframe.