I have a pandas DataFrame.
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
1
Ahmad Anis
123
2
3
programmer
Random
I want to convert it to following
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
programmer
Random
I can not display it correctly, but I do not want it to repeat similar things again, instead make them a big box, and uncommon things are displayed. I tried using multi-indexing but it was not working.
I want something similar to this
But here it is only doing it with single column, I want it to do it with my USER_ID, USER_NAME, USER_REPUTATION, NUMBER_OF_ANSWERS, NUMBER_OF_QUESTIONS columbs
I think you're looking for set_index:
cols = ["USER_ID", "USER_NAME", "USER_REPUTATION", "NUMBER_OF_ANSWERS", "NUMBER_OF_QUESTIONS"]
ndf = df.set_index(cols)
with some sample data:
>>> df
A B C D E
0 one A foo 0.945847 -0.561259
1 one A foo 0.579520 0.130518
2 one A foo -0.683629 -1.084639
3 one A bar -0.168223 -0.311991
4 one B bar 0.007965 1.108121
5 one B bar -1.877323 -0.258055
6 one B bar 0.992160 0.192339
7 one B foo -0.421557 -0.805156
8 two C bar -0.346622 1.335197
9 two C foo -0.979483 -1.382465
10 two C bar -0.815332 -1.491385
11 two C foo -2.112730 -0.331574
>>> cols = ["A", "B", "C"]
>>> ndf = df.set_index(cols)
>>> ndf
D E
A B C
one A foo 0.945847 -0.561259
foo 0.579520 0.130518
foo -0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
bar -1.877323 -0.258055
bar 0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
ndf is now a multi index frame.
To make the D and E at the same level as the A, B and C, we can set the index to all of them for the display purposes:
the_df = df.set_index(["A", "B", "C", "D", "E"])
to get (in an IPython notebook, for example)
Note that if you were to look at this in console:
>>> the_df
Empty DataFrame
Columns: []
Index: [(one, A, foo, 0.945847, -0.561259), (one, A, foo, 0.57952, 0.130518), ...]
because we set everything to the index and nothing remained in the values! But if you'd like to see it in the console as well, one trick is to use a "ghost" column, i.e., with name and values being the empty string "":
>>> the_df[""] = ""
>>> the_df
A B C D E
one A foo 0.945847 -0.561259
0.579520 0.130518
-0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
-1.877323 -0.258055
0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
removing the extra first row in HTML:
from bs4 import BeautifulSoup
# form the soup
soup = BeautifulSoup(the_df.to_html())
# find the first row and remove it
soup.find("tr").extract()
# get HTML back
html = str(soup)
When trying to read some columns using their indices from a tabular file with pandas read_csv it seems the usecols and names get out of sync with each other.
For example, having the file test.csv:
FOO A -46450.494736 0.0728830817231
FOO A -46339.7126846 0.0695018062805
FOO A -46322.4942905 0.0866205763556
FOO B -46473.3117983 0.0481618121947
FOO B -46537.6827055 0.0436893868921
FOO B -46467.2102205 0.0485001911304
BAR C -33424.1224914 6.7981041851
BAR C -33461.4101485 7.40607068177
BAR C -33404.6396495 4.72117502707
and trying to read 3 columns by index without preserving the original order:
cols = [1, 2, 0]
names = ['X', 'Y', 'Z']
df = pd.read_csv(
'test.csv', sep='\t',
header=None,
index_col=None,
usecols=cols, names=names)
I'm getting the following dataframe:
X Y Z
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
whereas I would expect column Z to have the FOO and BAR, like this:
Z X Y
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
I know pandas stores the dataframes as dictionary so the order of the columns may be different from the requested with usecols, but the problem here is that using usecols with indices and names doesn't make sense.
I really need to read the columns by their indices and then assign names to them. Is there any workaround for this?
The documentation could be clearer on this (feel free to make an issue, or even better submit a pull request!) but usecols is set-like - it does not define an order of columns, it simply is tested against for membership.
from io import StringIO
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[0, 1, 2])
Out[31]:
a b c
0 1 2 3
1 4 5 6
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[2, 1, 0])
Out[32]:
a b c
0 1 2 3
1 4 5 6
names on the other hand is ordered. So in this case, the answer is to specify the names in the order you want them.
Consider the following dataframe:
columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
"""
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
"""
The following commands work:
df.groupby('A').apply(lambda x: (x['C'] - x['D']))
df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())
but none of the following work:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
# KeyError or ValueError: could not broadcast input array from shape (5) into shape (5,3)
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
# KeyError or TypeError: cannot concatenate a non-NDFrame object
Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:
# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)
In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?
For reference, below is the construction of the original dataframe above:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8)})
Two major differences between apply and transform
There are two major differences between the transform and apply groupby methods.
Input:
apply implicitly passes all the columns for each group as a DataFrame to the custom function.
while transform passes each column for each group individually as a Series to the custom function.
Output:
The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.
So, transform works on just one Series at a time and apply works on the entire DataFrame at once.
Inspecting the custom function
It can help quite a bit to inspect the input to your custom function passed to apply or transform.
Examples
Let's create some sample data and inspect the groups so that you can see what I am talking about:
import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})
State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11
Let's create a simple custom function that prints out the type of the implicitly passed object and then raises an exception so that execution can be stopped.
def inspect(x):
print(type(x))
raise
Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:
df.groupby('State').apply(inspect)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError
As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.
Now, let's do the same thing with transform
df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError
It is passed a Series - a totally different Pandas object.
So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:
def subtract_two(x):
return x['a'] - x['b']
df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')
We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:
df.groupby('State').apply(subtract_two)
State
Florida 2 -2
3 -8
Texas 0 -2
1 -5
dtype: int64
The output is a Series and a little confusing as the original index is kept, but we have access to all columns.
Displaying the passed pandas object
It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:
from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']
Screenshot:
Transform must return a single dimensional sequence the same size as the group
The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:
def return_three(x):
return np.array([1, 2, 3])
df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group
The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:
def rand_group_len(x):
return np.random.rand(len(x))
df.groupby('State').transform(rand_group_len)
a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208
Returning a single scalar object also works for transform
If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:
def group_sum(x):
return x.sum()
df.groupby('State').transform(group_sum)
a b
0 9 16
1 9 16
2 4 14
3 4 14
As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.
My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:
df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.
So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).
Consider this example (on your dataframe):
zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)
will yield:
C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653
Which is exactly the same as if you would use it on only on one column at a time:
df.groupby('A')['C'].transform(zscore)
yielding:
0 0.989
1 -0.478
2 0.889
3 -0.671
4 0.034
5 1.149
6 -1.404
7 -0.509
Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:
df.groupby('A').apply(zscore)
gives error:
ValueError: operands could not be broadcast together with shapes (6,) (2,)
So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.
df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group
yielding:
A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373
Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:
df.groupby('A')['C'].apply(sum)
giving:
A
bar 3.973
foo 4.373
There are also cases when .transform is used to filter the data:
df[df.groupby(['B'])['D'].transform(sum) < -1]
A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179
I hope this adds a bit more clarity.
I am going to use a very simple snippet to illustrate the difference:
test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
The DataFrame looks like this:
id price
0 1 1
1 2 2
2 3 3
3 1 2
4 2 3
5 3 1
6 1 3
7 2 1
8 3 2
There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.
Now, I want to find the minimum payment made by each customer. There are two ways of doing it:
Using apply:
grouping.min()
The return looks like this:
id
1 1
2 1
3 1
Name: price, dtype: int64
pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
Using transform:
grouping.transform(min)
The return looks like this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64
pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.
If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.
If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:
test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.
tmp = df.groupby(['A'])['c'].transform('mean')
is like
tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])
or
tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)
you can use zscore to analyze the data in column C and D for outliers, where zscore is the series - series.mean / series.std(). Use apply too create a user defined function for difference between C and D creating a new resulting dataframe. Apply uses the group result set.
from scipy.stats import zscore
columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
print(df)
standardize=df.groupby('A')['C','D'].transform(zscore)
print(standardize)
outliersC= (standardize['C'] <-1.1) | (standardize['C']>1.1)
outliersD= (standardize['D'] <-1.1) | (standardize['D']>1.1)
results=df[outliersC | outliersD]
print(results)
#Dataframe results
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
#C and D transformed Z score
C D
0 0.398046 0.801292
1 -0.300518 -1.398845
2 1.121882 -1.251188
3 -1.046514 0.519353
4 0.666781 -0.417997
5 1.347032 0.879491
6 -0.482004 1.492511
7 -1.704704 -0.624618
#filtering using arbitrary ranges -1 and 1 for the z-score
A B C D
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922
>>>>>>>>>>>>> Part 2
splitting = df.groupby('A')
#look at how the data is grouped
for group_name, group in splitting:
print(group_name)
def column_difference(gr):
return gr['C']-gr['D']
grouped=splitting.apply(column_difference)
print(grouped)
A
bar 1 0.369953
3 -1.704616
5 0.861846
foo 0 0.074534
2 2.500196
4 1.365823
6 -1.332981
7 -0.658920
What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.