I have a dataframe as follows:
A B C
0 foo 1.496337 -0.604264
1 bar -0.025106 0.257354
2 foo 0.958001 0.933328
3 foo -1.126581 0.570908
4 bar -0.428304 0.881995
5 foo -0.955252 1.408930
6 bar 0.504582 0.455287
7 bar -1.076096 0.536741
8 bar 0.351544 -1.146554
9 foo 0.430260 -0.348472
I would like to get the max of column B of each group (when grouped by A) and add it the the column C. So here is what I tried:
Group by A:
df = df.groupby(by='A')
Get the maximum of column B and then tried to apply it to column 'C':
for name in ['foo','bar']:
maxi = df.get_group(name)['B'].max()
df.get_group(name)['C'] = df.get_group(name)['C']+maxi
At this point pandas suggests Try using .loc[row_indexer,col_indexer] = value instead. Does this mean I have to use for loops on rows with a if on the column A value and modify the C data one by one? I mean that does not seem to be pandas-ish and I feel that I am missing something. How could I better work around this grouped dataframe?
Such operations are done using transforms or aggregations.
In your case you need transform
# groupby 'A'
grouped = df.groupby('A')
# transform B so every row becomes the maximum along the group:
max_B = grouped['B'].transform('max')
# add the new column to the old df
df['D'] = df['A'] + max_B
Or in one line:
In [2]: df['D'] = df.groupby('A')['B'].transform('max') + df['C']
In [3]: df
Out[3]:
A B C D
0 foo 1.496337 -0.604264 0.892073
1 bar -0.025106 0.257354 0.761936
2 foo 0.958001 0.933328 2.429665
3 foo -1.126581 0.570908 2.067245
4 bar -0.428304 0.881995 1.386577
5 foo -0.955252 1.408930 2.905267
6 bar 0.504582 0.455287 0.959869
7 bar -1.076096 0.536741 1.041323
8 bar 0.351544 -1.146554 -0.641972
9 foo 0.430260 -0.348472 1.147865
For more info, see
http://pandas.pydata.org/pandas-docs/stable/groupby.html
Related
I was doing something very basic like this -
data = np.arange(1,13).reshape(4,3)
table = pd.DataFrame(data, index = list('abcd'), columns =['foo','bar','baz'])
table
foo bar baz
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12
And then I ran this -
table['bar':'foo']
#output
foo bar baz
c 7 8 9
d 10 11 12
I don't get why I am getting this result. Note that I am not asking for any other solution or workaround. I am just looking for explanation/rules behind this behavior.
I'm not entirly sure, but it looks like you can't use slicing for column names, the slicing only works on the rows, so only c and d are (lexicography) between bar and foo
You can instead use loc:
table.loc[:, 'foo':'bar']
Note that I changed the order of foo and bar, this is because they are ordered as you defined them, foo -> baz -> bar and not lexicographically. 'bar':'foo' will return an empty dataframe.
It's basically outputting row slices by comparing bar and foo lexicographically with the existing column names. The output includes column c and d as they're only two columns that fall between bar and foo: a < b < bar < c < d < ... < foo
First, you have to know that the notation df[x:y] try to slice your dataframe by index labels and not columns. This is different than the notation df[x] which try to select a column. A generic way to filter your dataframe is to use .loc (or .iloc). You should the documentation about Indexing and selecting data
>>> table['bar':'foo']
foo bar baz
c 7 8 9 # 'bar' >= 'c' (a b bar c d )
d 10 11 12 # 'd' <= 'foo' (a b c d foo)
When you use your code you raise an exception because we often use a RangeIndex or IntIndex index label:
>>> table.reset_index(drop=True)['bar':'foo']
...
TypeError: cannot do slice indexing on RangeIndex with these indexers [bar] of type str
Immediately, we understand the problem and use .loc:
>>> table.loc[:, 'foo':'bar']
foo bar
a 1 2
b 4 5
c 7 8
d 10 11
I have a pandas DataFrame.
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
1
Ahmad Anis
123
2
3
programmer
Random
I want to convert it to following
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
programmer
Random
I can not display it correctly, but I do not want it to repeat similar things again, instead make them a big box, and uncommon things are displayed. I tried using multi-indexing but it was not working.
I want something similar to this
But here it is only doing it with single column, I want it to do it with my USER_ID, USER_NAME, USER_REPUTATION, NUMBER_OF_ANSWERS, NUMBER_OF_QUESTIONS columbs
I think you're looking for set_index:
cols = ["USER_ID", "USER_NAME", "USER_REPUTATION", "NUMBER_OF_ANSWERS", "NUMBER_OF_QUESTIONS"]
ndf = df.set_index(cols)
with some sample data:
>>> df
A B C D E
0 one A foo 0.945847 -0.561259
1 one A foo 0.579520 0.130518
2 one A foo -0.683629 -1.084639
3 one A bar -0.168223 -0.311991
4 one B bar 0.007965 1.108121
5 one B bar -1.877323 -0.258055
6 one B bar 0.992160 0.192339
7 one B foo -0.421557 -0.805156
8 two C bar -0.346622 1.335197
9 two C foo -0.979483 -1.382465
10 two C bar -0.815332 -1.491385
11 two C foo -2.112730 -0.331574
>>> cols = ["A", "B", "C"]
>>> ndf = df.set_index(cols)
>>> ndf
D E
A B C
one A foo 0.945847 -0.561259
foo 0.579520 0.130518
foo -0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
bar -1.877323 -0.258055
bar 0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
ndf is now a multi index frame.
To make the D and E at the same level as the A, B and C, we can set the index to all of them for the display purposes:
the_df = df.set_index(["A", "B", "C", "D", "E"])
to get (in an IPython notebook, for example)
Note that if you were to look at this in console:
>>> the_df
Empty DataFrame
Columns: []
Index: [(one, A, foo, 0.945847, -0.561259), (one, A, foo, 0.57952, 0.130518), ...]
because we set everything to the index and nothing remained in the values! But if you'd like to see it in the console as well, one trick is to use a "ghost" column, i.e., with name and values being the empty string "":
>>> the_df[""] = ""
>>> the_df
A B C D E
one A foo 0.945847 -0.561259
0.579520 0.130518
-0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
-1.877323 -0.258055
0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
removing the extra first row in HTML:
from bs4 import BeautifulSoup
# form the soup
soup = BeautifulSoup(the_df.to_html())
# find the first row and remove it
soup.find("tr").extract()
# get HTML back
html = str(soup)
When trying to read some columns using their indices from a tabular file with pandas read_csv it seems the usecols and names get out of sync with each other.
For example, having the file test.csv:
FOO A -46450.494736 0.0728830817231
FOO A -46339.7126846 0.0695018062805
FOO A -46322.4942905 0.0866205763556
FOO B -46473.3117983 0.0481618121947
FOO B -46537.6827055 0.0436893868921
FOO B -46467.2102205 0.0485001911304
BAR C -33424.1224914 6.7981041851
BAR C -33461.4101485 7.40607068177
BAR C -33404.6396495 4.72117502707
and trying to read 3 columns by index without preserving the original order:
cols = [1, 2, 0]
names = ['X', 'Y', 'Z']
df = pd.read_csv(
'test.csv', sep='\t',
header=None,
index_col=None,
usecols=cols, names=names)
I'm getting the following dataframe:
X Y Z
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
whereas I would expect column Z to have the FOO and BAR, like this:
Z X Y
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
I know pandas stores the dataframes as dictionary so the order of the columns may be different from the requested with usecols, but the problem here is that using usecols with indices and names doesn't make sense.
I really need to read the columns by their indices and then assign names to them. Is there any workaround for this?
The documentation could be clearer on this (feel free to make an issue, or even better submit a pull request!) but usecols is set-like - it does not define an order of columns, it simply is tested against for membership.
from io import StringIO
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[0, 1, 2])
Out[31]:
a b c
0 1 2 3
1 4 5 6
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[2, 1, 0])
Out[32]:
a b c
0 1 2 3
1 4 5 6
names on the other hand is ordered. So in this case, the answer is to specify the names in the order you want them.
I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.
I have a dataframe that looks like this:
baz qux
one A
one B
two C
three A
one B
one C
I'm trying to reshape it to look like this:
one two three
A C A
B
B
C
I'm pretty confused about whether this is possible, and if so, how you would do it. I've tried using the pivot_table method as pd.pivot_table(cols='baz', rows='qux') but that threw a TypeError. I think I'm being an idiot and missing something really basic here. Any ideas?
I'm not sure if it's the most optimal way of doing it but it does the job:
import io
import pandas as pd
data = u'baz,qux\none,A\none,B\ntwo,C\nthree,A\none,B\none,C'
df = pd.read_csv(io.StringIO(data))
new = pd.DataFrame()
for key, group in df.groupby('baz'):
new = pd.concat([new, pd.DataFrame(group.reset_index().qux, columns=[key])],
axis=1)
print new.replace(np.nan, '')
Which gives back:
one two three
0 A C A
1 B
2 B
3 C
With pivot table you can get a matrix showing which baz corresponds to which qux:
>>> df['foo'] = 1 # Add aggregation column
>>> df.pivot_table('foo', cols='baz', rows=['qux'])
one three two
A 1 1 NaN
B 1 NaN NaN
C 1 NaN 1
This is not quite what you asked for, but perhaps it suffices:
import numpy as np
import pandas as pd
df = pd.DataFrame({'baz':'one one two three one one'.split(),
'qux': list('ABCABC')})
grouped = df.groupby(['baz', 'qux'])
df2 = grouped.apply(pd.DataFrame.reset_index, drop=True)['qux'].unstack(level=0)
df2.reset_index(drop=True, inplace=True)
df2 = df2.reindex(columns='one two three'.split())
df2 = df2.replace(np.nan, '')
print(df2)
yields
one two three
0 A A
1 B
2 B
3 C C