When trying to read some columns using their indices from a tabular file with pandas read_csv it seems the usecols and names get out of sync with each other.
For example, having the file test.csv:
FOO A -46450.494736 0.0728830817231
FOO A -46339.7126846 0.0695018062805
FOO A -46322.4942905 0.0866205763556
FOO B -46473.3117983 0.0481618121947
FOO B -46537.6827055 0.0436893868921
FOO B -46467.2102205 0.0485001911304
BAR C -33424.1224914 6.7981041851
BAR C -33461.4101485 7.40607068177
BAR C -33404.6396495 4.72117502707
and trying to read 3 columns by index without preserving the original order:
cols = [1, 2, 0]
names = ['X', 'Y', 'Z']
df = pd.read_csv(
'test.csv', sep='\t',
header=None,
index_col=None,
usecols=cols, names=names)
I'm getting the following dataframe:
X Y Z
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
whereas I would expect column Z to have the FOO and BAR, like this:
Z X Y
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
I know pandas stores the dataframes as dictionary so the order of the columns may be different from the requested with usecols, but the problem here is that using usecols with indices and names doesn't make sense.
I really need to read the columns by their indices and then assign names to them. Is there any workaround for this?
The documentation could be clearer on this (feel free to make an issue, or even better submit a pull request!) but usecols is set-like - it does not define an order of columns, it simply is tested against for membership.
from io import StringIO
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[0, 1, 2])
Out[31]:
a b c
0 1 2 3
1 4 5 6
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[2, 1, 0])
Out[32]:
a b c
0 1 2 3
1 4 5 6
names on the other hand is ordered. So in this case, the answer is to specify the names in the order you want them.
Related
I was doing something very basic like this -
data = np.arange(1,13).reshape(4,3)
table = pd.DataFrame(data, index = list('abcd'), columns =['foo','bar','baz'])
table
foo bar baz
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12
And then I ran this -
table['bar':'foo']
#output
foo bar baz
c 7 8 9
d 10 11 12
I don't get why I am getting this result. Note that I am not asking for any other solution or workaround. I am just looking for explanation/rules behind this behavior.
I'm not entirly sure, but it looks like you can't use slicing for column names, the slicing only works on the rows, so only c and d are (lexicography) between bar and foo
You can instead use loc:
table.loc[:, 'foo':'bar']
Note that I changed the order of foo and bar, this is because they are ordered as you defined them, foo -> baz -> bar and not lexicographically. 'bar':'foo' will return an empty dataframe.
It's basically outputting row slices by comparing bar and foo lexicographically with the existing column names. The output includes column c and d as they're only two columns that fall between bar and foo: a < b < bar < c < d < ... < foo
First, you have to know that the notation df[x:y] try to slice your dataframe by index labels and not columns. This is different than the notation df[x] which try to select a column. A generic way to filter your dataframe is to use .loc (or .iloc). You should the documentation about Indexing and selecting data
>>> table['bar':'foo']
foo bar baz
c 7 8 9 # 'bar' >= 'c' (a b bar c d )
d 10 11 12 # 'd' <= 'foo' (a b c d foo)
When you use your code you raise an exception because we often use a RangeIndex or IntIndex index label:
>>> table.reset_index(drop=True)['bar':'foo']
...
TypeError: cannot do slice indexing on RangeIndex with these indexers [bar] of type str
Immediately, we understand the problem and use .loc:
>>> table.loc[:, 'foo':'bar']
foo bar
a 1 2
b 4 5
c 7 8
d 10 11
I have a pandas DataFrame.
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
1
Ahmad Anis
123
2
3
programmer
Random
I want to convert it to following
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
programmer
Random
I can not display it correctly, but I do not want it to repeat similar things again, instead make them a big box, and uncommon things are displayed. I tried using multi-indexing but it was not working.
I want something similar to this
But here it is only doing it with single column, I want it to do it with my USER_ID, USER_NAME, USER_REPUTATION, NUMBER_OF_ANSWERS, NUMBER_OF_QUESTIONS columbs
I think you're looking for set_index:
cols = ["USER_ID", "USER_NAME", "USER_REPUTATION", "NUMBER_OF_ANSWERS", "NUMBER_OF_QUESTIONS"]
ndf = df.set_index(cols)
with some sample data:
>>> df
A B C D E
0 one A foo 0.945847 -0.561259
1 one A foo 0.579520 0.130518
2 one A foo -0.683629 -1.084639
3 one A bar -0.168223 -0.311991
4 one B bar 0.007965 1.108121
5 one B bar -1.877323 -0.258055
6 one B bar 0.992160 0.192339
7 one B foo -0.421557 -0.805156
8 two C bar -0.346622 1.335197
9 two C foo -0.979483 -1.382465
10 two C bar -0.815332 -1.491385
11 two C foo -2.112730 -0.331574
>>> cols = ["A", "B", "C"]
>>> ndf = df.set_index(cols)
>>> ndf
D E
A B C
one A foo 0.945847 -0.561259
foo 0.579520 0.130518
foo -0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
bar -1.877323 -0.258055
bar 0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
ndf is now a multi index frame.
To make the D and E at the same level as the A, B and C, we can set the index to all of them for the display purposes:
the_df = df.set_index(["A", "B", "C", "D", "E"])
to get (in an IPython notebook, for example)
Note that if you were to look at this in console:
>>> the_df
Empty DataFrame
Columns: []
Index: [(one, A, foo, 0.945847, -0.561259), (one, A, foo, 0.57952, 0.130518), ...]
because we set everything to the index and nothing remained in the values! But if you'd like to see it in the console as well, one trick is to use a "ghost" column, i.e., with name and values being the empty string "":
>>> the_df[""] = ""
>>> the_df
A B C D E
one A foo 0.945847 -0.561259
0.579520 0.130518
-0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
-1.877323 -0.258055
0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
removing the extra first row in HTML:
from bs4 import BeautifulSoup
# form the soup
soup = BeautifulSoup(the_df.to_html())
# find the first row and remove it
soup.find("tr").extract()
# get HTML back
html = str(soup)
I have a dataframe as follows:
A B C
0 foo 1.496337 -0.604264
1 bar -0.025106 0.257354
2 foo 0.958001 0.933328
3 foo -1.126581 0.570908
4 bar -0.428304 0.881995
5 foo -0.955252 1.408930
6 bar 0.504582 0.455287
7 bar -1.076096 0.536741
8 bar 0.351544 -1.146554
9 foo 0.430260 -0.348472
I would like to get the max of column B of each group (when grouped by A) and add it the the column C. So here is what I tried:
Group by A:
df = df.groupby(by='A')
Get the maximum of column B and then tried to apply it to column 'C':
for name in ['foo','bar']:
maxi = df.get_group(name)['B'].max()
df.get_group(name)['C'] = df.get_group(name)['C']+maxi
At this point pandas suggests Try using .loc[row_indexer,col_indexer] = value instead. Does this mean I have to use for loops on rows with a if on the column A value and modify the C data one by one? I mean that does not seem to be pandas-ish and I feel that I am missing something. How could I better work around this grouped dataframe?
Such operations are done using transforms or aggregations.
In your case you need transform
# groupby 'A'
grouped = df.groupby('A')
# transform B so every row becomes the maximum along the group:
max_B = grouped['B'].transform('max')
# add the new column to the old df
df['D'] = df['A'] + max_B
Or in one line:
In [2]: df['D'] = df.groupby('A')['B'].transform('max') + df['C']
In [3]: df
Out[3]:
A B C D
0 foo 1.496337 -0.604264 0.892073
1 bar -0.025106 0.257354 0.761936
2 foo 0.958001 0.933328 2.429665
3 foo -1.126581 0.570908 2.067245
4 bar -0.428304 0.881995 1.386577
5 foo -0.955252 1.408930 2.905267
6 bar 0.504582 0.455287 0.959869
7 bar -1.076096 0.536741 1.041323
8 bar 0.351544 -1.146554 -0.641972
9 foo 0.430260 -0.348472 1.147865
For more info, see
http://pandas.pydata.org/pandas-docs/stable/groupby.html
What's the difference between:
pandas df.loc[:,('col_a','col_b')]
and
df.loc[:,['col_a','col_b']]
The link below doesn't mention the latter, though it works. Do both pull a view? Does the first pull a view and the second pull a copy? Love learning Pandas.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Thanks
If your DataFrame has a simple column index, then there is no difference.
For example,
In [8]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('ABC'))
In [9]: df.loc[:, ['A','B']]
Out[9]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
In [10]: df.loc[:, ('A','B')]
Out[10]:
A B
0 0 1
1 3 4
2 6 7
3 9 10
But if the DataFrame has a MultiIndex, there can be a big difference:
df = pd.DataFrame(np.random.randint(10, size=(5,4)),
columns=pd.MultiIndex.from_arrays([['foo']*2+['bar']*2,
list('ABAB')]),
index=pd.MultiIndex.from_arrays([['baz']*2+['qux']*3,
list('CDCDC')]))
# foo bar
# A B A B
# baz C 7 9 9 9
# D 7 5 5 4
# qux C 5 0 5 1
# D 1 7 7 4
# C 6 4 3 5
In [27]: df.loc[:, ('foo','B')]
Out[27]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [28]: df.loc[:, ['foo','B']]
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
The KeyError is saying that the MultiIndex has to be lexsorted. If we do that, then we still get a different result:
In [29]: df.sortlevel(axis=1).loc[:, ('foo','B')]
Out[29]:
baz C 9
D 5
qux C 0
D 7
C 4
Name: (foo, B), dtype: int64
In [30]: df.sortlevel(axis=1).loc[:, ['foo','B']]
Out[30]:
foo
A B
baz C 7 9
D 7 5
qux C 5 0
D 1 7
C 6 4
Why is that? df.sortlevel(axis=1).loc[:, ('foo','B')] is selecting the column where the first column level equals foo, and the second column level is B.
In contrast, df.sortlevel(axis=1).loc[:, ['foo','B']] is selecting the columns where the first column level is either foo or B. With respect to the first column level, there are no B columns, but there are two foo columns.
I think the operating principle with Pandas is that if you use df.loc[...] as
an expression, you should assume df.loc may be returning a copy or a view. The Pandas docs do not specify any rules about which you should expect.
However, if you make an assignment of the form
df.loc[...] = value
then you can trust Pandas to alter df itself.
The reason why the documentation warns about the distinction between views and copies is so that you are aware of the pitfall of using chain assignments of the form
df.loc[...][...] = value
Here, Pandas evaluates df.loc[...] first, which may be a view or a copy. Now if it is a copy, then
df.loc[...][...] = value
is altering a copy of some portion of df, and thus has no effect on df itself. To add insult to injury, the effect on the copy is lost as well since there are no references to the copy and thus there is no way to access the copy after the assignment statement completes, and (at least in CPython) it is therefore soon-to-be garbage collected.
I do not know of a practical fool-proof a priori way to determine if df.loc[...] is going to return a view or a copy.
However, there are some rules of thumb which may help guide your intuition (but note that we are talking about implementation details here, so there is no guarantee that Pandas needs to behave this way in the future):
If the resultant NDFrame can not be expressed as a basic slice of the
underlying NumPy array, then it probably will be a copy. Thus, a selection of arbitrary rows or columns will lead to a copy. A selection of sequential rows and/or sequential columns (which may be expressed as a slice) may return a view.
If the resultant NDFrame has columns of different dtypes, then df.loc
will again probably return a copy.
However, there is an easy way to determine if x = df.loc[..] is a view a postiori: Simply see if changing a value in x affects df. If it does, it is a view, if not, x is a copy.
I have a dataframe that looks like this:
baz qux
one A
one B
two C
three A
one B
one C
I'm trying to reshape it to look like this:
one two three
A C A
B
B
C
I'm pretty confused about whether this is possible, and if so, how you would do it. I've tried using the pivot_table method as pd.pivot_table(cols='baz', rows='qux') but that threw a TypeError. I think I'm being an idiot and missing something really basic here. Any ideas?
I'm not sure if it's the most optimal way of doing it but it does the job:
import io
import pandas as pd
data = u'baz,qux\none,A\none,B\ntwo,C\nthree,A\none,B\none,C'
df = pd.read_csv(io.StringIO(data))
new = pd.DataFrame()
for key, group in df.groupby('baz'):
new = pd.concat([new, pd.DataFrame(group.reset_index().qux, columns=[key])],
axis=1)
print new.replace(np.nan, '')
Which gives back:
one two three
0 A C A
1 B
2 B
3 C
With pivot table you can get a matrix showing which baz corresponds to which qux:
>>> df['foo'] = 1 # Add aggregation column
>>> df.pivot_table('foo', cols='baz', rows=['qux'])
one three two
A 1 1 NaN
B 1 NaN NaN
C 1 NaN 1
This is not quite what you asked for, but perhaps it suffices:
import numpy as np
import pandas as pd
df = pd.DataFrame({'baz':'one one two three one one'.split(),
'qux': list('ABCABC')})
grouped = df.groupby(['baz', 'qux'])
df2 = grouped.apply(pd.DataFrame.reset_index, drop=True)['qux'].unstack(level=0)
df2.reset_index(drop=True, inplace=True)
df2 = df2.reindex(columns='one two three'.split())
df2 = df2.replace(np.nan, '')
print(df2)
yields
one two three
0 A A
1 B
2 B
3 C C