Assign multiple values to single row Pandas - python

I have a pandas DataFrame.
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
1
Ahmad Anis
123
2
3
programmer
Random
I want to convert it to following
USER_ID
USER_NAME
USER_REPUTATION
NUMBER_OF_ANSWERS
NUMBER_OF_QUESTIONS
BADGE_NAME
BADGE_CAT
0
1
Ahmad Anis
123
2
3
Topper
HTML
1
programmer
Random
I can not display it correctly, but I do not want it to repeat similar things again, instead make them a big box, and uncommon things are displayed. I tried using multi-indexing but it was not working.
I want something similar to this
But here it is only doing it with single column, I want it to do it with my USER_ID, USER_NAME, USER_REPUTATION, NUMBER_OF_ANSWERS, NUMBER_OF_QUESTIONS columbs

I think you're looking for set_index:
cols = ["USER_ID", "USER_NAME", "USER_REPUTATION", "NUMBER_OF_ANSWERS", "NUMBER_OF_QUESTIONS"]
ndf = df.set_index(cols)
with some sample data:
>>> df
A B C D E
0 one A foo 0.945847 -0.561259
1 one A foo 0.579520 0.130518
2 one A foo -0.683629 -1.084639
3 one A bar -0.168223 -0.311991
4 one B bar 0.007965 1.108121
5 one B bar -1.877323 -0.258055
6 one B bar 0.992160 0.192339
7 one B foo -0.421557 -0.805156
8 two C bar -0.346622 1.335197
9 two C foo -0.979483 -1.382465
10 two C bar -0.815332 -1.491385
11 two C foo -2.112730 -0.331574
>>> cols = ["A", "B", "C"]
>>> ndf = df.set_index(cols)
>>> ndf
D E
A B C
one A foo 0.945847 -0.561259
foo 0.579520 0.130518
foo -0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
bar -1.877323 -0.258055
bar 0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
ndf is now a multi index frame.
To make the D and E at the same level as the A, B and C, we can set the index to all of them for the display purposes:
the_df = df.set_index(["A", "B", "C", "D", "E"])
to get (in an IPython notebook, for example)
Note that if you were to look at this in console:
>>> the_df
Empty DataFrame
Columns: []
Index: [(one, A, foo, 0.945847, -0.561259), (one, A, foo, 0.57952, 0.130518), ...]
because we set everything to the index and nothing remained in the values! But if you'd like to see it in the console as well, one trick is to use a "ghost" column, i.e., with name and values being the empty string "":
>>> the_df[""] = ""
>>> the_df
A B C D E
one A foo 0.945847 -0.561259
0.579520 0.130518
-0.683629 -1.084639
bar -0.168223 -0.311991
B bar 0.007965 1.108121
-1.877323 -0.258055
0.992160 0.192339
foo -0.421557 -0.805156
two C bar -0.346622 1.335197
foo -0.979483 -1.382465
bar -0.815332 -1.491385
foo -2.112730 -0.331574
removing the extra first row in HTML:
from bs4 import BeautifulSoup
# form the soup
soup = BeautifulSoup(the_df.to_html())
# find the first row and remove it
soup.find("tr").extract()
# get HTML back
html = str(soup)

Related

Pandas df['col1':'col2'] giving the output I don't understand

I was doing something very basic like this -
data = np.arange(1,13).reshape(4,3)
table = pd.DataFrame(data, index = list('abcd'), columns =['foo','bar','baz'])
table
foo bar baz
a 1 2 3
b 4 5 6
c 7 8 9
d 10 11 12
And then I ran this -
table['bar':'foo']
#output
foo bar baz
c 7 8 9
d 10 11 12
I don't get why I am getting this result. Note that I am not asking for any other solution or workaround. I am just looking for explanation/rules behind this behavior.
I'm not entirly sure, but it looks like you can't use slicing for column names, the slicing only works on the rows, so only c and d are (lexicography) between bar and foo
You can instead use loc:
table.loc[:, 'foo':'bar']
Note that I changed the order of foo and bar, this is because they are ordered as you defined them, foo -> baz -> bar and not lexicographically. 'bar':'foo' will return an empty dataframe.
It's basically outputting row slices by comparing bar and foo lexicographically with the existing column names. The output includes column c and d as they're only two columns that fall between bar and foo: a < b < bar < c < d < ... < foo
First, you have to know that the notation df[x:y] try to slice your dataframe by index labels and not columns. This is different than the notation df[x] which try to select a column. A generic way to filter your dataframe is to use .loc (or .iloc). You should the documentation about Indexing and selecting data
>>> table['bar':'foo']
foo bar baz
c 7 8 9 # 'bar' >= 'c' (a b bar c d )
d 10 11 12 # 'd' <= 'foo' (a b c d foo)
When you use your code you raise an exception because we often use a RangeIndex or IntIndex index label:
>>> table.reset_index(drop=True)['bar':'foo']
...
TypeError: cannot do slice indexing on RangeIndex with these indexers [bar] of type str
Immediately, we understand the problem and use .loc:
>>> table.loc[:, 'foo':'bar']
foo bar
a 1 2
b 4 5
c 7 8
d 10 11

pandas read_csv usecols and names out of sync

When trying to read some columns using their indices from a tabular file with pandas read_csv it seems the usecols and names get out of sync with each other.
For example, having the file test.csv:
FOO A -46450.494736 0.0728830817231
FOO A -46339.7126846 0.0695018062805
FOO A -46322.4942905 0.0866205763556
FOO B -46473.3117983 0.0481618121947
FOO B -46537.6827055 0.0436893868921
FOO B -46467.2102205 0.0485001911304
BAR C -33424.1224914 6.7981041851
BAR C -33461.4101485 7.40607068177
BAR C -33404.6396495 4.72117502707
and trying to read 3 columns by index without preserving the original order:
cols = [1, 2, 0]
names = ['X', 'Y', 'Z']
df = pd.read_csv(
'test.csv', sep='\t',
header=None,
index_col=None,
usecols=cols, names=names)
I'm getting the following dataframe:
X Y Z
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
whereas I would expect column Z to have the FOO and BAR, like this:
Z X Y
0 FOO A -46450.494736
1 FOO A -46339.712685
2 FOO A -46322.494290
3 FOO B -46473.311798
4 FOO B -46537.682706
5 FOO B -46467.210220
6 BAR C -33424.122491
7 BAR C -33461.410148
8 BAR C -33404.639650
I know pandas stores the dataframes as dictionary so the order of the columns may be different from the requested with usecols, but the problem here is that using usecols with indices and names doesn't make sense.
I really need to read the columns by their indices and then assign names to them. Is there any workaround for this?
The documentation could be clearer on this (feel free to make an issue, or even better submit a pull request!) but usecols is set-like - it does not define an order of columns, it simply is tested against for membership.
from io import StringIO
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[0, 1, 2])
Out[31]:
a b c
0 1 2 3
1 4 5 6
pd.read_csv(StringIO("""a,b,c
1,2,3
4,5,6"""), usecols=[2, 1, 0])
Out[32]:
a b c
0 1 2 3
1 4 5 6
names on the other hand is ordered. So in this case, the answer is to specify the names in the order you want them.

Remove columns from pandas DataFrame that are not integers and outside specified numerical range

I have a DataFrame that has imported data. However, the imported data can be incorrect and so I am trying to get rid of it. An example DataFrame:
user test1 test2 other
0 foo 1 7 bar
1 foo 2 9 bar
2 foo 3;as 5 bar
3 foo 3 5 bar
I want to get clean up columns test1 and test2. I want to get rid of values that are not within a specified range and those that contain a string by some error (as shown above as the entry 3;as). I am doing this by defining a dict of acceptable values:
values_dict = {
'test1' : [1,2,3],
'test2' : [5,6,7],
}
and the list of column names I wish to clean:
headers = ['test1', 'test2']
My code as it stands right now:
# Remove string entries
for i in headers:
df[i] = pd.to_numeric(df[i], errors='coerce')
df[i] = df[i].fillna(0).astype(int)
# Remove unwanted values
for i in values_dict:
df[i] = df[df[i].isin(values_dict[i])]
But it seems that erroneous values are not removed to form a desired dataframe of:
user test1 test2 other
0 foo 1 7 bar
1 foo 3 5 bar
Thanks for the help!
You could do something like this; use np.logical_and to construct the and condition from multiple columns and use it to subset the data frame:
headers = ['test1', 'test2']
df[pd.np.logical_and(*(pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers))]
# user test1 test2 other
#0 foo 1 7 bar
#3 foo 3 5 bar
Break down:
[pd.to_numeric(df[col], errors='coerce').isin(values_dict[col]) for col in headers]
firstly converts interested columns to numeric type and then check if the column is in specific range; which makes a boolean series for each column:
#[0 True
# 1 True
# 2 False
# 3 True
# Name: test1, dtype: bool,
# 0 True
# 1 False
# 2 True
# 3 True
# Name: test2, dtype: bool]
To satisfy conditions from all columns simultaneously, we need an and operation, which can be constructed further using numpy.logical_and; use * here to unpack all columns conditions as argument.

Python pandas - apply function to grouped dataframe

I have a dataframe as follows:
A B C
0 foo 1.496337 -0.604264
1 bar -0.025106 0.257354
2 foo 0.958001 0.933328
3 foo -1.126581 0.570908
4 bar -0.428304 0.881995
5 foo -0.955252 1.408930
6 bar 0.504582 0.455287
7 bar -1.076096 0.536741
8 bar 0.351544 -1.146554
9 foo 0.430260 -0.348472
I would like to get the max of column B of each group (when grouped by A) and add it the the column C. So here is what I tried:
Group by A:
df = df.groupby(by='A')
Get the maximum of column B and then tried to apply it to column 'C':
for name in ['foo','bar']:
maxi = df.get_group(name)['B'].max()
df.get_group(name)['C'] = df.get_group(name)['C']+maxi
At this point pandas suggests Try using .loc[row_indexer,col_indexer] = value instead. Does this mean I have to use for loops on rows with a if on the column A value and modify the C data one by one? I mean that does not seem to be pandas-ish and I feel that I am missing something. How could I better work around this grouped dataframe?
Such operations are done using transforms or aggregations.
In your case you need transform
# groupby 'A'
grouped = df.groupby('A')
# transform B so every row becomes the maximum along the group:
max_B = grouped['B'].transform('max')
# add the new column to the old df
df['D'] = df['A'] + max_B
Or in one line:
In [2]: df['D'] = df.groupby('A')['B'].transform('max') + df['C']
In [3]: df
Out[3]:
A B C D
0 foo 1.496337 -0.604264 0.892073
1 bar -0.025106 0.257354 0.761936
2 foo 0.958001 0.933328 2.429665
3 foo -1.126581 0.570908 2.067245
4 bar -0.428304 0.881995 1.386577
5 foo -0.955252 1.408930 2.905267
6 bar 0.504582 0.455287 0.959869
7 bar -1.076096 0.536741 1.041323
8 bar 0.351544 -1.146554 -0.641972
9 foo 0.430260 -0.348472 1.147865
For more info, see
http://pandas.pydata.org/pandas-docs/stable/groupby.html

reshape data frame in pandas with pivot table

I have a dataframe that looks like this:
baz qux
one A
one B
two C
three A
one B
one C
I'm trying to reshape it to look like this:
one two three
A C A
B
B
C
I'm pretty confused about whether this is possible, and if so, how you would do it. I've tried using the pivot_table method as pd.pivot_table(cols='baz', rows='qux') but that threw a TypeError. I think I'm being an idiot and missing something really basic here. Any ideas?
I'm not sure if it's the most optimal way of doing it but it does the job:
import io
import pandas as pd
data = u'baz,qux\none,A\none,B\ntwo,C\nthree,A\none,B\none,C'
df = pd.read_csv(io.StringIO(data))
new = pd.DataFrame()
for key, group in df.groupby('baz'):
new = pd.concat([new, pd.DataFrame(group.reset_index().qux, columns=[key])],
axis=1)
print new.replace(np.nan, '')
Which gives back:
one two three
0 A C A
1 B
2 B
3 C
With pivot table you can get a matrix showing which baz corresponds to which qux:
>>> df['foo'] = 1 # Add aggregation column
>>> df.pivot_table('foo', cols='baz', rows=['qux'])
one three two
A 1 1 NaN
B 1 NaN NaN
C 1 NaN 1
This is not quite what you asked for, but perhaps it suffices:
import numpy as np
import pandas as pd
df = pd.DataFrame({'baz':'one one two three one one'.split(),
'qux': list('ABCABC')})
grouped = df.groupby(['baz', 'qux'])
df2 = grouped.apply(pd.DataFrame.reset_index, drop=True)['qux'].unstack(level=0)
df2.reset_index(drop=True, inplace=True)
df2 = df2.reindex(columns='one two three'.split())
df2 = df2.replace(np.nan, '')
print(df2)
yields
one two three
0 A A
1 B
2 B
3 C C

Categories