Retrieve dataframe row based on list from a cell value - python

I am trying to retrieve a row from a pandas dataframe where the cell value is a list. I have tried isin, but it looks like it is performing OR operation, not AND operation.
>>> import pandas as pd
>>> df = pd.DataFrame([['100', 'RB','stacked'], [['101','102'], 'CC','tagged'], ['102', 'S+C','tagged']],
columns=['vlan_id', 'mode' , 'tag_mode'],index=['dinesh','vj','mani'])
>>> df
vlan_id mode tag_mode
dinesh 100 RB stacked
vj [101, 102] CC tagged
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == '102']; # Fetching string value match
vlan_id mode tag_mode
mani 102 S+C tagged
>>> df.loc[df['vlan_id'].isin(['100','102'])]; # Fetching if contains either 100 or 102
vlan_id mode tag_mode
dinesh 100 RB stacked
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == ['101','102']]; # Fails ?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1283, in wrapper
res = na_op(values, other)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas\_libs\ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 3 vs 2
I can get the values to a list and compare it. Instead, Is there any way available where we can check it against a list value using .loc method itself?

To find a list you can iterate over the values of vlan_id and compare each value using np.array_equal:
df.loc[[np.array_equal(x, ['101','102']) for x in df.vlan_id.values]]
vlan_id mode tag_mode
vj [101, 102] CC tagged
Although, it's advised to avoid using lists as cell values in a dataframe.
DataFrame.loc can use a list of labels or a Boolean array to access rows and columns. The list comprehension above contructs a Boolean array.

I am not sure if this is the best way to do this, or if there is a good way to do this, since as far as I know pandas doesn't really support storing lists in Series. Still:
l = ['101', '102']
df.loc[pd.concat([df['vlan_id'].str[i] == l[i] for i in range(len(l))], axis=1).all(axis=1)]
Output:
vlan_id mode tag_mode
vj [101, 102] CC tagged

Another workaround would be to transform your vlan_id columns so that it can be queried as a string. You can do that by joining your vlan_id list values into comma-separated strings.
df['proxy'] = df['vlan_id'].apply(lambda x: ','.join(x) if type(x) is list else ','.join([x]) )
l = ','.join(['101', '102'])
print(df.loc[df['proxy'] == l])

Related

Python DataFrame TypeError: only integer scalar arrays can be converted to a scalar index

I know there are several questions about this error already. But in this particular case I'm not sure whether there is already a solution for my problem.
I have this part of code and i want to print the column "y" of the Dataframe df.
The following error occurs:
TypeError: only integer scalar arrays can be converted to a scalar index
labels=[]
xvectors=[]
for i in data:
labels.append(i[0])
xvectors.append(i[1])
X = np.array(xvectors)
y = np.array(labels)
feat_cols = [ 'xvec'+str(i) for i in range(X.shape[1]) ]
print(feat_cols)
df = pd.DataFrame(X,columns=[feat_cols])
df['y']= y
#df['label'] = df['y'].apply(lambda i: str(i))
print(df['y'])
X, y = None, None
Printing the whole DataFrame is possible. This looks like:
xvec0 xvec1 xvec2 xvec3 xvec4 ... xvec508 xvec509 xvec510 xvec511 y
0 3.397163 -1.112423 0.414708 0.563083 1.371336 ... 1.201095 -0.076261 -0.620443 -1.231465 DA01_03
1 0.159473 1.884818 -1.511547 -0.153500 -0.635701 ... -1.217205 -1.922081 0.878613 0.087912 DA01_06
2 1.089404 0.331919 -1.027480 0.594129 -2.473234 ... -3.505570 -3.509632 -0.553128 -0.453307 DA01_10
3 0.183993 -1.741467 -0.142570 -3.158320 4.355789 ... 3.857311 3.142393 0.991663 -2.842322 DA01_14
This is the whole errror message:
print(df['y'])
File "/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py", line 2958, in __getitem__
return self._get_item_cache(key)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py", line 3270, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py", line 960, in get
return self.iget(loc)
File "/usr/local/lib/python3.7/dist-packages/pandas/core/internals/managers.py", line 977, in iget
block = self.blocks[self._blknos[i]]
TypeError: only integer scalar arrays can be converted to a scalar index
I think it has something to do with the numpy array.
Thank you in advance!
Ah you pass your columns argument as a list in a list (feat_cols is already of type list). This turns your column headers 2-dimensional: you can see df.info() says it ranges from (xvec0,) to ... instead of xvec0.
Passing columns=feat_cols should do the trick :-)

Get Pandas DataFrame first column

This question is odd, since I know HOW to do something, but I dont know WHY I cant do it another way.
Suppose simple data frame:
import pandasas pd
a = pd.DataFrame([[0,1], [2,3]])
I can slice this data frame very easily, first column is a[[0]], second is a[[1]]. Simple isnt it?
Now, lets have more complex data frame. This is part of my code:
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
Data frame frame is also pandas DataFrame, such as a. I canget second column very easily as frame[[1]]. But when I try frame[[0]] I get an error:
Traceback (most recent call last):
File "<ipython-input-55-0c56ffb47d0d>", line 1, in <module>
frame[[0]]
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 1991, in __getitem__
return self._getitem_array(key)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\frame.py", line 2035, in _getitem_array
indexer = self.ix._convert_to_indexer(key, axis=1)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1184, in _convert_to_indexer
indexer = labels._convert_list_indexer(objarr, kind=self.name)
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\indexes\base.py", line 1112, in _convert_list_indexer
return maybe_convert_indices(indexer, len(self))
File "C:\Users\Robert\Desktop\Záloha\WinPython-64bit-3.5.2.2\python- 3.5.2.amd64\lib\site-packages\pandas\core\indexing.py", line 1856, in maybe_convert_indices
raise IndexError("indices are out-of-bounds")
IndexError: indices are out-of-bounds
I can still use frame.iloc[:,0] but problem is that I dont understand why I cant use simple slicing by [[]]? I use winpython spyder 3 if that helps.
using your code:
import pandas as pd
var_vec = [i for i in range(100)]
num_of_sites = 100
row_names = ["_".join(["loc", str(i)]) for i in
range(1,num_of_sites + 1)]
frame = pd.DataFrame(var_vec, columns = ["Variable"], index = row_names)
spec_ab = [i**3 for i in range(100)]
frame[1] = spec_ab
if you ask to print out the 'frame' you get:
Variable 1
loc_1 0 0
loc_2 1 1
loc_3 2 8
loc_4 3 27
loc_5 4 64
loc_6 5 125
......
So the cause of your problem becomes obvious, you have no column called '0'.
At line one you specify a lista called var_vec.
At line 4 you make a dataframe out of that list, but you specify the index values and the column name (which is usually good practice).
The numerical column name, '0', '1',.. as in the first example, only takes place when you dont specify the column name, its not a column position indexer.
If you want to access columns by their position, you can:
df[df.columns[0]]
what happens than, is you get the list of columns of the df, and you choose the term '0' and pass it to the df as a reference.
hope that helps you understand
edit:
another way (better) would be:
df.iloc[:,0]
where ":" stands for all rows. (also indexed by number from 0 to range of rows)

Group by in pandas dataframe and unioning a numpy array column

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following
first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]
When I load the this CSV with pandas data frame and using the following code
data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})
Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like
170.0, [19 23 234 376]
162.0, [1 2 3 4]
How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.
group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))
With your current csv file the 'third' column comes in as a string, instead of a list.
There might be nicer ways to convert to a list, but here goes...
from ast import literal_eval
data = pd.read_csv('test_groupby.csv')
# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')
# Convert string to list...
data['third'] = data['third'].apply(literal_eval)
group_data=data.groupby('first')
# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
# (np.array should work here, but...?)
ans = group_data.aggregate(
{'third': lambda x: list(np.unique(
np.concatenate(x.values)))})
print(ans)
third
first
162 [1, 2, 3, 4]
170 [19, 23, 234, 376]

replacing pandas dataframe variable values with a numpy array

I am doing a transformation on a variable from a pandas dataframe and then I would like to replace the column with my new values. The problem seems to be that after the transformation, the length of the array is not the same as the length of my dataframe's index. I don't think that is true though.
>>> df['variable'] = stats.boxcox(df.variable)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2119, in __setitem__
self._set_item(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2165, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\eMachine\WinPython-64bit-2.7.5.3\python-2.7.5.amd64\lib\site-packages\pandas\core\frame.py", line 2205, in _sanitize_column
raise AssertionError('Length of values does not match '
AssertionError: Length of values does not match length of index
When I check the length, these lengths seem to disagree. The len(array) says it is 2 but when I call the stats.boxcox it says it is 50000. What is going on here?
>>> len(df)
50000
>>> len(stats.boxcox(df.variable))
2
>>> stats.boxcox(df.variable)
(0 -0.079496
1 -0.117982
2 -0.104637
...
49985 -0.041300
49986 0.651771
49987 -0.115660
49988 -0.118034
49998 -0.118014
49999 -0.034076
Name: feat9, Length: 50000, dtype: float64, 8.4721358117221772)
>>>
You can see in your example that the result of boxcox is a tuple. This is consistent with the documentation, which indicates that boxcox returns a tuple of the transformed data and a lambda value. Notice in the example on that page that it does:
xt, _ = stats.boxcox(x)
. . . showing again that boxcox returns a 2-tuple.
You should be doing df['variable'] = stats.boxcox(df.variable)[0].

how to get the desired vertical list in python?

how to read a file vertically? So for instance a file would contain the following:
1234
4567
7890
to obtain [147, 258, 369 479]
this was used
rows = [line.split() for line in f]
columns=zip(*rows)
print(columns)
and the following was obtained
zip object
what should i do to fix it? so that i get the desired result
In Python 3 zip returns a zip object, useful for iterating over but not so much for printing. The fix is easy though:
print(list(columns))
Here your code
lines = "1234 4567 7890"
rows_iter = [iter(s) for s in lines.split()]
cols_as_list = zip(*strs_iter)
cols = [''.join(c) for c in cols_as_list]
Given your sample data in a file called 'foop.txt', here you go:
z = zip(*(l for l in open('foop.txt')))
columns = [''.join(x) for x in z]
print columns
results in:
['147', '258', '369', '470']
If you want to leave 'columns' as a generator one would just change that line:
columns = (''.join(x) for x in z)
It looks like you are using Python 3, where zip returns an iterator. To see the values, you need to consume the iterator, e.g. using the list constructor:
columns = list(zip(*rows))
To obtain individual columns in individual variables, you can unpack them:
col1, col2, ... = zip(*rows)
If the file really has only a single column, there is no reason to call split in the first place. Simply read the lines into a list:
col = [int(line) for line in f] # or float(line)...

Categories