Python Pandas Dataframe Constructor Converts List to String - python

is there a simple solution to solve the following problem:
I'm using Pandas dataframe
dataframe = pandas.DataFrame(data, index=[categorieid], columns=['title', 'categorieid'])
where categorieid is a list of integer values (e.g. [1,2,3,1]) and title a list of strings ['a','b,'c','d'].
And then i'm trying to access the title at a specific position with
dataframe.ix[i]['title'].values.tolist()
The problem is that i get an exception if only one title exists for a given index because then pandas saves my title as a string and not as a list.
Is there a solution to tell the dataframe constructor always to create a list() at each index even if there is only one item contained?
Thank you for any help
Edit:
My printed dataframe looks like this
categorieid title
0 0 a
0 0 c
1 1 b
1 1 d
0 0 e
2 2 f
Calling my values.tolist() results in
for _title in dataframe.ix[i]['title'].values.tolist():
AttributeError: 'unicode' object has no attribute 'values'

I think you're making it more difficult than it has to be by the way you're building the DataFrame. Also, accessing the 'values' attribute is not needed.
Since you only have one dimension, you're probably better off using a Series. Then you can select the entries using the index and convert to a list.
In [12]: s = pd.Series(list('acbdef'), index=[0, 0, 1, 1, 0, 2], name='title')
In [13]: s
Out[13]:
0 a
0 c
1 b
1 d
0 e
2 f
Name: title, dtype: object
In [14]: s[1].tolist()
Out[14]: ['b', 'd']
If you really need a DataFrame for some reason not mentioned, it will work similarly:
In [15]: df = pd.DataFrame(s)
In [16]: df
Out[16]:
title
0 a
0 c
1 b
1 d
0 e
2 f
In [17]: df['title'][1].tolist()
Out[17]: ['b', 'd']

Related

What is the recommended method for accessing pandas data? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

How to apply function to index named with at-sign in pandas? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

UserWarning: Pandas doesn't allow columns to be created via a new attribute name

I am stuck with my pandas script.
Actually , i am working with two csv file(one input and the other output file).
i want to copy all the rows of two column and want to make calculation and then copy it to another dataframe (output file).
The columns are as follows :
'lat', 'long','PHCount', 'latOffset_1', 'longOffset_1','PH_Lat_1', 'PH_Long_1', 'latOffset_2', 'longOffset_2', 'PH_Lat_2', 'PH_Long_2', 'latOffset_3', 'longOffset_3','PH_Lat_3', 'PH_Long_3', 'latOffset_4', 'longOffset_4','PH_Lat_4', 'PH_Long_4'.
i want to take the column 'lat' and 'latOffset_1' , do some calculation and put it in another new column('PH_Lat_1') which i have already created.
My function is :
def calculate_latoffset(latoffset): #Calculating Lat offset.
a=(df2['lat']-(2*latoffset))
return a
The main code :
for i in range(1,5):
print(i)
a='PH_lat_%d' % i
print (a)
b='latOffset_%d' % i
print (b)
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
Since the column name just differ by (1,2,3,4). so i want to call the function calculate_latoffset and calculate the all the rows of all the columns(PH_Lat_1, PH_Lat_2, PH_Lat_3,PH_Lat_4) in one go.
When using the above code i am getting this error :
basic_conversion.py:46: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
is it possible ?
Please kindly help
Simply use df2['a'] instead of df2.a
This is a Warning not an Error, so your code could still run through, but probably not following your intention.
Short answer: To create a new column for DataFrame, never use attribute access, the correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space bar or conflicts with an existing method name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
As to the topic, to create a new column for DataFrame, as you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, back to the Short answer.
The solution I can think of is to use .loc to get the column. You can try df.loc[:,a] instead of df.a.
Pandas dataframe columns cannot be created using the dot method to avoid potential conflicts with the dataframe attributes. Hope this helps
although all other answers are likely a much better solution, i figured that it does no harm to just ignore it move on.
import warnings
warnings.filterwarnings("ignore","Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access", UserWarning)
using the code above, the script will just disregard the warning and move on.
In df2.apply(lambda x: calculate_latoffset(x[b]), axis=1) you are creating a 5 column dataframe and you were trying to assign the value to a single field. Do df2[a] = calculate_latoffset(df2[b]) instead should deliver the desired output.

Converting strings in a dataframe column to an index value

Using Python and Pandas, I have a
dataframe, df
with a column entitled 'letters'
a list, letlist = ['a','b','c','d']
I would like to create a new column, 'letters_index', where the generate value would be the index of the string in column letters, in the list letlist
I tried
df['letters_index'] = letlist.index(df['letters'])
However, this didn't work. Do you have any suggestions?
As far as I understand, You need:
letlist = ['a', 'b', 'c', 'd']
print(df)
Output:
letters
0 a
1 b
2 b
3 d
4 c
And then
df['new_col'] = df['letters'].apply(lambda x: letlist.index(x))
Output:
0 0
1 1
2 1
3 3
4 2
Name: letters, dtype: int64
Beware that if the value in the column is not present in the list it would throw a ValueError.

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories