Pythonic Way to have multiple Or's when conditioning in a dataframe - python

Basically instead of writing
data[data['pos'] == "QB" | data['pos'] == "DST" | ...]
where there are many cases I want to check
I was trying to do something similar to this What's the pythonic method of doing multiple ors?. However, this
data[data['pos'] in ("QB", "DST", ...)]
doesn't work.
I read the documentation here http://pandas.pydata.org/pandas-docs/stable/gotchas.html but I'm still having issues.

What you are looking for is Series.isin . Example -
data[data['pos'].isin(("QB", "DST", ...))]
This would check if each value from pos is in the list of values - ("QB", "DST", ...) . Similar to what your multiple | would do.

Related

getting certain values of the datasets under certain conditions

I wanted to know the number of rows where the condition 'treatment' in 'group' column and, 'new_page' in 'landing_page' don't match, how can i get it ?
You could also do this if the purpose is just to get the counts:
data.loc[data.group == 'treatment',:].groupby(['group','landing_page'], dropna=False).count()
this is not the most beautiful way but it is, for me at least, the most straight forward and easy to understand solution:
selection = df.loc[(df['group'] != 'treatment') & (df['landing_page'] != 'new_page')]
with df.loc you simply &-chain your conditions together as in normal python if conditions.

Pythonic way to create a dictionary by iterating

I'm trying to write something that answers "what are the possible values in every column?"
I created a dictionary called all_col_vals and iterate from 1 to however many columns my dataframe has. However, when reading about this online, someone stated this looked too much like Java and the more pythonic way would be to use zip. I can't see how I could use zip here.
all_col_vals = {}
for index in range(RCSRdf.shape[1]):
all_col_vals[RCSRdf.iloc[:,index].name] = set(RCSRdf.iloc[:,index])
The output looks like 'CFN Network': {nan, 'N521', 'N536', 'N401', 'N612', 'N204'}, 'Exam': {'EXRC', 'MXRN', 'HXRT', 'MXRC'} and shows all the possible values for that specific column. The key is the column name.
I think #piRSquared's comment is the best option, so I'm going to steal it as an answer and add some explanation.
Answer
Assuming you don't have duplicate columns, use the following:
{k : {*df[k]} for k in df}
Explanation
k represents a column name in df. You don't have to use the .columns attribute to access them because a pandas.DataFrame works similarly to a python dict
df[k] represents the series k
{*df[k]} unpacks the values from the series and places them in a set ({}) which only keeps distinct elements by definition (see definition of a set).
Lastly, using list comprehension to create the dict is faster than defining an empty dict and adding new keys to it via a for-loop.

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Remove Rows that Contain a specific Value anywhere in the row (Pandas, Python 3)

I am trying to remove all rows in a Panda dataset that contain the symbol "+" anywhere in the row. So ideally this:
Keyword
+John
Mary+Jim
David
would become
Keyword
David
I've tried doing something like this in my code but it doesn't seem to be working.
excluded = ('+')
removal2 = removal[~removal['Keyword'].isin(excluded)]
The problem is that sometimes the + is contained within a word, at the beginning of a word, or at the end. Any ideas how to help? Do I need to use an index function? Thank you!
Use the vectorised str method contains and pass the '+' identifier, negate the boolean condition by using ~:
In [29]:
df[~df.Keyword.str.contains('\+')]
Out[29]:
Keyword
2 David

Pythonic way of saying "if all of the elements in list 1 also exist in list 2"

I want to return true from the if statement only if all of the elements from list 1 also exist in list 2 (list 2 is a superset of list 1). What is the most pythonic way of writing this?
You can use set operations:
if set(list1) <= set(list2):
#...
Note that the comparison itself is fast, but converting the lists to sets might not (depends on the size of the lists).
Converting to a set also removes any duplicate. So if you have duplicate elements and want to ensure that they are also duplicates in the other list, using sets will not work.
You can use built-in all() function:
if all(x in sLVals for x in fLVals):
# do something
In case of using sets think you can take a look at difference method as far as i know it is quite faster way:
if set(fLVals).difference(sLVals):
# there is a difference
else:
# no difference
Either set.issuperset or all(x in L2 for x in L1).
This one came straight out of good folks at MIT:
from operator import and_
reduce(and_, [x in b for x in a])
I tried to find the "readings.pdf" they had posted for the 6.01 class about a year ago...but I can't find it anymore.
Head to my profile and send me an email, and I'll send you the .pdf where I got this example. It's a very good book, but it doesn't seem to be a part of the class anymore.

Categories