getting certain values of the datasets under certain conditions - python

I wanted to know the number of rows where the condition 'treatment' in 'group' column and, 'new_page' in 'landing_page' don't match, how can i get it ?

You could also do this if the purpose is just to get the counts:
data.loc[data.group == 'treatment',:].groupby(['group','landing_page'], dropna=False).count()

this is not the most beautiful way but it is, for me at least, the most straight forward and easy to understand solution:
selection = df.loc[(df['group'] != 'treatment') & (df['landing_page'] != 'new_page')]
with df.loc you simply &-chain your conditions together as in normal python if conditions.

Related

using testing.assert_series_equal when series are not in the same order

I have two series that are equal but in different order.
data1 = np.array(['1','2','3','4','5','6'])
data2=np.array(['6','2','4','3','1','5'])
sr1 = pd.Series(data1)
sr2=pd.Series(data2)
the two series are outputs of different functions and I'm testing if they are equal:
pd.testing.assert_series_equal(sr1,sr2,check_names=False)
This is failing of course because the two series are not in the same order.
I checked in the documentation they have online, they mention check_like but it does not work for me (I guess because I don't have the same version of pandas).
Is there a quick way to test if these two series are equal even if they are not in the same order for a unit test without updating any packages ?
Assuming you consider the Series equal if they have the same items, I would use:
sr1.value_counts().eq(sr2.value_counts()).all()
Or, without sorting, which should be more efficient (sorting is O(n*logn)):
sr1.value_counts(sort=False).eq(sr2.value_counts(sort=False)).all()
Output: True
you can check if the sorted versions are the same to eliminate the order:
(np.sort(sr1) == np.sort(sr2)).all()
If there are missings, need to handle them separately to check if same number of missings, and then the rest:
((sr1.isna().sum() == sr2.isna().sum())
and (np.sort(sr1.dropna()) == np.sort(sr2.dropna())).all())

How to slice DataFrame using logical("or") and for loop in python?

I am a very beginner of Python.
The goal is to slice the column of DataFrame('incen') which contains specific device codes...
But the device codes keep changed, so I made it into a list('device_code').
I tried to use for loop like this way:
for a in device_code:
incen[(incen.device_name.str.contains(a) == True)]
However, it couldn't be merged into one DataFrame.
So I solved it into this way which is super inefficient:
incen[(incen.device_name.str.contains(device_code[0]) == True)]|incen[(incen.device_name.str.contains(device_code[1]) == True)]|incen[(incen.device_name.str.contains(device_code[2]) == True)]|incen[(incen.device_name.str.contains(device_code[3]) == True)]
...and so on...
Let me understand how to use logical 'or' and for loop at the same time. Thanx.
Use lsit comprehension with np.logical_or.reduce, compare by True is not necessary:
L = [incen.device_name.str.contains(a, na=False) for a in device_code]
incen[np.logical_or.reduce(L)]

Using pandas value_counts() under defined condition

After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!
The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2
The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]

python list comprehension with if condition looping

Is it possible to use list comprehension for a dataframe if I want to change one column's value based on the condition of another column's value.
The code I'm hoping to make work would be something like this:
return ['lower_level' for x in usage_time_df['anomaly'] if [y < lower_outlier for y in usage_time_df['device_years']]
Thanks!
I don't think what you want to do can be done in a list comprehension, and if it can, it will definitely not be efficient.
Assuming a dataframe usage_time_df with two columns, anomaly and device_years, if I understand correctly, you want to set the value in anomaly to lower_level when the value in device_years does not reach lower_outlier (which I guess is a float). The natural way to do that is:
usage_time_df.loc[usage_time_df['device_years'] < lower_outlier, 'anomaly'] = 'lower_level'

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Categories