Using pandas value_counts() under defined condition - python

After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!

The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2

The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]

Related

remove the common elements in array based on index value

I have one array want to remove the duplicate value based on index
Eg: a= [1,3,4,3]
Expected array : a = [1,4,3]
Want to remove the common elements with lower index value
It's not optimal, but since you don't seem to look for a fast algorithm, this should be enough (especially with small arrays):
[1,3,4,3].reverse.uniq.reverse
This code is for Ruby only.
You will need to loop through the list in reverse order:
for index in range(len(my_list)-1,-1,-1):
if my_list[index] in my_list[index+1:]:
del my_list[index]

python list comprehension with if condition looping

Is it possible to use list comprehension for a dataframe if I want to change one column's value based on the condition of another column's value.
The code I'm hoping to make work would be something like this:
return ['lower_level' for x in usage_time_df['anomaly'] if [y < lower_outlier for y in usage_time_df['device_years']]
Thanks!
I don't think what you want to do can be done in a list comprehension, and if it can, it will definitely not be efficient.
Assuming a dataframe usage_time_df with two columns, anomaly and device_years, if I understand correctly, you want to set the value in anomaly to lower_level when the value in device_years does not reach lower_outlier (which I guess is a float). The natural way to do that is:
usage_time_df.loc[usage_time_df['device_years'] < lower_outlier, 'anomaly'] = 'lower_level'

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Checking dataframe cells to see if they contain a value

Let's say I have a fairly simple code such as
import pandas
df_import=pandas.read_excel("dataframe.xlsx")
df_import['Company'].str.contains('value',na=False,case=False)
So this obviously imports pandas, creates a dataframe from an excel documentment and then searches the column titled Company for some value, and returns an index saying if the value of that cell contains that value (True or False)
However, I want to test 3 cases. Case A, no results were found (all False), case 2, only 1 case was found (only 1 True) and case 3, more that 1 result was found (# of True > 1).
My though is that I could set up a for loop, iterating through the column, and if a value of a cell is True, I add 1 to a variable (lets call it count). Then at the end, I have an if/elif/elif statement based on the value of count, whether it is 0,1,or >1.
Now, maybe there is a better way to check this but if not, I figured the for loop would look something like
for i in range (len(df_improt.index))
if df_import.iloc[i,0].str.contains('value',na=False,case=False)
count += 1
First of all, I'm not sure if I should use .iloc or .iat but both give me the error
AttributeError: 'str' object has no attribute 'str'
and I wasn't able to find a correction for this.
Your current code is not going to work because iloc[i, 0] returns a scalar value, and of course, those don't have str accessor methods associated with them.
A quick and easy fix would be to just call sum on the series level str.contains call.
count = df_import['Company'].str.contains('value', na=False, case=False).sum()
Now, count contains the number of matches in that column.

How do I pass multiple variables to a function in python?

I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".

Categories