Is it possible to use list comprehension for a dataframe if I want to change one column's value based on the condition of another column's value.
The code I'm hoping to make work would be something like this:
return ['lower_level' for x in usage_time_df['anomaly'] if [y < lower_outlier for y in usage_time_df['device_years']]
Thanks!
I don't think what you want to do can be done in a list comprehension, and if it can, it will definitely not be efficient.
Assuming a dataframe usage_time_df with two columns, anomaly and device_years, if I understand correctly, you want to set the value in anomaly to lower_level when the value in device_years does not reach lower_outlier (which I guess is a float). The natural way to do that is:
usage_time_df.loc[usage_time_df['device_years'] < lower_outlier, 'anomaly'] = 'lower_level'
Related
Trying to optimize my code I want to convert this for loop in a list comprehension, any help please?
fecha_añadir=pd.Timestamp('2200-01-01T12')
for x in range(0,len(df_vigencias)):
maximo=df_vigencias['max_vigencia'][x]
if df_vigencias['FECHA_FINAL_V'+str(int(maximo))][x] is pd.NaT:
df_vigencias['FECHA_FINAL_V'+str(int(maximo))][x]=fecha_añadir
I tried
[df_vigencias['FECHA_FINAL_V'+str(int(df_vigencias['max_vigencia']))]=fecha_añadir if df_vigencias['FECHA_FINAL_V'+str(int(df_vigencias['max_vigencia']))] is pd.Nat else df_vigencias['FECHA_FINAL_V'+str(int(df_vigencias['max_vigencia']))] for x in range(0,len(df_vigencias))]
This is the data frame
First I want to find the number in the last column then I use that number to look for the column name where I need to insert a value,
I though a listh comprehension will make my code faster, but any other solution could work
Please have a look at this tutorial:
How to Convert Loops to List Comprehensions in Python
After a lot of errors, exceptions and high blood pressure, I finally came up with this solution that works for what I needed it to: basically I need to calculate all the column values that respect a specific condition.
So, let's say I got a list of strings just like
vehicle = ['car', 'boat', 'car', 'car', 'bike', 'tank', 'DeLorean', 'tank']
I want to count which values appear more than 2 times.
Consider that the column name of the dataframe based upon the list is 'veh'.
So, this piece of code works:
df['veh'].value_counts()[df['veh'].value_counts() > 2]
The question is: why the [df['veh'].value_counts() > 2] part comes right after the "()" of value_counts()? No "." or any other linking sign that could mean something.
If I use the code
df['classi'].value_counts() > 1
(which would be the logic synthax that my limited brain can abstract), it returns boolean values.
Can someone, please, help me understanding the logic behind pandas?
I am pretty sure that pandas is awesome and the problem lies on this side of mine, but I really want to understand it. I've read a lot of material (documentation included), but could not find a solution to this gap of mine.
Thank you in advance!
The following line of code
df['veh'].value_counts()
Return a pandas Series with keys as indices and number of occurrences as values
Everything between square brackets [] are filters on keys for a pandas Series. So
df['veh'].value_counts()['car']
Should return the number of occurrences of the word 'car' in column 'veh'. Which is equivalent to the corresponding value for key 'car' on the series df['veh'].value_counts()
A pandas series also accept lists of keys as indices, So
df['veh'].value_counts()[['car','boat']]
Should return the number of occurrences for the words 'car' and 'boat' respectively
Furthermore, the series accept a list of booleans as key, if it is of the same length of the series. That is, it accepts a boolean mask
When you write
df['veh'].value_counts() > 2
You make a comparison between each value on df['veh'].value_counts() and the number 2. This returns a boolean for each value, that is a boolean mask.
So you can use the boolean mask as a filter on the series you created. Thus
df['veh'].value_counts()[df['veh'].value_counts() > 2]
Returns all the occurrences for the keys where the occurrences are greater than 2
The logic is that you can slice a series with a boolean series of the same size:
s[bool_series]
or equivalently
s.loc[bool_series]
This is also referred as boolean indexing.
Now, your code is equivalent to:
s = df['veh'].value_counts()
bool_series = s > 2
And then either the first two lines, e.g. s[s>2]
I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)
Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst
I wasn't sure if there was any good way of doing this. But I thought I'd give stackoverflow a try :)
I have a list/array with integers, and a second array also with integers. I want to find the max value from the first list, but the value can not be in the second array.
Is there any "fancy" way in python to put this down to one expression?
max_value = max(firstArray) that is not in secondArray
Use sets to get the values in firstArray that are not in secondArray:
max_value = max(set(firstArray) - set(secondArray))
Here's one way:
max_value = [x for x in sorted(first) if x not in second][0]
It's less efficient than sorting then using a for loop to test if elements are in the second array, but it fits on one line nicely!