Compare column value to an int in dataframe with python - python

I have a dataframe dframe and i want to iterate over lines of my column number if number <5 then my column STATE takes the value 'vv'
if number is between [5..17] so my value of my column STATE takes 'xx'.
ELSE STATE takes 'yy'.
SO I wrote this code but it doesn't work..
ANy helps please.
thank you
`
for it in dframe['number']:
if (it < 5):
dframe['STATE'] = 'vv'
elif (it >= 5 & it < 17):
dframe['STATE'] = 'xx'
else:
dframe['STATE'] = 'yy'`

You can do this in one line with a list comprehension:
dframe['STATE'] = ['vv' if (i < 5) else 'xx' if (i >= 5) & (i < 17) else 'yy' for i in dframe['number']]

Below code should work for you.
dframe['state']=''
dframe.loc[dframe['number'] <5, 'state'] = 'vv'
dframe.loc[(dframe['number'] >5) & (dframe['number']<17), 'state'] = 'xx'
dframe.loc[dframe['state'] =='', 'state'] = 'yy'

Related

Problem with for-if loop statement operation on pandas dataframe

I have a dataset which I want to create a new column that is based on a division of two other columns using a for-loop with if-conditions.
This is the dataset, with the empty 'solo_fare' column created beforehand.
The task is to loop through each row and divide 'Fare' by 'relatives' to get the per-passenger fare. However, there are certain if-conditions to follow (passengers in this category should see per-passenger prices of between 3 and 8)
The code I have tried here doesn't seem to fill in the 'solo_fare' rows at all. It returns an empty column (same as above df).
for i in range(0, len(fare_result)):
p = fare_result.iloc[i]['Fare']/fare_result.iloc[i]['relatives']
q = fare_result.iloc[i]['Fare']
r = fare_result.iloc[i]['relatives']
# if relatives == 0, return original Fare amount
if (r == 0):
fare_result.iloc[i]['solo_fare'] = q
# if the divided fare is below 3 or more than 8, return original Fare amount again
elif (p < 3) or (p > 8):
fare_result.iloc[i]['solo_fare'] = q
# else, return the divided fare to get solo_fare
else:
fare_result.iloc[i]['solo_fare'] = p
How can I get this to work?
You should probably not use a loop for this but instead just use loc
if you first create the 'solo fare' column and give every row the default value from Fare you can then change the value for the conditions you have set out
fare_result['solo_fare'] = fare_result['Fare']
fare_results.loc[(
(fare_results.Fare / fare_results.relatives) >= 3) & (
(fare_results.Fare / fare_results.relatives) <= 8), 'solo_fare'] = (
fare_results.Fare / fare_results.relatives)
Did you try to initialize those new colums first ?
By that I mean that the statement fare_result.iloc[i]['solo_fare'] = q
only means that you are assigning the value q to the field solo_fare of the line i
The issue there is that at this moment, the line i does not have any solo_fare key. Hence, you are only filling the last value of your table here.
To solve this issue, try declaring the solo_fare column before the for loop like:
fare_result['solo_fare'] = np.nan
One way to do is to define a row-wise function, and apply it to the dataframe:
# row-wise function (mockup)
def foo(fare, relative):
# your logic here. Mine just serves as example
if relative > 100:
res = fare/relative
elif (relative < 10):
res = fare
else:
res = 10
return res
Then apply it to the dataframe (row-wise):
fare_result['solo_fare'] = fare_result.apply(lambda row: foo(row['Fare'], row['relatives']) , axis=1)

Finding the smallest number not smaller than x for a rolling data of a dataframe in python

Let suppose i have data in rows for a column(O) : 1,2,3,4,5,6,7,8,9,10.
Its average is 5.5.
I need to find the smallest number which is larger than the average 5.5 :- i.e. '6'
Here is what I have tried soo far.
method 1:
df["test1"] = df["O"].shift().rolling(min_periods=1, window=10).apply(lambda x: pd.Series(x).nlargest(5).iloc[-1])
Discarded as that number may not always be the 6th number.
method 2:
great = []
df['test1'] = ''
df["avg"] = df["O"].shift().rolling(min_periods=1, window=10).mean()
for i in range(1, len(df)):
for j in range(0, 10):
if(df.loc[i-j, 'O'] > df.loc[i, 'avg']):
great.append(df.loc[i-j, 'O'])
df.loc[i, 'test1'] = min(great)
This throws an error:
KeyError: -1
Please help to find the small error in the code as soon as possible.
Thanks,
D.
Mask the Series when it is greater than the mean, then sort, then take the first row.
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10], columns=("vals",))
df[df.vals > df.vals.mean()].sort_values("vals").head(1)
# > vals
# 5 6
Try with
n = 10
output = df.vals.rolling(10).apply(lambda x : x[x>x.mean()].min())

Sum using loop in python

I have a dataframe and specifically need to find the min, max, and average of the age column using loops. However, try as I might. It just did not work. I hope somebody could help me see where the problem is. Thank you.
Here is my code
total_age = 0
i=0
max_amount = adult_data["age"][i]
min_amount = adult_data["age"][i]
for i in range(len(adult_data["age"])):
total_age = adult_data["age"][i] + total_age,
i = i + 1,
if adult_data["age"][i] > max_amount:
max_amount = adult_data["age"][i],
if adult_data["age"][i] < min_amount:
min_amount = adult_data["age"][i],
print(total_age)
The error I am currently getting is
ValueError: key of type tuple not found and not a MultiIndex
The commas at the end of statements indicate tuples. For example, i = i + 1,is the same as i = (i + 1,), where (i + 1,) is a tuple with one element.
So, your code is essentially the same as:
for i in range(len(adult_data["age"])):
total_age = (adult_data["age"][i] + total_age,)
i = (i + 1,)
if adult_data["age"][i] > max_amount:
max_amount = (adult_data["age"][i],)
if adult_data["age"][i] < min_amount:
min_amount = (adult_data["age"][i],)
That's a lot of tuples!
In other words, you don't need the commas. You also don't need i = i + 1 because the range automatically increments i.
Try:
adult_data["age"].agg(['min','max','mean'])

Python Pandas: Find a pattern in a DataFrame

I have the following Dataframe (1,2 millon rows):
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`
Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40
occur.
For the provided Dataframe that would mean:
The sould problem is that
Your help is highly appreciated.
I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.
import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})
sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])
beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
state = row['A']
value = row['B']
if not beginn_flag and state == 'beginn':
beginn_flag = True
beginn_value = value
elif beginn_flag and state == 'end':
if value >= beginn_value + 40:
new_seq = Sequence(beginn_value, value)
sequence_list.append(new_seq)
beginn_flag = False
print(sequence_list)
This code outputs the following:
[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]
Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

python, operation on big pandas Dataframe

I have a pandas DataFrame named Joined with 5 fields:
product | price | percentil_25 | percentil_50 | percentile_75
for each row I want to class the price like this:
if the price is below percentil_25 I'm giving to this product the class 1, and so on
So what I did is:
classe_final = OrderedDict()
classe_final['sku'] = []
classe_final['class'] = []
for index in range(len(joined)):
classe_final['sku'].append(joined.values[index][0])
if(float(joined.values[index][1]) <= float(joined.values[index][2])):
classe_final['class'].append(1)
elif(float(joined.values[index][2]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][3])):
classe_final['class'].append(2)
elif(float(joined.values[index][3]) < float(joined.values[index][1]) and float(joined.values[index][1]) <= float(joined.values[index][4])):
classe_final['class'].append(3)
else:
classe_final['class'].append(4)
But as my DataFrame is quite big it's taking forever.
Do you have any idea how I could do this quicker?
# build an empty df
df = pd.DataFrame()
# get a list of the unique products, could skip this perhaps
df['Product'] = other_df['Sku'].unique()
2 ways, define a func and call apply
def class(x):
if x.price < x.percentil_25:
return 1
elif x.price >= x.percentil_25 and x.price < x.percentil_50:
return 2:
elif x.price >= x.percentil_50 and x.price < x.percentil_75:
return 2:
elif x.price >= x.percentil_75:
return 4
df['class'] = other_df.apply(lambda row: class(row'), axis=1)
another way which I think is better and will be much faster is we could add the 'class' column to your existing df and use loc and then just take a view of the 2 columns of interest:
joined.loc[joined['price'] < joined['percentil_25'], 'class'] =1
joined.loc[(joined['price'] >= joined['percentil_25']) & (joined['price'] < joined['percentil_50']), 'class'] =2
joined.loc[(joined['price'] >= joined['percentil_50']) & (joined['price'] < joined['percentil_75']), 'class'] =3
joined.loc[joined['price'] >= joined['percentil_75'], 'class'] =4
classe_final = joined[['cku', 'class']]
Just for kicks you could use a load of np.where conditions:
classe_final['class'] = np.where(joined['price'] > joined['percentil_75'], 4, np.where( joined['price'] > joined['percentil_50'], 3, np.where( joined['price'] > joined['percentil_25'], 2, 1 ) ) )
this evaluates whether the price is greater than percentil_75, if so then class 4 otherwise it evaluates another conditiona and so on, may be worth timing this compared to loc but it is a lot less readable
Another solution, if someone asked me to bet which one is the fastest I'd go for this:
joined.set_index("product").eval(
"1 * (price >= percentil_25)"
" + (price >= percentil_50)"
" + (price >= percentil_75)"
)

Categories