How to create a pandas dataframe for the below conditions - python

import pandas as pd
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
for i in range(len(temp)):
if temp[i]<=80:
level=0
elif temp[i]>80 and temp[i]<=100:
level=1
elif temp[i]<=75:
n_level=0
elif temp[i]>75 and temp[i]<=95:
n_level=1//
df=pd.DataFrame([[temp[i],level]],columns=['temp1','level1','newlevel'])//
print(df)//
Unable to get the output which is expected to be like this
#temp# ##level## ###newlevel###
79 0 0
80 0 0
81 1 1

Please find below solution, I have used list comprehension to create the data required to build the data frame. Please change the conditions inside the list comprehensions to meet your needs. Code and Output
import pandas as pd
import numpy as np
temp=[79,80,81,80,80,79,76,75,76,78,80,81]
level=[0 if i<=80 else 1 for i in temp]
n_level=[0 if i<=75 else 1 for i in temp]
#print(temp)
#print(level)
#print(n_level)
df=pd.DataFrame({'temp':temp,'level':level,'newlevel':n_level})
df

You can generate your DataFrame using solely Pandas methods.
To generate both "level" columns, it is enough to use cut:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1])})
Note: Both "level" columns of the output DataFrame have category type.
If you are unhappy about that, cast them to int:
df = pd.DataFrame({'temp': temp,
'level1' : pd.cut(temp, [0, 80, 1000], labels=[0, 1]).astype(int),
'newlevel': pd.cut(temp, [0, 75, 1000], labels=[0, 1]).astype(int)})

Related

Null values present but not detected

In the following code, the column df2['age_groups'] has definitely one null value.
I am trying to append this null value to a list, but this list turns out to be empty.
Why am I running into this problem?
import numpy as np
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
null_list=[]
for i in df2['age_groups']:
if i == float('nan'):
null_list.append(i)
print(null_list) #empty list
print(df2['age_groups'].isna().sum()) # it shows that there is one null value
and
type(i) == float('nan')
generates the same outcome
For test missing values NaN and None (obviously same processing in pandas) is used pd.isna, not is, not ==:
null_list=[]
for i in df2['age_groups']:
if pd.isna(i):
null_list.append(i)
You only need to fix the if condition. See also this SO question.
Try this:
np.random.seed(100)
df1 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df2 = pd.DataFrame({"ages": np.random.randint(10, 80, 20)})
df1['age_groups'], bins = pd.cut(df1["ages"], 4, retbins=True)
df2['age_groups'] = pd.cut(df2["ages"], bins=bins)
null_list=[]
for i in df2['age_groups']:
if i is np.nan: # <- code changed here to np.nan
null_list.append(i)
print(null_list)
print(df2['age_groups'].isna().sum())
Output:
[nan]
1

Slice DataFrame at specific points and plot each slice

I am new to programming and Pythone could you help me?
I have a data frame which look like this.
d = {'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
df = pd.DataFrame(data=d)
I want to slice the data whenever value == 100 and then plot all slices in a figer.
So my questions are how to slice or cut the data as described? and what's the best structure to save slices in order to plot?.
Note 1: value column has no frequency that I can use and it varies from 0 to 100 where time is arbitrary.
Note 2: I already tried this solution but I get the same table
decreased_value = df[df['value'] <= 100][['time', 'value']].reset_index(drop=True)
How can I slice one column in a dataframe to several series based on a condition
Thanks in advance!
EDIT:
Here's a simpler way of handling my first answer (thanks to #aneroid for the suggestion).
Get the indices where value==100 and add +1 so that these land at the bottom of each slice:
indices = df.index[df['value'] == 100] + 1
Then use numpy.split (thanks to this answer for that method) to make a list of dataframes:
df_list = np.split(df, indices)
Then do your plotting for each slice in a for loop:
for df in df_list:
--- plot based on df here ---
VERBOSE / FROM SCRATCH METHOD:
You can get the indices for where value==100 like this:
indices = df.index[df.value==100]
Then add the smallest and largest indices in order to not leave out the beginning and end of the df:
indices = indices.insert(0,0).to_list()
indices.append(df.index[-1]+1)
Then cycle through a while loop to cut up the dataframe and put each slice into a list of dataframes:
i = 0
df_list = []
while i+1 < len(indices):
df_list.append(df.iloc[indices[i]:indices[i+1]])
i += 1
I already solved the problem using for loop, which can be used to slice and plot at the same time without using np.split function, as well as maintain the data structure.
Thanks to the previous answer by #k_n_c, it helps me improve it.
slices = df.index[df['score'] == 100]
slices = slices + 1
slices = np.insert(slices, 0,0, axis=0)
slices = np.append(slices,df.index[-1]+1)
prev_ind = 0
for ind in slices:
temp = df.iloc[prev_ind:ind,:]
plt.plot(temp.time, temp.score)
prev_ind = ind
plt.show()

Conditional creation of a Dataframe column, where the calculation of the column values change based on row input

I have a very long and wide dataframe. I'd like to create a new column in that dataframe, where the value depends on many other columns in the df. The calculation needed for the values in this new column, ALSO change, depending on a value in some other column.
The answers to this question and this question come close, but don't quite work out for me.
I'll eventually have about 30 different calculations that could be applied, so I'm not too keen on the np.where function, which is not that readible for too many conditions.
I've also been strongly adviced against doing a for-loop over all rows in a dataframe, because it's supposed to be awful for performance (please correct me if I'm wrong there).
What I've tried to do instead:
import pandas as pd
import numpy as np
# Information in my columns look something like this:
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3 , 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# lists to check against to decide upon which calculation is required
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
I expect that based on the row values in the df['text'], the right calculation is applied to same row value of df['mynewvalue'].
Instead, I get the error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How can I program this instead, so that I can use these kind of conditions to define the right calculation for this df['mynewvalue'] column?
The errors come from the conditions:
conditions = [
... ,
(df['text'] in someList),
(df['text'] in someOtherList),
(df['text'] in someThirdList)]
You try to ask if several elements are in a list. The answer is a list (for each element). As the error suggests, you have to decide if the condition is verified when at least one element verify the property (any) or if all the elements verify the property (any).
One solution is to use isin (doc) or all (doc) for pandas dataframes.
Here using any:
import pandas as pd
import numpy as np
# Information in my columns look something like this:
df = pd.DataFrame()
df['text'] = ['dab', 'def', 'bla', 'zdag', 'etc']
df['values1'] = [3, 4, 2, 5, 2]
df['values2'] = [6, 3, 21, 44, 22]
df['values3'] = [103, 444, 33, 425, 200]
# other lists to test against whether
someList = ['dab', 'bla']
someOtherList = ['def', 'zdag']
someThirdList = ['etc']
conditions = [
(df['text'] is None),
(df['text'].isin(someList)),
(df['text'].isin(someOtherList)),
(df['text'].isin(someThirdList))]
choices = [0,
round(df['values2'] * 0.5 * df['values3'], 2),
df['values1'] + df['values2'] - df['values3'],
df['values1'] + 249]
df['mynewvalue'] = np.select(conditions, choices, default=0)
print(df)
# text values1 values2 values3 mynewvalue
# 0 dab 3 6 103 309.0
# 1 def 4 3 444 -437.0
# 2 bla 2 21 33 346.5
# 3 zdag 5 44 425 -376.0
# 4 etc 2 22 200 251.0

pandas reset cumsum when the previous value is negative

I need to perform a cumulative sum on a data frame that is grouped, but I need to have it reset when the previous value is negative and the current value is positive.
In R I could apply a condition to the groupby with ave() function, but I can't do that in python, so I am having a bit of trouble thinking of a solution. Can anyone help me out?
Here is a sample:
import pandas as pd
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
df['CS'] = df.groupby(['GROUP', 'PRODUCT']).FORECAST.cumsum()
# Reset cumsum if
# condition: (df.FORECAST > 0) & (df.groupby(['GROUP', 'PRODUCT']).FORECAST.shift(-1).fillna(0) <= 0)
This solution will work to reset the sum for any example where the values to be summed change from negative to positive (regardless of whether the dataset is nice and periodic as it is in your example)
import numpy as np
import pandas as pd
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
cumsum = np.cumsum(df['FORECAST'])
# Array of indices where sum should be reset
reset_ind = np.where(df['FORECAST'].diff() > 0)[0]
# Sums that need to be subtracted at resets
subs = cumsum[reset_ind-1].values
# Repeat subtraction values for every entry BETWEEN resets and values after final reset
rep_subs = np.repeat(subs, np.hstack([np.diff(reset_ind), df['FORECAST'].size - reset_ind[-1]]))
# Stack together values before first reset and resetted sums
df['CS'] = np.hstack([cumsum[:reset_ind[0]], cumsum[reset_ind[0]:] - rep_subs])
Alternatively, based on this solution to a similar question (and my realisation of the usefulness of groupby)
import pandas as pd
import numpy as np
df = pd.DataFrame({'PRODUCT': ['A'] * 40, 'GROUP': ['1'] * 40, 'FORECAST': [100, -40, -40, -40]*10, })
# Create indices to group sums together
df['cumsum'] = (df['FORECAST'].diff() > 0).cumsum()
# Perform group-wise cumsum
df['CS'] = df.groupby(['cumsum'])['FORECAST'].cumsum()
# Remove intermediary cumsum column
df = df.drop(['cumsum'], axis=1)

Transforming pandas Dataframe into dictionary via function taking column inputs

I have the following pandas Dataframe:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
df1 = pd.DataFrame(dict1)
print(df1)
file amount front back
0 filename2 3 21889611 21973805
1 filename2 4 36357723 36403870
2 filename3 5 196312 277500
3 filename4 1 11 19
4 filename4 2 42 120
5 filename3 1 1992 3210
My task is to take N random draws between front and back, whereby N is equal to the value in amount. Parse this into a dictionary.
To do this on an row-by-row basis is easy for me to understand:
e.g. row 1
import numpy as np
random_draws = np.random.choice(np.arange(21889611, 21973805+1), 3)
e.g. row 2
random_draws = np.random.choice(np.arange(36357723, 36403870+1), 4)
Normally with pandas, users could define this as a function and use something like
def func(front, back, amount):
return np.random.choice(np.arange(front, back+1), amount)
df["new_column"].apply(func)
but the result of my function is an array of varying size.
My second problem is that I would like the output to be a dictionary, of the format
{file: [random_draw_results], file: [random_draw_results], file: [random_draw_results], ...}
For the above example df1, the function should output this dictionary (given the draws):
final_dict = {"filename2": [21927457, 21966814, 21898538, 36392840, 36375560, 36384078, 36366833],
"filename3": 212143, 239725, 240959, 197359, 276948, 3199],
"filename4": [100, 83, 15]}
We can pass axis=1 to operate over rows when using apply.
We then need to tell what columns to use and we return a list.
We then either perform some form of groupby or we could use defaultdict as shown below:
dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 'front':[21889611, 36357723, 196312, 11, 42, 1992], 'back':[21973805, 36403870, 277500, 19, 120, 3210]}
import numpy as np
import pandas as pd
def func(x):
return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()
df1 = pd.DataFrame(dict1)
df1["new_column"] = df1.apply(func, axis=1)
df1.groupby('file')['new_column'].apply(sum).to_dict()
Returns:
{'filename2': [21891765,
21904680,
21914414,
36398355,
36358161,
36387670,
36369443],
'filename3': [240766, 217580, 217581, 274396, 241413, 2488],
'filename4': [18, 96, 107]}
Alt2 would be to use (and by some small timings I ran it looks like it runs as fast):
from collections import defaultdict
d = defaultdict(list)
for k,v in df1.set_index('file')['new_column'].items():
d[k].extend(v)

Categories