Python Pandas Get Values According to If/Else - python

My input dataframe;
Order Need WarehouseStock StoreStock
1 3 74 5
0 4 44 44
0 0 44 44
6 12 44 44
0 6 644 44
6 6 44 44
I want to count whether any difference or not among "Order" and Need values with below code;
difference = df['Need'] - df['Order']
mask = difference.between(-1,1)
print (f'Count: {(~mask).sum()}')
I want to that something like this;
If (WarehouseStock-StoreStock) >= Need:
difference1 = df['Need'] - df['Order']
mask1 = difference1.between(-1,1)
print (f'Count: {(~mask1).sum()}')
Else
difference2 = df['Need'] - df['Order']
mask2 = difference2.between(-5,5)
print (f'Count: {(~mask2).sum()}')
Desired Outputs are;
Count 3
Order Need WarehouseStock StoreStock
1 3 74 5
6 12 44 44
0 6 644 44
Could you please help me about this?

Using numpy.where with pandas.Series.between:
import pandas as pd
import numpy as np
s = df['Need'] - df['Order']
ind = np.where((df['WarehouseStock'] - df['StoreStock']).ge(df['Need']), ~s.between(-1, 1), ~s.between(-5 , 5))
Output:
ind.sum()
# 3
df[ind]
Order Need WarehouseStock StoreStock
0 1 3 74 5
3 6 12 44 44
4 0 6 644 44

Related

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.
You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

Generate Column Value in Pandas based on previous rows

Let us assume I am taking a temperature measurement on a regular interval and recording the values in a Pandas Dataframe
day temperature [F]
0 89
1 91
2 93
3 88
4 90
Now I want to create another column which is set to 1 if and only if the two previous values are above a certain level. In my scenario I want to create a column value of 1 if the two consecutive values are above 90, thus yielding
day temperature Above limit?
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
Despite some SO and Google digging, it's not clear if I can use iloc[x], loc[x] or something else in a for loop?
You are looking for the shift function in pandas.
import io
import pandas as pd
data = """
day temperature Expected
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
"""
data = io.StringIO(data)
df = pd.read_csv(data, sep='\s+')
df['Result'] = ((df['temperature'].shift(1) > 90) & (df['temperature'] > 90)).astype(int)
# Validation
(df['Result'] == df['Expected']).all()
Try this:
df = pd.DataFrame({'temperature': [89, 91, 93, 88, 90, 91, 91, 93]})
limit = 90
df['Above'] = ((df['temperature']>limit) & (df['temperature'].shift(1)>limit)).astype(int)
df
In the future, please include the code to testing (in this case the df construction line)
df['limit']=""
df.iloc[0,2]=0
for i in range (1,len(df)):
if df.iloc[i,1]>90 and df.iloc[i-1,1]>90:
df.iloc[i,2]=1
else:
df.iloc[i,2]=0
Here iloc[i,2] refers to ith row index and 2 column index(limit column). Hope this helps
Solution using shift():
>> threshold = 90
>> df['Above limit?'] = 0
>> df.loc[((df['temperature [F]'] > threshold) & (df['temperature [F]'].shift(1) > threshold)), 'Above limit?'] = 1
>> df
day temperature [F] Above limit?
0 0 89 0
1 1 91 0
2 2 93 1
3 3 88 0
4 4 90 0
Try using rolling(window = 2) and then apply() as follows:
df["limit"]=df['temperature'].rolling(2).apply(lambda x: int(x[0]>90)&int(x[-1]> 90))

Python: Apply function to each row of a Pandas DataFrame and return **new data frame**

I am trying to apply a function to each row of a data frame. The tricky part is that the function returns a new data frame for each processed row. Assume the columns of this data frame can easily be derived from the processed row.
At the end the result should be all these data frames (1 for each processed row) concatenated. I intentionally do not provide sample code, because the simplest of solution proposal will do, as long as the 'tricky' part if fulfilled.
I have spend hours trying digging through docs and stackoverflow to find a solution. As usual the pandas docs are so devoid of practical examples aside the simplest of operations that I just couldn't figure it out. I also made sure to not miss any duplicate questions. Thanks a lot.
It is unclear what you are trying to achieve, but I doubt you need to create separate dataframes.
The example below shows how you can take a dataframe, subset it to your columns of interest, apply a function foo to one of the columns and then apply a second function bar that returns multiple values.
df = pd.DataFrame({
'first_name': ['john', 'nancy', 'jolly'],
'last_name': ['smith', 'drew', 'rogers'],
'A': [1, 4, 7],
'B': [2, 5, 8],
'C': [3, 6, 9]
})
>>> df
first_name last_name A B C
0 john smith 1 2 3
1 nancy drew 4 5 6
2 jolly rogers 7 8 9
def foo(first_name):
return 2 if first_name.startswith('j') else 1
def bar(first_name):
return (2, 0) if first_name.startswith('j') else (1, 3)
columns_of_interest = ['first_name', 'A']
df_new = pd.concat([
df[columns_of_interest].assign(x=df.first_name.apply(foo)),
df.first_name.apply(bar).apply(pd.Series)], axis=1)
>>> df_new
first_name A x 0 1
0 john 1 2 2 0
1 nancy 4 1 1 3
2 jolly 7 2 2 0
Assuming the function you are applying to each row is called f
pd.concat({i: f(row) for i, row in df.iterrows()})
Working example
df = pd.DataFrame(np.arange(25).reshape(5, 5), columns=list('ABCDE'))
def f(row):
return pd.concat([row] * 2, keys=['x', 'y']).unstack().drop('C', 1).assign(S=99)
pd.concat({i: f(row) for i, row in df.iterrows()})
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
Or
df.groupby(level=0).apply(lambda x: f(x.squeeze()))
A B D E S
0 x 0 1 3 4 99
y 0 1 3 4 99
1 x 5 6 8 9 99
y 5 6 8 9 99
2 x 10 11 13 14 99
y 10 11 13 14 99
3 x 15 16 18 19 99
y 15 16 18 19 99
4 x 20 21 23 24 99
y 20 21 23 24 99
I would do it this way - although I note the .apply is possibly what you are looking for.
import pandas as pd
import numpy as np
np.random.seed(7)
orig=pd.DataFrame(np.random.rand(6,3))
orig.columns=(['F1','F2','F3'])
res=[]
for i,r in orig.iterrows():
tot=0
for col in r:
tot=tot+col
rv={'res':tot}
a=pd.DataFrame.from_dict(rv,orient='index',dtype=np.float64)
res.append(a)
res[0].head()
Should return something like this
{'res':10}

how to add complementary intervals in pandas dataframe

Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check

Fill pandas dataframe with values in between

I am new to using pandas but want to learn it better. I am currently facing a problem. I have a DataFrame looking like this:
0 1 2
0 chr2L 1 4
1 chr2L 9 12
2 chr2L 17 20
3 chr2L 23 23
4 chr2L 26 27
5 chr2L 30 40
6 chr2L 45 47
7 chr2L 52 53
8 chr2L 56 56
9 chr2L 61 62
10 chr2L 66 80
I want to get something like this:
0 1 2 3
0 chr2L 0 1 0
1 chr2L 1 2 1
2 chr2L 2 3 1
3 chr2L 3 4 1
4 chr2L 4 5 0
5 chr2L 5 6 0
6 chr2L 6 7 0
7 chr2L 7 8 0
8 chr2L 8 9 0
9 chr2L 9 10 1
10 chr2L 10 11 1
11 chr2L 11 12 1
12 chr2L 12 13 0
And so on...
So, fill in the missing intervals with zeros, and save the present intervals as ones (if there is an easy way to save "boundary" positions (the borders of the intervals in the initial data) as 0.5 at the same time it might also be helpful) while splitting all data into 1-length intervals.
In the data there are multiple string values in the column 0, and this should be done for each of them separately. They require different length of the final data (the last value that should get a 0 or a 1 is different). Would appreciate your help with dealing with this in pandas.
This works for most of your first paragraph and some of the second. Left as an exercise: finish inserting insideness=0 rows (see end):
import pandas as pd
# dummied-up version of your data, but with column headers for readability:
df = pd.DataFrame({'n':['a']*4 + ['b']*2, 'a':[1,6,8,5,1,5],'b':[4,7,10,5,3,7]})
# splitting up a range, translated into df row terms:
def onebyone(dfrow):
a = dfrow[1].a; b = dfrow[1].b; n = dfrow[1].n
count = b - a
if count >= 2:
interior = [0.5]+[1]*(count-2)+[0.5]
elif count == 1:
interior = [0.5]
elif count == 0:
interior = []
return {'n':[n]*count, 'a':range(a, a + count),
'b':range(a + 1, a + count + 1),
'insideness':interior}
Edited to use pd.concat(), new in pandas 0.15, to combine the intermediate results:
# Into a new dataframe:
intermediate = []
for label in set(df.n):
for row in df[df.n == label].iterrows():
intermediate.append(pd.DataFrame(onebyone(row)))
df_onebyone = pd.concat(intermediate)
df_onebyone.index = range(len(df_onebyone))
And finally a sketch of identifying the missing rows, which you can edit to match the above for-loop in adding rows to a final dataframe:
# for times in the overall range describing 'a'
for i in range(int(newd[newd.n=='a'].a.min()),int(newd[newd.n=='a'].a.max())):
# if a time isn't in an existing 0.5-1-0.5 range:
if i not in newd[newd.n=='a'].a.values:
# these are the values to fill in a 0-row
print '%d, %d, 0'%(i, i+1)
Or, if you know the a column will be sorted for each n, you could keep track of the last end-value handled by onebyone() and insert some extra rows to catch up to the next start value you're going to pass to onebyone().

Categories