nested if in list/ dataframe with python - python

I have a dataframe for which I have columns df['kVA'] and df['Phase']. I am trying to create a column df['Line'], but with the following criteria:
Define line by phase
df['Line']=['1PH' if x=='1PH' else '3PH' for x in df['Phase'] ]
Define line by phase & kVa - desired output
df['Line']=['1PH' if x=='1PH' else ['3PHSM' if y<=750 else '3PHLG' for y in df['kVA']] for x in df['Phase'] ]
The code for define by phase works . But If I try to integrate the nested if the code stalls. I am trying to classify products to manufacturing lines by Phase and kVA characteristics. Both kVA and Phase are columns in my data frame (as attached).
How can I fix this?

Pandas is a great tool. I would do it this way:
# create some similar data
import pandas as pd
df = pd.DataFrame({'Phase': ['1PH', '3PH', '3PH', '1PH'], 'kVA': [50, 750, 300, 37.5]})
# add a new column (some elements will not change)
df['Line'] = df['Phase']
# modify rows that fit your criteria
df.loc[ (df.kVA < 750) & (df.Phase == '3PH'), 'Line'] += 'SM'
df.loc[ (df.kVA >= 750) & (df.Phase == '3PH'), 'Line'] += 'LG'
.loc and iloc are great for filtering part of your dataframe.
Note: I'm using Pandas v0.20.3 for this test.

To do that in a more pandas like fashion you probably want to do something more like this:
one_phase = df['Phase'] == '1PH'
small = df['kVA'] <= 750
df['Line'][one_phase] = '1PH'
df['Line'][~one_phase & small] = '3PHSM'
df['Line'][~one_phase & ~small] = '3PHLG'
Note: You did not leave any parsable sample data so this was not tested.

Related

How to optimize Pandas group by +- margin [duplicate]

This question already exists:
Group by +- margin using python
Closed 7 months ago.
I want to optimize my code where Group by +- margin using python. I want to group my Dataframe composed of 2 columns ['1', '2'] based on a margin +-1 (1) and +-10 (2)
For example, a really simplified overlook
[[273, 10],[274, 14],[275, 15]]
Expected output:
[[273, 10],[274, 14]],[[274, 14],[275, 15]]
My data is much more complex with nearly 1 million data points looking like this 652.125454455
This kind of code for example take me for ever, with no results
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
print("Random numbers were created")
df = pd.DataFrame({'1': a, '2':b})
df['id'] = df.index
1_MARGIN = 1
2_MARGIN = 10
tic = time.time()
group = []
for index, row in df.iterrows():
filtered_df = df[(row['1'] - 1_MARGIN < df['1']) & (df['1'] < row['1'] + 1_MARGIN) &
(row['2'] - 2_MARGIN < df['2']) & (df['2'] < row['2'] + 2_MARGIN)]
group.append(filtered_df[['id', '1']].values.tolist())
toc = time.time()
print(f"for loop: {str(1000*(toc-tic))} ms")
I also tried
data = df.groupby('1')['2'].apply(list).reset_index(name='irt')
but in this case there is no margin
I tried my best to understand what you wanted and I arrived at a very slow solution but at least it's a solution.
import pandas as pd
import numpy as np
a = np.random.uniform(low=300, high=1800, size=(300000,))
b = np.random.uniform(low=0, high=7200, size=(300000,))
df = pd.DataFrame({'1': a, '2':b})
dfbl1=np.sort(df['1'].apply(int).unique())
dfbl2=np.sort(df['2'].apply(int).unique())
MARGIN1 = 1
MARGIN2 = 10
marg1array=np.array(range(dfbl1[0],dfbl1[-1],MARGIN1))
marg2array=np.array(range(dfbl2[0],dfbl2[-1],MARGIN2))
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)&(df['2']>low2) & (df['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
I also tried to do each loop seperately and intersect them which should be faster but since we're storing .values.tolist() I couldn't figure out a faster way than below.
a=time.perf_counter()
groupmarg1=[]
groupmarg2=[]
for low,upper in zip(marg1array[:-1],marg1array[1:]):
groupmarg1.append(df.loc[(df['1']>low) & (df['1']<upper)])
newgroup=[]
for subgroup in groupmarg1:
for low2,upper2 in zip(marg2array[:-1],marg2array[1:]):
newgroup.append(subgroup.loc[(subgroup['2']>low2) & (subgroup['2']<upper2)].values.tolist())
print(time.perf_counter()-a)
which runs in ~9mins on my machine.
Oh and you need to filter out the empty dataframes, and if you want them as values.tolist() you can do it while filtering like this
gr2=[grp.values.tolist() for grp in newgroup if not grp.empty]

loop over columns in dataframes python

I want to loop over 2 columns in a specific dataframe and I want to access the data by the name of the column but it gives me this error (type error) on line 3
i=0
for name,value in df.iteritems():
q1=df[name].quantile(0.25)
q3=df[name].quantile(0.75)
IQR=q3-q1
min=q1-1.5*IQR
max=q3+1.5*IQR
minout=df[df[name]<min]
maxout=df[df[name]>max]
new_df=df[(df[name]<max) & (df[name]>min)]
i+=1
if i==2:
break
It looks like you want to exclude outliers based on the 1.5*IQR rule. Here is a simpler solution:
Input dummy data:
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'col%s' % (i+1): np.random.normal(size=1000)
for i in range(4)})
Removing the outliers (keep data: Q1-1.5IQR < data < Q3+1.5IQR):
Q1 = df.iloc[:, :2].quantile(.25)
Q3 = df.iloc[:, :2].quantile(.75)
IQR = Q3-Q1
non_outliers = (df.iloc[:, :2] > Q1-1.5*IQR) & (df.iloc[:, :2] < Q3+1.5*IQR)
new_df = df[non_outliers.all(axis=1)]
output:
Type error might happen for a lot of reasons so it will be better if you add part of the DF to try to understand the issue.
Also to loop over columns you can also use the iterrows() function:
import pandas as pd
df = pd.read_csv('filename.csv')
for _, content in df.iterrows():
print(content['columnname']) #add the name of the columns you want to loop over
refer to the following link for more information
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html#pandas.DataFrame.iterrows

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

How to improve performance on average calculations in python dataframe

I am trying to improve the performance of a current piece of code, whereby I loop through a dataframe (dataframe 'r') and find the average values from another dataframe (dataframe 'p') based on criteria.
I want to find the average of all values (column 'Val') from dataframe 'p' where (r.RefDate = p.RefDate) & (r.Item = p.Item) & (p.StartDate >= r.StartDate) & (p.EndDate <= r.EndDate)
Dummy data for this can be generated as per the below;
import pandas as pd
import numpy as np
from datetime import datetime
######### START CREATION OF DUMMY DATA ##########
rng = pd.date_range('2019-01-01', '2019-10-28')
daily_range = pd.date_range('2019-01-01','2019-12-31')
p = pd.DataFrame(columns=['RefDate','Item','StartDate','EndDate','Val'])
for item in ['A','B','C','D']:
for date in daily_range:
daily_p = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'StartDate':date,
'EndDate':date,
'Val' : np.random.randint(0,100,len(rng))})
p = p.append(daily_p)
r = pd.DataFrame(columns=['RefDate','Item','PeriodStartDate','PeriodEndDate','AvgVal'])
for item in ['A','B','C','D']:
r1 = pd.DataFrame({ 'RefDate': rng,
'Item':item,
'PeriodStartDate':'2019-10-25',
'PeriodEndDate':'2019-10-31',#datetime(2019,10,31),
'AvgVal' : 0})
r = r.append(r1)
r.reset_index(drop=True,inplace=True)
######### END CREATION OF DUMMY DATA ##########
The piece of code I currently have calculating and would like to improve the performance of is as follows
for i in r.index:
avg_price = p['Val'].loc[((p['StartDate'] >= r.loc[i]['PeriodStartDate']) &
(p['EndDate'] <= r.loc[i]['PeriodEndDate']) &
(p['RefDate'] == r.loc[i]['RefDate']) &
(p['Item'] == r.loc[i]['Item']))].mean()
r['AvgVal'].loc[i] = avg_price
The first change is that generating r DataFrame, both PeriodStartDate and
PeriodEndDate are created as datetime, see the following fragment of your
initiation code, changed by me:
r1 = pd.DataFrame({'RefDate': rng, 'Item':item,
'PeriodStartDate': pd.to_datetime('2019-10-25'),
'PeriodEndDate': pd.to_datetime('2019-10-31'), 'AvgVal': 0})
To get better speed, I the set index in both DataFrames to RefDate and Item
(both columns compared on equality) and sorted by index:
p.set_index(['RefDate', 'Item'], inplace=True)
p.sort_index(inplace=True)
r.set_index(['RefDate', 'Item'], inplace=True)
r.sort_index(inplace=True)
This way, the access by index is significantly quicker.
Then I defined the following function computing the mean for rows
from p "related to" the current row from r:
def myMean(row):
pp = p.loc[row.name]
return pp[pp.StartDate.ge(row.PeriodStartDate) &
pp.EndDate.le(row.PeriodEndDate)].Val.mean()
And the only thing to do is to apply this function (to each row in r) and
save the result in AvgVal:
r.AvgVal = r.apply(myMean2, axis=1)
Using %timeit, I compared the execution time of the code proposed by EdH with mine
and got the result almost 10 times shorter.
Check on your own.
By using iterrows I managed to improve the performance, although still may be quicker ways.
for index, row in r.iterrows():
avg_price = p['Val'].loc[((p['StartDate'] >= row.PeriodStartDate) &
(p['EndDate'] <= row.PeriodEndDate) &
(p['RefDate'] == row.RefDate) &
(p['Item'] == row.Item))].mean()
r.loc[index, 'AvgVal'] = avg_price

Writing to a csv using pandas with filters

I'm using the pandas library to load in a csv file using Python.
import pandas as pd
df = pd.read_csv("movies.csv")
I'm then checking the columns for specific values or statements, such as:
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
I want to create a new csv file using the criteria above, but am unsure how to do that as well as how to combine all those attributes into one csv.
Anyone have any ideas?
Combine the boolean masks using & (bitwise-and):
mask = viewNum & starringActorNum & titleLen
Select the rows of df where mask is True:
df_filtered = df.loc[mask]
Write the DataFrame to a csv:
df_filtered.to_csv('movies-filtered.csv')
import pandas as pd
df = pd.read_csv("movies.csv")
viewNum = df["views"] >= 1000
starringActorNum = df["starring"] > 3
df["title"] = df["title"].astype("str")
titleLen = df["title"].str.len() <= 10
mask = viewNum & starringActorNum & titleLen
df_filtered = df.loc[mask]
df_filtered.to_csv('movies-filtered.csv')
You can use the panda.DataFrame.query() interface. It allows text string queries, and is very fast for large data sets.
Something like this should work:
import pandas as pd
df = pd.read_csv("movies.csv")
# the len() method is not available to query, so pre-calculate
title_len = df["title"].str.len()
# build the data frame and send to csv file, title_len is a local variable
df.query('views >= 1000 and starring > 3 and #title_len <= 10').to_csv(...)

Categories