Incrementing in Pandas - python

I am looking to simplify my code a bit when I fill in a a Pandas dataframe, for example this works great to auto-increment a string:
results['Cluster'] = ['Cluster %s' %i for i in range(1, len(results) + 1)]
Result:
Cluster 1
Cluster 2
Cluster 3
etc
How I do the same elegant code when I refer to another dataframe values, all I need to do is to increment cluster_X and total_cX, rather then refer to each one individually...
results['Item per Population'] = [cluster_1.shape[0]/total_c1,cluster_2.shape[0]/total_c2,cluster_3.shape[0]/total_c3]
Thank you.

ok, I got it myself.
First define
clusters = [cluster_1, cluster_2...etc)
totals_pop = [total_c1, total_c2...etc)
then:
appended_data = []
for i, x in zip(clusters,totals_pop):
data = i.shape[0]/x
appended_data.append(data)
appended_data
results['Car Wash per Population'] = np.array(appended_data)
There might be a more elegant way of doing this with results.loc[...] instead.

Related

Adding empty rows in Pandas dataframe

I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!
import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")

efficient soultion to create multiple columns with formula. pandas/python

i'm trying to create multiple columns(couple of hundreds) using values within the same df. is there a more efficient way for me to create multiple columns in batches? below is an example where i have to manually input new column names jwrl2_rank.r1, jwrl2_rank.1r1,jwrl2_rank.2r1, etc.. attached to the formula.
i0, i1, i2 are the original column names
and rn is the value within the column.
i0='jwrl2_rank'
i1='jwrl2_rank.1'
i2='jwrl2_rank.2'
i3='jwrl2_rank.3'
i4='jwrl2_rank.4'
i5='jwrl2_rank.5'
i6='jwrl2_rank.6'
i7='jwrl2_rank.7'
rn=1
df['jwrl2_rank.r1']=((df.loc[(df[i0]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i0]==rn),i0].count()))-1
df['jwrl2_rank.1r1']=((df.loc[(df[i1]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i1]==rn),i1].count()))-1
df['jwrl2_rank.2r1']=((df.loc[(df[i2]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i2]==rn),i2].count()))-1
df['jwrl2_rank.3r1']=((df.loc[(df[i3]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i3]==rn),i3].count()))-1
df['jwrl2_rank.4r1']=((df.loc[(df[i4]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i4]==rn),i4].count()))-1
df['jwrl2_rank.5r1']=((df.loc[(df[i5]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i5]==rn),i5].count()))-1
df['jwrl2_rank.6r1']=((df.loc[(df[i6]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i6]==rn),i6].count()))-1
df['jwrl2_rank.7r1']=((df.loc[(df[i7]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i7]==rn),i7].count()))-1
many thanks. regards
Using a for loop should work.
Incrementing string value
By using string interpolation you could solve your problem. See here for a quick introduction. I am using f-strings in the example below.
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(1, MAX_NUMBER + 1):
new_name = f"{base_name}.{i}"
print(new_name)
>>>
jwrl2_rank.1
jwrl2_rank.2
jwrl2_rank.3
Example of for loop
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(MAX_NUMBER + 1):
current_iN = f"{base_name}.{i}"
new_col_name = f"{base_name}.{i}r1"
if i == 0: # compensate for missing zero in column name
current_iN = base_name
new_col_name = f"{base_name}.r1"
df[new_col_name]=((df.loc[(df[current_iN]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[current_iN]==rn),current_iN].count()))-1

Iterate over loop with append data continuous

I need to iterate over row data in a pandas dataframe. However, I am stuck with looping because spending much time on millions data. I think my code still is not optimal.
new_columns = ['alt', 'alt_anomaly']
df_new = pd.DataFrame(columns=new_columns)
loop = 20
idx = 0
for i, row in df.iterrows():
for alt in range(loop):
alt_anomaly = df.iloc[i]['alt'] * (400.00)
df_new.loc[idx] = row.values.tolist() + [alt_anomaly]
idx += 1
print(df_new)
Use 400 ft as multiples to gradually change on the first vector, the second by 800 feet, and so on by multiple.
its like
row[1] = 27800+400
row[2] = 27775+800
etc....
Thanks for your help, I appreciate that.
You can do the following without looping:
df['alt_anomaly'] = df['alt'] + (df.index+1)*400
Or use Pandas .add option:
df['alt'].add((df.index+1)*400)

Python loop fails to iterate over all the values of a html table

I'm trying to fetch all the data from all the rows for one specific column of a table. The problem is that the loop only fetches the first-row multiple times but is not able to continue to the next row. Here is the relevant code.
numRows = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
numColumns = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
print(numRows)
# Prints 139
print(numColumns)
# prints 21
for i in range(numRows + 1):
df = []
value = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr['{}']/td[16]".format(i))
df.append(value.text)
print(df)
As is evident from the print methods is that I have all the rows and columns of my table. So that part works. But when I try to iterate over all the rows for one specific column, I only get the first value. I have tried to solve this problem by using a format() method but that doesn't seem to solve the problem. Any idea how I can solve this problem?
please try iterate through all results found instead of finding individual elements. I cannot test the code since I do not have access to HTML file.
found_elements = driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr/td[16]")
for i in range(numRows + 1):
df = []
df.append(found_elements[i].text)
print(df)
I found the following code to work:
rader = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
#kolonner = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
kolonneFinish = []
kolonneBib = []
for i in range(1, rader + 1):
valueFinish = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr["+str(i)+"]/td[16]")
kolonneFinish.append(valueFinish.text)
I have no idea what the + + does. So if someone knows, please comment.

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Categories