efficient soultion to create multiple columns with formula. pandas/python - python

i'm trying to create multiple columns(couple of hundreds) using values within the same df. is there a more efficient way for me to create multiple columns in batches? below is an example where i have to manually input new column names jwrl2_rank.r1, jwrl2_rank.1r1,jwrl2_rank.2r1, etc.. attached to the formula.
i0, i1, i2 are the original column names
and rn is the value within the column.
i0='jwrl2_rank'
i1='jwrl2_rank.1'
i2='jwrl2_rank.2'
i3='jwrl2_rank.3'
i4='jwrl2_rank.4'
i5='jwrl2_rank.5'
i6='jwrl2_rank.6'
i7='jwrl2_rank.7'
rn=1
df['jwrl2_rank.r1']=((df.loc[(df[i0]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i0]==rn),i0].count()))-1
df['jwrl2_rank.1r1']=((df.loc[(df[i1]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i1]==rn),i1].count()))-1
df['jwrl2_rank.2r1']=((df.loc[(df[i2]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i2]==rn),i2].count()))-1
df['jwrl2_rank.3r1']=((df.loc[(df[i3]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i3]==rn),i3].count()))-1
df['jwrl2_rank.4r1']=((df.loc[(df[i4]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i4]==rn),i4].count()))-1
df['jwrl2_rank.5r1']=((df.loc[(df[i5]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i5]==rn),i5].count()))-1
df['jwrl2_rank.6r1']=((df.loc[(df[i6]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i6]==rn),i6].count()))-1
df['jwrl2_rank.7r1']=((df.loc[(df[i7]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[i7]==rn),i7].count()))-1
many thanks. regards

Using a for loop should work.
Incrementing string value
By using string interpolation you could solve your problem. See here for a quick introduction. I am using f-strings in the example below.
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(1, MAX_NUMBER + 1):
new_name = f"{base_name}.{i}"
print(new_name)
>>>
jwrl2_rank.1
jwrl2_rank.2
jwrl2_rank.3
Example of for loop
base_name='jwrl2_rank'
MAX_NUMBER = 3
for i in range(MAX_NUMBER + 1):
current_iN = f"{base_name}.{i}"
new_col_name = f"{base_name}.{i}r1"
if i == 0: # compensate for missing zero in column name
current_iN = base_name
new_col_name = f"{base_name}.r1"
df[new_col_name]=((df.loc[(df[current_iN]==rn)&(df['result']==1),'timing'].sum())/(df.loc[(df[current_iN]==rn),current_iN].count()))-1

Related

Extract several values from a row when a certain value is found using a list

I have a .csv file with 29 columns and 1692 rows.
The columns D_INT_1 and D_INT_2 are just dates.
I want to check for these 2 columns if there is dates between :>= "2022-03-01" and <= "2024-12-31.
And, if a value is found, I want to display the date found + the value of the column "NAME" that is located on the same row of said found value.
This is what I did right now, but it only grab the dates and not the adjacent value ('NAME').
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
This is what I would like to get:
Example of DF:
NAME, ADDRESS, D_INT_1, D_INT_2
Mark, H4N1V8, 2023-01-02, 2019,-01-01
Expected output:
MARK, 2023-01-02
Lots of times the compact [for in] syntax can be used efficiently for simple code, but in this case I wouldn't recommend it. I suggest you use a normal for. Here's an example:
# importing module
import pandas as pd
# reading CV file
df = pd.read_csv("scratch_2.csv")
# converting column data to list
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
# loop for every row in the data
# (i will start as 0 and increase by 1 every iteration)
for i in range(0, len(D_INT_1)):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
print(NAME[i], D_INT_1[i])
if D_INT_2[i] >= "2022-03-01" and D_INT_2[i] <= "2024-12-31":
print(NAME[i], D_INT_2[i])
First for performance dont use loops, because exist vectorized alternatives unpivot by DataFrame.melt and filter by Series.between with DataFrame.loc:
df = df.melt(id_vars='NAME', value_vars=['D_INT_1','D_INT_2'], value_name='Date')
df1 = df.loc[df['Date'].between("2022-03-01","2024-12-31"), ['NAME','Date']]
print (df1)
NAME Date
0 Mark 2023-01-02
Or filter original DataFrame and last join in concat:
df1 = df.loc[df['D_INT_1'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_1']]
df2 = df.loc[df['D_INT_2'].between("2022-03-01","2024-12-31"), ['NAME','D_INT_2']]
df = pd.concat([df1.rename(columns={'D_INT_1':'date'}),
df2.rename(columns={'D_INT_2':'date'})])
print (df)
NAME date
0 Mark 2023-01-02
Last if need loops output with print:
for i in df.itertuples():
print (i.NAME, i.Date)
Mark 2023-01-02 00:00:00
Mark 2019-01-01 00:00:00
So there a few things to be of note here:
In this case, you are better off probably with a normal for-loop since it can be a bit more complicated.
To do what you want, you want to first:
Load the names:
D_INT_1 = df['D_INT_1'].tolist()
D_INT_2 = df['D_INT_2'].tolist()
NAMES = df['NAME'].tolist()
Use enumerate since we know all lists are aligned the same in your loop, keep in mind that enumerate gets both value and index, but I am getting the value manually just for cleaner (and clearer) code:
ext = []
for i,_ in enumerate(D_INT_1):
if D_INT_1[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_1[i],NAMES[i]))
if D_INT_2[i] >= "2022-03-01" and D_INT_1[i] <= "2024-12-31":
ext.append((D_INT_2[i],NAMES[i]))
Of course, you can use a list comprehension (or in this case, two), but this form should be easier to understand for this answer.
To do so, you will need to still load the names like in the first step, then use enumerate in the list comprehension, while adding the name after i in a tuple, perhaps something like this:
ext = [(i,NAMES[ind]) for ind,i in enumerate(D_INT_1 + D_INT_2) if i >= "2022-03-01" and i <= "2024-12-31"]
Keep in mind that I didn't test the above code since I have no access to the original csv, but it should be a good starting point.

Python loop fails to iterate over all the values of a html table

I'm trying to fetch all the data from all the rows for one specific column of a table. The problem is that the loop only fetches the first-row multiple times but is not able to continue to the next row. Here is the relevant code.
numRows = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
numColumns = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
print(numRows)
# Prints 139
print(numColumns)
# prints 21
for i in range(numRows + 1):
df = []
value = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr['{}']/td[16]".format(i))
df.append(value.text)
print(df)
As is evident from the print methods is that I have all the rows and columns of my table. So that part works. But when I try to iterate over all the rows for one specific column, I only get the first value. I have tried to solve this problem by using a format() method but that doesn't seem to solve the problem. Any idea how I can solve this problem?
please try iterate through all results found instead of finding individual elements. I cannot test the code since I do not have access to HTML file.
found_elements = driver.find_elements_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr/td[16]")
for i in range(numRows + 1):
df = []
df.append(found_elements[i].text)
print(df)
I found the following code to work:
rader = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr'))
#kolonner = len(driver.find_elements_by_xpath('/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/thead/tr[2]/th'))
kolonneFinish = []
kolonneBib = []
for i in range(1, rader + 1):
valueFinish = driver.find_element_by_xpath("/html/body/div[2]/div/form[3]/div[2]/div[1]/div/div/div/div[2]/div[4]/section[1]/div[2]/div/div/table/tbody/tr["+str(i)+"]/td[16]")
kolonneFinish.append(valueFinish.text)
I have no idea what the + + does. So if someone knows, please comment.

Randomization of a list with conditions using Pandas

I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)

Output unique values from a pandas dataframe without reordering the output

I know that a few posts have been made regarding how to output the unique values of a dataframe without reordering the data.
I have tried many times to implement these methods, however, I believe that the problem relates to how the dataframe in question has been defined.
Basically, I want to look into the dataframe named "C", and output the unique values into a new dataframe named "C1", without changing the order in which they are stored at the moment.
The line that I use currently is:
C1 = pd.DataFrame(np.unique(C))
However, this returns an ascending order list (while, I simply want the list order preserved only with duplicates removed).
Once again, I apologise to the advanced users who will look at my code and shake their heads -- I'm still learning! And, yes, I have tried numerous methods to solve this problem (redefining the C dataframe, converting the output to be a list etc), to no avail unfortunately, so this is my cry for help to the Python gods. I defined both C and C1 as dataframes, as I understand that these are pretty much the best datastructures to house data in, such that they can be recalled and used later, plus it is quite useful to name the columns without affecting the data contained in the dataframe).
Once again, your help would be much appreciated.
F0 = ('08/02/2018','08/02/2018',50)
F1 = ('08/02/2018','09/02/2018',52)
F2 = ('10/02/2018','11/02/2018',46)
F3 = ('12/02/2018','16/02/2018',55)
F4 = ('09/02/2018','28/02/2018',48)
F_mat = [[F0,F1,F2,F3,F4]]
F_test = pd.DataFrame(np.array(F_mat).reshape(5,3),columns=('startdate','enddate','price'))
#convert string dates into DateTime data type
F_test['startdate'] = pd.to_datetime(F_test['startdate'])
F_test['enddate'] = pd.to_datetime(F_test['enddate'])
#convert datetype to be datetime type for columns startdate and enddate
F['startdate'] = pd.to_datetime(F['startdate'])
F['enddate'] = pd.to_datetime(F['enddate'])
#create contract duration column
F['duration'] = (F['enddate'] - F['startdate']).dt.days + 1
#re-order the F matrix by column 'duration', ensure that the bootstrapping
#prioritises the shorter term contracts
F.sort_values(by=['duration'], ascending=[True])
# create prices P
P = pd.DataFrame()
for index, row in F.iterrows():
new_P_row = pd.Series()
for date in pd.date_range(row['startdate'], row['enddate']):
new_P_row[date] = row['price']
P = P.append(new_P_row, ignore_index=True)
P.fillna(0, inplace=True)
#create C matrix, which records the unique day prices across the observation interval
C = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
C.columns = tempDateRange
#create the Repatriation matrix, which records the order in which contracts will be
#stored in the A matrix, which means that once results are generated
#from the linear solver, we know exactly which CalendarDays map to
#which columns in the results array
#this array contains numbers from 1 to NbContracts
R = pd.DataFrame(np.zeros((1, intNbCalendarDays)))
R.columns = tempDateRange
#define a zero filled matrix, P1, which will house the dominant daily prices
P1 = pd.DataFrame(np.zeros((intNbContracts, intNbCalendarDays)))
#rename columns of P1 to be the dates contained in matrix array D
P1.columns = tempDateRange
#create prices in correct rows in P
for i in list(range(0, intNbContracts)):
for j in list(range(0, intNbCalendarDays)):
if (P.iloc[i, j] != 0 and C.iloc[0,j] == 0) :
flUniqueCalendarMarker = P.iloc[i, j]
C.iloc[0,j] = flUniqueCalendarMarker
P1.iloc[i,j] = flUniqueCalendarMarker
R.iloc[0,j] = i
for k in list(range(j+1,intNbCalendarDays)):
if (C.iloc[0,k] == 0 and P.iloc[i,k] != 0):
C.iloc[0,k] = flUniqueCalendarMarker
P1.iloc[i,k] = flUniqueCalendarMarker
R.iloc[0,k] = i
elif (C.iloc[0,j] != 0 and P.iloc[i,j] != 0):
P1.iloc[i,j] = C.iloc[0,j]
#convert C dataframe into C_list, in prepataion for converting C_list
#into a unique, order preserved list
C_list = C.values.tolist()
#create C1 matrix, which records the unique day prices across unique days in the observation period
C1 = pd.DataFrame(np.unique(C))
Use DataFrame.duplicated() to check if your data-frame contains any duplicate or not.
If yes then you can try DataFrame.drop_duplicate() .

Python Import data dictionary and pattern

If I have data as:
Code, data_1, data_2, data_3, [....], data204700
a,1,1,0, ... , 1
b,1,0,0, ... , 1
a,1,1,0, ... , 1
c,0,1,0, ... , 1
b,1,0,0, ... , 1
etc. same code different value (0, 1, ?(not known))
I need to create a big matrix and I want to analyze.
How can I import data in a dictionary?
I want to use dictionary for column (204.700+1)
There is a built in function (or package) that return to me pattern?
(I expect a percent pattern). I mean as 90% of 1 in column 1, 80% of in column 2.
Alright so I am going to assume you want this in a dictionary for storing purposes and I will tell you that you don't want that with this kind of data. use a pandas DataFrame
this is how you will get your code into a dataframe:
import pandas as pd
my_file = 'file_name'
df = pd.read_csv(my_file)
now you don't need a package for returning the pattern you are looking for, just write a simple algorithm for returning that!
def one_percentage(data):
#get total number of rows for calculating percentages
size = len(data)
#get type so only grabbing the correct rows
x = data.columns[1]
x = data[x].dtype
#list of touples to hold amount of 1s and the column names
ones = [(i,sum(data[i])) for i in data if data[i].dtype == x]
my_dict = {}
#create dictionary with column names and percent
for x in ones:
percent = x[1]/float(size)
my_dict[x[0]] = percent
return my_dict
now if you want to get the percent of ones in any column, this is what you do:
percentages = one_percentage(df)
column_name = 'any_column_name'
print percentages[column_name]
now if you want to have it do every single column, then you can grab all of the column names and loop through them:
columns = [name for name in percentages]
for name in columns:
print str(percentages[name]) + "% of 1 in column " + name
let me know if you need anything else!

Categories