Applying a custom function to a column in a dataframe - python

I have a custom function that takes in a 8-character identifier (CUSIP), and based on some logic generates the 9th character (check bit). I want to apply this function to a dataframe consisting of 8-char identifiers and return back the dataframe with the full 9-char string.
e.g. a list of 2 8-char cusips:
list1 = [[ '912810SE',
'912810SF']]
pd1 = pd.DataFrame(list1)
print(pd1.apply(gen_cusip_checkbit))
I am expecting 9 and 6; however, I am getting 4 and 2 when applying the function to the df. Also, this should loop 8 times in the function, but when applied to the df it loops 36 times.
This is the function:
def gen_cusip_checkbit(cusip):
cusip=str(cusip).upper()
sumnum = 0
for i in range(len(cusip)):
val = 0
if cusip[i].isnumeric():
val = int(cusip[i])
else:
val = int(cusip_alpha.find(cusip[i])+10) # refers to alphabet string for mapping
if i % 2 != 0:
val *= 2
val = (val % 10) + (val // 10)
sumnum += val
return str((10 - (sumnum % 10)) % 10)

so it looks like when you do :
pd1.apply(gen_cusip_checkbit)
The variable sent to the function consists of :
0 912810SE
NAME: 0, DTYPE: OBJECT
The length of this variable is 36 which answers why your loop has 36 iterations
If you run the apply function against the column :
pd1[0].apply(gen_cusip_checkbit)
The variable that would be sent is just :
912810SE
which should give you the right output .

Related

Pandas Convert object values used in Passive Components to float

I have a dataframe of part numbers stored as object with a string containing 3 digits of values of following format:
Either 1R2, where the R is the decimal separator
Or only numbers where the first 2 are significant and the 3rd is the number of 0 following:
101 = 100
010 = 1
223 = 22000
476 = 47000000
My dataframe (important are positions 5~7):
MATNR
0 xx01B101KO3XYZC
1 xx03C010CA3GN5T
2 xx02L1R2CA3ANNR
Below code works fine for the 1R2 case and converts object to float64.
But I am stuck with getting the 2 significant numbers together with the number of 0s.
value_pos1 = 5
value_pos2 = 6
value_pos3 = 7
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + df['MATNR'].str.get(value_pos3)))
Result
MATNR object
Cap pF float64
dtype: object
Index(['MATNR', 'Value'], dtype='object')
MATNR Value
0 xx01B101KO3XYZC 101.0
1 xx03C010CA3GN5T 10.0
2 xx02L1R2CA3ANNR 1.2
It should be
MATNR Value
0 xx01B101KO3XYZC 100.0
1 xx03C010CA3GN5T 1.0
2 xx02L1R2CA3ANNR 1.2
Following I tried with errors and on top there is a wrong value for 0 # pos3 being 1 instead 0.
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(Value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + str(pow(10, pd.to_numeric(df['MATNR'].str.get(value_pos3))))))
Do you have an idea?
If I have understood your problem correctly, defining a method and applying it to all the values of the column seems most intuitive. The method takes a str input and returns a float number.
Here is a snippet of what the simple method will entaik.
def get_number(strrep):
if not strrep or len(strrep) < 8:
return 0.0
useful_str = strrep[5:8]
if useful_str[1] == 'R':
return float(useful_str[0] + '.' + useful_str[2])
else:
zeros = '0' * int(useful_str[2])
return float(useful_str[0:2] + zeros)
Then you could simply create a new column with the numeric conversion of the strings. The easiest way possible is using list comprehension:
df['Value'] = [get_number(x) for x in df['MATNR']]
Not sure where the bug in your code is, but another option that I tend to use when creating a new column based on other columns is pandas' apply function:
def create_value_col(row):
if row['MATNR'][value_pos2] == 'R':
val = row['MATNR'][value_pos1] + '.' + row['MATNR'][value_pos3]
else:
val = (int(row['MATNR'][value_pos1]) * 10 +
int(row['MATNR'][value_pos2])) * 10 ** int(row['MATNR'][value_pos3])
return val
df['Value'] = df.apply(lambda row: create_value_col(row), axis='columns')
This way, you can create a function that processes the data however you want and then apply it to every row and add the resulting series to your dataframe.

removing the middle member or members from list in python

I have this code. Everything is okay but it is not printing the desired values.
I think something is wrong in calling function but I can't figure it out.
this code is removing the middle element if the list length is odd, or the middle two elements if the length is even.
This is the code,
One_Ten = [1,2,3,4,5,6,7,8,9,10]
def removeMiddle(data:list)-> list:
index = 0
size = len(data)
index = size // 2
if (size % 2 == 0 ):
data = data[:index-1] + data[index+1:]
if (size % 2 == 1):
data.pop(index)
return data
data = list(One_Ten)
removeMiddle(data)
print("After removing the middle element (s):why ", data)
so the desired output should look like
[1,2,3,4,7,8,9,10]
You just need to assign data it's new value,
data = removeMiddle(data)
Altertnately you can me the function inplace by editing the first conditions
if (size % 2 == 0):
data.pop(index)
data.pop(index-1)

Add a value in a column as a function of the timestamp and another column

The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0

TypeError: zip argument #1 must support iteration in Python

I'm trying to write code to find two indices when a value changes from 0 to 1 and save that value in a variable called idx. Then, the two rows before and after the index should be extracted and processed. The code for extracting the rows is included below:
df1=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
N = 2
s = [x for s, e in zip(idx-N,idx) for x in range(s, e+1)]
df_before_2rows=df1.loc[df1.index.intersection(s)]
This works. But, if I run this in a for-loop that processes each index one-by-one then I get an error:
df1=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
for item in idx:
N = 2
s = [x for s, e in zip(item-N,item) for x in range(s, e+1)]
df_before_2rows=df1.loc[df1.index.intersection(s)]
TypeError: zip argument #1 must support iteration
Main goal is to get two rows before when flag change from 0 to 1 and process ,and then go next check if flag change from 0 to 1 then do same as above
IIUC, you can choose a different approach using groupby with cumsum of diff:
df = pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
for _, i in df.groupby(df["flag"].shift(1).diff().eq(1).cumsum()):
if i["flag"].eq(1).any(): # this is done to skip the last group with no flag of 1
print (i.tail(3))
# do your thing with i.tail(3)...
EDIT using your original method:
idx = [4,8] # I assume you retrieved the idx already
for item in idx:
N = 2
df_before_2rows=df.loc[range(item-N,item+1)]
print (df_before_2rows)
item is an element of idx; item-N is also just a number, hence the error.
for item in idx:
N = 2
s = [x for s, e in zip(item-N,item) for x in range(s, e+1)]
simplifies to:
for item in idx:
N = 2
# s = [x for x in range(item-N, item+1)]
s = list(range(item-N, item+1))
# s = np.arange(item-N, item+1)

Find value and index in panda series where the value increased 5 times

In a panda series it should go through the series and stop if one value has increased 5 times. With a simple example it works so far:
list2 = pd.Series([2,3,3,4,5,1,4,6,7,8,9,10,2,3,2,3,2,3,4])
def cut(x):
y = iter(x)
for i in y:
if x[i] < x[i+1] < x[i+2] < x[i+3] < x[i+4] < x[i+5]:
return x[i]
break
out = cut(list2)
index = list2[list2 == out].index[0]
So I get the correct Output of 1 and Index of 5.
But if I use a second list with series type and instead of (19,) which has (23999,) values then I get the Error:
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 3489660928
You can do something like this:
# compare list2 with the previous values
s = list2.gt(list2.shift())
# looking at last 5 values
s = s.rolling(5).sum()
# select those equal 5
list2[s.eq(5)]
Output:
10 9
11 10
dtype: int64
The first index where it happens is
s.eq(5).idxmax()
# output 10
Also, you can chain them together:
(list2.gt(list2.shift())
.rolling(5).sum()
.eq(5).idxmax()
)

Categories