I have a dataframe of part numbers stored as object with a string containing 3 digits of values of following format:
Either 1R2, where the R is the decimal separator
Or only numbers where the first 2 are significant and the 3rd is the number of 0 following:
101 = 100
010 = 1
223 = 22000
476 = 47000000
My dataframe (important are positions 5~7):
MATNR
0 xx01B101KO3XYZC
1 xx03C010CA3GN5T
2 xx02L1R2CA3ANNR
Below code works fine for the 1R2 case and converts object to float64.
But I am stuck with getting the 2 significant numbers together with the number of 0s.
value_pos1 = 5
value_pos2 = 6
value_pos3 = 7
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + df['MATNR'].str.get(value_pos3)))
Result
MATNR object
Cap pF float64
dtype: object
Index(['MATNR', 'Value'], dtype='object')
MATNR Value
0 xx01B101KO3XYZC 101.0
1 xx03C010CA3GN5T 10.0
2 xx02L1R2CA3ANNR 1.2
It should be
MATNR Value
0 xx01B101KO3XYZC 100.0
1 xx03C010CA3GN5T 1.0
2 xx02L1R2CA3ANNR 1.2
Following I tried with errors and on top there is a wrong value for 0 # pos3 being 1 instead 0.
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(Value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + str(pow(10, pd.to_numeric(df['MATNR'].str.get(value_pos3))))))
Do you have an idea?
If I have understood your problem correctly, defining a method and applying it to all the values of the column seems most intuitive. The method takes a str input and returns a float number.
Here is a snippet of what the simple method will entaik.
def get_number(strrep):
if not strrep or len(strrep) < 8:
return 0.0
useful_str = strrep[5:8]
if useful_str[1] == 'R':
return float(useful_str[0] + '.' + useful_str[2])
else:
zeros = '0' * int(useful_str[2])
return float(useful_str[0:2] + zeros)
Then you could simply create a new column with the numeric conversion of the strings. The easiest way possible is using list comprehension:
df['Value'] = [get_number(x) for x in df['MATNR']]
Not sure where the bug in your code is, but another option that I tend to use when creating a new column based on other columns is pandas' apply function:
def create_value_col(row):
if row['MATNR'][value_pos2] == 'R':
val = row['MATNR'][value_pos1] + '.' + row['MATNR'][value_pos3]
else:
val = (int(row['MATNR'][value_pos1]) * 10 +
int(row['MATNR'][value_pos2])) * 10 ** int(row['MATNR'][value_pos3])
return val
df['Value'] = df.apply(lambda row: create_value_col(row), axis='columns')
This way, you can create a function that processes the data however you want and then apply it to every row and add the resulting series to your dataframe.
Related
I am working with an excel file which I read into python as a pandas dataframe.
One of the columns contains responses of how many hours a person slept.
A sample column is as follows:
df['Sleep'] = [1, 2, 3, 'Blank', 4, 'Blank', '5`1/2', '`3/4']
My objective is to clean this data and get it all into a single datatype with NaN for Blanks. The blanks were taken care of using:
df['Sleep'] = df.['Sleep'].replace('Blank',np.nan)
My question is how can I convert something like 5`1/2 to 5.5? All fractions in the dataset start with the backtick symbol.
We have to use loc with fillna (because you have mixed types) and pd.eval
m = df['Sleep'].str.contains('`', na=False)
df.loc[m, 'Sleep'] = df.loc[m, 'Sleep'].str.replace('`', '+').apply(pd.eval)
df['Sleep'] = pd.to_numeric(df['Sleep'], errors='coerce')
Sleep
0 1
1 2
2 3
3 Blank
4 4
5 Blank
6 5.5
7 0.75
Like this:
df[df['Sleep'].str.contains("`")] = eval(df['Sleep'].str.replace("`","+"))
def convert_to_float(frac_str):
try:
return float(frac_str)
except ValueError:
num, denom = frac_str.split('/')
try:
leading, num = num.split('`')
whole = float(leading)
except ValueError:
whole = 0
frac = float(num) / float(denom)
return whole - frac if whole < 0 else whole + frac
df["Sleep"] = df["Sleep"].apply(lambda x: convert_to_float(x))
The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0
I have a custom function that takes in a 8-character identifier (CUSIP), and based on some logic generates the 9th character (check bit). I want to apply this function to a dataframe consisting of 8-char identifiers and return back the dataframe with the full 9-char string.
e.g. a list of 2 8-char cusips:
list1 = [[ '912810SE',
'912810SF']]
pd1 = pd.DataFrame(list1)
print(pd1.apply(gen_cusip_checkbit))
I am expecting 9 and 6; however, I am getting 4 and 2 when applying the function to the df. Also, this should loop 8 times in the function, but when applied to the df it loops 36 times.
This is the function:
def gen_cusip_checkbit(cusip):
cusip=str(cusip).upper()
sumnum = 0
for i in range(len(cusip)):
val = 0
if cusip[i].isnumeric():
val = int(cusip[i])
else:
val = int(cusip_alpha.find(cusip[i])+10) # refers to alphabet string for mapping
if i % 2 != 0:
val *= 2
val = (val % 10) + (val // 10)
sumnum += val
return str((10 - (sumnum % 10)) % 10)
so it looks like when you do :
pd1.apply(gen_cusip_checkbit)
The variable sent to the function consists of :
0 912810SE
NAME: 0, DTYPE: OBJECT
The length of this variable is 36 which answers why your loop has 36 iterations
If you run the apply function against the column :
pd1[0].apply(gen_cusip_checkbit)
The variable that would be sent is just :
912810SE
which should give you the right output .
I have a dataset that I am trying to split into 2 smaller dataframes called test and train. The original dataset has two columns "patient_nbr" and "encounter_id". These columns all have 6 digit values.
How can I go through this dataframe, and add up all the digits in those two columns. For example, if in the first row of the dataframe the values are 123456 and 123456, I need to add 1+2+3+4+5+6+1+2+3+4+5+6. The sum is used to determine if that row goes into test or train. If it is even, test. If it is odd, train.
Below is what I tried. But it is so slow. I turned the two columns I need into two numpy arrays in order to break down and add up the digits. I added those numpy arrays to get one, and looped through that to get determine which dataframe it should go in.
with ZipFile('dataset_diabetes.zip') as zf:
with zf.open('dataset_diabetes/diabetic_data.csv','r') as f:
df = pd.read_csv(f)
nums1 = []
nums2 = []
encounters = df["encounter_id"].values
for i in range(len(encounters)):
result = 0
while encounters[i] > 0:
rem = encounters[i] % 10
result = result + rem
encounters[i] = int(encounters[i]/10)
nums1.append(result)
patients = df["patient_nbr"].values
for i in range(len(patients)):
result = 0
while patients[i] > 0:
rem = patients[i] % 10
result = result + rem
patients[i] = int(patients[i]/10)
nums2.append(result)
nums = np.asarray(nums1) + np.asarray(nums2)
df["num"] = nums
# nums = df["num"].values
train = pd.DataFrame()
test = pd.DataFrame()
for i in range(len(nums)):
if int(nums[i] % 2) == 0:
# goes to train
train.append(df.iloc[i])
else:
# goes to test
test.append(df.iloc[i])
you can do it by playing with astype to go from int to str to float, sum both columns over the row once string (like concatenate both strings), then str.split and expand the string, and sum again per row after selecting the right columns and change the type of each digit as float.
#dummy example
df = pd.DataFrame({'patient_nbr':[123456, 123457, 123458],
'encounter_id':[123456, 123456, 123457]})
#create num
df['num'] = df[['patient_nbr', 'encounter_id']].astype(str).sum(axis=1)\
.astype(str).str.split('', expand=True)\
.loc[:,1:12].astype(float).sum(axis=1)
print (df)
patient_nbr encounter_id num
0 123456 123456 42.0
1 123457 123456 43.0
2 123458 123457 45.0
then use this column to create a mask with even as False and odd as True
mask = (df['num']%2).astype(bool)
train = df.loc[~mask, :] #train is the even
test = df.loc[mask, :] #test is the odd
print (test)
patient_nbr encounter_id num
1 123457 123456 43.0
2 123458 123457 45.0
I've been asked to do the following:
Using a while loop, you will write a program which will produce the following mathematical sequence:
1 * 9 + 2 = 11(you will compute this number)
12 * 9 + 3 = 111
123 * 9 + 4 = 1111
Then your program should run as far as the results contain only "1"s. You can build your numbers as string, then convert to ints before calculation. Then you can convert the result back to a string to see if it contains all "1"s.
Sample Output:
1 * 9 + 2 = 11
12 * 9 + 3 = 111
123 * 9 + 4 = 1111
1234 * 9 + 5 = 11111
Here is my code:
def main():
Current = 1
Next = 2
Addition = 2
output = funcCalculation(Current, Addition)
while (verifyAllOnes(output) == True):
print(output)
#string concat to get new current number
Current = int(str(Current) + str(Next))
Addition += 1
Next += 1
output = funcCalculation(Current, Next)
def funcCalculation(a,b):
return (a * 9 + b)
def verifyAllOnes(val):
Num_str = str(val)
for ch in Num_str:
if(str(ch)!= "1"):
return False
return True
main()
The bug is that the formula isn't printing next to the series of ones on each line. What am I doing wrong?
Pseudo-code:
a = 1
b = 2
result = a * 9 + b
while string representation of result contains only 1s:
a = concat a with the old value of b, as a number
b = b + 1
result = a * 9 + b
This can be literally converted into Python code.
Testing all ones
Well, for starters, here is one easy way to check that the value is all ones:
def only_ones(n):
n_str = str(n)
return set(n_str) == set(['1'])
You could do something more "mathy", but I'm not sure that it would be any faster. It would much more easily
generalize to other bases (than 10) if that's something you were interested in though
def only_ones(n):
return (n % 10 == 1) and (n == 1 or only_ones2(n / 10))
Uncertainty about how to generate the specific recurrence relation...
As for actually solving the problem though, it's actually not clear what the sequence should be.
What comes next?
123456
1234567
12345678
123456789
?
Is it 1234567890? Or 12345678910? Or 1234567900?
Without answering this, it's not possible to solve the problem in any general way (unless in fact the 111..s
terminate before you get to this issue).
I'm going to go with the most mathematically appealing assumption, which is that the value in question is the
sum of all the 11111... values before it (note that 12 = 11 + 1, 123 = 111 + 11 + 1, 1234 = 1111 + 111 + 11 + 1, etc...).
A solution
In this case, you could do something along these lines:
def sequence_gen():
a = 1
b = 1
i = 2
while only_ones(b):
yield b
b = a*9 + i
a += b
i += 1
Notice that I've put this in a generator in order to make it easier to only grab as many results from this
sequence as you actually want. It's entirely possible that this is an infinite sequence, so actually running
the while code by itself might take a while ;-)
s = sequence_gen()
s.next() #=> 1
s.next() #=> 11
A generator gives you a lot of flexibility for things like this. For instance, you could grab the first 10 values of the sequence using the itertools.islice
function:
import itertools as it
s = sequence_gen()
xs = [x for x in it.islice(s, 10)]
print xs