I am working with an excel file which I read into python as a pandas dataframe.
One of the columns contains responses of how many hours a person slept.
A sample column is as follows:
df['Sleep'] = [1, 2, 3, 'Blank', 4, 'Blank', '5`1/2', '`3/4']
My objective is to clean this data and get it all into a single datatype with NaN for Blanks. The blanks were taken care of using:
df['Sleep'] = df.['Sleep'].replace('Blank',np.nan)
My question is how can I convert something like 5`1/2 to 5.5? All fractions in the dataset start with the backtick symbol.
We have to use loc with fillna (because you have mixed types) and pd.eval
m = df['Sleep'].str.contains('`', na=False)
df.loc[m, 'Sleep'] = df.loc[m, 'Sleep'].str.replace('`', '+').apply(pd.eval)
df['Sleep'] = pd.to_numeric(df['Sleep'], errors='coerce')
Sleep
0 1
1 2
2 3
3 Blank
4 4
5 Blank
6 5.5
7 0.75
Like this:
df[df['Sleep'].str.contains("`")] = eval(df['Sleep'].str.replace("`","+"))
def convert_to_float(frac_str):
try:
return float(frac_str)
except ValueError:
num, denom = frac_str.split('/')
try:
leading, num = num.split('`')
whole = float(leading)
except ValueError:
whole = 0
frac = float(num) / float(denom)
return whole - frac if whole < 0 else whole + frac
df["Sleep"] = df["Sleep"].apply(lambda x: convert_to_float(x))
Related
I have a dataframe with a row for phone numbers. I wrote the following function to fill any NaNs with an empty string, and then add a '+' and '1' to any phone numbers that needed them.
def fixCampaignerPhone(phone):
if phone.isnull():
phone = ''
phone = str(phone)
if len(phone) == 10:
phone = ('1' + phone)
if len(phone) > 1:
phone = ('+' + phone)
return phone
I tried to apply this function to a column of a dataframe as follows:
df['phone'] = df.apply(lambda row: fixCampaignerPhone(row['phone']), axis =1)
My function was not correctly identifying and replacing NaN values. Error "object of type 'float' has no len()" I worked around it with a .fillna() on a separate line, but I would like to understand why this didn't work. The function works if I manually pass a NaN value, so I assume it has to do with the fact that pandas is passing the argument as a float object, and not just a regular float.
EDIT: full working code with sample data for debugging.
import pandas as pd
import numpy as np
def fixCampaignerPhone(phone):# adds + and 1 to front of phone numbers if necessary
if phone.isnull():
phone = ''
phone = str(phone)
if len(phone) == 10:
phone = ('1' + phone)
if len(phone) > 1:
phone = ('+' + phone)
return phone
d = {0: float("NaN"), 1:"2025676789"}
sampledata = pd.Series(data = d, index = [0 , 1])
sampledata.apply(lambda row: fixCampaignerPhone(row))
EDIT 2:
changing phone.isnull() to pd.isna(phone) works for my sample data, but not for my production data set, so it must just be a weird quirk in my data somewhere. For context, the phone numbers in my production dataset must either be NaN, an 11 digit string starting with 1, or a 10 digit string. However, when I run my lambda function on my production dataset, I get the error "object of type 'float' has no len()" so somehow some floats/NaNs are slipping past my if statement
From this imaginary DataFrame :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
A,phone
L,3453454564
L,345345
R,345345
h,3
A,345345
L,345345
R,3453434543
R,345345
R,345345
R,345345
"""), sep=',')
>>> df
A phone
0 L 3453454564
1 L 345345
2 R 345345
3 h 3
4 A 345345
5 L 345345
6 R 3453434543
7 R 345345
8 R 345345
9 R 345345
We can use select from numpy to build our if segment and get the expected result :
import numpy as np
df['phone'] = df['phone'].astype(str)
condlist = [df['phone'].str.len() == 10,
df['phone'].str.len() > 1]
choicelist = ['1' + df['phone'],
'+' + df['phone']]
df['phone'] = np.select(condlist, choicelist, default='')
Output :
A phone
0 L 13453454564
1 L +345345
2 R +345345
3 h
4 A +345345
5 L +345345
6 R 13453434543
7 R +345345
8 R +345345
9 R +345345
Here is a working piece of code, you have to use pd.isnull(phone) instead of phone.isnull():
import pandas as pd
import numpy as np
def fixCampaignerPhone(phone):# adds + and 1 to front of phone numbers if necessary
if pd.isnull(phone):
phone = ''
phone = str(phone)
if len(phone) == 10:
phone = ('1' + phone)
if len(phone) > 1:
phone = ('+' + phone)
return phone
d = {0: float("NaN"), 1:"2025676789"}
sampledata = pd.Series(data = d, index = [0 , 1])
r=sampledata.apply(lambda row: fixCampaignerPhone(row))
print(r)
result is:
0
1 +12025676789
dtype: object
I have a dataframe of part numbers stored as object with a string containing 3 digits of values of following format:
Either 1R2, where the R is the decimal separator
Or only numbers where the first 2 are significant and the 3rd is the number of 0 following:
101 = 100
010 = 1
223 = 22000
476 = 47000000
My dataframe (important are positions 5~7):
MATNR
0 xx01B101KO3XYZC
1 xx03C010CA3GN5T
2 xx02L1R2CA3ANNR
Below code works fine for the 1R2 case and converts object to float64.
But I am stuck with getting the 2 significant numbers together with the number of 0s.
value_pos1 = 5
value_pos2 = 6
value_pos3 = 7
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + df['MATNR'].str.get(value_pos3)))
Result
MATNR object
Cap pF float64
dtype: object
Index(['MATNR', 'Value'], dtype='object')
MATNR Value
0 xx01B101KO3XYZC 101.0
1 xx03C010CA3GN5T 10.0
2 xx02L1R2CA3ANNR 1.2
It should be
MATNR Value
0 xx01B101KO3XYZC 100.0
1 xx03C010CA3GN5T 1.0
2 xx02L1R2CA3ANNR 1.2
Following I tried with errors and on top there is a wrong value for 0 # pos3 being 1 instead 0.
df['Value'] = pd.to_numeric(np.where(df['MATNR'].str.get(value_pos2)=='R',
df['MATNR'].str.get(Value_pos1) + '.' + df['MATNR'].str.get(value_pos3),
df['MATNR'].str.slice(start=value_pos1, stop=value_pos3) + str(pow(10, pd.to_numeric(df['MATNR'].str.get(value_pos3))))))
Do you have an idea?
If I have understood your problem correctly, defining a method and applying it to all the values of the column seems most intuitive. The method takes a str input and returns a float number.
Here is a snippet of what the simple method will entaik.
def get_number(strrep):
if not strrep or len(strrep) < 8:
return 0.0
useful_str = strrep[5:8]
if useful_str[1] == 'R':
return float(useful_str[0] + '.' + useful_str[2])
else:
zeros = '0' * int(useful_str[2])
return float(useful_str[0:2] + zeros)
Then you could simply create a new column with the numeric conversion of the strings. The easiest way possible is using list comprehension:
df['Value'] = [get_number(x) for x in df['MATNR']]
Not sure where the bug in your code is, but another option that I tend to use when creating a new column based on other columns is pandas' apply function:
def create_value_col(row):
if row['MATNR'][value_pos2] == 'R':
val = row['MATNR'][value_pos1] + '.' + row['MATNR'][value_pos3]
else:
val = (int(row['MATNR'][value_pos1]) * 10 +
int(row['MATNR'][value_pos2])) * 10 ** int(row['MATNR'][value_pos3])
return val
df['Value'] = df.apply(lambda row: create_value_col(row), axis='columns')
This way, you can create a function that processes the data however you want and then apply it to every row and add the resulting series to your dataframe.
The title may not be very clear, but with an example I hope it would make some sense.
I would like to create an output column (called "outputTics"), and put a 1 in it 0.21 seconds after a 1 appears in the "inputTics" column.
As you see, there is no value 0.21 seconds exactly after another value, so I'll put the 1 in the outputTics column two rows after : an example would be at the index 3, there is a 1 at 11.4 seconds so I'm putting an 1 in the output column at 11.6 seconds
If there is a 1 in the "inputTics" column 0.21 second of earlier, do not put a one in the output column : an example would be at the index 1 in the input column
Here is an example of the red column I would like to create.
Here is the code to create the dataframe :
A = pd.DataFrame({"Timestamp":[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.1,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9,13.0],
"inputTics":[0,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,1,1,1],
"outputTics":[0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0]})
You can use pd.Timedelta if you can to avoid python rounded numbers if you want
Create the column with zeros.
df['outputTics'] = 0
Define a function set_output_tic in the following manner
def set_output_tic(row):
if row['inputTics'] == 0:
return 0
index = df[df == row].dropna().index
# check for a 1 in input within 0.11 seconds
t = row['Timestamp'] + pd.TimeDelta(seconds = 0.11)
indices = df[df.Timestamp <= t].index
c = 0
for i in indices:
if df.loc[i,'inputTics'] == 0:
c = c + 1
else:
c = 0
break
if c > 0:
df.loc[indices[-1] + 1, 'outputTics'] = 1
return 0
then call the above function using df.apply
temp = df.apply(set_output_tic, axis = 1) # temp is practically useless
This was actually kinda tricky, but by playing with indices in numpy you can do it.
# Set timestamp as index for a moment
A = A.set_index(['Timestamp'])
# Find the timestamp indices of inputTics and add your 0.11
input_indices = A[A['inputTics']==1].index + 0.11
# Iterate through the indices and find the indices to update outputTics
output_indices = []
for ii in input_indices:
# Compare indices to full dataframe's timestamps
# and return index of nearest timestamp
oi = np.argmax((A.index - ii)>=0)
output_indices.append(oi)
# Create column of output ticks with 1s in the right place
output_tics = np.zeros(len(A))
output_tics[output_indices] = 1
# Add it to dataframe
A['outputTics'] = outputTics
# Add condition that if inputTics is 1, outputTics is 0
A['outputTics'] = A['outputTics'] - A['inputTics']
# Clean up negative values
A[A['outputTic']<0] = 0
# The first row becomes 1 because of indexing; change to 0
A = A.reset_index()
A.at[0, 'outputTics'] = 0
I have a dataset that I am trying to split into 2 smaller dataframes called test and train. The original dataset has two columns "patient_nbr" and "encounter_id". These columns all have 6 digit values.
How can I go through this dataframe, and add up all the digits in those two columns. For example, if in the first row of the dataframe the values are 123456 and 123456, I need to add 1+2+3+4+5+6+1+2+3+4+5+6. The sum is used to determine if that row goes into test or train. If it is even, test. If it is odd, train.
Below is what I tried. But it is so slow. I turned the two columns I need into two numpy arrays in order to break down and add up the digits. I added those numpy arrays to get one, and looped through that to get determine which dataframe it should go in.
with ZipFile('dataset_diabetes.zip') as zf:
with zf.open('dataset_diabetes/diabetic_data.csv','r') as f:
df = pd.read_csv(f)
nums1 = []
nums2 = []
encounters = df["encounter_id"].values
for i in range(len(encounters)):
result = 0
while encounters[i] > 0:
rem = encounters[i] % 10
result = result + rem
encounters[i] = int(encounters[i]/10)
nums1.append(result)
patients = df["patient_nbr"].values
for i in range(len(patients)):
result = 0
while patients[i] > 0:
rem = patients[i] % 10
result = result + rem
patients[i] = int(patients[i]/10)
nums2.append(result)
nums = np.asarray(nums1) + np.asarray(nums2)
df["num"] = nums
# nums = df["num"].values
train = pd.DataFrame()
test = pd.DataFrame()
for i in range(len(nums)):
if int(nums[i] % 2) == 0:
# goes to train
train.append(df.iloc[i])
else:
# goes to test
test.append(df.iloc[i])
you can do it by playing with astype to go from int to str to float, sum both columns over the row once string (like concatenate both strings), then str.split and expand the string, and sum again per row after selecting the right columns and change the type of each digit as float.
#dummy example
df = pd.DataFrame({'patient_nbr':[123456, 123457, 123458],
'encounter_id':[123456, 123456, 123457]})
#create num
df['num'] = df[['patient_nbr', 'encounter_id']].astype(str).sum(axis=1)\
.astype(str).str.split('', expand=True)\
.loc[:,1:12].astype(float).sum(axis=1)
print (df)
patient_nbr encounter_id num
0 123456 123456 42.0
1 123457 123456 43.0
2 123458 123457 45.0
then use this column to create a mask with even as False and odd as True
mask = (df['num']%2).astype(bool)
train = df.loc[~mask, :] #train is the even
test = df.loc[mask, :] #test is the odd
print (test)
patient_nbr encounter_id num
1 123457 123456 43.0
2 123458 123457 45.0
I have this long data. I like to sort this by 30 each and save separately.
Data print like this,
A292340
A291630
A278240
A267770
A267490
A261250
A261110
A253150
A252400
A253250
A243890
A243880
A236350
A233740
A233160
A225800
A225060
A225050
A225040
A225130
A219900
A204450
A204480
A204420
A196030
A196220
A167860
A152500
A123320
A122630
.
This is fairly simple question, but I need your help..
Thank you.
(And how can I make a list out of one results printed? list addtion?
I believe need create MultiIndex by modulo and floor divide np.arange by length of DataFrame and then unstack:
But if length modulo is not equal 0 (e.g. (30 % 12)), last values are not matched to last column and Nones are added:
N = 12
r = np.arange(len(df))
df.index = [r % N, r // N]
df = df['col'].unstack()
print (df)
0 1 2
0 A292340 A236350 A196030
1 A291630 A233740 A196220
2 A278240 A233160 A167860
3 A267770 A225800 A152500
4 A267490 A225060 A123320
5 A261250 A225050 A122630
6 A261110 A225040 None
7 A253150 A225130 None
8 A252400 A219900 None
9 A253250 A204450 None
10 A243890 A204480 None
11 A243880 A204420 None
Setup:
d = {'col': ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']}
df = pd.DataFrame(d)
print (df.head())
col
0 A292340
1 A291630
2 A278240
3 A267770
4 A267490
If you don't have Pandas and Numpy modules you can use this:
Setup:
long_list = ['A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250', 'A261110', 'A253150', 'A252400',
'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160', 'A225800', 'A225060', 'A225050',
'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420', 'A196030', 'A196220', 'A167860',
'A152500', 'A123320', 'A122630', 'A292340', 'A291630', 'A278240', 'A267770', 'A267490', 'A261250',
'A261110', 'A253150', 'A252400', 'A253250', 'A243890', 'A243880', 'A236350', 'A233740', 'A233160',
'A225800', 'A225060', 'A225050', 'A225040', 'A225130', 'A219900', 'A204450', 'A204480', 'A204420',
'A196030', 'A196220', 'A167860', 'A152500', 'A123320', 'A122630']
Code:
number_elements_in_sublist = 30
sublists = []
sublists.append([])
sublist_index = 0
for index, element in enumerate(long_list):
sublists[sublist_index].append(element)
if index > 0:
if (index+1) % number_elements_in_sublist == 0:
if index == len(long_list)-1:
break
sublists.append([])
sublist_index += 1
for index, sublist in enumerate(sublists):
print("Sublist Nr." + str(index+1))
for element in sublist:
print(element)