Multi-part manipulation post str.split() Pandas - python

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.

One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Related

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510
You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Run functions over many dataframes, add results to another dataframe, and dynamically name the resulting column with the name of the original df

I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6

How to split a column when having multiple scenarios (pandas)

I have a column that has a combination of longitude and latitude. I'm trying to split them separately. But, I'm facing a problem. Here's what my data looks like:
print(df['location'])
location
0 -10.8544921875-49.8238090851324
1 2.021484375-59.478568831926
2 2.021484375 / 49.823809085
3 -10.8544921875/ 59.478568831926
4 9.61795 19.33163
As you can see some don't have any spacing but separated with a ' - '. Some have spacing a separated with ' / '. And other have spacing without any character between them.
I've tried to separate it one by one and firstly by doing:
df[['Long','Lat']] = df['location'].str.split(" ",1, expand=True)
Obviously, it didn't separate everything.
My problem is, what do I do next or is there a better approach using Regular Expression? which I'm not familiar with at all
Desired Output:
long Lat
0 -10.8544921875 -49.8238090851324
1 2.021484375 -59.478568831926
2 2.021484375 49.823809085
3 -10.8544921875 59.478568831926
4 9.61795 19.33163
Try:
df[['Long','Lat']] = df['location'].str.extractall(r'([-]?\d+(\.\d+)?)')[0].unstack(level=1)
Outputs:
>>> df[['Long','Lat']]
Long Lat
0 -10.8544921875 -49.8238090851324
1 2.021484375 -59.478568831926
2 2.021484375 49.823809085
3 -10.8544921875 59.478568831926
4 9.61795 19.33163

Using dictionary to add some columns to a dataframe with assign function

I was using python and pandas to do some statistical analysis on data and at some point I needed to add some new columns with assign function
df_res = (
df
.assign(col1 = lambda x: np.where(x['event'].str.contains('regex1'),1,0))
.assign(col2 = lambda x: np.where(x['event'].str.contains('regex2'),1,0))
.assign(mycol = lambda x: np.where(x['event'].str.contains('regex3'),1,0))
.assign(newcol = lambda x: np.where(x['event'].str.contains('regex4'),1,0))
)
I wanted to know if there is any way to add columns names and my regex to a dictionary and use a for loop or another lambda expression to assign these columns automatically:
Dic = {'col1':'regex1','col2':'regex2','mycol':'regex3','newcol':'regex4'}
df_res = (
df
.assign(...using Dic here...)
)
I need to add more columns later and I think it will make it easier to add new columns later.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Assigning multiple columns within the same assign is possible. For Python 3.6 and above, later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order. For Python 3.5 and below, the order of keyword arguments is not specified, you cannot refer to newly created or modified columns. All items are computed first, and then assigned in alphabetical order.
Changed in version 0.23.0: Keyword argument order is maintained for Python 3.6 and later.
If you map all your regex so that each dictionary value holds a lambda instead of just the regex, you can simply unpack the dic into assign:
lambda_dict = {
col:
lambda x, regex=regex: (
x['event'].
str.contains(regex)
.astype(int)
)
for col, regex in Dic.items()
}
res = df.assign(**lambda_dict)
EDIT
Here's an example:
import pandas as pd
import random
random.seed(0)
events = ['apple_one', 'chicken_one', 'chicken_two', 'apple_two']
data = [random.choice(events) for __ in range(10)]
df = pd.DataFrame(data, columns=['event'])
regex_dict = {
'apples': 'apple',
'chickens': 'chicken',
'ones': 'one',
'twos': 'two',
}
lambda_dict = {
col:
lambda x, regex=regex: (
x['event']
.str.contains(regex)
.astype(int)
)
for col, regex in regex_dict.items()
}
res = df.assign(**lambda_dict)
print(res)
# Output
event apples chickens ones twos
0 apple_two 1 0 0 1
1 apple_two 1 0 0 1
2 apple_one 1 0 1 0
3 chicken_two 0 1 0 1
4 apple_two 1 0 0 1
5 apple_two 1 0 0 1
6 chicken_two 0 1 0 1
7 apple_two 1 0 0 1
8 chicken_two 0 1 0 1
9 chicken_one 0 1 1 0
The problem with the prior code was that the regex was only evaluated during the last loop. Adding it as a default argument fixes this.
This can do what you want to do
pd.concat([df,pd.DataFrame({a:list(df["event"].str.contains(b)) for a,b in Dic.items()})],axis=1)
Actually using a for loop will do the same
If I understand you question correctly, you're trying to rename the columns, in which case I think you could just use Pandas rename function. This would look like
df_res = df_res.rename(mapper=Dic)
-Ben

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Categories