How to combine multiple columns in a pandas Dataframe by using apply? - python

I want to read three columns from my pandas data frame and then combine with some character to form a new data frame column, the below iteration code works fine.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)
Sample Input
Sample Output
However, if I want to do the same job by using apply or lambda. In fact, I am trying but it is not working. the code is as below which I believe is not correct. Thanks in advance for helping me out.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)

Here is possible use if need lambda function:
cols = ["DAY_OF_MONTH","MONTH","AYEAR"]
df["Test_FL_DATE"] = df[cols].astype(str).apply(lambda x: '/'.join(x))
Or:
df["Test_FL_DATE"] = df[cols].apply(lambda x: '/'.join(x.astype(str)))
But nicer is:
df["Test_FL_DATE"] = df[["DAY_OF_MONTH","MONTH","AYEAR"]].astype(str).apply('/'.join)
And faster solution is simply join by +:
df["Test_FL_DATE"] = (df["DAY_OF_MONTH"].astype(str) + '/' +
df["MONTH"].astype(str) + '/' +
df["AYEAR"].astype(str))

Probably easiest to use pd.Series.str.cat, which concatenates one string Series with other Series.
df['Test_FL_Date'] = (df['DAY_OF_MONTH']
.astype(str)
.str
.cat([df['MONTH'], df['AYEAR'], sep='/'))

Related

Pandas apply multiple function with list

I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)

How to create conditionnal columns in Pandas with any?

I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things

How to separate tuple into independent pandas columns?

I am working with matching two separate dataframes on first name using HMNI's fuzzymerge.
On output each row returns a key like: (May, 0.9905315373004635)
I am trying to separate the Name and Score into their own columns. I tried the below code but don't quite get the right output - every row ends up with the same exact name/score in the new columns.
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
matched[['consumer_name_first', 'key','MatchedNameFinal', 'MatchedNameScore']]
first when going over rows in pandas is better to use apply
matched['MatchedNameFinal'] = matched.key.apply(lambda x: x[0][0])
matched['MatchedNameScore'] = matched.key.apply(lambda x: x[0][1])
and in your case I think you are missing a tab in the for loop
for i, v in enumerate(matched.key):
matched['MatchedNameFinal'] = (matched.key[i][0][0])
matched['MatchedNameScore'] = (matched.key[i][0][1])
Generally, you want to avoid using enumerate for pandas because pandas functions are vectorized and much faster to execute.
So this solution won't iterate using enumerate.
First you turn the list into single tuple per row.
matched.key.explode()
Then use zip to split the tuple into 2 columns.
matched['col1'], matched['col2'] = zip(tuples)
Do all in 1 line.
matched['MatchedNameFinal'], matched['MatchedNameScore'] = zip(*matched.key.explode())

Pandas apply with swifter on axis 1 doesn't return

I try to apply the following code (minimal example) to my 2 Million rows DataFrame, but for some reason .apply returns more than one row to the function and breaks my code. I'm not sure what changed, but the code did run before.
def function(row):
return [row[clm1], row[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Did anyone get an idea or a similar issue?
Important without swifter everything works fine, but too slow due to the amount of rows.
This should work ==>
def function(row_different_name):
return [row_different_name[clm1], row_different_name[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.swifter.apply(function,axis=1)
Try changing the name of function parameter rwo to some other name.
based on this previous answer what you are trying to do should work if you change it like this:
def function(row):
return [row.swifter[clm1], row.swifter[clm2]]
res = pd.DataFrame()
res[["clm1", "clm2"]] = df.apply(function, axis=1, result_type='expand')
this because apply on a column lacks result_type as arg, while apply on a dataframe has it
axis=1 means column, so it will insert it vertically. Is that what you want? Try removing axis=1

How to lowercase an entire Data Frame?

I'm' trying to build a function to the job because my data frames are in a list. This is the function that I am working on:
def lower(x):
'''
This function lowercase the entire Data Frame.
'''
for x in clean_lst:
for x.columns in x:
x.columns['i'].map(lambda i: i.lower())
It's not working like that!
This is the list of data frames:
clean_lst = [pop_movies, trash_movies]
I am planing to access the list like this:
lower = [pd.DataFrame(lower(x)) for x in clean_list]
pop_movies = lower[0]
trash_movies = lower[1]
HELP!!!
You can use apply functions from pandas package which works on df / series.
clean_lst = [i.apply(lambda x: x.str.lower()) for i in clean_lst]
You should use a vectorized method for every column in the dataframe
x["column_i"].str.lower()

Categories