I've a dataframe with the following structure (3 columns):
DATE,QUOTE,SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, KAGGLE
What I am trying to do is make a substring on QUOTE column in order to generate anew column only with the words after the last occurrence (in this case the word 'TEST').
My expected result:
DATE,QUOTE, SATUS, SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, DONE, KAGGLE
For that I'm trying with the following code:
import pandas as pd
df = pd.read_excel (filename)
split = lambda x: len(x['QUOTE'].rsplit('TEST',1)[0])
df["STATUS"] = df.apply(split, axis=1)
print(df["STATUS"].unique())
However I'm just printing numbers not 'DONE'.
What I am doing wrong?
Thanks!
In the definition of split you are using len, that returns the length of sequence (an integer),
len([1, 'Done']) # returns 2
You need to access the last index, for example:
df['STATUS'] = df.QUOTE.str.rsplit('TEST').str[-1]
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Or if you want to use apply, just change the definition of split:
split = lambda x: x['QUOTE'].rsplit('TEST', 1)[-1]
df["STATUS"] = df.apply(split, axis=1)
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Note than using lambda to create named functions is consider a not so good practice.
Related
I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)
I am trying to clean lists within a column in my dataframe from all the terms that do not make sense.
For example
Col New_Col
VM ['#']
JS [ '/','/UTENTI/','//utilsit/promo', '/notifiche/']
www.facebook.com ['https://www.facebook.com/','https://twitter.com/']
FA ['/nordest/venezia/','/nordest/treviso/']
I would like to remove from each list(row) in the column all the words that
do not start with https, http or //
contains Col as subset in New_Col (for example: www.facebook.com is included in https://www.facebook.com/ so I should remove it, does not matter if it starts with https)
I tried to write this code:
prefixes=['http','https','//']
for word in df['New_Col']:
if word.startswith(prefixes):
list.remove(word)
print (df['New_Col'])
however it says that
'list' object has no attribute 'startswith'
(Attribute error).
I think I am considering in my code above a list and not column with lists.
Can you please help me to understand how to do it?
Use, DataFrame.apply on axis=1 along with custom filter function fx:
import re
fx = lambda s: [w for w in s['New_Col'] if s['Col'] not in w and re.match(r'^https?|//', w)]
df['New_Col'] = df.apply(fx, axis=1)
# print(df)
Col New_Col
0 VM []
1 JS [//utilsit/promo]
2 www.facebook.com [https://twitter.com/]
3 FA []
make a function to remove the words you want using regular expression and then apply it on the dataframe column as below:
df['ColName'].apply(lambda x: func(x))
Here func is the function that will take each row of the ColName column and will return your required result
I have a DataFrame containing int and str data which I have to process through.
I would like to separate the text and the numerical values in each cell into separate columns, so that I can compute on the numerical data.
My columns are similar to this:
I have read about doing something like this through the apply function and applymap function, but I can't design such a function as I am new to pandas. It should basically do -
def separator():
if cell has str:
Add str part to another column(Check column), leave int inplace.
else:
Add 'NA' to Check column
You can do this using extract with a followed to_numeric:
import pandas as pd
df = pd.DataFrame({'a_mrk4': ['042FP', '077', '079', '1234A-BC D..EF']})
df[['a_mrk4', 'Check']] = df['a_mrk4'].str.extract(r'(\d+)(.*)')
df['a_mrk4'] = pd.to_numeric(df['a_mrk4'])
print(df)
Output:
a_mrk4 Check
0 42 FP
1 77
2 79
3 1234 A-BC D..EF
you can use regular expressions, let's say that you have a column (target_col) and the data follow the pattern digits+text then you can use the following column
df.target_col.str.extractall(r'(/d)(/w)')
you can tweak the re to match your specific needs
I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]
I have a pandas dataframe column of numbers which all have a dash in between, for example :
"123-045"
I am wondering is there anyway to delete the zero after the dash sign, to make the above example to
"123-45"
? And is it possible to apply the process condition to the entire column??
I have used a for loop to check each digit after the dash sign, using the python string function. But the number of rows is large, and the for loop takes forever.
Try Series.str.replace method with regex (?<=-)0+ to remove 0 after -:
df = pd.DataFrame({'a': ["123-045"]})
df
# a
#0 123-045
df.a.str.replace('(?<=-)0+', '')
#0 123-45
#Name: a, dtype: object
If str is your string then it could be as simple as this:
str = re.sub("-.", "-", str)
Or with pandas dataframe:
df = pd.DataFrame({'key': ["assa-dssd-sd"]})
print (df.key.str.replace("-.", "-"))