I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)
Related
I've a dataframe with the following structure (3 columns):
DATE,QUOTE,SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, KAGGLE
What I am trying to do is make a substring on QUOTE column in order to generate anew column only with the words after the last occurrence (in this case the word 'TEST').
My expected result:
DATE,QUOTE, SATUS, SOURCE
2019-11-21,1ºTEST/2ºTEST DONE, DONE, KAGGLE
For that I'm trying with the following code:
import pandas as pd
df = pd.read_excel (filename)
split = lambda x: len(x['QUOTE'].rsplit('TEST',1)[0])
df["STATUS"] = df.apply(split, axis=1)
print(df["STATUS"].unique())
However I'm just printing numbers not 'DONE'.
What I am doing wrong?
Thanks!
In the definition of split you are using len, that returns the length of sequence (an integer),
len([1, 'Done']) # returns 2
You need to access the last index, for example:
df['STATUS'] = df.QUOTE.str.rsplit('TEST').str[-1]
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Or if you want to use apply, just change the definition of split:
split = lambda x: x['QUOTE'].rsplit('TEST', 1)[-1]
df["STATUS"] = df.apply(split, axis=1)
print(df)
Output
DATE QUOTE SOURCE STATUS
0 2019-11-21 1ºTEST/2ºTEST DONE KAGGLE DONE
Note than using lambda to create named functions is consider a not so good practice.
I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]
I want to read three columns from my pandas data frame and then combine with some character to form a new data frame column, the below iteration code works fine.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)
Sample Input
Sample Output
However, if I want to do the same job by using apply or lambda. In fact, I am trying but it is not working. the code is as below which I believe is not correct. Thanks in advance for helping me out.
def date_creation(a,b,c):
date=str(a) +'/'+str(b)+'/'+str(c)
return date
df.loc["Test_FL_DATE"]=df[:,["DAY_OF_MONTH","MONTH","AYEAR"]].apply(date_creation)
Here is possible use if need lambda function:
cols = ["DAY_OF_MONTH","MONTH","AYEAR"]
df["Test_FL_DATE"] = df[cols].astype(str).apply(lambda x: '/'.join(x))
Or:
df["Test_FL_DATE"] = df[cols].apply(lambda x: '/'.join(x.astype(str)))
But nicer is:
df["Test_FL_DATE"] = df[["DAY_OF_MONTH","MONTH","AYEAR"]].astype(str).apply('/'.join)
And faster solution is simply join by +:
df["Test_FL_DATE"] = (df["DAY_OF_MONTH"].astype(str) + '/' +
df["MONTH"].astype(str) + '/' +
df["AYEAR"].astype(str))
Probably easiest to use pd.Series.str.cat, which concatenates one string Series with other Series.
df['Test_FL_Date'] = (df['DAY_OF_MONTH']
.astype(str)
.str
.cat([df['MONTH'], df['AYEAR'], sep='/'))
Im using
df[colname].str.extract(regex)
to parse a column of strings into several columns. I'd like to be able to assign the column names at the same time, something like:
df[colname].str.extract(regex, columns=cnames)
where:
cnames = ['col1','col2','col3']
regex = r'(sometext\w)_(aa|bb)_(\d+-\d)'
Its possible with a clunky construction like:
df[colname].str.extract(regex).rename(columns = dict(zip(range(len(cnames)),cnames)))
Or else I could embed the column names in the regex as named groups, so the regex changes to:
regex = r'(?P<col1>sometext\w)_(?P<col2>aa|bb)_(?P<col3>\d+-\d)'
Am i missing something here, is there a simpler way?
thanks
What you have done with embedding the names into the regex is a correct way of doing this. It states to do this in the documentation.
Your first solution using .rename() would not be robust if you had some columns with the names 0, 1 and 2 already.
IMO the regex solution is the best but you could start to use something like .pipe() to implement a function in this way. However, as you will see, it starts to get messy when you do not want the same regex.
def extract_colnames(df, column, sep, cnames, drop_col=True):
if drop_col:
drop_col = [column]
else:
drop_col = []
regex = '(?P<' + ('>.*)' + sep + '(?P<').join(cnames) + '>.*)'
return df.join(df.loc[:, column].str.extract(regex, expand=True)).drop(drop_col, axis=1)
cnames = ['col1','col2','col3']
data = data.pipe(extract_colnames, column='colname',
sep='_', cnames=cnames, drop_col=True)
EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).