splitting string a multiple delimiters - python

Is there a way to split a string in a pandas dataframe column like
coordinates(gDNA)
chr10:g.89711916T>A
into tab separated fields
chr\start\ref\alt
chr10\t89711916\tT\tA
in pandas.
So far I've tried
df[['chr','others']] = df['coordinates(gDNA)'].str.split(':',expand=True)
and extracted the first part, but not sure what to do for the rest

Use:
df[['chr','start', 'alt']] = df['coordinates(gDNA)'].str.split(':g.|>',expand=True)
df[['start','ref']] = df['start'].str.extract('(\d+)(\D+)')
print (df)
coordinates(gDNA) chr start alt ref
0 chr10:g.89711916T>A chr10 89711916 A T

Try this:
df[['chr','start','ref','alt']] = df['coordinates(gDNA)'].str.extract('(\w+).*?(\d+)(\w+).*?(\w+)')

df = pd.DataFrame(
columns=['coordinates(gDNA)'],
data=[['chr10:g.89711916T>A']]
)
def parser(x):
ch, x = x.split(':g.')
start = int(x[:-3])
ref = x[-3]
alt = x[-1]
return dict(chr=ch, start=start, ref=ref, alt=alt)
pd.DataFrame([*map(parser, df['coordinates(gDNA)'])], df.index)
alt chr ref start
0 A chr10 T 89711916

Related

How to extract the date from string and aggregate it?

I have a dataframe which has two features item_listing_time and item_sale_time.
Those are strings and looks like this:
item_listing_time
item_sale_time
2018-09-30T19:06:21.000-07:00
2018-09-30T23:06:21.000-07:00
I want to create a feature sold_in_24h which is True when sale happens in 24h.
At the moment my workflow looks like this:
# replacing "T" and "." char in item_listing_time and item_sale_time columns by space
data2['item_listing_time'] = data2['item_listing_time'].str.replace('T',' ')
data2['item_listing_time'] = data2['item_listing_time'].str.replace('.',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('T',' ')
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ')
# storing datetime into datetime_listing column as datetime type
data2['date_listing'] = data2.litem_listing_time.str.split(' ').str[0]
data2['time_listing'] = data2.item_listing_time.str.split(' ').str[1]
data2['datetime_listing'] = data2.date_listing + " "+data2.time_listing
data2['datetime_listing'] = pd.to_datetime(data2['datetime_listing'], format='%Y-%m-%d %H:%M:%S')
# same with saletime
data2['date_sale'] = data2.item_sale_time.str.split(' ').str[0]
data2['time_sale'] = data2.item_sale_time.str.split(' ').str[1]
data2['saletime'] = data2.date_sale + " "+data2.time_sale
data2['saletime'] = pd.to_datetime(data2['saletime'], format='%Y-%m-%d %H:%M:%S')
# creating column for sold_in_24h
data2["was_sold_in_24h"] = (data2["saletime"] - data2["datetime_listing"]) < pd.Timedelta(days=1)
This method works, but not sure is it neat way to solve this problem.
Any opinions how to improve it or leave it this way, as it provides the desired result.
Thanks!
You can convert directly the "item_listing_time" and "item_sale_time" to datetime:
df["item_listing_time"] = pd.to_datetime(df["item_listing_time"])
df["item_sale_time"] = pd.to_datetime(df["item_sale_time"])
one_day = pd.Timedelta(days=1)
df["sold_in_24h"] = df.apply(
lambda x: x["item_listing_time"] + one_day > x["item_sale_time"], axis=1
)
print(df)
Prints:
item_listing_time item_sale_time sold_in_24h
0 2018-09-30 19:06:21-07:00 2018-09-30 23:06:21-07:00 True
For replacing "T" and "." char in item_listing_time and item_sale_time columns by space:
One option you can use is that, replace calls can be chained together for single case, it will be better like this:
data2['item_sale_time'] = data2['item_sale_time'].str.replace('.',' ').replace('T',' ')
For storing datetime into datetime_listing column as datetime type:
Store the split result and use it:
data = data2.litem_listing_time.str.split(' ')
data2['date_listing'] = data[0]
data2['time_listing'] = data[1]
But these are very basic tip-offs, I will also be looking for some deep-thought suggestions here!

Converting DataFrame containing UTF-8 and Nulls to String Without Losing Data

Here's my code for reading in this dataframe:
html = 'https://www.agroindustria.gob.ar/sitio/areas/ss_mercados_agropecuarios/logistica/_archivos/000023_Posici%C3%B3n%20de%20Camiones%20y%20Vagones/000010_Entrada%20de%20camiones%20y%20vagones%20a%20puertos%20semanal%20y%20mensual.php'
url = urlopen(html)
df = pd.read_html(html, encoding = 'utf-8')
remove = []
for x in range(len(df)):
if len(df[x]) < 10:
remove.append(x)
for x in remove[::-1]:
df.pop(x)
df = df[0]
The dataframe contained uses both ',' and '.' as thousands indicators, and i want neither. So 5.103 should be 5103.
Using this code:
df = df.apply(lambda x: x.str.replace('.', ''))
df = df.apply(lambda x: x.str.replace(',', ''))
All of the data will get changed, but the values in the last four columns will all turn to NaN. I'm assuming this has something to do with trying to use str.replace on a float?
Trying any sort of df[column] = df[column].astype(str) also gives back errors, as does something convoluted like the following:
for x in df.columns.tolist():
for k, v in df[x].iteritems():
if pd.isnull(v) == False and type(v) = float:
df.loc(k, df[x]) == str(v)
What is the right way to approach this problem?
You can try this regex approach. I haven't tested it, but it should work.
df = df.apply(lambda x: re.sub(r'(\d+)[.,](\d+)',r'\1\2',str(x)))

Converting a part of a field of a data frame to a lower case [Pandas]

I have a data frame with the following:
df=pd.DataFrame(['DMA.CSV','NaN' , 'AEB.csv', 'Xy.PY'],columns=['File_Name'])
What is the efficient way to get all the File_Names with extension converted to lower case (excluding NaN). The output should look like this:
['DMA.csv','NaN' , 'AEB.csv', 'Xy.py']
This one excludes 'NaN' from the output:
df = df.File_Name.iloc[df[~df.File_Name.str.contains('NaN')].index].str.split('.', expand=True)
df.iloc[:,1] = df.iloc[:,1].str.lower()
df = df[0] + '.' + df[1]
You can also try this:
def lower_suffix(mystr):
if '.' in mystr:
return mystr[:mystr.rfind('.')]+mystr[mystr.rfind('.'):].lower()
else:
return mystr
df['File_Name'] = df['File_Name'].apply(lower_suffix)
print(df)
You are applying the function which finds, if it exists, the last '.' in your file name and replaces whatever comes afterwards by lowercase.
Using os.path.splitext
Ex:
import pandas as pd
import os
df=pd.DataFrame(['Hello.world.txt', 'DMA.CSV','NaN' , 'AEB.csv', 'Xy.PY'],columns=['File_Name'])
df["File_Name"] = [ filename+ext.lower() if ext else filename for filename,ext in df["File_Name"].apply(os.path.splitext) ]
print(df)
Output:
File_Name
0 Hello.world.txt
1 DMA.csv
2 NaN
3 AEB.csv
4 Xy.py
You can try this:
import pandas as pd
df=pd.DataFrame(['DMA.CSV','NaN' , 'AEB.csv', 'Xy.PY'],columns=['File_Name'])
for i, v in enumerate(df['File_Name'].str.split('.')):
if len(v) == 2:
df.iloc[i] = v[0]+'.'+v[1].lower()
else:
df.iloc[i] = v[0]
print(df)
File_Name
0 DMA.csv
1 NaN
2 AEB.csv
3 Xy.py
After a lot of research, I found the following method, which is quite simple, I feel:
df['File_Name'] = [x.rsplit('.',1)[0]+'.'+x.rsplit('.',1)[-1].lower() if '.' in str(x)
else x for x in df['File_Name']]
This will exclude all the NaN values and also will take care of multiple dots ('.') in the file names (as 'Hello.World.TXT')

Formatting a specific row of integers to the ssn style

I want to format a specific column of integers to ssn format (xxx-xx-xxxx). I saw that openpyxl has builtin styles. I have been using pandas and wasn't sure if it could do this specific format.
I did see this -
df.iloc[:,:].str.replace(',', '')
but I want to replace the ',' with '-'.
import pandas as pd
df = pd.read_excel('C:/Python/Python37/Files/Original.xls')
df.drop(['StartDate', 'EndDate','EmployeeID'], axis = 1, inplace=True)
df.rename(columns={'CheckNumber': 'W/E Date', 'CheckBranch': 'Branch','DeductionAmount':'Amount'},inplace=True)
df = df[['Branch','Deduction','CheckDate','W/E Date','SSN','LastName','FirstName','Amount','Agency','CaseNumber']]
ssn = (df['SSN'] # the integer column
.astype(str) # cast integers to string
.str.zfill(8) # zero-padding
.pipe(lambda s: s.str[:2] + '-' + s.str[2:4] + '-' + s.str[4:]))
writer = pd.ExcelWriter('C:/Python/Python37/Files/Deductions Report.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Your question is a bit confusing, see if this helps:
If you have a column of integers and you want to create a new one made up of strings in SSN (Social Security Number) format. You can try something like:
df['SSN'] = (df['SSN'] # the "integer" column
.astype(int) # the integer column
.astype(str) # cast integers to string
.str.zfill(9) # zero-padding
.pipe(lambda s: s.str[:3] + '-' + s.str[3:5] + '-' + s.str[5:]))
Setup
Social Security numbers are nine-digit numbers using the form: AAA-GG-SSSS
s = pd.Series([111223333, 222334444])
0 111223333
1 222334444
dtype: int64
Option 1
Using zip and numpy.unravel_index:
pd.Series([
'{}-{}-{}'.format(*el)
for el in zip(*np.unravel_index(s, (1000,100,10000)))
])
Option 2
Using f-strings:
pd.Series([f'{i[:3]}-{i[3:5]}-{i[5:]}' for i in s.astype(str)])
Both produce:
0 111-22-3333
1 222-33-4444
dtype: object
I prefer:
df["ssn"] = df["ssn"].astype(str)
df["ssn"] = df["ssn"].str.strip()
df["ssn"] = (
df.ssn.str.replace("(", "")
.str.replace(")", "")
.str.replace("-", "")
.str.replace(" ", "")
.apply(lambda x: f"{x[:3]}-{x[3:5]}-{x[5:]}")
)
This take into account rows that are partially formatted, fully formatted, or not formatted and correctly formats them all.
For Example:
data = [111111111,123456789,"222-11-3333","433-3131234"]
df = pd.DataFrame(data, columns=['ssn'])
Gives you:
Before
After the code you then get:
After

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories