Stripping 0's from the middle of a dataframe - python

Basically data is coming into my program in this format
0xxxx000xxxx where the x is unique to the data that I have in another system. I'm trying to remove those 0's as they're always in the same place.
I tried
df['item'] = df['item'].str.replace('0','')
but sometimes the x can be a 0 and will get rid of it. I'm not sure how to get rid of just the 0's in those specific positions.
EX:
Input: 099890000890
Output (Desired): 99890890

Use the str accessor for indexing:
df['item'] = df['item'].str[1:5] + df['item'].str[8:]
Or str.replace:
df['item'] = df['item'].str.replace(r'0(.{4})000(.{4})', r'\1\2', regex=True)
Output (as new column. Item2):
item item2
0 099890000890 99890890

Related

How to remove space in number (Pandas, Python) [duplicate]

I have the following dataframe that I am trying to remove the spaces between the numbers in the value column and then use pd.to_numeric to change the dtype. THe current dtype of value is an object.
periodFrom value
1 17.11.2020 28 621 240
2 18.11.2020 30 211 234
3 19.11.2020 33 065 243
4 20.11.2020 34 811 330
I have tried multiple variations of this but can't work it out:
df['value'] = df['value'].str.strip()
df['value'] = df['value'].str.replace(',', '').astype(int)
df['value'] = df['value'].astype(str).astype(int)
One option is to apply .str.split() first in order to split by whitespaces(even if the anyone of them has more than one character length), then concatenate (''.join()) while removing those whitespaces along with converting to integers(int()) such as
j=0
for i in df['value'].str.split():
df['value'][j]=int(''.join(i))
j+=1
You can do:
df['value'].replace({' ':''}, regex=True)
Or
df['value'].apply(lambda x: re.sub(' ', '', str(x)))
And add to both .astype(int).

Remove unwanted characters from Dataframe values in Pandas

I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.

How to get an specific located text into a dataframe index?

I have a dataframe with some text indexes which contains a necessary information that I want to copy into a list.
I don't know how is the text info specifically (the word always changes), but I know where is located in the index:
'point.subclase.optimum.R31.done'. R31 is the value which I would like to write in a list, so I know that that text, that is always different, is between point.subclase.optimum. and .done.
I've tried with:
info_list = []
for col in df.columns:
if ('point.subclase.optimum.' in col) and ('.done' in col):
info_list.append(col)
But that script just provide me the entire index in the list.
Does anyone know how to solve it?
Use Series.str.extract with escape \. because special regex character, then remove possible missing values if no match by Series.dropna and last convert output to list:
df = pd.DataFrame({'a':range(3)}, index=['point.subclase.optimum.R31.done',
'point.subclase',
'point.subclase.optimum.R98.done'])
print (df)
a
point.subclase.optimum.R31.done 0
point.subclase 1
point.subclase.optimum.R98.done 2
L = (df.index.str.extract(r'point\.subclase\.optimum\.(.*)\.done', expand=False)
.dropna()
.tolist())
print (L)
['R31', 'R98']

How to trim string from reverse in Pandas column

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Categories