How to get an specific located text into a dataframe index? - python

I have a dataframe with some text indexes which contains a necessary information that I want to copy into a list.
I don't know how is the text info specifically (the word always changes), but I know where is located in the index:
'point.subclase.optimum.R31.done'. R31 is the value which I would like to write in a list, so I know that that text, that is always different, is between point.subclase.optimum. and .done.
I've tried with:
info_list = []
for col in df.columns:
if ('point.subclase.optimum.' in col) and ('.done' in col):
info_list.append(col)
But that script just provide me the entire index in the list.
Does anyone know how to solve it?

Use Series.str.extract with escape \. because special regex character, then remove possible missing values if no match by Series.dropna and last convert output to list:
df = pd.DataFrame({'a':range(3)}, index=['point.subclase.optimum.R31.done',
'point.subclase',
'point.subclase.optimum.R98.done'])
print (df)
a
point.subclase.optimum.R31.done 0
point.subclase 1
point.subclase.optimum.R98.done 2
L = (df.index.str.extract(r'point\.subclase\.optimum\.(.*)\.done', expand=False)
.dropna()
.tolist())
print (L)
['R31', 'R98']

Related

Stripping 0's from the middle of a dataframe

Basically data is coming into my program in this format
0xxxx000xxxx where the x is unique to the data that I have in another system. I'm trying to remove those 0's as they're always in the same place.
I tried
df['item'] = df['item'].str.replace('0','')
but sometimes the x can be a 0 and will get rid of it. I'm not sure how to get rid of just the 0's in those specific positions.
EX:
Input: 099890000890
Output (Desired): 99890890
Use the str accessor for indexing:
df['item'] = df['item'].str[1:5] + df['item'].str[8:]
Or str.replace:
df['item'] = df['item'].str.replace(r'0(.{4})000(.{4})', r'\1\2', regex=True)
Output (as new column. Item2):
item item2
0 099890000890 99890890

Remove unwanted characters from Dataframe values in Pandas

I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.

How do I find if a string is in a list in a specific column of a dataframe?

I have 2 large dataframes I want to compare against each other.
I have .split(" ") one of the columns and placed the result in a new column of the dataframe.
I now want to check and see if a value exists in that new column, instead of using a .contains() in the original column, to avoid the value getting picked up within a word.
Here is what I've tried and why I'm frustrated.
row['company'][i] == 'nom'
L_df['Name split'][7126853] == "['nom', '[this', 'is', 'nom]']"
row['company'][i] in L_df['Name split'][7126853] == True (this is the index where I know the specific value occurs)
row['company'][i] in L_df['Name split'] #WHAAT? == False (my attempt to check the entire column); why is this false when I've shown it exists?
L_df[L_df['Name split'].isin([row['company'][i]])] == [empty]
edit: I should additionally add that I am trying to set up a process where I can iterate to check entries in the smaller dataset against the larger one.
result = L_df[ #The [9] is a placeholder for our iterable 'i' that will go row by row
L_df['Company name'].str.contains(row['company'][i], na=False) #Can be difficult with names like 'Nom'
#(row['company'][i] in L_df['Name split'])
& L_df['Industry'].str.contains('marketing', na=False) #Unreliable currently, need to get looser matches; min. reduction
& L_df['Locality'].str.contains(row['city'][i], na=False) #Reliable, but not always great at reducing results
& ((row['workers'][i] >= L_df['Emp Lower bound']) & (row['workers'][i] <= L_df['Emp Upper bound'])) #Unreliable
]
the first line is what I am trying to replace with this new process, so I don't get matches when 'nom' appears in the middle of words.
Here is a solution that first merges the two dataframes into one and then uses lambda to process the columns of interest. The result is placed in a new column found:
df1 = pandas.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pandas.DataFrame(data={'Name split': ["here is a string including findme and then some".split(" "), "something here".split(" ")]})
combined_df = pandas.concat([df1,df2], axis=1)
combined_df['found'] = combined_df.apply(lambda row: row['company'] in row['Name split'], axis=1)
Result:
company Name split found
0 findme [here, is, a, string, including, findme, and, ... True
1 asdf [something, here] False
EDIT:
In order to compare each value from the company column to every cell in the Name split column in the other dataframe, and to have access to the whole row from the latter dataframe, I would simply loop through each column, see here:
df1 = pd.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pd.DataFrame(data={'Name split': ["random text".split(" "), "here is a string including findme and then some".split(" "), "somethingasdfq here".split(" ")], '`another column`': [3, 1, 2]})
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1['company'] in row2['Name split']:
# do something here with row2
print(row2)
Probably not very efficient, but could be improved by breaking out of the inner loop as soon as a match is found if we only need one match.

How to trim string from reverse in Pandas column

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?
Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.
You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Categories