I have the following Dataframe full of locus/gen names from a multiple genome alignment.
However, I am trying to get only a full list of the locus/name without the coordinates.
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 0:Rv0001:1-1524 1:MSMEG_RS33460:6986600-6988114 2:MRA_RS00005:1-1524 3:BQ2027_RS00005:1-1524
1 0:Rv0002:2052-3260 1:MSMEG_RS00005:499-1692 2:MRA_RS00010:2052-3260 3:BQ2027_RS00010:2052-3260
2 0:Rv0003:3280-4437 1:MSMEG_RS00015:2624-3778 2:MRA_RS00015:3280-4437 3:BQ2027_RS00015:3280-4437
To avoid issues with empty cells, I am filling cells with 'N/A' and then striping the unwanted characters. But it's giving the same exact result, nothing seems to be happening.
for value in orthologs['Tuberculosis_locus']:
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].fillna("N/A")
orthologs['Tuberculosis_locus'] = orthologs['Tuberculosis_locus'].map(lambda x: x.lstrip('\d:').rstrip(':\d+'))
Any idea on what I am doing wrong? I'd like the following output:
Tuberculosis_locus Smagmatis_locus H37RA_locus Bovis_locus
0 Rv0001 MSMEG_RS33460 MRA_RS00005 BQ2027_RS00005
1 Rv0002 MSMEG_RS00005 MRA_RS00010 BQ2027_RS00010
2 Rv0003 MSMEG_RS00015 MRA_RS00015 BQ2027_RS00015
Split by : with a maximum split of two and then take the 2nd elements, eg:
df.applymap(lambda v: v.split(':', 2)[1])
def clean(x):
x = x.split(':')[1].strip()
return x
orthologs = orthologs.applymap(clean)
should work.
Explanation:
applymap is for the whole dataframe and apply is for a data column.
clean is a function you want to apply to every entry of the dataframe. Note that you don't need (x) anymore when you use it together with applymap or apply.
I have 2 large dataframes I want to compare against each other.
I have .split(" ") one of the columns and placed the result in a new column of the dataframe.
I now want to check and see if a value exists in that new column, instead of using a .contains() in the original column, to avoid the value getting picked up within a word.
Here is what I've tried and why I'm frustrated.
row['company'][i] == 'nom'
L_df['Name split'][7126853] == "['nom', '[this', 'is', 'nom]']"
row['company'][i] in L_df['Name split'][7126853] == True (this is the index where I know the specific value occurs)
row['company'][i] in L_df['Name split'] #WHAAT? == False (my attempt to check the entire column); why is this false when I've shown it exists?
L_df[L_df['Name split'].isin([row['company'][i]])] == [empty]
edit: I should additionally add that I am trying to set up a process where I can iterate to check entries in the smaller dataset against the larger one.
result = L_df[ #The [9] is a placeholder for our iterable 'i' that will go row by row
L_df['Company name'].str.contains(row['company'][i], na=False) #Can be difficult with names like 'Nom'
#(row['company'][i] in L_df['Name split'])
& L_df['Industry'].str.contains('marketing', na=False) #Unreliable currently, need to get looser matches; min. reduction
& L_df['Locality'].str.contains(row['city'][i], na=False) #Reliable, but not always great at reducing results
& ((row['workers'][i] >= L_df['Emp Lower bound']) & (row['workers'][i] <= L_df['Emp Upper bound'])) #Unreliable
]
the first line is what I am trying to replace with this new process, so I don't get matches when 'nom' appears in the middle of words.
Here is a solution that first merges the two dataframes into one and then uses lambda to process the columns of interest. The result is placed in a new column found:
df1 = pandas.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pandas.DataFrame(data={'Name split': ["here is a string including findme and then some".split(" "), "something here".split(" ")]})
combined_df = pandas.concat([df1,df2], axis=1)
combined_df['found'] = combined_df.apply(lambda row: row['company'] in row['Name split'], axis=1)
Result:
company Name split found
0 findme [here, is, a, string, including, findme, and, ... True
1 asdf [something, here] False
EDIT:
In order to compare each value from the company column to every cell in the Name split column in the other dataframe, and to have access to the whole row from the latter dataframe, I would simply loop through each column, see here:
df1 = pd.DataFrame(data={'company': ['findme', 'asdf']})
df2 = pd.DataFrame(data={'Name split': ["random text".split(" "), "here is a string including findme and then some".split(" "), "somethingasdfq here".split(" ")], '`another column`': [3, 1, 2]})
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if row1['company'] in row2['Name split']:
# do something here with row2
print(row2)
Probably not very efficient, but could be improved by breaking out of the inner loop as soon as a match is found if we only need one match.
Having found the maximum value in a panda data frame column, I am just trying to get the equivalent row name as a string.
Here's my code:
df[df['ColumnName'] == df['ColumnName'].max()].index
Which returns me an answer:
Index(['RowName'], dtype='object')
How do I just get RowName back?
(stretch question - why does .idmax() fry in the formulation df['Colname'].idmax? And, yes, I have tried it as .idmax() and also appended it to df.loc[:,'ColName'] etc.)
Just use integer indexing:
df[df['ColumnName'] == df['ColumnName'].max()].index[0]
Here [0] extracts the first element. Note your criterion may yield multiple indices.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have to implement pandas .apply(function, axis=1) (to apply row wise function) in pyspark. As I am a novice, I am not sure if It can be implemented either through map function or using UDFs. I am not able to find any similar implementation anywhere.
Basically all I want is to pass a row to a function do some operations to create new columns which depend on the values of current and previous rows and then return modified rows to create a new dataframe.
One of the function used with pandas is given below:
previous = 1
def row_operation(row):
global previous
if pd.isnull(row["PREV_COL_A"])==True or (row["COL_A"]) != (row["PREV_COL_A"]):
current = 1
elif row["COL_C"] > cutoff:
current = previous +1
elif row["COL_C"]<=cutoff:
current = previous
else:
current = Nan
previous = current
return current
Here PREV_COL_A is nothing but COL_A lagged by 1 row.
Please note that this function is the simplest and does not return rows however others do.
If anyone can guide me on how to implement row operations in pyspark it would be a great help.
TIA
You could use rdd.mapPartition. It will give you an iterator over the rows and you yield out the result rows you want to return. The iterable you are given won't allow you to go index forward or backwards, just return the next row. However you can save off rows as you are processing to do whatever you need to do. For example
def my_cool_function(rows):
prev_rows = []
for row in rows:
# Do some processing with all the rows, and return a result
yield my_new_row
if len(prev_rows) >= 2:
prev_rows = prev_rows[1:]
prev_rows.append(row)
updated_rdd = rdd.mapPartitions(my_cool_function)
Note, I used a list to track the partitions for the sake of example, but python lists are really arrays which don't have efficient head push/pop methods, so you will probably want to use an actual Queue.