How to extract the rows that contain the desired strings? - python

I would like to extract the rows based on the list of string as words, phrases etc. My questions are as follow:
Do I need to write this code every single time to exact?
What codes can I write to generate a new variable after this for loop?
Here is what I tried.
fruit=['apple','banana','orange']
b1=[]
b2=[]
b3=[]
for i in range(len(df)):
SelecgtedWords='apple'
if SelectedWord in df.loc[i,'text']:
a1=df.loc[i,'title']
a2=df.loc[i,'text']
a3=df.loc[i,'label']
a4=df.loc[i,'author']
b1.append(a1)
b2.append(a2)
b3.append(a3)
b4.append(a4)
new_df=pd.DataFrame(columns=[title,'text','label','author'])
new_df['title']=b1
new_df['text']=b2
new_df['label']=b3
new_df['author']=b4
It's basically like an Excel filter function, but I want to automate the process.

You not need the for loop to do that. Complementing the #Mike67 sugestion:
fruit=['apple','banana','orange']
new_df = df.loc[df['text'].str.contains('|'.join(fruit), regex=True)]

Related

Optimization: Apply function to all values in a pandas dataframe

I have a data frame of words that looks like this:
I built a function called get_freq(word) that takes a string and returns a list with the word and its frequency in a certain corpus (iWeb Corpus). This corpus is in another data frame called df_freq
def get_freq(word):
word_freq=[]
for i in range(len(df_freq)):
if(df_freq.iloc[i, 0]==word):
word_freq.append(word)
word_freq.append(df_freq.iloc[i, 1])
break
return word_freq
This step works fine:
Now, I need to iterate through the whole data frame and apply the get_freq() function to every word in every cell. I would like the original words to be replaced by the list that the function returns.
I managed to do this with the following code:
for row in range(len(df2)):
for col in range(len(df2.columns)):
df2.values[row,col] = get_freq(df2.iat[row,col])
The problem is that this took over 5 minutes to complete. The reason for this is that I'm using a nested for and the function get_freq(word) has another for in it. I have tried using a while instead in the function, without improvement.
How can I optimize the execution time of this task? Any suggestions are welcome.
This is what DataFrame.applymap is for:
df = df.applymap(get_freq)
However, because this operation probably can't be vectorized, it's going to take some time any way you go about it.

Python remove outer quotes in a list of lists made from a data frame column

I have a pandas data frame called positive_samples that has a column called Gene Class, which is basically a pair of genes stored as a list. It looks like below
The entire data frame looks like this.
.
So the gene class column is just the other two columns in the data frame combined. I made a list using the gene class column like below. This take all the gene pair lists and make them into a single list.
#convert the column to a list
postive_gene_pairs = positive_samples["Gene Class"].tolist()
This is the output.
Each pair is now wrapped within double quotes, which I dont want because I loop through this list and use .loc method to locate this pairs in another data frame called new_expression which has them as an index like this
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[[positive_gene_pair],"GSM144819"])
This throws a keyerror.
And it definely because of the extra quotes that each pair is wrapped around because when I instantiate a list like below without quotes it works just fine.
So my question is how do I remove the extra quotes to make this work with .loc? To make a list just like below, but from a data frame column?.
pairs = [['YAL013W','YBR103W'],['YAL011W','YMR263W']]
I tried so many workarounds like replace, strip but none of them worked for me as ideally they would work for strings but I was trying to make them work on a list, any easy solution? I just want to have a list of list like this pairs list that does not have extra single or double quotes.
Convert list of strings to lists first:
import ast
postive_gene_pairs = positive_samples["Gene Class"].apply(ast.literal_eval).tolist()
And then remove []:
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[[positive_gene_pair],"GSM144819"])
to:
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[positive_gene_pair,"GSM144819"])
define a functio:
def listup(initlist):
# Converting string to list
res = ini_list.strip('][').split(', ')
return res
change from
postive_gene_pairs = positive_samples["Gene Class"].tolist()
to
postive_gene_pairs = positive_samples["Gene Class"].apply(listup).tolist()

Remove first 3 characters in string using a condition statement

Can anyone Kindly help please?
I'm trying to remove three of the first characters within the string using the statement:
Data['COUNTRY_CODE'] = Data['COUNTRY1'].str[3:]
This will create a new column after removing the first three values of the string. However, I do not want this to be applied to all of the values within the same column so was hoping there would be a way to use a conditional statement such as 'Where' in order to only change the desired strings?
I assume you are using pandas so your condition check can be like:
condition_mask = Data['COL_YOU_WANT_TO_CHECK'] == 'SOME CONDITION'
Your new column can be created as:
# Assuming you want the first 3 chars as COUNTRY_CODE
Data.loc[condition_mask, 'COUNTRY_CODE'] = Data['COUNTRY1'].str[:3]

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

Passing a list to a new list with a dynamic name after each step of a loop

I have a data frame with that contains several columns.
I would like to iterate some columns of the data frame and convert each row to a list based on a function called tokenizer.
columns = ['stemmed', 'lemmatized', 'lem_stop','stem_stop', 'lem_stop_nltk', 'stem_stop_nltk']
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = []
for i in columns:
for tweet in df[i]:
sentences += tweet_to_sentences(tweet, tokenizer)
However, I would like to create 6 different lists rather than one.
How can I change the name of the variable based on variable i in each step of the first loop (i.e. for i in columns:)
I was thinking something like this
for i in columns:
for tweet in df[i]:
sentences += tweet_to_sentences(tweet, tokenizer)
i_list = sentences
where i_list will translate in stemmed_list, lemmatized_list, lem_stop_list and so on.
Any idea?
Papa,
Perhaps I'm wrong with what you're asking, but what I'm hearing is that each time you iterate 'i' in your for loop you want to dynamically create a list(with a new name)?
If i could suggest a new approach based on this understanding, it'd be to create a list of lists, and each time you go into the for loop you can create a new list and add it as an element to this list of lists. Perhaps a better solution than this would be to create a dictionary, have the list name 'stemmed' be the key and the list you create in the for loop be the value.
Would this work for you?

Categories