Optimization: Apply function to all values in a pandas dataframe - python

I have a data frame of words that looks like this:
I built a function called get_freq(word) that takes a string and returns a list with the word and its frequency in a certain corpus (iWeb Corpus). This corpus is in another data frame called df_freq
def get_freq(word):
word_freq=[]
for i in range(len(df_freq)):
if(df_freq.iloc[i, 0]==word):
word_freq.append(word)
word_freq.append(df_freq.iloc[i, 1])
break
return word_freq
This step works fine:
Now, I need to iterate through the whole data frame and apply the get_freq() function to every word in every cell. I would like the original words to be replaced by the list that the function returns.
I managed to do this with the following code:
for row in range(len(df2)):
for col in range(len(df2.columns)):
df2.values[row,col] = get_freq(df2.iat[row,col])
The problem is that this took over 5 minutes to complete. The reason for this is that I'm using a nested for and the function get_freq(word) has another for in it. I have tried using a while instead in the function, without improvement.
How can I optimize the execution time of this task? Any suggestions are welcome.

This is what DataFrame.applymap is for:
df = df.applymap(get_freq)
However, because this operation probably can't be vectorized, it's going to take some time any way you go about it.

Related

running eval code on a dataframe for each row in python as fast as possible

I have a dataframe where I have defined some set of rules to be applied for request data and return value for each rule. The rules which are to be applied are defined outside of the main processing function. The rules dataframe looks like this:
rule_id
rule_name
rule_function
1
frequency_check
frequency_check(request_data)
2
age_check
age_check(request_data)
def process_rules(request_data, rules_df):
rule_vals = []
for i, row in rules_df.iterrows():
rule_vals.append(eval(row['rule_function']))
return rule_vals
def frequency_check(request_data):
""" some code to return values"""
def age_check(request_data):
""" some code to return values"""
The above approach is working but it is takinf ~1.1 second to evaluate and return the final result. Can anyone help me on the approach that will work fastest in this scenario.
Till now I have tried using dictionary comprehension instead of using rules_df and apply lambda function but none on them is giving significant improvement.
If you don't need to look at row - 1 (or row - n I suppose), then you are best avoiding operations on a row-by-row basis. Instead look at using pandas vectorised capabilities by operating on the entire column in one go.
See here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

How to extract the rows that contain the desired strings?

I would like to extract the rows based on the list of string as words, phrases etc. My questions are as follow:
Do I need to write this code every single time to exact?
What codes can I write to generate a new variable after this for loop?
Here is what I tried.
fruit=['apple','banana','orange']
b1=[]
b2=[]
b3=[]
for i in range(len(df)):
SelecgtedWords='apple'
if SelectedWord in df.loc[i,'text']:
a1=df.loc[i,'title']
a2=df.loc[i,'text']
a3=df.loc[i,'label']
a4=df.loc[i,'author']
b1.append(a1)
b2.append(a2)
b3.append(a3)
b4.append(a4)
new_df=pd.DataFrame(columns=[title,'text','label','author'])
new_df['title']=b1
new_df['text']=b2
new_df['label']=b3
new_df['author']=b4
It's basically like an Excel filter function, but I want to automate the process.
You not need the for loop to do that. Complementing the #Mike67 sugestion:
fruit=['apple','banana','orange']
new_df = df.loc[df['text'].str.contains('|'.join(fruit), regex=True)]

How can I mix similar code using loops (for/while) in Python?

I have some code repeated, where only some numbers are changing.
df_h0 = df.copy()
df_h0['hour']='00:00'
df_h0['totalCount']=df.post_time_data.str.split('"00:00","postCount":"').str[1].str.split('","topic').str[0]
df_h0 = df_h0.fillna(0)
df_h1 = df.copy()
df_h1['hour']='01:00'
df_h1['totalCount']=df.post_time_data.str.split('"01:00","postCount":"').str[1].str.split('","topic').str[0]
df_h1 = df_h1.fillna(0)
df_h2 = df.copy()
df_h2['hour']='02:00'
df_h2['totalCount']=df.post_time_data.str.split('"02:00","postCount":"').str[1].str.split('","topic').str[0]
df_h2 = df_h2.fillna(0)
I want to simplify this code with a loop but I'm not sure how to start with that since I'm new in Python.
I will try to show what the process looks like in general, so that you can figure these things out yourself in the future. However, it's not automatic - you will need to think about what you are doing every time, in order to write the best code you are capable of.
Step 1: Grab a single representative block of the code you want to repeat, and identify the parts that change:
df_h0 = df.copy()
# ^^^ the variable name changes
df_h0['hour']='00:00'
# ^^^^^ the hour string changes
df_h0['totalCount']=df.post_time_data.str.split('"00:00","postCount":"').str[1].str.split('","topic').str[0]
# the delimiter string changes ^^^^^^^^^^^^^^^^^^^^^^^
df_h0 = df_h0.fillna(0)
Step 2: Understand that our output will be a list of values, instead of multiple separate variables with related names.
This will be much easier to work with going forward :)
Step 3: Analyze the changes.
We have an hour string which varies, and a delimiter string which also varies; but the delimiter string always has the same general form, which is based upon the hour string. So if we have the hour string, we can create the delimiter string. There is really only one piece of varying information - the hour. We'll adjust the code to reflect that:
hour = '00:00' # give the variable information a name
delimiter = f'"{hour}","postCount":"' # compute the derived information
# and then use those values in the rest of the code
df_h0 = df.copy()
df_h0['hour'] = hour
df_h0['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
df_h0 = df_h0.fillna(0)
Step 4: To make the overall code easier to understand, we put this block into its own function.
That lets us give a name to the process of making a single table. We use the input to the function to provide the varying information that we described in step 3. There is one thing that changes, so there will be one parameter to represent that. However, we also need to provide the data context that we're working with here - the df dataframe - so that the function has access to it. So we have two parameters in total.
def hourly_data(df, hour):
# since 'hour' was provided, we don't define it here
delimiter = f'"{hour}","postCount":"'
# now we use a generic name inside the function.
result = df.copy()
result['hour'] = hour
result['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
# At the last step of the original process, we `return` the value
# instead of simply assigning it.
return result.fillna(0)
Now we have code that, given an 'hour' string, can produce a new dataframe, simply by calling it - for example: df_h0 = hourly_data(df, '00:00').
Step 5: A bit more analysis.
We would like to call this function with each possible hour value, presumably from '00:00' through '23:00' inclusive. However, these strings have an obvious pattern to them. It would be easier if we just supply the number for the hour to hourly_data, and have it produce the string.
def hourly_data(df, hour):
# Locally replace the integer hour value with the hour string.
# The `:02` here is used to zero-pad and right-align the hour value
# as two digits.
hour = f'{hour:02}:00'
delimiter = f'"{hour}","postCount":"'
# The rest as before.
result = df.copy()
result['hour'] = hour
result['totalCount']=df.post_time_data.str.split(delimiter).str[1].str.split('","topic').str[0]
return result.fillna(0)
Step 6: Now we are ready to use this code in a loop.
In Python, the natural loop to "transform" one input list into another is the list comprehension. It looks like this:
hourly_dfs = [hourly_data(df, hour) for hour in range(24)]
Here, range is a built-in function that gives us the desired sequence of input values.
We can also build the list manually with a for loop:
hourly_dfs = []
for hour in range(24):
hourly_dfs.append(hourly_data(df, hour))
We could also have done the work inside the body of the for loop (someone else will probably come along with another answer and show code like that). But by making the function first, we get code that is easier to understand, and which also allows us to use a list comprehension. The list comprehension approach is simpler, because we don't have to think about the process of starting from empty and .appending each element, we let Python build a list instead of telling it how to do so.
You could make a list of the variables and iterate over them and use the string.format method
vars = [df_h0, df_h1, df_h2]
x = 0
for var in vars:
var = df.copy()
var['hour']='0{0}:00'.format(x)
var['totalCount']=df.post_time_data.str.split('0{0}:00", "postCount":'. format(x)).str[1].str.split('","topic').str[0]
var = var.fillna(0)
x += 1
If you have Python 3.6+ you can use f strings instead of .format() as well
Hopefully I havent missed anything, but if I have, you could just impletent the same logic I used by declaring anither variable like x

Python Pandas replace string based on format

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Categories