Python Pandas replace string based on format - python

Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot

In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst

Related

Optimization: Apply function to all values in a pandas dataframe

I have a data frame of words that looks like this:
I built a function called get_freq(word) that takes a string and returns a list with the word and its frequency in a certain corpus (iWeb Corpus). This corpus is in another data frame called df_freq
def get_freq(word):
word_freq=[]
for i in range(len(df_freq)):
if(df_freq.iloc[i, 0]==word):
word_freq.append(word)
word_freq.append(df_freq.iloc[i, 1])
break
return word_freq
This step works fine:
Now, I need to iterate through the whole data frame and apply the get_freq() function to every word in every cell. I would like the original words to be replaced by the list that the function returns.
I managed to do this with the following code:
for row in range(len(df2)):
for col in range(len(df2.columns)):
df2.values[row,col] = get_freq(df2.iat[row,col])
The problem is that this took over 5 minutes to complete. The reason for this is that I'm using a nested for and the function get_freq(word) has another for in it. I have tried using a while instead in the function, without improvement.
How can I optimize the execution time of this task? Any suggestions are welcome.
This is what DataFrame.applymap is for:
df = df.applymap(get_freq)
However, because this operation probably can't be vectorized, it's going to take some time any way you go about it.

running eval code on a dataframe for each row in python as fast as possible

I have a dataframe where I have defined some set of rules to be applied for request data and return value for each rule. The rules which are to be applied are defined outside of the main processing function. The rules dataframe looks like this:
rule_id
rule_name
rule_function
1
frequency_check
frequency_check(request_data)
2
age_check
age_check(request_data)
def process_rules(request_data, rules_df):
rule_vals = []
for i, row in rules_df.iterrows():
rule_vals.append(eval(row['rule_function']))
return rule_vals
def frequency_check(request_data):
""" some code to return values"""
def age_check(request_data):
""" some code to return values"""
The above approach is working but it is takinf ~1.1 second to evaluate and return the final result. Can anyone help me on the approach that will work fastest in this scenario.
Till now I have tried using dictionary comprehension instead of using rules_df and apply lambda function but none on them is giving significant improvement.
If you don't need to look at row - 1 (or row - n I suppose), then you are best avoiding operations on a row-by-row basis. Instead look at using pandas vectorised capabilities by operating on the entire column in one go.
See here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Define variable number of columns in for loop

I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)

How do I pass multiple variables to a function in python?

I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Categories