How to speed up this Python function? - python

I hope it is OK to ask questions of this type.
I have a get_lags function that takes a data frame, and for each column, shifts the column by each n in the list n_lags. So, if n_lags = [1, 2], the function shifts each column once by 1 and once by 2 positions, creating new lagged columns in this way.
def get_lags (df, n_lags):
data =df.copy()
data_with_lags = pd.DataFrame()
for column in data.columns:
for i in range(n_lags[0], n_lags[-1]+1):
new_column_name = str(column) + '_Lag' + str(i)
data_with_lags[new_column_name] = data[column].shift(-i)
data_with_lags.fillna(method = 'ffill', limit = max(n_lags), inplace = True)
return data_with_lags
So, if:
df.columns
ColumnA
ColumnB
Then, get_lags(df, [1 , 2]).columns will be:
ColumnA_Lag1
ColumnA_Lag2
ColumnB_Lag1
ColumnB_Lag2
Issue: working with data frames that have about 100,000 rows and 20,000 columns, this takes forever to run. On a 16-GB RAM, core i7 windows machine, once I waited for 15 minutes to the code to run before I stopped it. Is there anyway I can tweak this function to make it faster?

You'll need shift + concat. Here's the concise version -
def get_lags(df, n_lags):
return pd.concat(
[df] + [df.shift(i).add_suffix('_Lag{}'.format(i)) for i in n_lags],
axis=1
)
And here's a more memory-friendly version, using a for loop -
def get_lags(df, n_lags):
df_list = [df]
for i in n_lags:
v = df.shift(i)
v.columns = v.columns + '_Lag{}'.format(i)
df_list.append(v)
return pd.concat(df_list, axis=1)

This may not apply to your case (I hope I understand what you're trying to do correctly), but you can speed it up massively by not doing it in the first place. Can you treat your columns like a ring buffer?
Instead of changing the columns afterwards, keep track of:
how many columns can you use (how many lag items for each entry)
what was the last lag column used
(optionally) how many times you "rotated"
So instead of moving the data, you do something like:
current_column = (current_column + 1) % total_columns
and write to that column next.

Related

How to add multiple columns to a dataframe based on calculations

I have a csv dataset (with > 8m rows) that I load into a dataframe. The csv has columns like:
...,started_at,ended_at,...
2022-04-01 18:23:32,2022-04-01 22:18:15
2022-04-02 01:16:34,2022-04-02 02:18:32
...
I am able to load the dataset into my dataframe, but then I need to add multiple calculated columns to the dataframe for each row. In otherwords, unlike this SO question, I do not want the rows of the new columns to have the same initial value (col 1 all NAN, col 2, all "dogs", etc.).
Right now, I can add my columns by doing something like:
df['start_time'] = df.apply(lambda row: add_start_time(row['started_at']), axis = 1)
df['start_cat'] = df.apply(lambda row: add_start_cat(row['start_time']), axis = 1)
df['is_dark'] = df.apply(lambda row: add_is_dark(row['started_at']), axis = 1)
df['duration'] = df.apply(lamba row: calc_dur(row'[started_at'],row['ended_at']), axis = 1)
But it seems inefficient since the entire dataset is processed N times (once for each call).
It seems that I should be able to calculate all of the new columns in a single go, but I am missing some conceptual approach.
Examples:
def calc_dur(started_at, ended_at):
# started_at, ended_at are datetime64[ns]; converted at csv load
diff = ended_at - started_at
return diff.total_seconds() / 60
def add_start_time(started_at):
# started_at is datetime64[ns]; converted at csv load
return started_at.time()
def add_is_dark(started_at):
# tz is pytz.timezone('US/Central')
# chi_town is the astral lookup for Chicago
st = started_at.replace(tzinfo=TZ)
chk = sun(chi_town.observer, date=st, tzinfo=chi_town.timezone)
return st >= chk['dusk'] or st <= chk['dawn']
Update 1
Following on the information for MoRe, I was able to get the essential working. I needed to augment by adding the column names, and then with the merge to specify the index.
data = pd.Series(df.apply(lambda x: [
add_start_time(x['started_at']),
add_is_dark(x['started_at']),
yrmo(x['year'], x['month']),
calc_duration_in_minutes(x['started_at'], x['ended_at']),
add_start_cat(x['started_at'])
], axis = 1))
new_df = pd.DataFrame(data.tolist(),
data.index,
columns=['start_time','is_dark','yrmo',
'duration','start_cat'])
df = df.merge(new_df, left_index=True, right_index=True)
import pandas as pd
data = pd.Series(dataframe.apply(lambda x: [function1(x[column_name]), function2(x[column_name)], function3(x[column_name])], axis = 1))
pd.DataFrame(data.tolist(),data.index)
if i understood your mean correctly, it's your answer. but before everything please use Swifter pip :)
first create a series by lists and convert it to columns...
swifter is a simple library (at least i think it is simple) that only has only one useful method: apply
import swifter
data.swifter.apply(lambda x: x+1)
it use parallel manner to improve speed in large datasets... in small ones, it isn't good and even is worse
https://pypi.org/project/swifter/

Using pandas .shift on multiple columns with different shift lengths

I have created a function that parses through each column of a dataframe, shifts up the data in that respective column to the first observation (shifting past '-'), and stores that column in a dictionary. I then convert the dictionary back to a dataframe to have the appropriately shifted columns. The function is operational and takes about 10 seconds on a 12x3000 dataframe. However, when applying it to 12x25000 it is extremely extremely slow. I feel like there is a much better way to approach this to increase the speed - perhaps even an argument of the shift function that I am missing. Appreciate any help.
def create_seasoned_df(df_orig):
"""
Creates a seasoned dataframe with only the first 12 periods of a loan
"""
df_seasoned = df_orig.reset_index().copy()
temp_dic = {}
for col in cols:
to_shift = -len(df_seasoned[df_seasoned[col] == '-'])
temp_dic[col] = df_seasoned[col].shift(periods=to_shift)
df_seasoned = pd.DataFrame.from_dict(temp_dic, orient='index').T[:12]
return df_seasoned
Try using this code with apply instead:
def create_seasoned_df(df_orig):
df_seasoned = df_orig.reset_index().apply(lambda x: x.shift(df_seasoned[col].eq('-').sum()), axis=0)
return df_seasoned

Create function that loops through columns in a Data frame

I am a new coder using jupyter notebook. I have a dataframe that contains 23 columns with different amounts of values( at most 23 and at least 2) I have created a function that normalizes the contents of one column below.
def normalize(column):
y = DFref[column].values[()]
y = x.astype(int)
KGF= list()
for element in y:
element_norm = element / x.sum()
KGF.append(element_norm)
return KGF
I am now trying to create a function that loops through all columns in the Data frame. Right now if I plug in the name of one column, it works as intended. What would I need to do in order to create a function that loops through each column and normalizes the values of each column, and then adds it to a new dataframe?
It's not clear if all 23 columns are numeric, but I will assume they are. Then there are a number of ways to solve this. The method below probably isn't the best, but it might be a quick fix for you...
colnames = DFref.columns.tolist()
normalised_data = {}
for colname in colnames:
normalised_data[colname] = normalize(colname)
df2 = pd.DataFrame(normalised_data)

Optimize a for loop applied on all elements of a df

EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).

Operating on Pandas groups

I am attempting to perform multiple operations on a large dataframe (~3 million rows).
Using a small test-set representative of my data, I've come up with a solution. However the script runs extremely slowly when using the large dataset as input.
Here is the main loop of the application:
def run():
df = pd.DataFrame(columns=['CARD_NO','CUSTOMER_ID','MODIFIED_DATE','STATUS','LOYALTY_CARD_ENROLLED'])
foo = input.groupby('CARD_NO', as_index=False, sort=False)
for name, group in foo:
if len(group) == 1:
df = df.append(group)
else:
dates = group['MODIFIED_DATE'].values
if all_same(dates):
df = df.append(group[group.STATUS == '1'])
else:
df = df.append(group[group.MODIFIED_DATE == most_recent(dates)])
path = ''
df.to_csv(path, sep=',', index=False)
The logic is as follows:
For each CARD_NO
- if there is only 1 CARD_NO, add row to new dataframe
- if there are > 1 of the same CARD_NO, check MODIFIED_DATE,
- if MODIFIED_DATEs are different, take the row with most recent date
- if all MODIFIED_DATES are equal, take whichever row has STATUS = 1
The slow-down occurs at each iteration around,
input.groupby('CARD_NO', as_index=False, sort=False)
I am currently trying to parallelize the loop by splitting the groups returned by the above statement, but I'm not sure if this is the correct approach...
Am I overlooking a core functionality of Pandas?
Is there a better, more Pandas-esque way of solving this problem?
Any help is greatly appreciated.
Thank you.
Two general tips:
For looping over a groupby object, you can try apply. For example,
grouped = input.groupby('CARD_NO', as_index=False, sort=False))
grouped.apply(example_function)
Here, example_function is called for each group in your groupby object. You could write example_function to append to a data structure yourself, or if it has a return value, pandas will try to concatenate the return values into one dataframe.
Appending rows to dataframes is slow. You might be better off building some other data structure with each iteration of the loop, and then building your dataframe at the end. For example, you could make a list of dicts.
data = []
grouped = input.groupby('CARD_NO', as_index=False, sort=False)
def example_function(row,data_list):
row_dict = {}
row_dict['length'] = len(row)
row_dict['has_property_x'] = pandas.notnull(row['property_x'])
data_list.append(row_dict)
grouped.apply(example_function, data_list=data)
pandas.DataFrame(data)
I wrote a significant improvement in running time. The return statements appear to change the dataframe in place, significantly improving running time (~30 minutes for 3 million rows) and avoiding the need for secondary data structures.
def foo(df_of_grouped_data):
group_length = len(df_of_grouped_data)
if group_length == 1:
return df_of_grouped_data
else:
dates = df_of_grouped_data['MODIFIED_DATE'].values
if all_same(dates):
return df_of_grouped_data[df_of_grouped_data.STATUS == '1']
else:
return df_of_grouped_data[df_of_grouped_data.MODIFIED_DATE == most_recent(dates)]
result = card_groups.apply(foo)

Categories