How to optimize dataframe iteration in pandas?

How to optimize dataframe iteration in pandas? - python

I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.
df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
row_ids = []
for index, row in df.iterrows():
if (index % 1000) == 0:
print("Row node index: {}".format(str(index)))
caculated_id = get_id(row['name', row['sex']])
row_ids.append(caculated_id)
df['id'] = row_ids
Is there a way to make it much faster without going row by row?
Add more info based on suggested solutions:

Use apply instead:
def func(x):
if (x.name % 1000) == 0:
print("Row node index: {}".format(str(x.name)))
caculated_id = get_id(row['name', row['sex']])
return caculated_id
df['id'] = df.apply(func, axis=1)

If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.
import numpy as np
v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)
Edit:
To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.
v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)
Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.
import numpy as np
from tqdm import tqdm
#np.vectorize
def get_id(name, sex):
global pbar
...
pbar.update(1)
...
return
global pbar
with tqdm(total=len(df)) as pbar:
df['id'] = get_id(df['name'].values, df['sex'].values)

Related

Adding empty rows in Pandas dataframe

I'd like to append consistently empty rows in my dataframe.
I have following code what does what I want but I'm struggling in adjusting it to my needs:
s = pd.Series('', data_only_trades.columns)
f = lambda d: d.append(s, ignore_index=True)
set_rows = np.arange(len(data_only_trades)) // 4
empty_rows = data_only_trades.groupby(set_rows, group_keys=False).apply(f).reset_index(drop=True)
How can I adjust the code so I add two or more rows instead of one?
How can I set a starting point (e.g. it should start with row 5 -- Do I have to use .loc then in arange?)
Also tried this code but I was struggling in setting the starting row and the values to blank (I got NaN):
df_new = pd.DataFrame()
for i, row in data_only_trades.iterrows():
df_new = df_new.append(row)
for _ in range(2):
df_new = df_new.append(pd.Series(), ignore_index=True)
Thank you!

import numpy as np
v = np.ndarray(shape=(numberOfRowsYouWant,df.values.shape[1]), dtype=object)
v[:] = ""
pd.DataFrame(np.vstack((df.values, v)))
I think you can use NumPy
but, if you want to use your manner, simply convert NaN to "":
df.fillna("")

Apply multiple agg functions on groupby index

I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html = wp.page(wiki_page).html().replace(u'\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.str.findall(find)
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
EDIT:
Added desired output (NOTE: was done on excel)

Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
italy_df
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:

vaex: shift column by n steps

I'm preparing a big multivariate time series data set for a supervised learning task and I would like to create time shifted versions of my input features so my model also infers from past values. In pandas there's the shift(n) command that lets you shift a column by n rows. Is there something similar in vaex?
I could not find anything comparable in the vaex documentation.

No, we do not support that yet (https://github.com/vaexio/vaex/issues/660). Because vaex is extensible (see http://docs.vaex.io/en/latest/tutorial.html#Adding-DataFrame-accessors) I thought I would give you the solution in the form of that:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
def shift(self, column, n, inplace=False):
# make a copy without column
df = self.df.copy().drop(column)
# make a copy with just the colum
df_column = self.df[[column]]
# slice off the head and tail
df_head = df_column[-n:]
df_tail = df_column[:-n]
# stitch them together
df_shifted = df_head.concat(df_tail)
# and join (based on row number)
return df.join(df_shifted, inplace=inplace)
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df['shifted_y'] = df.y
df2 = df.mytool.shift('shifted_y', 2)
df2
It generates a single column datagram, slices that up, concatenates and joins it back. All without a single memory copy.
I am assuming here a cyclic shift/rotate.

The function needs to be modified slightly in order to work in the latest release (vaex 4.0.0ax), see this thread.
Code by Maarten should be updated as follows:
import vaex
import numpy as np
#vaex.register_dataframe_accessor('mytool', override=True)
class mytool:
def __init__(self, df):
self.df = df
# mytool.shift is the analog of pandas.shift() but add the shifted column with specified name to the end of initial df
def shift(self, column, new_column, n, cyclic=True):
df = self.df.copy().drop(column)
df_column = self.df[[column]]
if cyclic:
df_head = df_column[-n:]
else:
df_head = vaex.from_dict({column: np.ma.filled(np.ma.masked_all(n, dtype=float), 0)})
df_tail = df_column[:-n]
df_shifted = df_head.concat(df_tail)
df_shifted.rename(column, new_column)
return df_shifted
x = np.arange(10)
y = x**2
df = vaex.from_arrays(x=x, y=y)
df2 = df.join(df.mytool.shift('y', 'shifted_y', 2))
df2

Python Pandas: use row number in apply

I am only beginning with Pandas and I am stuck with the following problem:
I want to use the row number in df.apply() so that it calculates (1+0.05)^(row_number), ex:
(1+0.05)^0 in its first row, (1+0.05)^1 in its second, (1+0.05)^2 in its third etc....
I tried the following but get AttributeError: 'int' object has no attribute 'name'
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
df['Investition2'] = df['Investition'].apply(lambda x: x*(1+TDE)**x.name)
Any ideas ?
Regards Johann

Welcome to pandas. Familiarize yourself with vectorized functions. The basic idea behind vectorized functions is that you apply the operation to every element in an array without an explicit loop. For example:
x + 1
means "add 1 to element in x".
Similarly:
x * y
means "multiply every element in x by every element in y, pair-wise".
Deep down, vectorized functions are implemented using highly-optimized C loops so they are both fast and convenient.
In your case:
df['Investition2'] = (1+TDE)**df.index

You can create a custom function (foo in our case) to access row.name since you are using df['Investitions2'] it's giving you a Series. And apply series will iterate through its values.
import pandas as pd
considered_period_years = 60
start_year = 2019
TDE = 0.02
year = list(range(start_year,start_year+considered_period_years))
df = pd.DataFrame(year, columns = ['Year'])
df.insert(0, 'Year Number', range(0,60), allow_duplicates = False)
df.insert(2, 'Investition', 0, allow_duplicates = False)
def foo(row):
return row['Investition']*(1+TDE)**row.name
df['Investition2'] = df.apply(lambda x: foo(x), axis=1)
Another alternative is to use itertuples or iterrows.

Python looping and Pandas rank/index quirk

This question pertains to one posted here:
Sort dataframe rows independently by values in another dataframe
In the linked question, I utilize a Pandas Dataframe to sort each row independently using values in another Pandas Dataframe. The function presented there works perfectly every single time it is directly called. For example:
import pandas as pd
import numpy as np
import os
##Generate example dataset
d1 = {}
d2 = {}
d3 = {}
d4 = {}
## generate data:
np.random.seed(5)
for col in list("ABCDEF"):
d1[col] = np.random.randn(12)
d2[col+'2'] = np.random.random_integers(0,100, 12)
d3[col+'3'] = np.random.random_integers(0,100, 12)
d4[col+'4'] = np.random.random_integers(0,100, 12)
t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M")
#place data into dataframes
dat1 = pd.DataFrame(d1, index = t_index)
dat2 = pd.DataFrame(d2, index = t_index)
dat3 = pd.DataFrame(d3, index = t_index)
dat4 = pd.DataFrame(d4, index = t_index)
## Functions
def sortByAnthr(X,Y,Xindex, Reverse=False):
#order the subset of X.index by Y
ordrX = [x for (x,y) in sorted(zip(Xindex,Y), key=lambda pair: pair[1],reverse=Reverse)]
return(ordrX)
def OrderRow(row,df):
ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist()
return(ordrd_row)
def r_selectr(dat2,dat1, n, Reverse=False):
ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index,Reverse),axis=1).iloc[:,-n:]
ordr_cols.columns = list(range(0,n)) #assign interpretable column names
ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1)
return([ordr_cols, ordr_r])
## Call functions
ordr_cols2,ordr_r = r_selectr(dat2,dat1,5)
##print output:
print("Ordering set:\n",dat2.iloc[-2:,:])
print("Original set:\n", dat1.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
As can be checked, the columns of dat1 are correctly ordered according to the values in dat2.
However, when called from a loop over dataframes, it does not rank/index correctly and produces completely dubious results. Although I am not quite able to recreate the problem using the reduced version presented here, the idea should be the same.
## Loop test:
out_dict = {}
data_dicts = {'dat2':dat2, 'dat3': dat3, 'dat4':dat4}
for i in range(3):
#this outer for loop supplies different parameter values to a wrapper
#function that calls r_selectr.
for key in data_dicts.keys():
ordr_cols,_ = r_selectr(data_dicts[key], dat1,5)
out_list.append(ordr_cols)
#do stuff here
#print output:
print("Ordering set:\n",dat3.iloc[-2:,:])
print("Column ordr:\n",ordr_cols2.iloc[-2:,:])
In my code (almost completely analogous to the example given here), the ordr_cols are no longer ordered correctly for any of the sorting data frames.
I currently solve the issue by separating the ordering and indexing operations with r_selectr into two separate functions. That, for some reason, resolves the issue though I have no idea why.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize dataframe iteration in pandas? - python

Use apply instead: def func(x): if (x.name % 1000) == 0: print("Row node index: {}".format(str(x.name))) caculated_id = get_id(row['name', row['sex']]) return caculated_id df['id'] = df.apply(func, axis=1)

Related

Adding empty rows in Pandas dataframe

Apply multiple agg functions on groupby index

vaex: shift column by n steps

Python Pandas: use row number in apply

Python looping and Pandas rank/index quirk

Categories

Resources