Pandas .loc super slow in large index dataset

Pandas .loc super slow in large index dataset - python

I am new to pandas , so assume I must be missing something obvious...
Summary:
I have a DataFrame with 300K+ rows. I retrieve a row of new data which may or may not be related to the existing subset of rows in the DF(identified by Group ID), either retrieve the existing Group ID or generate new one and finally insert it with the Group ID.
Pandas seems very slow for this.
Please advise : What am I missing / should I be using something else?
Details:
Columns are (example):
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
Each groupID can have many unique timeStamp's
groupID is internally generated :
Either using an existing one (by matching the row to existing data, say by column 'D')
Generate new groupID
Thus (in my view at least) I cannot do updates/inserts in bulk, I have to do it row by row
I used an SQL DB analogy to create an index as concat of groupID and timeStamp (I Have tried MultiIndex but it seems even slower).
Finally I insert/update using .loc(ind,columnName)
Code:
import pandas as pd
import numpy as np
import time
columnList = ['groupID','timeStamp'] + list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
columnTypeDict = {'groupID':'int64','timeStamp':'int64'}
startID = 1234567
df = pd.DataFrame(columns=columnList)
df = df.astype(columnTypeDict)
fID = list(range(startID,startID+300000))
df['groupID'] = fID
ts = [1000000000]*150000 + [10000000001]*150000
df['timeStamp'] = ts
indx = [str(i) + str(j) for i, j in zip(fID, ts)]
df['Index'] = indx
df['Index'] = df['Index'].astype('uint64')
df = df.set_index('Index')
startTime = time.time()
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind =int(str(groupID) + str(timeStamp))
#print(ind)
df.loc[ind,'A'] = 1
print(df)
print(time.time()-startTime,"secs")
If the index column already exists, its fast, but if it doesn't 10,000 inserts take 140secs

I think accessing dataframes is a relatively expensive operation.
You can save temporatily these values and use them to create dataframe that will be merged with the original one as follows:
startTime = time.time()
temporary_idx = []
temporary_values = []
for groupID in range(startID+49000,startID+50000) :
timeStamp = 1000000003
# Obtain/generate an index
ind = int(str(groupID) + str(timeStamp))
temporary_idx.append(ind)
temporary_values.append(1)
# create a dataframe with new values and apply a join with the original dataframe
df = df.drop(columns=["A"])\
.merge(
pd.DataFrame({"A": temporary_values}, index=temporary_idx).rename_axis("Index", axis="index"),
how="outer", right_index=True, left_index=True
)
print(df)
print(time.time()-startTime,"secs")
When I benchmarked, This takes less than 2 seconds to execute
I don't know what is exactly your real use case, but this for the case of inserting column A as you stated in your example. If your use case is more complex than that, then there might be a better solution

Related

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.

I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Creating an index after performing groupby on DateTIme

I have the following data in the format below (see below)
I next perform recasting, groupby and averaging (see code) to reduce data dimensionality.
df_mod=pd.read_csv('wet_bulb_hr.csv')
#Mod Date
df_mod['wbt_date'] = pd.to_datetime(df_mod['wbt_date'])
#Mod Time
df_mod['wbt_time'] = df_mod['wbt_time'].astype('int')
df_mod['wbt_date'] = df_mod['wbt_date'] + \
pd.to_timedelta(df_mod['wbt_time']-1, unit='h')
df_mod['wet_bulb_temperature'] = \
df_mod['wet_bulb_temperature'].astype('float')
df = df_mod
df = df.drop(['wbt_time','_id'], axis = 1)
#df_novel = df.mean()
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
After writing to an output file, I get an output that looks like this.
Investigating further, I can understand why. All my processing has resulted in a dataframe of shape 1 but what I really need is the 2 wbt_date columns to be exported as well. This does not seem to happen due to the groupby function
My question: How do I generate an index and have the groupby wbt_date columns as a new single column such that the output is:

You can flatten MultiIndex to Index in YYYY-MM by list comprehension:
df = df.groupby([df.wbt_date.dt.year,df.wbt_date.dt.month]).mean()
df.index = [f'{y}-{m}' for y, m in df.index]
df = df.rename_axis('date').reset_index()
Or use month period by Series.dt.to_period:
df = df.groupby([df.wbt_date.dt.to_period('m')).mean().reset_index()

Try this,
# rename exisiting index & on reset will get added as new column.
df.index.rename("wbt_year", inplace=True)
df.reset_index(inplace=True)
df['month'] = df['wbt_year'].astype(str) + "-" + df['wbt_date'].astype(str)
Output,
>>> df['month']
0 2019-0
1 2018-1
2 2017-2

Python, Pandas and for loop: Populate dataframe row based on a match with list values

I have a pandas dataframe with an "id" column. I also have a list called 'new_ids', which is a subset of the values found in the "id" column.
So I want to add a column to the pandas dataframe, which indicates whether the ID is new or not. I first initialized this column to 0.
df['new_id'] = 0
Now I want to loop through the new_id list, and whenever the ID is found in my pandas dataframe "id" column, I want to change the 'new_id' value for the row, which belongs to this ID to 1. So later on, all the IDs, which are new will have a 1 assigned to them in the "new_id" column, and all old IDs will remain at 0.
index = df.index.values
for x in index:
if new_ids in df.id:
df.new_id[x] = '1'
x = x + 1
else:
x = x + 1
This somehow does not work, I am getting a lot of errors. Any idea what I am doing wrong? Many thanks!

Actually you do not need to iterate manually in DataFrame. Pandas will do the work for you. It is pretty easy and straightforward to use builtin method to do the work.
Here are some sample codes.
import pandas as pd
sample = [['a','b','c'],[1,2,3],[4,5,6],['e','f','g']]
df = pd.DataFrame(sample, columns = ['name', 'ids', 'value'])
new_ids = ['b',5]
df['new_id'] = df['ids'].isin(new_ids)

multiconditional mapping in python pandas

I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.

How to select last row and also how to access PySpark dataframe by index?

From a PySpark SQL dataframe like
name age city
abc 20 A
def 30 B
How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe).
And how can I access the dataframe rows by index.like row no. 12 or 200 .
In pandas I can do
df.tail(1) # for last row
df.ix[rowno or index] # by index
df.loc[] or by df.iloc[]
I am just curious how to access pyspark dataframe in such ways or alternative ways.
Thanks

How to get the last row.
Long and ugly way which assumes that all columns are oderable:
from pyspark.sql.functions import (
col, max as max_, struct, monotonically_increasing_id
)
last_row = (df
.withColumn("_id", monotonically_increasing_id())
.select(max(struct("_id", *df.columns))
.alias("tmp")).select(col("tmp.*"))
.drop("_id"))
If not all columns can be order you can try:
with_id = df.withColumn("_id", monotonically_increasing_id())
i = with_id.select(max_("_id")).first()[0]
with_id.where(col("_id") == i).drop("_id")
Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.
how can I access the dataframe rows by index.like
You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.

How to get the last row.
If you have a column that you can use to order dataframe, for example "index", then one easy way to get the last record is using SQL:
1) order your table by descending order and
2) take 1st value from this order
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()
And how can I access the dataframe rows by index.like row no. 12 or 200 .
Similar way you can get record in any line
row_number = 12
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number))
latest_rec.show()
If you do not have "index" column you can create it using
from pyspark.sql.functions import monotonically_increasing_id
df = df.withColumn("index", monotonically_increasing_id())

from pyspark.sql import functions as F
expr = [F.last(col).alias(col) for col in df.columns]
df.agg(*expr)
Just a tip: Looks like you still have the mindset of someone who is working with pandas or R. Spark is a different paradigm in the way we work with data. You don't access data inside individual cells anymore, now you work with whole chunks of it. If you keep collecting stuff and doing actions, like you just did, you lose the whole concept of parallelism that spark provide. Take a look on the concept of transformations vs actions in Spark.

Use the following to get a index column that contains monotonically increasing, unique, and consecutive integers, which is not how monotonically_increasing_id() work. The indexes will be ascending in the same order as colName of your DataFrame.
import pyspark.sql.functions as F
from pyspark.sql.window import Window as W
window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)
df = df\
.withColumn('int', F.lit(1))\
.withColumn('index', F.sum('int').over(window))\
.drop('int')\
Use the following code to look at the tail, or last rownums of the DataFrame.
rownums = 10
df.where(F.col('index')>df.count()-rownums).show()
Use the following code to look at the rows from start_row to end_row the DataFrame.
start_row = 20
end_row = start_row + 10
df.where((F.col('index')>start_row) & (F.col('index')<end_row)).show()
zipWithIndex() is an RDD method that does return monotonically increasing, unique, and consecutive integers, but appears to be much slower to implement in a way where you can get back to your original DataFrame amended with an id column.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas .loc super slow in large index dataset - python

Related

Find the difference between data frames based on specific columns and output the entire record

Creating an index after performing groupby on DateTIme

Python, Pandas and for loop: Populate dataframe row based on a match with list values

multiconditional mapping in python pandas

How to select last row and also how to access PySpark dataframe by index?

Categories

Resources