Efficient way to add many rows to a DataFrame - python

I really want to speed up my code.
My already working code loops through a DataFrame and gets the start and end year. Then I add it to the lists. At the end of the loop, I append to the empty DataFrame.
rows = range(3560)
#initiate lists and dataframe
start_year = []
end_year = []
for i in rows:
start_year.append(i)
end_year.append(i)
df = pd.DataFrame({'Start date':start_year, 'End date':end_year})
I get what I expect, but very slowly:
Start date End date
0 1 1
1 2 2
2 3 3
3 4 4

Yes, it can be made faster. The trick is to avoid list.append (or, worse pd.DataFrame.append) in a loop. You can use list(range(3560)), but you may find np.arange even more efficient. Here you can assign an array to multiple series via dict.fromkeys:
df = pd.DataFrame(dict.fromkeys(['Start date', 'End date'], np.arange(3560)))
print(df.shape)
# (3560, 2)
print(df.head())
# Start date End date
# 0 0 0
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4

Related

Checking Dataframe for Offsetting Values

I have a list of transactions that lists the matter, the date, and the amount. People entering the data often make mistakes and have to reverse out costs by entering a new cost with a negative amount to offset the error. I'm trying to identify both reversal entries and the entry being reversed by grouping my data according to matter number and work date and then comparing Amounts.
The data looks something like this:
MatterNum
WorkDate
Amount
1
1/02/2022
10
1
1/02/2022
15
1
1/02/2022
-10
2
1/04/2022
15
2
1/05/2022
-5
2
1/05/2022
5
So my output table would look like this:
|MatterNum|WorkDate|Amount|Reversal?|
|---------|--------|------|---------|
|1|1/02/2022|10|yes|
|1|1/02/2022|15|no|
|1|1/02/2022|-10|yes|
|2|1/04/2022|15|no|
|2|1/05/2022|-5|yes|
|2|1/05/2022|5|yes|
Right now, i'm using the following code to check each row:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[2,'1/4/2022',12],
[2,'1/5/2022',-5],
[2,'1/5/2022',5]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
def rev_check(MatterNum, workDate, WorkAmt, df):
funcDF = df.loc[(df['MatterNum'] == MatterNum) & (df['WorkDate'] == workDate)]
listCheck = funcDF['Amount'].tolist()
if WorkAmt*-1 in listCheck:
return 'yes'
df['reversal?'] = df.apply(lambda row: rev_check(row.MatterNum, row.WorkDate, row.Amount, df), axis=1)
This seems to work, but it is pretty slow. I need to check millions of rows of data. Is there a better way I can approach this that would be more efficient?
If I assume that a "reversal" is when this row's amount is less than the previous row's amount, then pandas can do this with diff:
import pandas as pd
data = [
[1,'1/2/2022',10],
[1,'1/2/2022',15],
[1,'1/2/2022',-10],
[1,'1/2/2022',12]
]
df = pd.DataFrame(data, columns=['MatterNum','WorkDate','Amount'])
print(df)
df['Reversal'] = df['Amount'].diff()<0
print(df)
Output:
MatterNum WorkDate Amount
0 1 1/2/2022 10
1 1 1/2/2022 15
2 1 1/2/2022 -10
3 1 1/2/2022 12
MatterNum WorkDate Amount Reversal
0 1 1/2/2022 10 False
1 1 1/2/2022 15 False
2 1 1/2/2022 -10 True
3 1 1/2/2022 12 False
The first row has to be special-cased, since there's nothing to compare against.

changing index of 1 row in pandas

I have the the below df build from a pivot of a larger df. In this table 'week' is the the index (dtype = object) and I need to show week 53 as the first row instead of the last
Can someone advice please? I tried reindex and custom sorting but can't find the way
Thanks!
here is the table
Since you can't insert the row and push others back directly, a clever trick you can use is create a new order:
# adds a new column, "new" with the original order
df['new'] = range(1, len(df) + 1)
# sets value that has index 53 with 0 on the new column
# note that this comparison requires you to match index type
# so if weeks are object, you should compare df.index == '53'
df.loc[df.index == 53, 'new'] = 0
# sorts values by the new column and drops it
df = df.sort_values("new").drop('new', axis=1)
Before:
numbers
weeks
1 181519.23
2 18507.58
3 11342.63
4 6064.06
53 4597.90
After:
numbers
weeks
53 4597.90
1 181519.23
2 18507.58
3 11342.63
4 6064.06
One way of doing this would be:
import pandas as pd
df = pd.DataFrame(range(10))
new_df = df.loc[[df.index[-1]]+list(df.index[:-1])].reset_index(drop=True)
output:
0
9 9
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
Alternate method:
new_df = pd.concat([df[df["Year week"]==52], df[~(df["Year week"]==52)]])

python panda new column with order of values

I would like to make a new column with the order of the numbers in a list. I get 3,1,0,4,2,5 ( index of the lowest numbers ) but I would like to have a new column with 2,1,4,0,3,5 ( so if I look at a row i get the list and I get in what order this number comes in the total list. what am I doing wrong?
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df.sort_values(by='list').index
print(df)
What you're looking for is the rank:
import pandas as pd
df = pd.DataFrame({'list': [4,3,6,1,5,9]})
df['order'] = df['list'].rank().sub(1).astype(int)
Result:
list order
0 4 2
1 3 1
2 6 4
3 1 0
4 5 3
5 9 5
You can use the method parameter to control how to resolve ties.

How to apply a python function to splitted 'from the end' pandas sub-dataframes and get a new dataframe?

The problem
Starting from a pandas dataframe df made of dim_df rows, I need a new
dataframe df_new obtained by applying a function to every sub-dataframe of dimension dim_blk, ideally splitted starting from the last row (so the first block, not the last, may have or not the right number of rows, dim_blk), in the most efficient way (may be vectorized?).
Example
In the following example the dataframe is made of few rows, but the real dataframe will be made of millions of rows, that's why I need an efficient solution.
dim_df = 7 # dimension of the starting dataframe
dim_blk = 3 # number of rows of the splitted block
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
print(df)
Output:
TEST
0 1
1 2
2 3
3 4
4 5
5 6
6 7
The splitted blocks I want:
1 # note: this is the first block composed by a <= dim_blk number of rows
2,3,4
5,6,7 # note: this is the last block and it has dim_blk number of rows
I've done so (I don't know if this is the efficient way):
lst = np.arange(dim_df, 0, -dim_blk) # [7 4 1]
lst_mod = lst[1:] # [4 1] to cut off the last empty sub-dataframe
split_df = np.array_split(df, lst_mod[::-1]) # splitted by reversed list
print(split_df)
Output:
split_df: [
TEST
0 1,
TEST
1 2
2 3
3 4,
TEST
4 5
5 6
6 7]
For example:
print(split_df[1])
Output:
TEST
1 2
2 3
3 4
How can I get a new dataframe, df_new, where every row is made by two columns, min and max (just an example) calculated for every blocks?
I.e:
# df_new
Min Max
0 1 1
1 2 4
2 5 7
Thank you,
Gilberto
You can convert the split_df into dataframe and then create a dataframe using min and max functions i.e
split_df = pd.DataFrame(np.array_split(df['TEST'], lst_mod[::-1]))
df_new = pd.DataFrame({"MIN":split_df.min(axis=1),"MAX":split_df.max(axis=1)}).reset_index(drop=True)
Output:
MAX MIN
0 1.0 1.0
1 4.0 2.0
2 7.0 5.0
Moved solution from question to answer:
The Solution
I've think laterally and found a very speedy solution:
Apply a rolling function to the entire dataframe
Choose every num_blk rows starting from the end
The code (with different values):
import numpy as np
import pandas as pd
import time
dim_df = 500000
dim_blk = 240
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
start_time = time.time()
df['MAX'] = df['TEST'].rolling(dim_blk).max()
df['MIN'] = df['TEST'].rolling(dim_blk).min()
df[['MAX', 'MIN']] = df[['MAX', 'MIN']].fillna(method='bfill')
df_split = pd.DataFrame(columns=['MIN', 'MAX'])
df_split['MAX'] = df['MAX'][-1::-dim_blk][::-1]
df_split['MIN'] = df['MIN'][-1::-dim_blk][::-1]
df_split.reset_index(inplace=True)
del(df_split['index'])
print(df_split.tail())
print('\n\nEND\n\n')
print("--- %s seconds ---" % (time.time() - start_time))
Time Stats
The original code stops after 545 secs. The new code stops after 0,16 secs. Awesome!

Python Pandas dataframe

I have one dataframe (df1) like the following:
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
I have another very big dataframe (df2) which has a column named Absolute_Time. Absolute_Time has the format as ATime of df1. So what I want to do is, for example, for all Absolute_Time's that lay in the range of row 0 to row 1 of ETime of df1, I want to subtract row 0 of Difference of df1 and so on.
Here's an attempt to accomplish what you might be looking for, starting with:
print(df1)
ATime ETime Difference
0 1444911017815 1588510 1444909429305
1 1444911144979 1715672 1444909429307
2 1444911285683 1856374 1444909429309
3 1444911432742 2003430 1444909429312
4 1444911677101 2247786 1444909429315
5 1444912444821 3015493 1444909429328
6 1444913394542 3965199 1444909429343
7 1444913844134 4414784 1444909429350
8 1444914948835 5519467 1444909429368
9 1444915840638 6411255 1444909429383
10 1444916566634 7137240 1444909429394
11 1444917379593 7950186 1444909429407
next creating a new DataFrame with random times within the range of df1:
df2 = pd.DataFrame({'Absolute Time':[randrange(start=df1.ATime.iloc[0], stop=df1.ATime.iloc[-1]) for i in range(100)]})
df2 = df2.sort_values('Absolute Time').reset_index(drop=True)
np.searchsorted provides you with the index positions where df2 should be inserted in df1 (for the columns in question):
df2.index = np.searchsorted(df1.ATime.values, df2.loc[:, 'Absolute Time'].values)
Assigning the new index and merging produces a new DataFrame. Filling the missing Difference values forward allows to subtract in the next step:
df = pd.merge(df1, df2, left_index=True, right_index=True, how='left').fillna(method='ffill').dropna().astype(int)
df['Absolute Time Adjusted'] = df['Absolute Time'].sub(df.Difference)
print(df.head())
ATime ETime Difference Absolute Time \
1 1444911144979 1715672 1444909429307 1444911018916
1 1444911144979 1715672 1444909429307 1444911138087
2 1444911285683 1856374 1444909429309 1444911138087
3 1444911432742 2003430 1444909429312 1444911303233
3 1444911432742 2003430 1444909429312 1444911359690
Absolute Time Adjusted
1 1589609
1 1708780
2 1708778
3 1873921
3 1930378

Categories