How can I divide up a pandas dataframe? - python

I have an enormous timeseries of functions stored in a pandas dataframe in an HDF5 store and I want to make plots of a certain transform of every function in the timeseries. Since the number of plots is so large, and plotting them takes so long, I've used fork() and numpy.array_split() to break the indices up and run several plots in parallel.
Doing things this way means that every process has a copy of the whole timeseries. Since what limits how many processes I can run is the total amount of memory I use, I would like to be able to have each process store only it's own chunk of the dataframe.
How can I split up a pandas dataframe?

np.array_split works pretty well for this usecase.
[40]: df = DataFrame(np.random.randn(5,10))
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 -1.998163 -1.973708 0.461369 -0.575661 0.862534 -1.326168 1.164199 -1.004121 1.236323 -0.339586
1 -0.591188 -0.162782 0.043923 0.101241 0.120330 -1.201497 -0.108959 -0.033221 0.145400 -0.324831
2 0.114842 0.200597 2.792904 0.769636 -0.698700 -0.544161 0.838117 -0.013527 -0.623317 -1.461193
3 1.309628 -0.444961 0.323008 -1.409978 -0.697961 0.132321 -2.851494 1.233421 -1.540319 1.107052
4 0.436368 0.627954 -0.942830 0.448113 -0.030464 0.764961 -0.241905 -0.620992 1.238171 -0.127617
Just pretty-printing as you get a list of 3 elements here.
In [43]: for dfs in np.array_split(df,3,axis=1):
....: print dfs, "\n"
....:
0 1 2 3
0 -1.998163 -1.973708 0.461369 -0.575661
1 -0.591188 -0.162782 0.043923 0.101241
2 0.114842 0.200597 2.792904 0.769636
3 1.309628 -0.444961 0.323008 -1.409978
4 0.436368 0.627954 -0.942830 0.448113
4 5 6
0 0.862534 -1.326168 1.164199
1 0.120330 -1.201497 -0.108959
2 -0.698700 -0.544161 0.838117
3 -0.697961 0.132321 -2.851494
4 -0.030464 0.764961 -0.241905
7 8 9
0 -1.004121 1.236323 -0.339586
1 -0.033221 0.145400 -0.324831
2 -0.013527 -0.623317 -1.461193
3 1.233421 -1.540319 1.107052

Related

split Python DataFrame into k parts with index and iterate over them in a loop

I suppose that someone might have asked this already, but for the life of me I cannot find what I need after some looking, possibly my level of Py is too low.
I saw several questions with answers using globals() and exec() with comments that it's a bad idea, other answers suggest using dictionaries or lists. At this point I got a bit loopy about what to use here and any help would be very welcome.
What I need is roughly this:
I have a Python DataFrame, say called dftest
I'd like to split dftest into say 6 parts of similar size
then I'd like to iterate over them (or possibly parallelise?) and run some steps calling some spatial functions that use parameters (param0,param1, ... param5) over each of the rows of each df to add more columns, preferably export each result to a csv (as it takes long time to complete one part, I wouldn't want to loose the result of each iteration)
And then I'd like to put them back together into one DataFrame, say dfresult (possibly with concat) and continue doing the next thing with it
To keep it simple, this is how a toy dftest looks like (the original df has more rows and columns):
print(dftest)
# rowid type lon year
# 1 1 Tomt NaN 2021
# 2 2 Lägenhet 12.72 2022
# 3 3 Lägenhet NaN 2017
# 4 4 Villa 17.95 2016
# 5 5 Radhus 17.95 2021
# 6 6 Villa 17.95 2016
# 7 7 Fritidshus 18.64 2020
# 8 8 Villa 18.64 2019
# 9 9 Villa 18.63 2021
# 10 10 Villa 18.63 2019
# 11 11 Villa 17.66 2017
# 12 12 Radhus 17.66 2022
So here is what I tried:
dfs = np.array_split(dftest, 6)
for j in range(0,6):
print ((f'dfs[{j}] has'),len(dfs[j].index),'obs ',min(dfs[j].index),'to ',max (dfs[j].index))
where I get output:
# dfs[0] has 2 obs 1 to 2
# dfs[1] has 2 obs 3 to 4
# dfs[2] has 2 obs 5 to 6
# dfs[3] has 2 obs 7 to 8
# dfs[4] has 2 obs 9 to 10
# dfs[5] has 2 obs 11 to 12
So now I'd like to iterate over each df and create more columns. I tried a hardcoded test, one by one something like this:
for row in tqdm(dfs[0].itertuples()):
x = row.type
y = foo.bar(x, param="param0")
i = row[0]
dfs[0].x[i, 'anotherColumn'] = baz(y)
#... some more functions ...
dfs[0].to_csv("/projectPath/dfs0.csv")
I suppose this should be possible to automate or even run in parallel (how?)
And in the end I'll try putting them together (no clue if this would work), possibly something like this:
pd.concat([dfs[0],dfs[1],dfs[2],dfs[3],dfs[4],dfs[5]])
If I had a 100 parts - perhaps dfs[0]:dfs[5] would work...I'm still in the previous step
PS. I'm using a Jupyter notebook on localhost with Python3.
As far as I understand, you can use the chunk_apply function of the parallel-pandas library. This function splits the dataframe into chunks and applies a custom function to each chunk then concatenates the result. Everything works in parallel.Toy example:
#pip install parallel-pandas
import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas
#initialize parallel-pandas
# n_cpu - count of cores and split chunks
ParallelPandas.initialize(n_cpu=8)
def foo(df):
# do something with df
df['new_col'] = df.sum(axis=1)
return df
if __name__ == '__main__':
ROW = 10000
COL = 10
df = pd.DataFrame(np.random.random((ROW, COL)))
res = df.chunk_apply(foo, axis=0)
print(res.head())
Out:
0 1 2 ... 8 9 new_col
0 0.735248 0.393912 0.966608 ... 0.261675 0.207216 6.276589
1 0.256962 0.461601 0.341175 ... 0.688134 0.607418 5.297881
2 0.335974 0.093897 0.622115 ... 0.442783 0.115127 3.102827
3 0.488585 0.709927 0.209429 ... 0.942065 0.126600 4.367873
4 0.619996 0.704085 0.685806 ... 0.626539 0.145320 4.901926

Take average of window in pandas

I have a large pandas dataframe, I want to average first 12 rows, then next 12 rows and so on. I wrote a for loop for this task
df_list=[]
for i in range(0,len(df),12):
print(i,i+12)
df_list.append(df.iloc[i:i+12].mean())
pd.concat(df_list,1).T
Is there an efficient way to do this without for loop
You can divide the index by N i.e. 12 in your case, then group the dataframe by the quotient, and finally call mean on these groups:
# Random dataframe of shape 120,4
>>> df=pd.DataFrame(np.random.randint(10,100,(120,4)), columns=list('ABCD'))
>>> df.groupby(df.index//12).mean()
A B C D
0 49.416667 52.583333 63.833333 47.833333
1 60.166667 61.666667 53.750000 34.583333
2 49.916667 54.500000 50.583333 64.750000
3 51.333333 51.333333 56.333333 60.916667
4 51.250000 51.166667 50.750000 50.333333
5 56.333333 50.916667 51.416667 59.750000
6 53.750000 57.000000 45.916667 59.250000
7 48.583333 59.750000 49.250000 50.750000
8 53.750000 48.750000 51.583333 68.000000
9 54.916667 48.916667 57.833333 43.333333
I believe you want to split your dataframe to seperate chunks with 12 rows. Then you can use np.arange inside groupby to take the mean of each seperate chunk:
df.groupby(np.arange(len(df)) // 12).mean()

Comparing elements between two dataframes and adding columns in case of equality

Considering two dataframes as follows:
import pandas as pd
df_rp = pd.DataFrame({'id':[1,2,3,4,5,6,7,8], 'res': ['a','b','c','d','e','f','g','h']})
df_cdr = pd.DataFrame({'id':[1,2,5,6,7,1,2,3,8,9,3,4,8],
'LATITUDE':[-22.98, -22.97, -22.92, -22.87, -22.89, -22.84, -22.98,
-22.14, -22.28, -22.42, -22.56, -22.70, -22.13],
'LONGITUDE':[-43.19, -43.39, -43.24, -43.28, -43.67, -43.11, -43.22,
-43.33, -43.44, -43.55, -43.66, -43.77, -43.88]})
What I have to do:
Compare each df_rp['id'] element with each df_cdr['id'] element;
If they are the same, I need to add in a data structure (list, series, etc.) the latitudes and longitudes that are on the same line as the id without repeating the id.
Below is an example of how I need the data to be grouped:
1:[-22.98,-43.19],[-22.84,-43.11]
2:[-22.97,-43.39],[-22.98,-43.22]
3:[-22.14,-43.33],[-22.56,-43.66]
4:[-22.70,-43.77]
5:[-22.92,-43.24]
6:[-22.87,-43.28]
7:[-22.89,-43.67]
8:[-22.28,-43.44],[-22.13,-43.88]
I'm having a hard time choosing which data structure is best for the situation (as I did in the example looks like a dictionary, but there would be several dictionaries) and how to add latitude and logitude to pairs without repeating the id. I appreciate any help.
We need to agg the second df , then reindex assign it back
df_rp['L$L']=df_cdr.drop('id',1).apply(tuple,1).groupby(df_cdr.id).agg(list).reindex(df_rp.id).to_numpy()
df_rp
Out[59]:
id res L$L
0 1 a [(-22.98, -43.19), (-22.84, -43.11)]
1 2 b [(-22.97, -43.39), (-22.98, -43.22)]
2 3 c [(-22.14, -43.33), (-22.56, -43.66)]
3 4 d [(-22.7, -43.77)]
4 5 e [(-22.92, -43.24)]
5 6 f [(-22.87, -43.28)]
6 7 g [(-22.89, -43.67)]
7 8 h [(-22.28, -43.44), (-22.13, -43.88)]
df_cdr['lat_long'] = df_cdr.apply(lambda x: list([x['LATITUDE'],x['LONGITUDE']]),axis=1)
df_cdr = df_cdr.drop(columns=['LATITUDE' , 'LONGITUDE'],axis=1)
df_cdr = df_cdr.groupby('id').agg(lambda x: x.tolist())
Output
lat_long
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
9 [[-22.42, -43.55]]
Assume df_rp.id is unique and sorted as in your sample. I come up with solution using set_index and loc to filter out id in df_cdr, but not in df_rp. Next, call groupby with lambda returns arrays
s = (df_cdr.set_index('id').loc[df_rp.id].groupby(level=0).
apply(lambda x: x.to_numpy()))
Out[709]:
id
1 [[-22.98, -43.19], [-22.84, -43.11]]
2 [[-22.97, -43.39], [-22.98, -43.22]]
3 [[-22.14, -43.33], [-22.56, -43.66]]
4 [[-22.7, -43.77]]
5 [[-22.92, -43.24]]
6 [[-22.87, -43.28]]
7 [[-22.89, -43.67]]
8 [[-22.28, -43.44], [-22.13, -43.88]]
dtype: object

round pandas column with precision but no trailing 0

Not duplicate because I'm asking about pandas round().
I have a dataframe with some columns with numbers. I run
df = df.round(decimals=6)
That successfully truncated the long decimals instead of 15.36785699998 correctly writing: 15.367857, but I still get 1.0 or 16754.0 with a trailing zero.
How do I get rid of the trailing zeros in all the columns, once I ran pandas df.round() ?
I want to save the dataframe as a csv, and need the data to show the way I wish.
df = df.round(decimals=6).astype(object)
Converting to object will allow mixed representations. But, keep in mind that this is not very useful from a performance standpoint.
df
A B
0 0.149724 -0.770352
1 0.606370 -1.194557
2 10.000000 10.000000
3 10.000000 10.000000
4 0.843729 -1.571638
5 -0.427478 -2.028506
6 -0.583209 1.114279
7 -0.437896 0.929367
8 -1.025460 1.156107
9 0.535074 1.085753
df.round(6).astype(object)
A B
0 0.149724 -0.770352
1 0.60637 -1.19456
2 10 10
3 10 10
4 0.843729 -1.57164
5 -0.427478 -2.02851
6 -0.583209 1.11428
7 -0.437896 0.929367
8 -1.02546 1.15611
9 0.535074 1.08575

Finding the percent change of values in a Series

I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...

Categories