shift columns of a dataframe without looping? - python

consider this toy example. i need to shift each column down by one * (its position in the array). so a kind of diagonal shift:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(1,10,(5,5)),columns=list("ABCDE"))
for i,k in enumerate(df):
df[k] = df[k].shift(i)
transforms:
A B C D E
0 6 1 6 3 1
1 2 7 5 9 7
2 6 6 6 9 8
3 7 8 8 2 8
4 5 2 9 9 2
into
A B C D E
0 6 NaN NaN NaN NaN
1 2 1 NaN NaN NaN
2 6 7 6 NaN NaN
3 7 6 5 3 NaN
4 5 8 6 9 1
which is what i want.
however for larger dataframes with hierarchical indexes, this looping method does not seem feasible. in fact, i've got an ipython notebook that has been running for almost an hour now with no end in sight.
this makes me think that there must be an easier, perhaps vectorized way... perhaps using some kind of "apply", however i'm not sure how to do that when each column needs to be shifted down as a function of its position in the array.

Unless you have really a lot of data (dozens of gigabytes), shifting it does not take hours. It seems that the time is spent in rebuilding the indices. Especially with hierarchical indexing it is possible that the complex indices are rebuilt after each shift. If your tables are large, this may take a lot of time.
One possible approach (at least to isolate the problem) is to just extract the data into a np.array (take the .values), shift it, and recreate the DataFrame. In numpy shifting the data is rather trivial by, e.g.:
for c in range(1, a.shape[1]):
a[c:,c] = a[:-c,c]
a[:c, c] = np.nan
Shifting a float array with 500 columns and a million rows (4 GB array) with this code took my computer approximately 12 seconds, but the total time will depend heavily on your indexing and the cost of recreating it.

Related

How do I transpose columns into rows of a Pandas DataFrame?

My current data frame is comprised of 10 rows and thousands of columns. The setup currently looks similar to this:
A B A B
1 2 3 4
5 6 7 8
But I desire something more like below, where essentially I would transpose the columns into rows once the headers start repeating themselves.
A B
1 2
5 6
3 4
7 8
I've been trying df.reshape but perhaps can't get the syntax right. Any suggestions on how best to transpose the data like this?
I'd probably go for stacking, grouping and then building a new DataFrame from scratch, eg:
pd.DataFrame({col: vals for col, vals in df.stack().groupby(level=1).agg(list).items()})
That'll also give you:
A B
0 1 2
1 3 4
2 5 6
3 7 8
Try with stack, groupby and pivot:
stacked = df.T.stack().to_frame().assign(idx=df.T.stack().groupby(level=0).cumcount()).reset_index()
output = stacked.pivot("idx", "level_0", 0).rename_axis(None, axis=1).rename_axis(None, axis=0)
>>> output
A B
0 1 2
1 5 6
2 3 4
3 7 8

Replacing multiple string values in a column with numbers in pandas

I am currently working on a data frame in pandas named df. One column contains
multiple labels (more than 100, to be exact).
I know how to replace values when there are a smaller amount of values.
For instance, in the typical Titanic example:
titanic.Sex.replace({'male': 0,'female': 1}, inplace=True)
Of course, doing so for 100+ values would be extremely time-consuming. I have seen similar questions, but all answers involve typing the data. Is there a faster way to do this?
I think you're looking for factorize:
df = pd.DataFrame({'col': list('ABCDEBJZACA')})
df['factor'] = df['col'].factorize()[0]
output:
col factor
0 A 0
1 B 1
2 D 2
3 C 3
4 E 4
5 B 1
6 J 5
7 Z 6
8 A 0
9 C 3
10 A 0

Baffled by dataframe groupby.diff()

I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.

How to get elements of one pandas dataframe based on the index from another in Python?

I have some data that looks largely like the following:
y=numpy.random.uniform(0,1,10)
yx_index=[2,5,6,8]
yx=numpy.random.normal(0,1,4)
sety=pandas.DataFrame(y,columns=['set_y'])
subset_yx=pandas.DataFrame(yx,columns=['subset'],index=yx_index)
output:
set_y=
set
0 0.548554
1 0.436084
2 0.192882
3 0.468712
4 0.290172
5 0.462640
6 0.072014
7 0.273997
8 0.242552
9 0.289873
set_x=
set
2 0.943326
5 0.462640
6 2.433632
8 0.060528
set_x is always a subset of set_y. My question is, what is the easiest way to get elements of set_y which have indexes the same as set_x?
So in the above case the desired output would be:
set_z=
set
2 0.192882
5 0.462640
6 0.072014
8 0.242552
you can use one of many available indexers. I would recommend .ix[] which is usually faster compared to loc / iloc:
In [86]: set_y.ix[set_x.index]
Out[86]:
set
2 0.192882
5 0.462640
6 0.072014
8 0.242552

Efficient way to call previous row in python

I want to substitute the previous row value whenever a 0 value is found in the column of the dataframe in python. I used the following code,
if not a[j]:
a[j] = a[j-1]
and also
if a[j]==0:
a[j]=a[j-1]
Update:
Complete code updated:
for i in pd.unique(r.a):
sub=r[r.vehicle_id==i]
sub=DataFrame(sub,columns= ['a','b','c','d','e'])
sub=sub.drop_duplicates(["a","b","c","d"])
sub['c']=pd.to_datetime(sub['c'],unit='s')
for j in range(1, len(sub[1:])):
if not sub.d[j]:
sub.d[j] = sub.d[j-1]
if not sub.e[j]:
sub.e[j]=sub.e[j-1]
sub=sub.drop_duplicates(["lash_angle","lash_check_count"])
This is the starting of my code. the sub.d[j] line is only getting delayed
These both seem to work well when using integer values. One of the column contains decimal values. When using the code for that column, it is taking a huge time to complete(Nearly 15-20 secs) for the statement to complete. I am looping through nearly 10000 ids and wasting 15 secs at this step is making my entire code inefficient. Is there a better way, I can do this for the float (decimal) values, so that it would be much faster?
Thanks
Assuming that by "column of the dataframe" you mean you're actually talking about a column (Series) of a pandas DataFrame, then one trick is to replace the 0 by nan and then forward-fill. For example:
>>> df = pd.DataFrame(np.random.randint(0,4, 10**6))
>>> df.head(10)
0
0 0
1 3
2 3
3 0
4 1
5 2
6 3
7 2
8 0
9 3
>>> df[0] = df[0].replace(0, np.nan).ffill()
>>> df.head(10)
0
0 NaN
1 3
2 3
3 3
4 1
5 2
6 3
7 2
8 2
9 3
where you can decide for yourself how you want to handle the case of a 0 at the start, where you have no value to fill. This assumes that there aren't already NaN values you want to leave alone, but if there are, you can just use a mask with .loc to select only the ones you want to change.

Categories