I'm trying to rename a single row of a pandas dataframe by it's tuple.
For example:
import pandas as pd
df = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1],
'i2':[0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.],
'y':[9,10,11,12,13,14,15,16]})
df.set_index(['i1','i2'], inplace=True)
Creates df:
x y
i1 i2
0 0 1.0 9
1 2.0 10
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16
I'd like to be able to use something like: df.rename(index={(0,1):(0,9)},inplace=True) to get:
x y
i1 i2
0 0 1.0 9
9 2.0 10 <-- new key
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16
The command executes without raising an error but returns the same df unchanged.
This also returns the same df: df.rename(index={pd.IndexSlice[0,1]:pd.IndexSlice[0,9]},inplace=True)
This will have close to the desired effect:
df.loc[(0,9),:] = df.loc[(0,1),:]
df.drop(index=(0,1),inplace=True)
but if row ordering matters, it'll be a pain to get it into the right order, and possibly quite slow if the df gets big.
I'm using Pandas 1.0.1, python 3.7. Any suggestions? Thank you in advance.
Possible solution with list comprehension and MultiIndex.from_tuples:
L = [(0,9) if x == (0,1) else x for x in df.index]
df.index = pd.MultiIndex.from_tuples(L, names=df.index.names)
print (df)
x y
i1 i2
0 0 1.0 9
9 2.0 10
2 3.0 11
3 4.0 12
1 0 5.0 13
1 6.0 14
2 7.0 15
3 8.0 16
Related
I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0
I have this:
df['new'] = df[['col1', 'col2']].pct_change(axis=1)
I want the percent change across rows in col1 and col2. However I am getting the error:
ValueError: Wrong number of items passed 2, placement implies 1
What am I doing wrong?
The percent change function is returning a pandas DataFrame object with two columns! This is why you see the ValueError where 1 item is expected instead of two.
import numpy as np
x = np.range(1,11)
y = x*3
df = pd.DataFrame()
df['col1'] = x
df['col2'] = y
df
col1 col2
0 1 3
1 2 6
2 3 9
3 4 12
4 5 15
5 6 18
6 7 21
7 8 24
8 9 27
9 10 30
df.pct_change(axis=1)
col1 col2
0 NaN 2.0
1 NaN 2.0
2 NaN 2.0
3 NaN 2.0
4 NaN 2.0
5 NaN 2.0
6 NaN 2.0
7 NaN 2.0
8 NaN 2.0
9 NaN 2.0
The percent change across rows that you want is stored in the last column ('col2' in this case) so just choose that last column to populate the 'new' column. In this case we compute a 200% change for every row.
df['new'] = df.pct_change(axis=1)['col2']
col1 col2 new
0 1 3 2.0
1 2 6 2.0
2 3 9 2.0
3 4 12 2.0
4 5 15 2.0
5 6 18 2.0
6 7 21 2.0
7 8 24 2.0
8 9 27 2.0
9 10 30 2.0
I am trying to clean up a dataset. Only values smaller than the last value should be kept.
Right now it look slike this:
my_data
0 10
1 8
2 7
3 10
4 5
5 8
6 2
after the cleanup it should look like this:
my_data
0 10
1 8
2 7
3 7
4 5
5 5
6 2
I also have some working code but I am looking for a faster and more pythonic way of doing it.
import pandas as pd
df_results = pd.DataFrame()
df_results['my_data'] = [10, 8, 7, 10, 5, 8, 2]
data_idx = list(df_results['my_data']._index)
for i in range(1, len(df_results['my_data'])):
current_value = df_results['my_data'][data_idx[i]]
last_value = df_results['my_data'][data_idx[i - 1]]
df_results['my_data'][data_idx[i]] = current_value if current_value < last_value else last_value
You can use:
In [53]: df[df.my_data.diff() > 0] = np.nan
In [54]: df
Out[54]:
my_data
0 10.0
1 8.0
2 7.0
3 NaN
4 5.0
5 NaN
6 2.0
In [55]: df.ffill()
Out[55]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0
I am using shift with diff
s=df.my_data.diff().gt(0)
df.loc[s,'my_data']=df.loc[s.shift(-1).fillna(False),'my_data'].values
Out[71]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0
I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs
I have a text file of the form :
data.txt
2
8
4
3
1
9
6
5
7
How to read it into a pandas dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Try this:
with open(filename, 'r') as f:
data = f.read().replace('\n',',').replace(',,','\n')
In [7]: pd.read_csv(pd.compat.StringIO(data), header=None)
Out[7]:
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 1
Much easier, if you know there are always N elements in a group - just load your data and reshape -
pd.DataFrame(np.loadtxt('data.txt').reshape(3, -1))
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
To load integers, pass dtype to loadtxt -
pd.DataFrame(np.loadtxt('data.txt', dtype=int).reshape(3, -1))
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7
Option 2
This is more general, will work when you cannot guarantee that there are always 3 numbers at a time. The idea here is to read in blank lines as NaN, and separate your data based on the presence of NaNs.
df = pd.read_csv('data.txt', header=None, skip_blank_lines=False)
df
0
0 2.0
1 8.0
2 4.0
3 NaN
4 3.0
5 1.0
6 9.0
7 NaN
8 6.0
9 5.0
10 7.0
df_list = []
for _, g in df.groupby(df.isnull().cumsum().values.ravel()):
df_list.append(g.dropna().reset_index(drop=True))
df = pd.concat(df_list, axis=1, ignore_index=True)
df
0 1 2
0 2.0 8.0 4.0
1 3.0 1.0 9.0
2 6.0 5.0 7.0
Caveat - if your data also has NaNs, this will not separate properly.
Although this is definitely not the best way to handle it, we can do some processing ourselves. In case the values are integers, the following should work:
import pandas as pd
with open('data.txt') as f:
data = [list(map(int, row.split())) for row in f.read().split('\n\n')]
dataframe = pd.DataFrame(data)
which produces:
>>> dataframe
0 1 2
0 2 8 4
1 3 1 9
2 6 5 7