i have a table in pandas df
id count
0 10 3
1 20 4
2 30 5
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
also i have another pandas series s
0 1000
1 2000
2 3000
3 4000
what i want to do is replace the NaN values in my df with the respective values from series s.
my final output should be
id count
0 10 3
1 20 4
2 30 5
3 40 1000
4 50 2000
5 60 3000
6 70 4000
Any ideas how do achieve this?
Thanks in advance.
There is problem lenght of Series is different as length of NaN values in column count. So you need reindex Series by length of NaN:
s = pd.Series({0: 1000, 1: 2000, 2: 3000, 3: 4000, 5: 5000})
print (s)
0 1000
1 2000
2 3000
3 4000
5 5000
dtype: int64
df.loc[df['count'].isnull(), 'count'] =
s.reindex(np.arange(df['count'].isnull().sum())).values
print (df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
It's as simple as this:
df.count[df.count.isnull()] = s.values
In this case, I prefer iterrows for its readability.
counter = 0
for index, row in df.iterrows():
if row['count'].isnull():
df.set_value(index, 'count', s[counter])
counter += 1
I might add that this 'merging' of dataframe + series is a bit odd, and prone to bizarre errors. If you can somehow get the series into the same format as the dataframe (aka add some index/column tags, then you might be better served by the merge function).
You can re-index your Series with indexes of np.nan from dataframe and than fillna() with your Series:
s.index = np.where(df['count'].isnull())[0]
df['count'] = df['count'].fillna(s)
print(df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
Related
date data1
0 2012/1/1 100
1 2012/1/2 109
2 2012/1/3 108
3 2012/1/4 120
4 2012/1/5 80
5 2012/1/6 130
6 2012/1/7 100
7 2012/1/8 140
Given the dataframe above, I want get the number of rows which data1 value is between ± 10 of each row's data1 field, and append that count to each row, such that:
date data Count
0 2012/1/1 100.0 4.0
1 2012/1/2 109.0 4.0
2 2012/1/3 108.0 4.0
3 2012/1/4 120.0 2.0
4 2012/1/5 80.0 1.0
5 2012/1/6 130.0 3.0
6 2012/1/7 100.0 4.0
7 2012/1/8 140.0 2.0
Since each row's field is rule's compare object, I use iterrows, although I know this is not elegant:
result = pd.DataFrame(index=df.index)
for i,r in df.iterrows():
high=r['data']+10
low=r['data1']-10
df2=df.loc[(df['data']<=r['data']+10)&(df['data']>=r['data']-10)]
result.loc[i,'date']=r['date']
result.loc[i,'data']=r['data']
result.loc[i,'count']=df2.shape[0]
result
Is there any more Pandas-style way to do that?
Thank you for any help!
Use numpy broadcasting for boolean mask and for count Trues use sum:
arr = df['data'].to_numpy()
df['count'] = ((arr[:, None] <= arr+10)&(arr[:, None] >= arr-10)).sum(axis=1)
print (df)
date data count
0 2012/1/1 100 4
1 2012/1/2 109 4
2 2012/1/3 108 4
3 2012/1/4 120 2
4 2012/1/5 80 1
5 2012/1/6 130 3
6 2012/1/7 100 4
7 2012/1/8 140 2
Today I'm a bit stuck with a problem that I'm not being able to somewhat efficiently resolve. I have a DataFrame like this:
id Date Days Value
1 20130101 95 100
1 20130102 100 100
.
1 20140101 120 90
.
1 20150101 150 90
.
1 20180101 190 85
2 20130101 98 80
.
2 20140101 70 80
.
2 20180101 150 80
So, it's monthly data, and I want to create a column named Value_t5 that takes the Value of a given row, five years into the future if in each 12-month gap, Value was over 90 days. So, for the first row, I have to check 20140101, 20150101, 20160101, 20170101 and 20180101. Because Days is over 90 in all of those rows, Value_t5 will take the value 85 for the 20130101 row (nan for the rest, because I didn't add more data). Then, for id number 2, the 20130101 would take a nan value, because on 20140101, Days was below 70. So, the expected output would be:
id Date Days Value Value_t5
1 20130101 95 100 85
1 20130102 100 100 np.nan
.
1 20140101 120 90 np.nan
.
1 20150101 150 90 np.nan
.
1 20180101 190 85 np.nan
2 20130101 98 80 np.nan
.
2 20140101 70 80 np.nan
.
2 20180101 150 80 np.nan
I'm guessing some kind of combination of groupby , .all() and pd.DateOffset(), might be involved in the answer, but I'm not being able to find it without haveing to merge 5 offsetted dataframes.
Also I've got 17 million rows of data, so apply is probably not the best idea.
My best bet would be to create a n x 5 matrix with all yearly Days values for each row and then processing that. Is there any straightforward way to do this ?
If your data is monthly, you can simply do rolling:
# toy data:
reps = 100000
dates = np.tile(pd.date_range('2005-01-01', '2020-12-01', freq='MS'),reps)
ids = np.repeat(np.arange(reps), len(dates)//reps)
np.random.seed(1)
df = pd.DataFrame({'id':ids,
'Date': dates,
'Days': np.random.randint(0,20, len(dates)),
'Values': np.arange(len(dates))})
# threshold, put 90 here
thresh = 5
# rolling months
roll = 5
df['valid'] = df['Days'].ge(thresh).astype(int)
groups = df.groupby('id')
df['5m'] = groups['valid'].rolling(roll).sum().values
df['5m'] = groups['5m'].shift(-roll).values
df['value_t5'] = np.where(df['5m']==roll, groups['Values'].shift(-roll*12), np.nan)
Output (head):
id Date Days Values valid 5m value_t5
0 1 2013-01-01 5 0 1 5.0 60.0
1 1 2013-02-01 11 1 1 5.0 61.0
2 1 2013-03-01 12 2 1 5.0 62.0
3 1 2013-04-01 8 3 1 4.0 NaN
4 1 2013-05-01 9 4 1 4.0 NaN
Performance: On my computer, that took about 40 seconds (for 19MM rows).
I want to change the first index column to integer type.
ex)0.0->0, 1.0->1 2.0->2 ....
However, I can't search for the first column. As you can see, it's made up of multi-index. plz help me..
I succeeded in approaching a single value using the Pandas grammar. However, I don't know how to change the whole value of first column.
sum count
timestamp(hour) goods price price
0.0 1 1000 40
2 200 29
3 129 11
4 76 5
1.0 1 1000 40
2 200 29
3 129 11
4 76 5
.
.
.
In[61] pivot1.index[0][0]
Out[62] 0.0
You can use DataFrame.rename with level=0:
df = pd.DataFrame({
'col':[4,5,4,5,5,4],
'timestamp(hour)':[7,8.0,8,8.0,8,3],
'goods':list('aaabbb')
}).set_index(['timestamp(hour)','goods'])
print (df)
col
timestamp(hour) goods
7.0 a 4
8.0 a 5
a 4
b 5
b 5
3.0 b 4
df = df.rename(int, level=0)
print (df)
col
timestamp(hour) goods
7 a 4
8 a 5
a 4
b 5
b 5
3 b 4
You could:
df.index = df.index.set_levels([df.index.levels[0].astype(int), df.index.levels[1]])
But the answer of jezrael is better I guess.
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I am trying do something like that.
ff = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500],
'difference':[np.NaN,50,30,np.NaN,3,0,np.NaN,300,500]})
I have tried with rolling, but I cannot find the function or the rolling sub-class that subtracts, only sum and var and other stats.
Is there a way?
I was thinking that I can create two dfs: one - with the first row of every uid eliminated, the second one - with the last row of every uid eliminated. But to be honest, I have no idea how to do that dynamically, for every uid.
Use groupby with diff:
df = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500]})
df['difference'] = df.groupby('uid')['balance'].diff()
Output:
uid date balance difference
0 1 09/06 150 NaN
1 1 10/06 200 50.0
2 1 11/06 230 30.0
3 20 09/06 12 NaN
4 20 10/06 15 3.0
5 20 11/06 15 0.0
6 4 09/06 700 NaN
7 4 10/06 1000 300.0
8 4 11/06 1500 500.0