I have the following two dataframes:
test=pd.DataFrame({"x":[1,2,3,4,5],"y":[6,7,8,9,0]})
test2=pd.DataFrame({"z":[1],"p":[6]})
which result respectively in:
x y
0 1 6
1 2 7
2 3 8
3 4 9
4 5 0
and
z p
0 1 6
What is the best way to create a column "s" in table test that is equal to:
test["s"]=test["x"]*test2["z"]+test2["p"]
when I try the above expression I get the following output:
x y s
0 1 6 7.0
1 2 7 NaN
2 3 8 NaN
3 4 9 NaN
4 5 0 NaN
but I want the result along all the rows. I have researched something about the apply method or so called vectorized operations but I don't really know how to undertake the problem.
Expected output:
x y s
0 1 6 7.0
1 2 7 8.0
2 3 8 9.0
3 4 9 10.0
4 5 0 11.0
Thanks in advance
Here is my solution, I took Trenton_M suggestions.
test=pd.DataFrame({"x":[1,2,3,4,5],"y":[6,7,8,9,0]})
test2=pd.DataFrame({"z":[1],"p":[6]})
Multiplication process:
test["s"] = test['x'] * test2.z.loc[0] + test2.p.loc[0]
test
Output:
x y s
0 1 6 7
1 2 7 8
2 3 8 9
3 4 9 10
4 5 0 11
Use scalar multiplication, like this:
test['s'] = test.x * test2.z[0] + test2.p[0]
Related
I am trying to shift data in a Pandas dataframe in the following manner from this:
time
value
1
1
2
2
3
3
4
4
5
5
1
6
2
7
3
8
4
9
5
10
To this:
time
value
1
2
3
1
4
2
5
3
1
2
3
6
4
7
5
8
In short, I want to move the data 3 rows down each time a new cycle for a time block begins.
Have not been able to find solution on this, as it seems my English is quite limited not knowing how to describe the problem without an example.
Edit:
Both solutions work. Thank you.
IIUC, you can shift per group:
df['value_shift'] = df.groupby(df['time'].eq(1).cumsum())['value'].shift(2)
output:
time value value_shift
0 1 1 NaN
1 2 2 NaN
2 3 3 1.0
3 4 4 2.0
4 5 5 3.0
5 1 6 NaN
6 2 7 NaN
7 3 8 6.0
8 4 9 7.0
9 5 10 8.0
Try with groupby:
df["value"] = df.groupby(df["time"].diff().lt(0).cumsum())["value"].shift(2)
>>> df
time value
0 1 NaN
1 2 NaN
2 3 1.0
3 4 2.0
4 5 3.0
5 1 NaN
6 2 NaN
7 3 6.0
8 4 7.0
9 5 8.0
I have a single column in a DataFrame containing only numbers, and I need to create a boolean column to indicate if the value will have an n-fold increase in x periods ahead.
I developed a solution using two for loops, but it doesn't seem pythonic enough for me.
Is there a better, more efficient way of doing it? Maybe something with map() or apply()?
Find below my code with an MRE.
df = pd.DataFrame([1,2,2,1,3,2,1,3,4,1,2,3,4,4,5,1], columns=['column'])
df['double_in_5_periods_ahead'] = 'n/a'
periods_ahead = 5
for i in range(0,len(df)-periods_ahead):
for j in range(1,periods_ahead):
if df['column'].iloc[i+j]/df['column'].iloc[i] >= 2:
df['double_in_5_periods_ahead'].iloc[i] = 1
break
else:
df['double_in_5_periods_ahead'].iloc[i] = 0
This is the output:
column double_in_5_periods_ahead
0 1 1
1 2 0
2 2 0
3 1 1
4 3 0
5 2 1
6 1 1
7 3 0
8 4 0
9 1 1
10 2 1
11 3 n/a
12 4 n/a
13 4 n/a
14 5 n/a
15 1 n/a
Let us try rolling
n = 5
df['new'] = (df['column'].iloc[::-1].rolling(n).max() / df['column']).gt(2).astype(int)
df.iloc[-n:,1]=np.nan
df
Out[146]:
column new
0 1 1.0
1 2 0.0
2 2 0.0
3 1 1.0
4 3 0.0
5 2 0.0
6 1 1.0
7 3 0.0
8 4 0.0
9 1 1.0
10 2 1.0
11 3 NaN
12 4 NaN
13 4 NaN
14 5 NaN
15 1 NaN
the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it
If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5
I have a dataframe
Id Seqno. Event
1 2 A
1 3 B
1 5 A
1 6 A
1 7 D
2 0 E
2 1 A
2 2 B
2 4 A
2 6 B
I want to get all the events happened since the count of recent occurrence of Pattern A = 2 for each ID. Seqno. is a sequence number for each ID.
The output will be
Id Seqno. Event
1 5 A
1 6 A
1 7 D
2 1 A
2 2 B
2 4 A
2 6 B
so far i tried,
y=x.groupby('Id').apply( lambda
x:x.eventtype.eq('A').cumsum().tail(2)).reset_index()
p=y.groupby('Id').apply(lambda x:
x.iloc[0]).reset_index(drop=True)
q= x.reset_index()
s= pd.merge(q,p,on='Id')
dd= s[s['index']>=s['level_1']]
I was wondering if there is a good way of doing it.
Use groupby with cumsum, subtract it from the count of A's per group, and filter:
g = df['Event'].eq('A').groupby(df['Id'])
df[(g.transform('sum') - g.cumsum()).le(1)]
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B
Thanks to cold ,ALollz and Vaishali, via the explanation (from the comment) using groupby with cumcount get the count , then we using reindex and ffill
s=df.loc[df.Event=='A'].groupby('Id').cumcount(ascending=False).add(1).reindex(df.index)
s.groupby(df['Id']).ffill()
Out[57]:
0 3.0
1 3.0
2 2.0
3 1.0
4 1.0
5 NaN
6 2.0
7 2.0
8 1.0
9 1.0
dtype: float64
yourdf=df[s.groupby(df['Id']).ffill()<=2]
yourdf
Out[58]:
Id Seqno. Event
2 1 5 A
3 1 6 A
4 1 7 D
6 2 1 A
7 2 2 B
8 2 4 A
9 2 6 B
I am trying to use the dropna function in pandas. I would like to use it for a specific column.
I can only figure out how to use it to drop NaN if ALL rows have ALL NaN values.
I have a dataframe (see below) that I would like to drop all rows after the first occurance of an NaN in a specific column, column "A"
current code, only works if all row values are NaN.
data.dropna(axis = 0, how = 'all')
data
Original Dataframe
data = pd.DataFrame({"A": (1,2,3,4,5,6,7,"NaN","NaN","NaN"),"B": (1,2,3,4,5,6,7,"NaN","9","10"),"C": range(10)})
data
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
7 NaN NaN 7
8 NaN 9 8
9 NaN 10 9
What I would like the output to look like:
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 4 3
4 5 5 4
5 6 6 5
6 7 7 6
Any help on this is appreciated.
Obviously I am would like to do it in the cleanest most efficient way possible.
Thanks!
use iloc + argmax
data.iloc[:data.A.isnull().values.argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 4 3
4 5.0 5 4
5 6.0 6 5
6 7.0 7 6
or with a different syntax
top_data = data[:data['A'].isnull().argmax()]
Re: accepted answer. If column in question has no NaNs, argmax returns 0 and thus df[:argmax] will return an empty dataframe.
Here's my workaround:
max_ = data.A.isnull().argmax()
max_ = len(data) if max_ == 0 else max_
top_data = data[:max_]