I am new to Pandas in python. I have a dataframe with 2 keys 15 rows each and 1 column like below
1
key1/1 0.5
key1/2 0.5
key1/3 0
key1/4 0
key1/5 0.6
key1/6 0.7
key1/7 0
key1/8 0
key1/9 0
key1/10 0.5
key1/11 0.5
key1/12 0.5
key1/13 0
key1/14 0.5
key1/15 0.5
key2/1 0.4
key2/2 0.2
key2/3 0
key2/4 0
key2/5 0.1
key2/6 0.2
key2/7 0
key2/8 0
key2/9 0.3
key2/10 0.2
key2/11 0
key2/12 0.5
key2/13 0
key2/14 0
key2/15 0.5
I want to iterate the rows of the dataframe so each time it meets a 'zero' it creates a new column like below
1 2 3 4
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 nan nan 0.5 nan
key1/4 nan nan nan nan
1 2 3 4 5
key2/1 0.4 0.1 0.3 0.5 0.5
key2/2 0.2 0.2 0.2 nan nan
key2/3 nan nan nan nan nan
key2/4 nan nan nan nan nan
I have tried the following code trying to iterate 'key1' only
df2=pd.Dataframe[]
for row in df['key1'].index:
new_df['keyl'][row] == df['keyl'][row]
if df['keyl'][row] == 0:
new_df['key1'].append(df2,ignore_index=True)
Obviously it is not working, please send some help. Ideally I would like to modify the same dataframe instead of creating a new one. Thanks
EDIT
Below is a drawing of how my data looks like
And below is what I am trying to achieve
You can use mask them by zero and assign a key. Based on the key you can group them and transform them to columns.
All credit goes to this answer. You will find a great explanation there.
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
df2.pivot(columns='group')
1
group 1 2 3 4
key1/1 0.5 NaN NaN NaN
key1/10 NaN NaN 0.5 NaN
key1/11 NaN NaN 0.5 NaN
key1/12 NaN NaN 0.5 NaN
key1/14 NaN NaN NaN 0.5
key1/15 NaN NaN NaN 0.5
key1/2 0.5 NaN NaN NaN
key1/5 NaN 0.6 NaN NaN
key1/6 NaN 0.7 NaN NaN
Your group key will look like this:
1 group
key1/1 0.5 1
key1/2 0.5 1
key1/3 NaN 1
key1/4 NaN 1
key1/5 0.6 2
key1/6 0.7 2
key1/7 NaN 2
key1/8 NaN 2
key1/9 NaN 2
key1/10 0.5 3
key1/11 0.5 3
key1/12 0.5 3
key1/13 NaN 3
key1/14 0.5 4
key1/15 0.5 4
This data you can translate it into column format.
Complete solution:
df2 = df.mask((df['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
df3.index = [f"key1/{i}" for i in range(1,len(df3)+1)]
0 1 2 3
key1/1 0.5 0.6 0.5 0.5
key1/2 0.5 0.7 0.5 0.5
key1/3 NaN NaN 0.5 NaN
If you want something in that format you need to have data like this:
group
1 [0.5, 0.5]
2 [0.6, 0.7]
3 [0.5, 0.5, 0.5]
4 [0.5, 0.5]
Name: 1, dtype: object
Update1:
Assuming data to be:
def func(r):
df2 = r.mask((r['1'] == 0) )
df2['group'] = (df2['1'].shift(1).isnull() & df2['1'].notnull()).cumsum()
df2 = df2.dropna()
x = df2.groupby('group')['1'].apply(list)
df3 = pd.DataFrame(x.values.tolist()).T
# df3.index = [r.name]*len(df3)
return (df3)
df.groupby(df.index).apply(func)
Related
I was looking at this question: How can I find 5 consecutive rows in pandas Dataframe where a value of a certain column is at least 0.5, which is similar to the one I have in mind. I would like to find say at least 3 consecutive rows where a value is less than 0.5 (but not negative nor nan), while considering the entire dataframe and not just one column as in the question linked above. Here a facsimile dataframe:
from random import uniform
idx = pd.date_range("2018-01-01", periods=10, freq="M")
df = pd.DataFrame(
{
'A':[0, 0.4, 0.5, 0.3, 0,0,0,0,0,0],
'B':[0, 0.6, 0.8,0, 0.3, 0.3, 0.9, 0.7,0,0],
'C':[0,0,0.5, 0.4, 0.4, 0.2,0,0,0,0],
'D':[0.4,0, 0.6, 0.5, 0.7, 0.2,0, 0.9, 0.8,0],
'E':[0.4, 0.3, 0.2, 0.7, 0.7, 0.8,0,0,0,0],
'F':[0,0,0.6, 0.7,0.8, 0.3, 0.4, 0.1,0,0]
},
index=idx
)
df = df.replace({0:np.nan})
df
Hence, since columns B and D don't satisfy the criteria should be removed from the output.
I'd prefer not to use for loop and the like since it is a 2000-column df, therefore I tried with the following:
def consecutive_values_in_range(s, min, max):
return s.between(left=min, right=max)
min, max = 0, 0.5
df.apply(lambda col: consecutive_values_in_range(col, min, max), axis=0)
print(df)
But I didn't obtain what I was looking for, that would be something like this:
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN
Any suggestions? Thanks in advance.
lower, upper = 0, 0.5
n = 3
df.loc[:, ((df <= upper) & (df >= lower)).rolling(n).sum().eq(n).any()]
get an is_between mask over df
get the rolling sum of these masks per column, window size being 3
since True == 1 and False == 0, if we get 3 in any point, that implies consecutive 3 True's, i.e., 0 <= val <= 0.5 values in that column
so check equality against 3 and see if there's any in a column
lastly index with the resulting True/False mask per column
to get
A C E F
2018-01-31 NaN NaN 0.4 NaN
2018-02-28 0.4 NaN 0.3 NaN
2018-03-31 0.5 0.5 0.2 0.6
2018-04-30 0.3 0.4 0.7 0.7
2018-05-31 NaN 0.4 0.7 0.8
2018-06-30 NaN 0.2 0.8 0.3
2018-07-31 NaN NaN NaN 0.4
2018-08-31 NaN NaN NaN 0.1
2018-09-30 NaN NaN NaN NaN
2018-10-31 NaN NaN NaN NaN
I want to generate a column count that counts the value of pts group by id. Condition is if x and y both contain NaN corresponding pts will be counted, otherwise it will be ignored.
Sample Df:
id pts x y
0 1 0.1 NaN NaN
1 1 0.2 1.0 NaN
2 1 1.1 NaN NaN
3 2 0.1 NaN NaN
4 2 0.2 2.0 1.0
5 3 1.1 NaN NaN
6 3 1.2 NaN 5.0
7 3 3.1 NaN NaN
8 3 3.2 NaN NaN
9 4 0.1 NaN NaN
Expected df:
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
I tried:
df['count'] = df.groupby(['id'])['pts'].value_counts()
You can test if missing values in both Dataframes by DataFrame.isna and DataFrame.all and then count Trues values by sum for new column in GroupBy.transform:
df['count'] = df[['x','y']].isna().all(axis=1).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
Or chain both masks by & for bitwise AND:
df['count'] = (df['x'].isna() & df['y'].isna()).groupby(df['id']).transform('sum')
print (df)
id pts x y count
0 1 0.1 NaN NaN 2
1 1 0.2 1.0 NaN 2
2 1 1.1 NaN NaN 2
3 2 0.1 NaN NaN 1
4 2 0.2 2.0 1.0 1
5 3 1.1 NaN NaN 3
6 3 1.2 NaN 5.0 3
7 3 3.1 NaN NaN 3
8 3 3.2 NaN NaN 3
9 4 0.1 NaN NaN 1
I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0
I have a dataframe that looks like this:
userId movie1 movie2 movie3 movie4 score
0 4.1 2.1 1.0 NaN 2
1 3.1 1.1 3.4 1.4 1
2 2.8 NaN 1.7 NaN 3
3 NaN 5.0 NaN 2.3 4
4 NaN NaN NaN NaN 1
5 2.3 NaN 2.0 4.0 1
I want to subtract the movie scores from each movie so the output would look like this:
userId movie1 movie2 movie3 movie4 score
0 2.1 0.1 -1.0 NaN 2
1 2.1 0.1 2.4 0.4 1
2 -0.2 NaN -2.3 NaN 3
3 NaN 1.0 NaN -1.7 4
4 NaN NaN NaN NaN 1
5 1.3 NaN 1.0 3.0 1
The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.
I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.
Use df.filter to identify the movie columns and then subtract these columns from score array:
In [35]: x = df.filter(like='movie', axis=1).columns.tolist()
In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]
In [37]: df
Out[37]:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
EDIT: When the movie column names are random. Select all columns except 'userId', 'score':
x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]
A possible solution
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['userId'] = [0 , 1 , 2 , 3 , 4 , 5 ]
df['movie1'] = [4.1 , 3.1, 2.8 , np.nan, np.nan, 2.3 ]
df['movie2'] = [2.1 , 1.1, np.nan, 5.0 , np.nan, np.nan]
df['movie3'] = [1.0 , 3.4, 1.7 , np.nan, np.nan, 2.0 ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3 , np.nan, 4.0 ]
df['score'] = [2, 1, 3, 4, 5, 6]
print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')
print('after = ', df)
It should return
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
You can use NumPy broadcasting to subtract here.
v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] = out
df
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
If you don't know column names use pd.Index.difference here.
cols = df.columns.difference(['userId', 'score'])
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')
Now, just replace 'movie1':'movie4' with cols.
v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] = out
Solution without using .apply():
df.iloc[:, 1:5] = (
df.iloc[:, 1:5]
- df['score'].values.reshape(-1, 1)
)
You can select the columns with iloc if the names of the columns are unknown and use the sub function from pandas to avoid converting to numpy or using apply. I'm assuming value [2,'movie3'] is a typo in your expected output.
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df
Out:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 1
5 5 1.3 NaN 1.0 3.0 1
df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
lambda x: x - df["score"]
)
print(df)
Prints:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
I have a dataframe of the following structure which is simplified for this question.
A B C D E
0 2014/01/01 nan nan 0.2 nan
1 2014/01/01 0.1 nan nan nan
2 2014/01/01 nan 0.3 nan 0.7
3 2014/01/02 nan 0.4 nan nan
4 2014/01/02 0.5 nan 0.6 0.8
What I have here is a series of readings across several timestamps on single days. The columns B,C,D and E represent different locations. The data I am reading in is set up such that at a specified timestamp it takes data from certain locations and fills in nan values for the other locations.
What I wish to do is group the data by timestamp which I can easily do with a .GroupBy()command. From there I wish to have the nan values in the grouped data be overwritten with the valid values taken in later rows such that this is the following result is obtained.
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
How do I go about achieving this?
Try df.groupby with DataFrameGroupBy.agg:
In [528]: df.groupby('A', as_index=False, sort=False).agg(np.nansum)
Out[528]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
A shorter version with DataFrameGroupBy.sum (thanks MaxU!):
In [537]: df.groupby('A', as_index=False, sort=False).sum()
Out[537]:
A B C D E
0 2014/01/01 0.1 0.3 0.2 0.7
1 2014/01/02 0.5 0.4 0.6 0.8
you can try this by using pandas first
df.groupby('A', as_index=False).first()
A B C D E
0 1/1/2014 0.1 0.3 0.2 0.7
1 1/2/2014 0.5 0.4 0.6 0.8