So, i have some data in list form, such as:
Q=[2,3,4,5,6,7,8,9,10,11,12] #values
M=[11,0,1,2,3,4,5,6,7,8,9] #months
Y=[2010,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011] #years
And i want to get a dataframe, with one row per year, and one column per month, adding the data of Q on the positions given by M and Y.
so far i have tried a couple of things, my current code is as follows:
def save_data(data_list,year_info,month_info):
#how many datapoints
n_data=len(data_list)
#how many years
y0=year_info[0]
yf=year_info[n_data-1]
n_years=yf-y0+1
#creating the list i want to fill out
df_list=[[math.nan]*12]*n_years
ind=0
for y in range(n_years):
for m in range(12):
if ind<len(data_list):
if year_info[ind]-y0==y and month_info[ind]==m:
df_list[y][m]=data_list[ind]
ind+=1
df=pd.DataFrame(df_list)
return df
I get this output:
0
1
2
3
4
5
6
7
8
9
10
11
0
3
4
5
6
7
8
9
10
11
12
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
2
And i want to get:
0
1
2
3
4
5
6
7
8
9
10
11
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
nan
I have tried doing a bunch of diferent things, but so far nothing has worked, I'm wondering if there's a more straightforward way of doing this, my code seems to be overwriting in a weird way, i do not know for instance why is there a 2 on the last value of second row, since that's the first value of my list.
Thanks in advance!
Try pivot:
(pd.DataFrame({'Y':Y,'M':M,'Q':Q})
.pivot(index='Y', columns='M', values='Q')
)
Output:
M 0 1 2 3 4 5 6 7 8 9 11
Y
2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0
2011 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 NaN
Related
so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...
I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.
I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23
giving a dataframe like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 40.000000 -0.074763 -0.840403
6 40.000000 0.709794 -1.000048
7 40.000000 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 0.230000
12 0.116258 22.000000 1.119744
13 -0.501180 22.000000 0.558941
14 0.551586 22.000000 -0.993749
what I would like to be able to do is write some code that would filter the data and give me a data frame like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 NaN 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 NaN
11 1.187208 0.964340 NaN
12 0.116258 NaN 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
or even better keep the first value and change the rest to NaN. Like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
using shift & mask:
df.shift(1) == df compares the next row to the current for consecutive duplicates.
df.mask(df.shift(1) == df)
# outputs
0 1 2
0 0.365329 0.153527 0.143244
1 0.688364 0.495755 1.065965
2 0.354180 -0.023518 3.338483
3 -0.106851 0.296802 -0.594785
4 40.000000 0.149378 1.507316
5 NaN -1.312952 0.225137
6 NaN -0.242527 -1.731890
7 NaN 0.798908 0.654434
8 2.226980 -1.117809 -1.172430
9 -1.228234 -3.129854 -1.101965
10 0.393293 1.682098 0.230000
11 -0.029907 -0.502333 NaN
12 0.107994 22.000000 0.354902
13 -0.478481 NaN 0.531017
14 -1.517769 NaN 1.552974
if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row
df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Option 1
Specialized solution using diff. Get's at the final desired output.
df.mask(df.diff().eq(0))
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".
i have two data frames predictor_df and solution_df like this :
predictor_df
1000 A B C
1001 1 2 3
1002 4 5 6
1003 7 8 9
1004 Nan Nan Nan
and a solution_df
0 D
1 10
2 11
3 12
the reason for the names is that the predictor_df is used to do some analysis on it's columns to arrive at analysis_df . My analysis leaves the rows with Nan values in predictor_df and hence the shorter solution_df
Now i want to know how to join these two dataframes to obtain my final dataframe as
A B C D
1 2 3 10
4 5 6 11
7 8 9 12
Nan Nan Nan
please guide me through it . thanks in advance.
Edit : i tried to merge the two dataframes but the result comes like this ,
A B C D
1 2 3 Nan
4 5 6 Nan
7 8 9 Nan
Nan Nan Nan
Edit 2 : also when i do pd.concat([predictor_df, solution_df], axis = 1)
it becomes like this
A B C D
Nan Nan Nan 10
Nan Nan Nan 11
Nan Nan Nan 12
Nan Nan Nan Nan
You could use reset_index with drop=True which resets the index to the default integer index.
pd.concat([df_1.reset_index(drop=True), df_2.reset_index(drop=True)], axis=1)
A B C D
0 1 2 3 10.0
1 4 5 6 11.0
2 7 8 9 12.0
3 Nan Nan Nan NaN
I have a Pandas dataframe as below:
incomplete_df = pd.DataFrame({'event1': [1, 2 ,np.NAN,5 ,6,np.NAN,np.NAN,11 ,np.NAN,15],
'event2': [np.NAN,1 ,np.NAN,3 ,4,7 ,np.NAN,12 ,np.NAN,17],
'event3': [np.NAN,np.NAN,np.NAN,np.NAN,6,4 ,9 ,np.NAN,3 ,np.NAN]})
incomplete_df
event1 event2 event3
0 1 NaN NaN
1 2 1 NaN
2 NaN NaN NaN
3 5 3 NaN
4 6 4 6
5 NaN 7 4
6 NaN NaN 9
7 11 12 NaN
8 NaN NaN 3
9 15 17 NaN
I want to append a reason column that gives a standard text + the column name of the minimum value of that row. In other words, the desired output is:
event1 event2 event3 reason
0 1 NaN NaN 'Reason is event1'
1 2 1 NaN 'Reason is event2'
2 NaN NaN NaN 'Reason is None'
3 5 3 NaN 'Reason is event2'
4 6 4 6 'Reason is event2'
5 NaN 7 4 'Reason is event3'
6 NaN NaN 9 'Reason is event3'
7 11 12 NaN 'Reason is event1'
8 NaN NaN 3 'Reason is event3'
9 15 17 NaN 'Reason is event1'
I can do incomplete_df.apply(lambda x: min(x),axis=1) but this does not ignore NAN's and more importantly returns the value rather than the name of the corresponding column.
EDIT:
Having found out about the idxmin() function from EMS's answer, I timed the the two solutions below:
timeit.repeat("incomplete_df.apply(lambda x: x.idxmin(), axis=1)", "from __main__ import incomplete_df", number=1000)
[0.35261858807214175, 0.32040155511039536, 0.3186818508661702]
timeit.repeat("incomplete_df.T.idxmin()", "from __main__ import incomplete_df", number=1000)
[0.17752145781657447, 0.1628651645393262, 0.15563708275042387]
It seems like the transpose approach is twice as fast.
incomplete_df['reason'] = "Reason is " + incomplete_df.T.idxmin()
ely's answer transposes the dataframe but this is not necessary.
Use the argument axis="columns" instead:
incomplete_df['reason'] = "Reason is " + incomplete_df.idxmin(axis="columns")
This is arguably easier to understand and faster (tested on Python 3.10.2):