Creating a dataframe from monthly values which dont start on january - python

So, i have some data in list form, such as:
Q=[2,3,4,5,6,7,8,9,10,11,12] #values
M=[11,0,1,2,3,4,5,6,7,8,9] #months
Y=[2010,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011] #years
And i want to get a dataframe, with one row per year, and one column per month, adding the data of Q on the positions given by M and Y.
so far i have tried a couple of things, my current code is as follows:
def save_data(data_list,year_info,month_info):
#how many datapoints
n_data=len(data_list)
#how many years
y0=year_info[0]
yf=year_info[n_data-1]
n_years=yf-y0+1
#creating the list i want to fill out
df_list=[[math.nan]*12]*n_years
ind=0
for y in range(n_years):
for m in range(12):
if ind<len(data_list):
if year_info[ind]-y0==y and month_info[ind]==m:
df_list[y][m]=data_list[ind]
ind+=1
df=pd.DataFrame(df_list)
return df
I get this output:
0
1
2
3
4
5
6
7
8
9
10
11
0
3
4
5
6
7
8
9
10
11
12
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
2
And i want to get:
0
1
2
3
4
5
6
7
8
9
10
11
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
1
3
4
5
6
7
8
9
10
11
12
nan
nan
I have tried doing a bunch of diferent things, but so far nothing has worked, I'm wondering if there's a more straightforward way of doing this, my code seems to be overwriting in a weird way, i do not know for instance why is there a 2 on the last value of second row, since that's the first value of my list.
Thanks in advance!

Try pivot:
(pd.DataFrame({'Y':Y,'M':M,'Q':Q})
.pivot(index='Y', columns='M', values='Q')
)
Output:
M 0 1 2 3 4 5 6 7 8 9 11
Y
2010 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0
2011 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 NaN

Related

How fill unstinting numeric values in df column

so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...

How to change consecutive repeating values in pandas dataframe series to nan or 0?

I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.
I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23
giving a dataframe like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 40.000000 -0.074763 -0.840403
6 40.000000 0.709794 -1.000048
7 40.000000 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 0.230000
12 0.116258 22.000000 1.119744
13 -0.501180 22.000000 0.558941
14 0.551586 22.000000 -0.993749
what I would like to be able to do is write some code that would filter the data and give me a data frame like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 NaN 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 NaN
11 1.187208 0.964340 NaN
12 0.116258 NaN 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
or even better keep the first value and change the rest to NaN. Like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
using shift & mask:
df.shift(1) == df compares the next row to the current for consecutive duplicates.
df.mask(df.shift(1) == df)
# outputs
0 1 2
0 0.365329 0.153527 0.143244
1 0.688364 0.495755 1.065965
2 0.354180 -0.023518 3.338483
3 -0.106851 0.296802 -0.594785
4 40.000000 0.149378 1.507316
5 NaN -1.312952 0.225137
6 NaN -0.242527 -1.731890
7 NaN 0.798908 0.654434
8 2.226980 -1.117809 -1.172430
9 -1.228234 -3.129854 -1.101965
10 0.393293 1.682098 0.230000
11 -0.029907 -0.502333 NaN
12 0.107994 22.000000 0.354902
13 -0.478481 NaN 0.531017
14 -1.517769 NaN 1.552974
if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row
df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Option 1
Specialized solution using diff. Get's at the final desired output.
df.mask(df.diff().eq(0))
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

how to merge two dataframes if the index and length both do not match?

i have two data frames predictor_df and solution_df like this :
predictor_df
1000 A B C
1001 1 2 3
1002 4 5 6
1003 7 8 9
1004 Nan Nan Nan
and a solution_df
0 D
1 10
2 11
3 12
the reason for the names is that the predictor_df is used to do some analysis on it's columns to arrive at analysis_df . My analysis leaves the rows with Nan values in predictor_df and hence the shorter solution_df
Now i want to know how to join these two dataframes to obtain my final dataframe as
A B C D
1 2 3 10
4 5 6 11
7 8 9 12
Nan Nan Nan
please guide me through it . thanks in advance.
Edit : i tried to merge the two dataframes but the result comes like this ,
A B C D
1 2 3 Nan
4 5 6 Nan
7 8 9 Nan
Nan Nan Nan
Edit 2 : also when i do pd.concat([predictor_df, solution_df], axis = 1)
it becomes like this
A B C D
Nan Nan Nan 10
Nan Nan Nan 11
Nan Nan Nan 12
Nan Nan Nan Nan
You could use reset_index with drop=True which resets the index to the default integer index.
pd.concat([df_1.reset_index(drop=True), df_2.reset_index(drop=True)], axis=1)
A B C D
0 1 2 3 10.0
1 4 5 6 11.0
2 7 8 9 12.0
3 Nan Nan Nan NaN

Pandas: getting the name of the minimum column

I have a Pandas dataframe as below:
incomplete_df = pd.DataFrame({'event1': [1, 2 ,np.NAN,5 ,6,np.NAN,np.NAN,11 ,np.NAN,15],
'event2': [np.NAN,1 ,np.NAN,3 ,4,7 ,np.NAN,12 ,np.NAN,17],
'event3': [np.NAN,np.NAN,np.NAN,np.NAN,6,4 ,9 ,np.NAN,3 ,np.NAN]})
incomplete_df
event1 event2 event3
0 1 NaN NaN
1 2 1 NaN
2 NaN NaN NaN
3 5 3 NaN
4 6 4 6
5 NaN 7 4
6 NaN NaN 9
7 11 12 NaN
8 NaN NaN 3
9 15 17 NaN
I want to append a reason column that gives a standard text + the column name of the minimum value of that row. In other words, the desired output is:
event1 event2 event3 reason
0 1 NaN NaN 'Reason is event1'
1 2 1 NaN 'Reason is event2'
2 NaN NaN NaN 'Reason is None'
3 5 3 NaN 'Reason is event2'
4 6 4 6 'Reason is event2'
5 NaN 7 4 'Reason is event3'
6 NaN NaN 9 'Reason is event3'
7 11 12 NaN 'Reason is event1'
8 NaN NaN 3 'Reason is event3'
9 15 17 NaN 'Reason is event1'
I can do incomplete_df.apply(lambda x: min(x),axis=1) but this does not ignore NAN's and more importantly returns the value rather than the name of the corresponding column.
EDIT:
Having found out about the idxmin() function from EMS's answer, I timed the the two solutions below:
timeit.repeat("incomplete_df.apply(lambda x: x.idxmin(), axis=1)", "from __main__ import incomplete_df", number=1000)
[0.35261858807214175, 0.32040155511039536, 0.3186818508661702]
timeit.repeat("incomplete_df.T.idxmin()", "from __main__ import incomplete_df", number=1000)
[0.17752145781657447, 0.1628651645393262, 0.15563708275042387]
It seems like the transpose approach is twice as fast.
incomplete_df['reason'] = "Reason is " + incomplete_df.T.idxmin()
ely's answer transposes the dataframe but this is not necessary.
Use the argument axis="columns" instead:
incomplete_df['reason'] = "Reason is " + incomplete_df.idxmin(axis="columns")
This is arguably easier to understand and faster (tested on Python 3.10.2):

Categories