Wrangling shifted DataFrame with Pandas

Wrangling shifted DataFrame with Pandas - python

In the following pandas DataFrame, The first two columns (Remessas_A and Remessas_A_1d) were given and I had to find the third (previsao) following the pattern described below. Notice that I'm not counting the column DataEntrega as the first, which is a datetime index.
DataEntrega,Remessas_A,Remessas_A_1d,previsao
2020-07-25,696.0,,
2020-07-26,0.0,,
2020-07-27,518.0,,
2020-07-28,629.0,,
2020-07-29,699.0,,
2020-07-30,660.0,,
2020-07-31,712.0,,
2020-08-01,2.0,-672.348684948797,23.651315051203028
2020-08-02,0.0,-504.2138715410994,-504.2138715410994
2020-08-03,4.0,-91.10009092298037,426.89990907701963
2020-08-04,327.0,194.46620611760167,823.4662061176017
2020-08-05,442.0,220.65451760630847,919.6545176063084
2020-08-06,474.0,-886.140302693952,-226.14030269395198
2020-08-07,506.0,-61.28132269808316,650.7186773019168
2020-08-08,11.0,207.12286256242962,230.77417761363265
2020-08-09,2.0,109.36137834671834,-394.85249319438105
2020-08-10,388.0,146.2428764085755,573.1427854855951
2020-08-11,523.0,-193.02046115081606,630.4457449667857
2020-08-12,509.0,-358.59415822684485,561.0603593794635
2020-08-13,624.0,966.9258406162757,740.7855379223237
2020-08-14,560.0,175.8273195122506,826.5459968141674
2020-08-15,70.0,19.337299248463978,250.11147686209662
2020-08-16,3.0,83.09413535361391,-311.75835784076713
2020-08-17,401.0,-84.67345026550751,488.4693352200876
2020-08-18,526.0,158.53310638454195,788.9788513513276
2020-08-19,580.0,285.99137337700336,847.0517327564669
2020-08-20,624.0,-480.93226226400344,259.85327565832023
2020-08-21,603.0,-194.68412031046182,631.8618765037056
2020-08-22,45.0,-39.23172496101115,210.87975190108546
2020-08-23,2.0,-115.26376570266325,-427.0221235434304
2020-08-24,463.0,10.04635376084557,498.5156889809332
2020-08-25,496.0,-32.44638720124206,756.5324641500856
2020-08-26,600.0,-198.6715680014182,648.3801647550487
2020-08-27,663.0,210.40991269713578,470.263188355456
2020-08-28,628.0,40.32391720053602,672.1857937042416
2020-08-29,380.0,-2.4418918145294626,208.437860086556
2020-08-30,0.0,152.66166068424076,-274.3604628591896
2020-08-31,407.0,18.499558564880928,517.0152475458141
The first 7 values of Remessas_A_1d and previsao are nulls, and will be kept nulls.
In order to obtain the first 7 non nulls values of previsao, from 2020-08-01 to 2020-08-07, I've made a shift of the Remessas_A 7 days ahead and I've added the rows of the shifted Remessas_A and the original Remessas_A_1d:
#res is the name of the dataframe
res['previsao'].loc['2020-08-01':'2020-08-07'] = res['Remessas_A'].shift(7).loc['2020-08-01':'2020-08-07'].add(res['Remessas_A_1d'].loc['2020-08-01':'2020-08-07'])
To find the next 7 values of previsao, from 2020-08-08 to 2020-08-14, now I shifted the previsao column 7 days ahead and I've added the rows of the shifted previsao and the original previsao:
res['previsao'].loc['2020-08-08':'2020-08-14'] = res['previsao'].shift(7).loc['2020-08-08':'2020-08-14'].add(res['Remessas_A_1d'].loc['2020-08-08':'2020-08-14'])
To find the next values of previsao, I repeated the last step, moving 7 days ahead each time:
res['previsao'].loc['2020-08-15':'2020-08-21'] = res['previsao'].shift(7).loc['2020-08-15':'2020-08-21'].add(res['Remessas_A_1d'].loc['2020-08-15':'2020-08-21'])
res['previsao'].loc['2020-08-22':'2020-08-28'] = res['previsao'].shift(7).loc['2020-08-22':'2020-08-28'].add(res['Remessas_A_1d'].loc['2020-08-22':'2020-08-28'])
res['previsao'].loc['2020-08-29':'2020-08-31'] = res['previsao'].shift(7).loc['2020-08-29':'2020-08-31'].add(res['Remessas_A_1d'].loc['2020-08-29':'2020-08-31'])
#the last line only spaned 3 days because I reached the end of my dataframe
Instead of doing that by hand, how can I create a function that would take periods=7, Remessas_A and Remessas_A_1d as input and would give previsao as the output?

Not the most elegant code, but this should do the trick:
df["previsao"][df.index <= pd.to_datetime("2020-08-07")] = df["Remessas_A"].shift(7) + df["Remessas_A_1d"]
for d in pd.date_range("2020-08-08", "2020-08-31"):
df.loc[d, "previsao"] = df.loc[d - pd.Timedelta("7d"), "previsao"] + df.loc[d, "Remessas_A_1d"]
Edit: I've assumed you have DataEntrega as an index and datetime object. Can post the rest of the code if you need.

Related

Apply if else condition in specific pandas column by location

I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)

Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967

Update main dataframe based on sub dataframes coming from groupby

I am pretty new to pandas and trying to learn it. So, any advice would be appreciated :)
This is just a small part of my whole dataframe DF2:
Chromosome_Name
Sequence_Source
Sequence_Feature
Start
End
Strand
Gene_ID
Gene_Name
0
1
ensembl_havana
gene
14363
34806
-
"ENSG00000227232"
"WASH7P"
1
1
havana
gene
89295
138566
-
"ENSG00000238009"
"RP11-34P13.7"
2
1
havana
gene
141474
178862
-
"ENSG00000241860"
"RP11-34P13.13"
3
1
havana
gene
227615
272253
-
"ENSG00000228463"
"AP006222.2"
4
1
ensembl_havana
gene
312720
453948
+
"ENSG00000237094"
"RP4-669L17.10"
These are my conditions:
Condition 1: Reference row's "Start" value <= Other row's "End" value.
Condition 2: Reference row's "End" value >= Other row's "Start" value.
This is what I have done so far:
chromosome_list = ["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","X","Y"]
dataFrame = DF2.groupby(["Chromosome_Name"])
for chromosome in chromosome_list:
CHR = dataFrame.get_group(chromosome)
for i in range(0, len(CHR)-1):
for j in range(i+1, len(CHR)):
Overlap_index = DF2[(DF2.loc[i, ["Chromosome_Name"] == chromosome]) & (DF2.loc[i, ["Start"]] <= DF2.loc[j, ["End"]]) & (DF2.loc[i, ["End"]] >= DF2.loc[j, ["Start"]])].index
DF2 = DF2.drop(Overlap_index )
The chromosome_list is all the unique values of column "Chromosome_Name".
Mainly, I want to check for each row that whether the columns ("Start" and "End") values are satisfying the conditions above. I believe I need to iterate a single row (reference row) over the particular rows found in the data frame. However, to achieve this I need to consider the value of the first column "Chromosome_Name".
More specifically, every row in DF2 should be checked according to the conditions stated above but, for example, a row at Chromosome_Name = 5 shouldn't be checked with the row of Chromosome_Name = 12. Therefore, first, I thought that I should split the dataframe using pd.groupby() according to Chromosome_Name then, using these dataframes' indexes, I could manipulate (drop the given rows from) the DF2. However, it did not work :)
P.S. After DF2 is splitted into sub dataframes (according to unique Chromosome_Name), each sub dataframe has different size. e.g. There are 641 rows at Chromosome_Name = X but there are 19342 rows for the Chromosome_Name = 1
If you know how to correct my code or provide me another solution, I would be glad.
Thanks in advance.

I am new to pandas too so I do not want to give you a wrong insight and advices but have you ever thougth of converting Start and End columns to lists. So that you can use if statement if you are not comfortable with pandas but your task is urgent. However, I am aware that converting dataframe into list would be something opposite to the creation of pandas.

Checking timestamps in pandas

I have a panda dataset with every line timestamped (unix time - every line represents a day).
Ex:
Index Timestamp Value
1 1544400000 2598
2 1544572800 2649
3 1544659200 2234
4 1544745600 2204
5 1544832000 1293
Is it possible to use a method in which I can subtract every row (from first column) from the previous row? The purpose is to know if the interval between lines is the same, to make sure that the dataset isn't skipping a day.
In the example above, the first day skips to the third day, giving a 48hrs interval, while the other rows are all 24hrs interval.
I think i could do it using iterrows(), but that seems very costly for large databases.
--
Not sure I was clear enough so, in the example above:
Column Timestamp:
Row 2 - row 1 = 172800 (48hrs)
Row 3 - row 2 = 86400 (24hs)
Row 4 - row 3 = 86400 (24hrs) ...

Pandas DataFrames have a diff method that does what you want. Note that the first row of the returned diff will contain NaNs, so you'll want to ignore that in any comparison.
An example would be
import pandas as pd
df = pd.DataFrame({'timestamps': [100, 200, 300, 500]})
# get diff of column (ignoring the first NaN values) and convert to a list
X = df['timestamps'].diff()[1:].tolist()
X.count(X[0]) == len(X) # check if all values are the same, e.g. https://stackoverflow.com/a/3844948/1862861

Pandas calculated column from datetime index groups loop

I have a Pandas df with a Datetime Index. I want to loop over the following code with different values of strike, based on the index date value (different strike for different time period). Here is my code that produces what I am after for 1 strike across the whole time series:
import pandas as pd
import numpy as np
index=pd.date_range('2017-10-1 00:00:00', '2018-12-31 23:50:00', freq='30min')
df=pd.DataFrame(np.random.randn(len(index),2).cumsum(axis=0),columns=['A','B'],index=index)
strike = 40
payoffs = df[df>strike]-strike
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])
print(dist)
I want to use different values of strike based on the time period (index value).
So far I have tried to create a categorical calculated column with the intention of using map or apply row wise on the df. I have also played around with creating a dictionary and mapping the dict across the df.
Even if I get the calculated column with the correct strike value, I can 't think how to subtract the calculated column value (strike) from all other columns to get payoffs from above.
I feel like I need to use for loop and potentially create groups of date chunks that get appended together at the end of the loop, maybe with pd.concat.
Thanks in advance

I think you need convert DatetimeIndex to quarter period by to_period, then to string and last map by dict.
For comapring need gt with sub:
d = {'2017Q4':30, '2018Q1':40, '2018Q2':50, '2018Q3':60, '2018Q4':70}
strike = df.index.to_series().dt.to_period('Q').astype(str).map(d)
payoffs = df[df.gt(strike, 0)].sub(strike, 0)
mean_payoff = payoffs.fillna(0).mean()
dist = mean_payoff.describe(percentiles=[0.05,.5,.95])

Mapping your dataframe index into a dictionary can be a starting point.
a = dict()
a[2017]=30
a[2018]=40
ranint = random.choices([30,35,40,45],k=21936)
#given your index used in example
df = pd.DataFrame({values:ranint},index=index)
values year strick
2017-10-01 00:00:00 30 2017 30
2017-10-01 00:30:00 30 2017 30
2017-10-01 01:00:00 45 2017 30
df.year = df.index.year
index.strike = df.year.map(a)
df.returns = df.values - df.strike
Then you can extract return that are greater than 0:
df[df.returns>0]

Add multiple columns to multiple data frames

I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.

Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Wrangling shifted DataFrame with Pandas - python

Related

Apply if else condition in specific pandas column by location

Update main dataframe based on sub dataframes coming from groupby

Checking timestamps in pandas

Pandas calculated column from datetime index groups loop

Add multiple columns to multiple data frames

Categories

Resources