How to apply "Diff()" methode to multiple columns? - python

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged

In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)

Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

Related

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

(Pandas)I fail to assign value when I use loc and column name together

The sample of my data is as follows.
date traffic_amount factor
-----------------------------------------------
2012-01-01 613.586982 1.000000
2012-02-01 598.591841 1.107143
2012-03-01 653.042743 1.000000
2012-04-01 666.692536 1.033333
I tried to change value which has index 2012-02-01 using loc and column name. But I failed to do that. The value remained the same.
data_per_member.loc['2012-02-01']['factor'] = 31/29
print(data_per_member.loc['2012-02-01']['factor'])
1.1071428571428572
31/29 is 1.0689655172413792.
So I noticed that my code 'data_per_member.loc['2012-02-01']['factor'] = 31/29' did not work. Why did not my code work?
Do not chain the loc with []
data_per_member.loc['2012-02-01','factor'] = 31/29

Function on columns inside dataframe

I have the following data frame
Attributes Adj Close
Symbols OMC PUB WPP ^IXIC ^DJI
Date
2015-06-30 NaN NaN NaN NaN NaN
2015-07-01 0.005900 0.001178 0.012686 0.005264 0.007855
2015-07-02 -0.001825 -0.004116 0.001648 0.004484 0.006289
2015-07-06 -0.003267 -0.032502 -0.010842 0.001036 0.003665
2015-07-07 0.015070 -0.037371 -0.017480 0.002142 0.008943
I want to iterate through the columns and find the date in which the value in 'PUB' > than the value in '^IXIC'
I was reading through some of the things other people have been posting online, I was thinking about making another column and use np.where to populate it but was wondering if there was any way to get the data without going through that entire process
(Something like a for loop with an if function that goes through the dataframe and returns the date when the condition is met)
You do not need for loop here
dates = df.index[df['PUB'] > df['^IXIC']]

How to join two column in dataframe by overwriting only NaN values in pandas?

I am trying to join two columns by overwriting only NaN values in the second column. I have tried multiple things but nothing is working.
New.Market.Cap New.Market.Cap2 Expected.Output
Date Symbol
2017-01-01 BTC 4.467053e+09 NaN 4.467053e+09
ETH 7.148243e+08 6.059076e+08 6.059076e+08
XRP 3.633730e+08 NaN 3.633730e+08
2017-01-02 BTC 4.575871e+09 NaN 4.575871e+09
ETH 7.334621e+08 6.249679e+08 6.249679e+08
XRP 3.633730e+08 NaN 3.633730e+08
I've tried multiple things, but could not make it work.
Use Series.combine_first or Series.fillna:
df['Expected.Output'] = df['New.Market.Cap2'].combine_first(df['New.Market.Cap'])
Or:
df['Expected.Output'] = df['New.Market.Cap2'].fillna(df['New.Market.Cap'])
If need also remove columns DataFrame.pop is your friend:
df['Expected.Output'] = df.pop('New.Market.Cap2').combine_first(df.pop('New.Market.Cap'))

Subsetting a Pandas series

I have a Pandas series. Basically one specific row of a pandas data frame.
Name: NY.GDP.PCAP.KD.ZG, dtype: int64
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff 0.522725
NY.GDP.MKTP.CN_logdiff 0.884601
NY.GDP.MKTP.KD_logdiff 0.990679
NY.GDP.MKTP.KD.ZG 0.992603
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff 0.856861
NY.GDP.MKTP.PP.KD_logdiff 0.990679
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff 0.523433
NY.GDP.PCAP.KD.ZG 1.000000
NY.GDP.PCAP.KN_logdiff 0.999456
NY.GDP.PCAP.PP.CD_logdff 0.857321
NY.GDP.PCAP.PP.KD_logdiff 0.999456
The first column is index as you would find in a series. Now I want to basically get all these index names in a list such that only those index should come whose absolute value in the right column is less than 0.5. To give a context this series is basically a row corresponding to the variable NY.GDP.PCAP.KD.ZG in a correlation matrix and I want to retain this variable along with those variables which have correlation less than 0.5 with this variable. Rest variables I will drop from the dataframe
Currently I do something like this but it also keeps NaN
print(tourism[columns].corr().ix[14].where(np.absolute(tourism[columns].corr().ix[14]<0.5)))
where tourism is the data frame , columns is the set of columns on which I did correlation analysis and 14 is the row in the correlation matrix corresponding to column mentioned above
gives:
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff NaN
NY.GDP.MKTP.CN_logdiff NaN
NY.GDP.MKTP.KD_logdiff NaN
NY.GDP.MKTP.KD.ZG NaN
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff NaN
NY.GDP.MKTP.PP.KD_logdiff NaN
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff NaN
NY.GDP.PCAP.KD.ZG NaN
NY.GDP.PCAP.KN_logdiff NaN
NY.GDP.PCAP.PP.CD_logdff NaN
NY.GDP.PCAP.PP.KD_logdiff NaN
Name: NY.GDP.PCAP.KD.ZG, dtype: float64
If x is your series, then:
x[x.abs() < 0.5].index

Categories