Function on columns inside dataframe - python

I have the following data frame
Attributes Adj Close
Symbols OMC PUB WPP ^IXIC ^DJI
Date
2015-06-30 NaN NaN NaN NaN NaN
2015-07-01 0.005900 0.001178 0.012686 0.005264 0.007855
2015-07-02 -0.001825 -0.004116 0.001648 0.004484 0.006289
2015-07-06 -0.003267 -0.032502 -0.010842 0.001036 0.003665
2015-07-07 0.015070 -0.037371 -0.017480 0.002142 0.008943
I want to iterate through the columns and find the date in which the value in 'PUB' > than the value in '^IXIC'
I was reading through some of the things other people have been posting online, I was thinking about making another column and use np.where to populate it but was wondering if there was any way to get the data without going through that entire process
(Something like a for loop with an if function that goes through the dataframe and returns the date when the condition is met)

You do not need for loop here
dates = df.index[df['PUB'] > df['^IXIC']]

Related

Search long series for non NaN entries

I am looking through a DataFrame with different kinds of data whose usefulness I'm trying to evaluate. So I am looking at each column and check the kind of data it is. E.g.
print(extract_df['Auslagenersatz'])
For some I get responses like this:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
263 NaN
264 NaN
265 NaN
266 NaN
267 NaN
I would like to check whether that column contains any information at all so what I am looking for is something like
s = extract_df['Auslagenersatz']
print(s.loc[s == True])
where I am assuming that NaN is interpreted as False in the same way an empty set is. I would like it to return only those elements of the series that satisfy this condition (being not empty). The code above does not work however, as I get an empty set even for columns that I know have non-NaN entries.
I oriented myself with this post How to select rows from a DataFrame based on column values
but I can't figure where I'm going wrong or how to do this instead. The Problem comes up often so any help is well appreciated.
import pandas as pd
df = pd.DataFrame({'A':[2,3,None, 4,None], 'B':[2,13,None, None,None], 'C':[None,3,None, 4,None]})
If you want to see non-NA values of column A then:
df[~df['A'].isna()]

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example
This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

How to apply "Diff()" methode to multiple columns?

I'm trying to apply diff() method to multiple columns to make data Stationary for time series.
x1 = frc_data['004L004T10'].diff(periods=8)
x1
Date
2013-10-01 NaN
2013-11-01 NaN
2013-12-01 NaN
2014-01-01 NaN
2014-02-01 NaN
So diff is working for single column.
However diff is not working for all the columns:
for x in frc_data.columns:
frc_data[x].diff(periods=1)
No errors, although the Data remains Unchanged
In order to change the DataFrame, you need to assign the diff to a new column i.e.
for x in frc_data.columns:
frc_data[x] = frc_data[x].diff(periods=1)
Loop is not necessary, only remove [x] for difference of all columns:
frc_data = frc_data.diff(periods=1)

How to join two column in dataframe by overwriting only NaN values in pandas?

I am trying to join two columns by overwriting only NaN values in the second column. I have tried multiple things but nothing is working.
New.Market.Cap New.Market.Cap2 Expected.Output
Date Symbol
2017-01-01 BTC 4.467053e+09 NaN 4.467053e+09
ETH 7.148243e+08 6.059076e+08 6.059076e+08
XRP 3.633730e+08 NaN 3.633730e+08
2017-01-02 BTC 4.575871e+09 NaN 4.575871e+09
ETH 7.334621e+08 6.249679e+08 6.249679e+08
XRP 3.633730e+08 NaN 3.633730e+08
I've tried multiple things, but could not make it work.
Use Series.combine_first or Series.fillna:
df['Expected.Output'] = df['New.Market.Cap2'].combine_first(df['New.Market.Cap'])
Or:
df['Expected.Output'] = df['New.Market.Cap2'].fillna(df['New.Market.Cap'])
If need also remove columns DataFrame.pop is your friend:
df['Expected.Output'] = df.pop('New.Market.Cap2').combine_first(df.pop('New.Market.Cap'))

Plotting counts of a dataframe grouped by timestamp

So I have a pandas dataframe which has a large number of columns, and one of the columns is a timestamp in datetime format. Each row in the dataframe represents a single "event". What I'm trying to do is graph the frequency of these events over time. Basically a simple bar graph showing how many events per month.
Started with this code:
data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count().plot(kind = 'bar')
plt.show()
This "kind of" works. But there are 2 problems:
1) The graph comes with a legend which includes all the columns in the original data (like 30+ columns). And each bar on the graph has a tiny sub-bar for each of the columns (all of which are the same value since I'm just counting events).
2) There are some months where there are zero events. And these months don't show up on the graph at all.
I finally came up with code to get the graph looking the way I wanted. But it seems to me that I'm not doing this the "correct" way, since this must be a fairly common usecase.
Basically I created a new dataframe with one column "count" and an index that's a string representation of month/year. I populated that with zeroes over the time range I care about and then I copied over the data from the first frame into the new one. Here is the code:
import pandas as pd
import matplotlib.pyplot as plt
cnt = data.groupby([(data.Timestamp.dt.year),(data.Timestamp.dt.month)]).count()
index = []
for year in [2015, 2016, 2017, 2018]:
for month in range(1,13):
index.append('%04d-%02d'%(year, month))
cnt_new = pd.DataFrame(index=index, columns=['count'])
cnt_new = cnt_new.fillna(0)
for i, row in cnt.iterrows():
cnt_new.at['%04d-%02d'%i,'count'] = row[0]
cnt_new.plot(kind = 'bar')
plt.show()
Anyone know an easier way to go about this?
EDIT --> Per request, here's an idea of the type of dataframe. It's the results from an SQL query. Actual data is my company's so...
Timestamp FirstName LastName HairColor \
0 2018-11-30 02:16:11 Fred Schwartz brown
1 2018-11-29 16:25:55 Sam Smith black
2 2018-11-19 21:12:29 Helen Hunt red
OK, so I think I got it. Thanks to Yuca for resample command. I just need to run that on the Timestamp data series (rather than on the whole dataframe) and it gives me exactly what I was looking for.
> data.index = data.Timestamp
> data.Timestamp.resample('M').count()
Timestamp
2017-11-30 0
2017-12-31 0
2018-01-31 1
2018-02-28 2
2018-03-31 7
2018-04-30 9
2018-05-31 2
2018-06-30 6
2018-07-31 5
2018-08-31 4
2018-09-30 1
2018-10-31 0
2018-11-30 5
So OP request is: "Basically a simple bar graph showing how many events per month"
Using pd.resample and monthly frequency yields the desired result
df[['FirstName']].resample('M').count()
Output:
FirstName
Timestamp
2018-11-30 3
To include non observed months, we need to create a baseline calendar
df_a = pd.DataFrame(index = pd.date_range(df.index[0].date(), periods=12, freq='M'))
and then assign to it the result of our resample
df_a['count'] = df[['FirstName']].resample('M').count()
Output:
count
2018-11-30 3.0
2018-12-31 NaN
2019-01-31 NaN
2019-02-28 NaN
2019-03-31 NaN
2019-04-30 NaN
2019-05-31 NaN
2019-06-30 NaN
2019-07-31 NaN
2019-08-31 NaN
2019-09-30 NaN
2019-10-31 NaN

Categories