Sorting column entries in Pandas by name and week - python

I have a csv file which I am trying to manipulate and plot. This is tabular data of an entire year of statistics per companies. I would like to plot the earnings of (say) Google each week for a decade. So, I know I have to splice together several years of data in the form of arrays. However, I am not sure how to organize this data in terms of weeks.
(1) How do I search columns do find only 'Google' and (2) how can I plot this by week? I think I would have to sum from days 1-7
fname = "file.csv"
import pandas as pd
data = pd.read_csv(rita1989)
data.columns
#OUTPUT ..., 'Date', 'DayWeek',..., 'Companies', ..., 'Earnings'
01/05/2008 7 Yahoo 5678.89
01/06/2008 1 Google 3486.84
01/07/2008 2 Google 2379.23
01/08/2008 3 Ask 3578.22
01/09/2008 4 Google 2341.10
01/10/2008 5 DuckDuckGo 8410.00

Something like this:
data['week'] = pd.DatetimeIndex(data['Date']).to_period('W-SAT').to_timestamp(how='end')
data[data['Companies']=='Google'].groupby('week')['Earnings'].sum()
I suspect there's a more elegant way to get the week variable than what I'm doing.
This should also work, but we have to make the index the date:
data.index = pd.DatetimeIndex(data['Date'])
goog_totals = data['Earnings'].resample('w', how='sum').dropna()
At that point, you can just plot with goog_totals.plot().

Related

How to display aggregated and non-aggregated values at the same time?

I've gout an hourly time series over the strecth of a year. I'd like to display daily, and/or monthly aggregated values along with the source data in a plot. The most solid way would supposedly be to add those aggregated values to the source dataframe and take it from there. I know how to take an hourly series like this:
And show hour by day for the whole year like this:
But what I'm looking for is to display the whole thing like below, where the aggregated data are shown togehter with the source data. Mock example:
And I'd like to do it for various time aggregations like day, week, month, quarter and year.
I know this question is a bit broad, but I've been banging my head against this problem for longer than I'd like to admit. Thank you for any suggestions!
import pandas as pd
import numpy as np
np.random.seed(1)
time = pd.date_range(start='01.01.2020', end='31.12.2020', freq='1H')
A = np.random.uniform(low=-1, high=1, size=len(time)).tolist()
df1 = pd.DataFrame({'time':time, 'A':np.cumsum(A)})
df1.set_index('time', inplace=True)
df1.plot()
times = pd.DatetimeIndex(df1.index)
df2 = df1.groupby([times.month, times.day]).mean()
df2.plot()
Code sample:
You are looking for step function, and also, a different way to groupby:
# replace '7D' with '1D' to match your code
# but 1 day might be too small to see the steps
df2 = df1.groupby(df1.index.floor('7D')).mean()
plt.step(df2.index, df2.A, c='r')
plt.plot(df1.index, df1.A)
Output:

Plot a DataFrame over a range of time indices

I am new to Python and pandas.
I have a dataset loaded into Python as a DataFrame. The index of the DataFrame are times of the format "2018-01-01 00:00:00". My dataset ranges from "2018-01-01 00:00:00" to "2018-12-31 23:59:59". The data column has a column name "X".
I can plot the entire dataset using matplotlib:
plt.plot(data.index, data["X"])
However, I want to plot different segments of the time series: 1 month, 6 months, 2 days, 3 seconds, etc.
What is the best way to do this?
Thanks
If you want to plot a month you could do
data.loc['2018-02',"X"].plot()
6 months
data.loc['2018-02':'2018-08',"X"].plot()
and the same logic applies for other ranges
You might need to do one more processing step on your index to ensure you're dealing with datetime objects rather than strings.
new_data = (
data
.assign(datetime=lambda df: pandas.to_datetime(df.index))
.set_index('datetime')
)
new_data["X"].plot()
This should get us really close to what you want, but i haven't tested it on data with your date format.

python resample/group by OHLC data

I have hourly OHLC data that I am trying to regroup to see only from 9pm to 5am in one row and than for every day like that.
I've tried several ways suggested here, but without success.
index_21_09 = eur.index.indexer_between_time('21:00','05:00')
df = eur.iloc[index_21_09]
With this I filter data from 21 - 05, but in several rows, I need them in one row.
then, I tried this:
df_day_max = df.groupby(pd.Grouper(freq='8h')).max()
df_day_min = df.groupby(pd.Grouper(freq = '8h')).min()
df_group = (pd.concat([df_day_max['High'], df_day_min['Low ']], axis=1).dropna())
But, I get data from 16.00 - 00.00! How, if I previously filter them from 21-05?
df_resample = df.resample('8H').ohlc()
With this, I get same result, only with NaN values.
Any help with this? Thanks.

Lag values and differences in pandas dataframe with missing quarterly data

Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')

Average of daily count of records per month in a Pandas DataFrame

I have a pandas DataFrame with a TIMESTAMP column, which is of the datetime64 data type. Please keep in mind, initially this column is not set as the index; the index is just regular integers, and the first few rows look like this:
TIMESTAMP TYPE
0 2014-07-25 11:50:30.640 2
1 2014-07-25 11:50:46.160 3
2 2014-07-25 11:50:57.370 2
There is an arbitrary number of records for each day, and there may be days with no data. What I am trying to get is the average number of daily records per month then plot it as a bar chart with months in the x-axis (April 2014, May 2014... etc.). I managed to calculate these values using the code below
dfWIM.index = dfWIM.TIMESTAMP
for i in range(dfWIM.TIMESTAMP.dt.year.min(),dfWIM.TIMESTAMP.dt.year.max()+1):
for j in range(1,13):
print dfWIM[(dfWIM.TIMESTAMP.dt.year == i) & (dfWIM.TIMESTAMP.dt.month == j)].resample('D', how='count').TIMESTAMP.mean()
which gives the following output:
nan
nan
3100.14285714
6746.7037037
9716.42857143
10318.5806452
9395.56666667
9883.64516129
8766.03225806
9297.78571429
10039.6774194
nan
nan
nan
This is ok as it is, and with some more work, I can map to results to correct month names, then plot the bar chart. However, I am not sure if this is the correct/best way, and I suspect there might be an easier way to get the results using Pandas.
I would be glad to hear what you think. Thanks!
NOTE: If I do not set the TIMESTAMP column as the index, I get a "reduction operation 'mean' not allowed for this dtype" error.
I think you'll want to do two rounds of groupby, first to group by day and count the instances, and next to group by month and compute the mean of the daily counts. You could do something like this.
First I'll generate some fake data that looks like yours:
import pandas as pd
# make 1000 random times throughout the year
N = 1000
times = pd.date_range('2014', '2015', freq='min')
ind = np.random.permutation(np.arange(len(times)))[:N]
data = pd.DataFrame({'TIMESTAMP': times[ind],
'TYPE': np.random.randint(0, 10, N)})
data.head()
Now I'll do the two groupbys using pd.TimeGrouper and plot the monthly average counts:
import seaborn as sns # for nice plot styles (optional)
daily = data.set_index('TIMESTAMP').groupby(pd.TimeGrouper(freq='D'))['TYPE'].count()
monthly = daily.groupby(pd.TimeGrouper(freq='M')).mean()
ax = monthly.plot(kind='bar')
The formatting along the x axis leaves something to be desired, but you can tweak that if necessary.

Categories