get element in column that preceeded new column - python

I'm trying to collect items from column 'data' that just preceded the data I collected in column 'min' and create new column. See
Here is the data (importing with pd.read_csv):
time,data
12/15/18 01:10 AM,130352.146180556
12/16/18 01:45 AM,130355.219097222
12/17/18 01:47 AM,130358.223263889
12/18/18 02:15 AM,130361.281701389
12/19/18 03:15 AM,130364.406597222
12/20/18 03:25 AM,130352.427430556
12/21/18 03:27 AM,130355.431597222
12/22/18 05:18 AM,130358.663541667
12/23/18 06:44 AM,130361.842430556
12/24/18 07:19 AM,130364.915243056
12/25/18 07:33 AM,130352.944409722
12/26/18 07:50 AM,130355.979826389
12/27/18 09:13 AM,130359.153472222
12/28/18 11:53 AM,130362.4871875
12/29/18 01:23 PM,130365.673263889
12/30/18 02:17 PM,130353.785763889
12/31/18 02:23 PM,130356.798263889
01/01/19 04:41 PM,130360.085763889
01/02/19 05:01 PM,130363.128125
and my code:
import pandas as pd
import numpy as np
from scipy import signal
from scipy.signal import argrelextrema
import datetime
diff=pd.DataFrame()
df=pd.read_csv('saw_data2.csv')
df['time']=pd.to_datetime(df['time'])
print(df.head())
n=2 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
If you plot the data, you'll see it is similiar to a sawtooth. The element before in 'data' that I get in 'min' is the element I want to put in a new column df['new_col'].
I've tried many things like,
df['new_col']=df.index.get_loc(df['min'].df['data'])
and,
df['new_col']=df['min'].shift() #obviously wrong

IIUC, you can do the shift before selecting the rows with a value in min:
df['new_col'] = df.shift().loc[df['min'].notna(), 'data']
print (df)
time data min new_col
0 12/15/18 01:10 AM 130352.146181 130352.146181 NaN
1 12/16/18 01:45 AM 130355.219097 NaN NaN
2 12/17/18 01:47 AM 130358.223264 NaN NaN
3 12/18/18 02:15 AM 130361.281701 NaN NaN
4 12/19/18 03:15 AM 130364.406597 NaN NaN
5 12/20/18 03:25 AM 130352.427431 130352.427431 130364.406597
6 12/21/18 03:27 AM 130355.431597 NaN NaN
7 12/22/18 05:18 AM 130358.663542 NaN NaN
8 12/23/18 06:44 AM 130361.842431 NaN NaN
9 12/24/18 07:19 AM 130364.915243 NaN NaN
10 12/25/18 07:33 AM 130352.944410 130352.944410 130364.915243
11 12/26/18 07:50 AM 130355.979826 NaN NaN
12 12/27/18 09:13 AM 130359.153472 NaN NaN
13 12/28/18 11:53 AM 130362.487187 NaN NaN
14 12/29/18 01:23 PM 130365.673264 NaN NaN
15 12/30/18 02:17 PM 130353.785764 130353.785764 130365.673264
16 12/31/18 02:23 PM 130356.798264 NaN NaN
17 01/01/19 04:41 PM 130360.085764 NaN NaN
18 01/02/19 05:01 PM 130363.128125 NaN NaN

Related

Not able to scrape entire table using pd.read_html

I tried using pd.read_html to scrape a table, but the last 3 columns are returning "nan". Here is the code I used:
import pandas as pd
url = 'https://www.actionnetwork.com/mlb/public-betting'
todays_games = pd.read_html(url)[0]
There are 7 columns in total, and it grabs all of the headers, but not the data in the last 3 columns. I also tried parsing this using BeautifulSoup, but got the same result.
print(todays_games)
Scheduled Open ... Diff Bets
0 5:05 PM 951MarlinsMIA952NationalsWSH -118+100 ... NaN NaN
1 5:10 PM 979BrewersMIL980TigersDET -227+188 ... NaN NaN
2 7:07 PM 965RaysTB966Blue JaysTOR +150-175 ... NaN NaN
3 8:10 PM 967Red SoxBOS968MarinersSEA -125+105 ... NaN NaN
4 10:35 PM 953RedsCIN954PiratesPIT -154+135 ... NaN NaN
5 11:05 PM 955CubsCHC956PhilliesPHI +170-200 ... NaN NaN
6 11:05 PM 969YankeesNYY970OriolesBAL -227+188 ... NaN NaN
7 11:10 PM 957CardinalsSTL958MetsNYM +135-154 ... NaN NaN
8 11:20 PM 959RockiesCOL960BravesATL +170-200 ... NaN NaN
9 11:40 PM 971IndiansCLE972TwinsMIN +100-118 ... NaN NaN
10 Thu 9/16, 12:05 AM 973AstrosHOU974RangersTEX -213+175 ... NaN NaN
11 Thu 9/16, 12:10 AM 975AngelsLAA976White SoxCWS +160-189 ... NaN NaN
12 Thu 9/16, 12:10 AM 977AthleticsOAK978RoyalsKC -149+125 ... NaN NaN
13 Thu 9/16, 1:45 AM 961PadresSD962GiantsSF +103-120 ... NaN NaN
14 Thu 9/16, 2:10 AM 963DiamondbacksARI964DodgersLAD -185+155 ... NaN NaN
I'm assuming the problem has something to do with the HTML code. Can anyone help me solve this?
Send HTTP GET to https://api.actionnetwork.com/web/v1/scoreboard/mlb?bookIds=15,30,68,75,69,76,71,79,247,123,263&date=20210915
and get the data you are looking for.
import requests
r = requests.get(
'https://api.actionnetwork.com/web/v1/scoreboard/mlb?bookIds=15,30,68,75,69,76,71,79,247,123,263&date=20210915')
print(r.json())

Dataframe shift moving data into random columns?

I'm using code to shift time series data that looks somewhat similar to this:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
with the intent of getting it into something like this:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
I'm using this code to make that happen:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
with the addition of a groupby.sum() because some players in the dataframe moved from one teamt o another within the same season and i want to combine those numbers. However, the data i have is actually coming out extremely wrong. The data has too many columns to post, but it seems like the data from the previous year (_prev) is placed into random columns. It doesn't change and will always place it into the same wrong columns. Is this an issue caused by the groupby.sum()? is it because i'm using a columns variable (containing all the same names as res.columns with a str(_prev) attached to them) and a list(res.columns)? And regardless of which it is, how do i solve this?
here's the outputs of columns and res.columns:
columns:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
both are length 35 when tested.
I suggest use:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN

Pandas not filtering on basis of basis of a column value

I have had success in filtering the column values however for this dataframe the filter is returning a NaN dataframe after filter. I am not sure where I am wrong. I am posting the codes with results.
import pandas as pd
df = pd.read_csv("http://portal.amfiindia.com/DownloadNAVHistoryReport_Po.aspx?mf=17&tp=1&frmdt=04-Nov-2017&todt=02-Dec-2018",sep=";",parse_dates=['Date'])
df=df.drop(['Repurchase Price','Sale Price'],axis=1)
df = df.dropna()
df['Net Asset Value'] = df['Net Asset Value'].apply(pd.to_numeric,errors='coerce')
df.columns = [['scheme_code','scheme','nav','date']]
df[df['scheme_code'] == '123690']
The result of the filter is
scheme_code scheme nav date
2 123690 NaN nan NaT
3 123690 NaN nan NaT
4 123690 NaN nan NaT
5 123690 NaN nan NaT
6 123690 NaN nan NaT
7 123690 NaN nan NaT
8 123690 NaN nan NaT
9 123690 NaN nan NaT
10 123690 NaN nan NaT
11 123690 NaN nan NaT
12 123690 NaN nan NaT
13 123690 NaN nan NaT
14 123690 NaN nan NaT
15 123690 NaN nan NaT
16 123690 NaN nan NaT
17 123690 NaN nan NaT
18 123690 NaN nan NaT
19 123690 NaN nan NaT
20 123690 NaN nan NaT
21 123690 NaN nan NaT
22 123690 NaN nan NaT
23 123690 NaN nan NaT
24 123690 NaN nan NaT
25 123690 NaN nan NaT
26 123690 NaN nan NaT
27 123690 NaN nan NaT
28 123690 NaN nan NaT
29 123690 NaN nan NaT
30 123690 NaN nan NaT
31 123690 NaN nan NaT
However if I look at the dataframe head method I can see the actual data with values
scheme_code scheme nav \
2 123690 Kotak Banking and PSU Debt - Growth 38.60
3 123690 Kotak Banking and PSU Debt - Growth 38.58
4 123690 Kotak Banking and PSU Debt - Growth 38.58
5 123690 Kotak Banking and PSU Debt - Growth 38.59
6 123690 Kotak Banking and PSU Debt - Growth 38.59
date
2 2017-11-06
3 2017-11-07
4 2017-11-08
5 2017-11-09
I also tried to convert to numeric, still I can see the same result. I would appreciate if someone can help me to figure out what is the error.
The problem is that you specify the columns as a list of a list (note the double brackets), so the condition is not met. Just change it to a simple list:
df.columns = ['scheme_code','scheme','nav','date']

How to read and write table with extra information as a dataframe and adding new columns from the informations

I have a file-like object generated from StringIO which is a table with lines of information ahead the table (see below starting from #TIMESTAMP).
I want to add extra columns to the exisiting table using the information "Date", "UTCoffset - Time (Substraction)" from #Timestamp and "ZenAngle" from #GLOBAL_SUMMARY.
I used pd.read_csv command to read it but it only worked when I skip the first 8 rows which includes the information I need. Also the Error "TypeError: data argument can't be an iterator" was reported as I tried to import the object below as dataframe.
#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
I think you can first use read_csv to create 3 DataFrames:
import pandas as pd
import io
temp=u"""#TIMESTAMP
UTCOffset,Date,Time
+00:30:32,2011-09-05,08:32:21
#GLOBAL_SUMMARY
Time,IntACGIH,IntCIE,ZenAngle,MuValue,AzimAngle,Flag,TempC,O3,Err_O3,SO2,Err_SO2,F324
08:32:21,7.3576,52.758,59.109,1.929,114.427,000000,24,291,1,,,91.9
#GLOBAL
Wavelength,S-Irradiance,Time
290.0,0.000e+00
290.5,0.000e+00
291.0,4.380e-06
291.5,2.234e-05
292.0,2.102e-05
292.5,2.204e-05
293.0,2.453e-05
293.5,2.256e-05
294.0,3.088e-05
294.5,4.676e-05
295.0,3.384e-05
295.5,3.582e-05
296.0,4.298e-05
296.5,3.774e-05
297.0,4.779e-05
297.5,7.399e-05
298.0,9.214e-05
298.5,1.080e-04
299.0,2.143e-04
299.5,3.180e-04
300.0,3.337e-04
300.5,4.990e-04
301.0,8.688e-04
301.5,1.210e-03
302.0,1.133e-03
"""
df1 = pd.read_csv(io.StringIO(temp),
skiprows=9)
print (df1)
Wavelength S-Irradiance Time
0 290.0 0.000000 NaN
1 290.5 0.000000 NaN
2 291.0 0.000004 NaN
3 291.5 0.000022 NaN
4 292.0 0.000021 NaN
5 292.5 0.000022 NaN
6 293.0 0.000025 NaN
7 293.5 0.000023 NaN
8 294.0 0.000031 NaN
9 294.5 0.000047 NaN
10 295.0 0.000034 NaN
11 295.5 0.000036 NaN
12 296.0 0.000043 NaN
13 296.5 0.000038 NaN
14 297.0 0.000048 NaN
15 297.5 0.000074 NaN
16 298.0 0.000092 NaN
17 298.5 0.000108 NaN
18 299.0 0.000214 NaN
19 299.5 0.000318 NaN
20 300.0 0.000334 NaN
21 300.5 0.000499 NaN
22 301.0 0.000869 NaN
23 301.5 0.001210 NaN
24 302.0 0.001133 NaN
df2 = pd.read_csv(io.StringIO(temp),
skiprows=1,
nrows=1)
print (df2)
UTCOffset Date Time
0 +00:30:32 2011-09-05 08:32:21
df3 = pd.read_csv(io.StringIO(temp),
skiprows=5,
nrows=1)
print (df3)
Time IntACGIH IntCIE ZenAngle MuValue AzimAngle Flag TempC O3 \
0 08:32:21 7.3576 52.758 59.109 1.929 114.427 0 24 291
Err_O3 SO2 Err_SO2 F324
0 1 NaN NaN 91.9

Reindexing and filling on one level of a hierarchical index in pandas

I have a pandas dataframe with a two level hierarchical index ('item_id' and 'date'). Each row has columns for a variety of metrics for a particular item in a particular month. Here's a sample:
total_annotations unique_tags
date item_id
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2008-07-01 2 81 33
2008-11-01 2 82 34
2009-04-01 2 84 35
2010-03-01 2 90 35
2010-04-01 2 100 36
2010-11-01 2 105 40
2011-05-01 2 106 40
2011-07-01 2 108 42
2005-08-01 3 479 200
2005-09-01 3 707 269
2005-10-01 3 980 327
2005-11-01 3 1176 373
2005-12-01 3 1536 438
2006-01-01 3 1854 497
2006-02-01 3 2206 560
2006-03-01 3 2558 632
2007-02-01 3 5650 1019
As you can see, there are not observations for all consecutive months for each item. What I want to do is reindex the dataframe such that each item has rows for each month in a specified range. Now, this is easy to accomplish for any given item. So, for item_id 99, for example:
baseDateRange = pd.date_range('2005-07-01','2013-01-01',freq='MS')
data.xs(99,level='item_id').reindex(baseDateRange,method='ffill')
But with this method, I'd have to iterate through all the item_ids, then merge everything together, which seems woefully over-complicated.
So how can I apply this to the full dataframe, ffill-ing the observations (but also the item_id index) such that each item_id has properly filled rows for all the dates in baseDateRange?
Essentially for each group you want to reindex and ffill. The apply gets passed a data frame that has the item_id and date still in the index, so reset, then set and reindex with filling.
idx is your baseDateRange from above.
In [33]: df.groupby(level='item_id').apply(
lambda x: x.reset_index().set_index('date').reindex(idx,method='ffill')).head(30)
Out[33]:
item_id annotations tags
item_id
2 2005-07-01 NaN NaN NaN
2005-08-01 NaN NaN NaN
2005-09-01 NaN NaN NaN
2005-10-01 NaN NaN NaN
2005-11-01 NaN NaN NaN
2005-12-01 NaN NaN NaN
2006-01-01 NaN NaN NaN
2006-02-01 NaN NaN NaN
2006-03-01 NaN NaN NaN
2006-04-01 NaN NaN NaN
2006-05-01 NaN NaN NaN
2006-06-01 NaN NaN NaN
2006-07-01 NaN NaN NaN
2006-08-01 NaN NaN NaN
2006-09-01 NaN NaN NaN
2006-10-01 NaN NaN NaN
2006-11-01 NaN NaN NaN
2006-12-01 NaN NaN NaN
2007-01-01 NaN NaN NaN
2007-02-01 NaN NaN NaN
2007-03-01 NaN NaN NaN
2007-04-01 2 30 14
2007-05-01 2 32 16
2007-06-01 2 36 19
2007-07-01 2 36 19
2007-08-01 2 36 19
2007-09-01 2 36 19
2007-10-01 2 36 19
2007-11-01 2 36 19
2007-12-01 2 36 19
Constructing on Jeff's answer, I consider this to be somewhat more readable. It is also considerably more efficient since only the droplevel and reindex methods are used.
df = df.set_index(['item_id', 'date'])
def fill_missing_dates(x, idx=all_dates):
x.index = x.index.droplevel('item_id')
return x.reindex(idx, method='ffill')
filled_df = (df.groupby('item_id')
.apply(fill_missing_dates))

Categories