How to read webpage dataset in pandas? - python

I am trying to read this table
on the webpage: https://datahub.io/sports-data/german-bundesliga
I am using this code:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga"
pd.read_html(url)[2]
It reads another tables but not the tables of this type.
Also there is a link to this specific table:
https://datahub.io/sports-data/german-bundesliga/r/0.html
I also tried this:
import pandas as pd
url="https://datahub.io/sports-data/german-bundesliga/r/0.html"
pd.read_html(url)
But it says that there are no tables to read

There is no necessity to use the HTML form of the table cause the table is available in CSV format.
pd.read_csv('https://datahub.io/sports-data/german-bundesliga/r/season-1819.csv').head()
output:
Div Date HomeTeam AwayTeam FTHG FTAG FTR HTHG HTAG HTR ... BbAv<2.5 BbAH BbAHh BbMxAHH BbAvAHH BbMxAHA BbAvAHA PSCH PSCD PSCA
0 D1 24/08/2018 Bayern Munich Hoffenheim 3 1 H 1 0 H ... 3.55 22 -2.00 1.92 1.87 2.05 1.99 1.23 7.15 14.10
1 D1 25/08/2018 Fortuna Dusseldorf Augsburg 1 2 A 1 0 H ... 1.76 20 0.00 1.80 1.76 2.17 2.11 2.74 3.33 2.78
2 D1 25/08/2018 Freiburg Ein Frankfurt 0 2 A 0 1 A ... 1.69 20 -0.25 2.02 1.99 1.92 1.88 2.52 3.30 3.07
3 D1 25/08/2018 Hertha Nurnberg 1 0 H 1 0 H ... 1.76 20 -0.25 1.78 1.74 2.21 2.14 1.79 3.61 5.21
4 D1 25/08/2018 M'gladbach Leverkusen 2 0 H 0 0 D ... 2.32 20 0.00 2.13 2.07 1.84 1.78 2.63 3.70 2.69
5 rows × 61 columns

Related

Index must be DatetimeIndex when filtering dataframe

I then have a function which look for a specific date (in this case, 2022-01-26):
def get_days(data, date):
df = pd.read_csv(data)
df = df[(df['date'] >= date) & (df['date'] <= date)]
get_trading_session_times(df)
Which returns:
v vw o c h l n date time
0 134730.0 3.6805 3.60 3.61 3.90 3.58 494 2022-01-26 09:00:00
1 72594.0 3.6324 3.60 3.62 3.70 3.57 376 2022-01-26 09:01:00
2 51828.0 3.6151 3.62 3.63 3.65 3.57 278 2022-01-26 09:02:00
3 40245.0 3.6343 3.63 3.65 3.65 3.62 191 2022-01-26 09:03:00
4 76428.0 3.6094 3.64 3.62 3.66 3.57 298 2022-01-26 09:04:00
.. ... ... ... ... ... ... ... ... ...
868 176.0 3.1300 3.13 3.13 3.13 3.13 2 2022-01-26 23:53:00
869 550.0 3.1200 3.12 3.12 3.12 3.12 3 2022-01-26 23:56:00
870 460.0 3.1211 3.12 3.12 3.12 3.12 3 2022-01-26 23:57:00
871 1175.0 3.1201 3.12 3.12 3.12 3.12 6 2022-01-26 23:58:00
872 559.0 3.1102 3.11 3.11 3.11 3.11 5 2022-01-26 23:59:00
[873 rows x 9 columns]
When I then try to look for only times between 09:00 and 09:30 like so:
def get_trading_session_times(df):
df = df['time'].between_time('09:00', '09:30')
print(df)
I get the following error:
Index must be DatetimeIndex when filtering dataframe
Full code:
import pandas as pd
data = 'data\BBIG.csv'
date = '2022-01-26'
def get_days(data, date):
df = pd.read_csv(data)
df = df[(df['date'] >= date) & (df['date'] <= date)]
get_trading_session_times(df)
def get_trading_session_times(df):
df = df['time'].between_time('09:00', '09:30')
print(df)
get_days(data, date)
What am I doing wrong?
between_time is only valid if your index is a DateTiimeIndex
As your string time is well formatted, you can use between to compare them because your values can be sorted in lexicographical order.
>>> df[df['time'].between('09:00', '09:30')]
v vw o c h l n date time
0 134730.0 3.6805 3.60 3.61 3.90 3.58 494 2022-01-26 09:00:00
1 72594.0 3.6324 3.60 3.62 3.70 3.57 376 2022-01-26 09:01:00
2 51828.0 3.6151 3.62 3.63 3.65 3.57 278 2022-01-26 09:02:00
3 40245.0 3.6343 3.63 3.65 3.65 3.62 191 2022-01-26 09:03:00
4 76428.0 3.6094 3.64 3.62 3.66 3.57 298 2022-01-26 09:04:00
Update
If your time column contains a time object:
from datetime import time
df['time'] = pd.to_datetime(df['time']).dt.time
out = df[df['time'].between(time(9, 0), time(9, 30))]
print(out)
# Output
v vw o c h l n date time
0 134730.0 3.6805 3.60 3.61 3.90 3.58 494 2022-01-26 09:00:00
1 72594.0 3.6324 3.60 3.62 3.70 3.57 376 2022-01-26 09:01:00
2 51828.0 3.6151 3.62 3.63 3.65 3.57 278 2022-01-26 09:02:00
3 40245.0 3.6343 3.63 3.65 3.65 3.62 191 2022-01-26 09:03:00
4 76428.0 3.6094 3.64 3.62 3.66 3.57 298 2022-01-26 09:04:00

export dataframe to csv staking the columns with header and date index

I have a dataframe that I'd like to export to a csv file where each column is stacked on top of one another. I want to use each header as a label with the date in this format, Allu_1_2013.
date Allu_1 Allu_2 Alluv_3 year
2013-01-01 2.00 1.45 3.54 2013
2014-01-01 3.09 2.35 9.01 2014
2015-01-01 4.79 4.89 10.04 2015
The final csv text tile should look like
Allu_1_2013 2.00
Allu_1_2014 3.09
Allu_1_2015 4.79
Allu_2_2013 1.45
Allu_2_2014 2.35
Allu_2_2015 4.89
Allu_3_2013 3.54
Allu_3_2014 9.01
Allu_3_2015 10.04
You can use melt:
new_df = df.melt(id_vars=["date", "year"],
var_name="Date",
value_name="Value").drop(columns=['date'])
new_df['idx'] = new_df['Date'] + '_' + new_df['year'].astype(str)
new_df = new_df.drop(columns=['year', 'Date'])
Value
idx
0
2
Allu_1_2013
1
3.09
Allu_1_2014
2
4.79
Allu_1_2015
3
1.45
Allu_2_2013
4
2.35
Allu_2_2014
5
4.89
Allu_2_2015
6
3.54
Alluv_3_2013
7
9.01
Alluv_3_2014
8
10.04
Alluv_3_2015

Add zero in between values in column (dataframe)

I have a dataframe:
ID Date Volume Sales
1 20191 3.33 1.33
1 20192 3.33 1.33
1 20193 3.33 1.33
1 20194 2.66 2
1 20195 2.66 2
1 20196 2.66 2
1 20197 2 2.66
1 20198 2 2.66
1 20199 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33
I would like to add a 0 right after the year 2019 in this case to that the date looks like: 201901 etc while 201910 and above remains the same.
My initial thought process is to use;
np.where(df['Date'].str.len() == 5,
where if string equals 5, we add the zero. Otherwise, data stays the same.
Expected output:
ID Date Volume Sales
1 201901 3.33 1.33
1 201902 3.33 1.33
1 201903 3.33 1.33
1 201904 2.66 2
1 201905 2.66 2
1 201906 2.66 2
1 201907 2 2.66
1 201908 2 2.66
1 201909 2 2.66
1 201910 1.33 3.33
1 201911 1.33 3.33
1 201912 1.33 3.33
Assuming these are strings:
df.Date = df.Date.apply(lambda x: x if len(x) == 6 else f"{x[:4]}0{x[-1]}")
But I concur that you should convert to proper dates, as suggested by #0 0 in the comment.

Unable to print dataframe from the extracted Boolean values

import pandas as pd
diamonds = pd.read_csv('diam.csv')
print(diamonds.head())
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
I want to print only the object data types
x=diamonds.dtypes=='object'
diamonds.where(diamonds[x]==True)
But I get this error:
unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
where uses the row axis. Use diamonds.loc[:, diamonds.dtypes == object], or the builtin select_dtypes
From your post (badly formatted) I recreated the diamonds DataFrame,
getting result like below:
Unnamed: 0 carat cut color clarity depth table price x y z quality?color
0 0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 Ideal,E 1
1 1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 Premium,E 2
2 2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 Good,E 3
3 3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 Premium,I 4
4 4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 Good,J
When you run x = diamonds.dtypes == 'object' and print x, the result is:
Unnamed: 0 False
carat False
cut True
color True
clarity True
depth False
table False
price False
x False
y False
z False
quality?color True
dtype: bool
It is a bool vector, containing answer to the question is this column of object type?
Note that diamonds.columns.where(x).dropna().tolist() prints:
['cut', 'color', 'clarity', 'quality?color']
i.e. names of columns with object type.
So to print all columns of object type you should run:
diamonds[diamonds.columns.where(x).dropna()]

Is there an option in pandas to see if value in column was less than another column in one row and then it changed over time?

I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:
This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334
Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]

Categories