I have this dagtaframe, talismen_players:
Team TotalShots TotalCreated Goals ... WeightBetweenness %ShotInv Player Color
0 Aston Villa 55.0 68.0 7.0 ... 0.45 0.36 Jack Grealish #770038
1 Manchester City 76.0 96.0 8.0 ... 0.44 0.32 Kevin De Bruyne #97C1E7
2 Watford 62.0 43.0 4.0 ... 0.37 0.34 Gerard Deulofeu #FBEE23
3 Leicester City 60.0 67.0 6.0 ... 0.32 0.34 James Maddison #0053A0
4 Norwich City 29.0 69.0 0.0 ... 0.31 0.32 Emiliano Buendia #00A14E
5 Chelsea 63.0 40.0 5.0 ... 0.28 0.23 Mason Mount #034694
6 Tottenham Hotspur 64.0 30.0 9.0 ... 0.28 0.29 Son Heung-Min #132257
7 Everton 66.0 30.0 10.0 ... 0.22 0.26 Richarlison #274488
8 Arsenal 64.0 18.0 17.0 ... 0.21 0.27 Pierre-Emerick Aubameyang #EF0107
9 Bournemouth 25.0 40.0 1.0 ... 0.21 0.23 Ryan Fraser #D3151B
10 Crystal Palace 42.0 20.0 6.0 ... 0.20 0.24 Jordan Ayew #C4122E
11 Burnley 33.0 40.0 2.0 ... 0.20 0.27 Dwight McNeil #630F33
12 Newcastle United 41.0 24.0 1.0 ... 0.20 0.24 Joelinton #231F20
13 Liverpool 89.0 41.0 14.0 ... 0.18 0.31 Mohamed Salah #CE1317
14 Brighton and Hove Albion 27.0 52.0 2.0 ... 0.16 0.23 Pascal Groß #005DAA
15 Wolverhampton Wanderers 86.0 38.0 11.0 ... 0.16 0.38 Raul Jimenez #FDB913
16 Sheffield United 31.0 35.0 5.0 ... 0.15 0.24 John Fleck #E52126
17 West Ham United 48.0 18.0 6.0 ... 0.15 0.25 Sebastien Haller #7C2C3B
18 Southampton 64.0 21.0 15.0 ... 0.11 0.24 Danny Ings #ED1A3B
19 Manchester United 37.0 31.0 1.0 ... 0.10 0.17 Andreas Pereira #E80909
And I have singled out one element in a series being plotted by altair, like so:
target_team = 'Liverpool'
# the following prints 'Mohamed Salah'
target_player = talismen_players.loc[talismen_players['Team']==target_team, 'Player'].item()
# all elements
talisman_chart = alt.Chart(talismen_players).mark_bar().encode(
alt.Y('Player:N', title="", sort='-x'),
alt.X('WeightBetweenness:Q', title="Interconectividade do craque com o resto do time"),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
).properties(
width=800
).configure_axis(
grid=False
).configure_view(
strokeWidth=0
)
_________
This plots all elements, but I want to highlight one of them:
This is the code for achieving the result using lines: Multiline highlight
I've tried with this code:
highlight = alt.selection(type='single', on='mouseover',
fields=['Player'], nearest=True)
base = alt.Chart(talismen_players).encode(
alt.X('WeightBetweenness:Q'),
alt.Y('Player:N', title=''),
alt.Color('Color', legend=None, scale=None),
tooltip = [alt.Tooltip('Player:N'),
alt.Tooltip('Team:N'),
alt.Tooltip('TotalShots:Q'),
alt.Tooltip('%ShotInv:Q')],
)
points = base.mark_circle().encode(
opacity=alt.value(0)
).add_selection(
highlight
).properties(
width=700
)
lines = base.mark_bar().encode(
size=alt.condition(~highlight, alt.value(18), alt.value(20))
)
But the result is not ideal:
Highlight is not working.
QUESTION:
Ideally, I'd like to have target_player bar highlighted on default. and THEN be able to highlight the other bars, with mouseover.
But having target_player bar statically highlighted suffices.
Highlight should not change bar colors, but rather have a focus quality, while other bars are out of focus.
You can do this using an initialized selection, along with an opacity for highlighting. For example:
import altair as alt
import pandas as pd
df = pd.DataFrame({
'x': [1, 2, 3, 4],
'y': ['A', 'B', 'C', 'D'],
'color': ['W', 'X', 'Y', 'Z'],
})
hover = alt.selection_single(fields=['y'], on='mouseover', nearest=True, init={'y': 'C'})
alt.Chart(df).mark_bar().encode(
x='x',
y='y',
color='color',
opacity=alt.condition(hover, alt.value(1.0), alt.value(0.3))
).add_selection(hover)
Related
I made a mock-up example to illustrate my problem, naturally, what I am working with is something way more complex. Reading the example will make everything easier to understand but my goal is to use the row reference values of one dataframe to set the values of a new column of another dataframe. That, taking my example, I want to create a new column in df1 named z1, that column will be formed by considering the values of x1 taking the reference of y2 values of d2.
import numpy as np
import pandas as pd
x1 = np.array([])
for i in np.arange(0, 15, 3):
x1i = np.repeat(i, 3)
x1 = np.append(x1, x1i)
y1 = np.linspace(0, 1, len(x1))
x2 = np.arange(0, 15, 3)
y2 = np.linspace(0, 1, len(x2))
df1 = pd.DataFrame([x1, y1]).T
df2 = pd.DataFrame([x2, y2]).T
df1.columns = ['x1', 'y1']
df2.columns = ['x2', 'y2']
So, we have that df1 is:
x1 y1
0 0.0 0.000000
1 0.0 0.071429
2 0.0 0.142857
3 3.0 0.214286
4 3.0 0.285714
5 3.0 0.357143
6 6.0 0.428571
7 6.0 0.500000
8 6.0 0.571429
9 9.0 0.642857
10 9.0 0.714286
11 9.0 0.785714
12 12.0 0.857143
13 12.0 0.928571
14 12.0 1.000000
and df2 is:
x2 y2
0 0.0 0.00
1 3.0 0.25
2 6.0 0.50
3 9.0 0.75
4 12.0 1.00
What I would like to achieve is:
x1 y1 z1
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00
You can use map for this.
df1['z'] = df1['x1'].map(df2.set_index('x2')['y2'])
x1 y1 z
0 0.0 0.000000 0.00
1 0.0 0.071429 0.00
2 0.0 0.142857 0.00
3 3.0 0.214286 0.25
4 3.0 0.285714 0.25
5 3.0 0.357143 0.25
6 6.0 0.428571 0.50
7 6.0 0.500000 0.50
8 6.0 0.571429 0.50
9 9.0 0.642857 0.75
10 9.0 0.714286 0.75
11 9.0 0.785714 0.75
12 12.0 0.857143 1.00
13 12.0 0.928571 1.00
14 12.0 1.000000 1.00
This is the site I'm trying to retrieve information from: https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml I want to get the box score data so like the Oakland A's total batting average in the game, at bats in the game, etc. However, when I retreive and print the html from the site, these box scores are missing completely from the html. Any suggestions? Thanks.
Here's my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Please help! Thanks! I tried selenium and had the same problem.
The page is loaded by javascript. Try using the requests_html package instead. See below sample.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
s = HTMLSession()
page = s.get(url, timeout=20)
page.html.render()
soup = BeautifulSoup(page.html.html, 'html.parser')
print(soup.prettify)
The other tables are there in the requested html, but within the comments. So you need to parse out the comments to get those additional tables:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.baseball-reference.com/boxes/CLE/CLE202108120.shtml"
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
comments = data.find_all(string=lambda text: isinstance(text, Comment))
tables = pd.read_html(url)
for each in comments:
if 'table' in str(each):
try:
tables.append(pd.read_html(str(each))[0])
except:
continue
Output:
Oakland
print(tables[2].to_string())
Batting AB R H RBI BB SO PA BA OBP SLG OPS Pit Str WPA aLI WPA+ WPA- cWPA acLI RE24 PO A Details
0 Mark Canha LF 6.0 1.0 1.0 3.0 0.0 0.0 6.0 0.247 0.379 0.415 0.793 23.0 19.0 0.011 0.58 0.040 -0.029% 0.01% 1.02 1.0 1.0 0.0 2B
1 Starling Marte CF 3.0 0.0 2.0 3.0 0.0 1.0 4.0 0.325 0.414 0.476 0.889 12.0 7.0 0.116 0.90 0.132 -0.016% 0.12% 1.57 2.8 1.0 0.0 2B,HBP
2 Stephen Piscotty PH-RF 1.0 0.0 1.0 2.0 0.0 0.0 2.0 0.211 0.272 0.349 0.622 7.0 3.0 0.000 0.00 0.000 0.000% 0% 0.00 2.0 1.0 0.0 HBP
3 Matt Olson 1B 6.0 0.0 1.0 2.0 0.0 0.0 6.0 0.283 0.376 0.566 0.941 21.0 13.0 -0.057 0.45 0.008 -0.065% -0.06% 0.78 -0.6 9.0 1.0 GDP
4 Mitch Moreland DH 5.0 3.0 2.0 2.0 0.0 0.0 6.0 0.230 0.290 0.415 0.705 23.0 16.0 0.049 0.28 0.064 -0.015% 0.05% 0.50 1.5 NaN NaN 2·HR,HBP
5 Josh Harrison 2B 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.294 0.366 0.435 0.801 7.0 3.0 0.057 1.50 0.057 0.000% 0.06% 2.63 0.6 0.0 0.0 NaN
6 Tony Kemp 2B 4.0 3.0 3.0 0.0 1.0 0.0 5.0 0.252 0.370 0.381 0.751 16.0 10.0 -0.001 0.14 0.009 -0.010% 0% 0.24 1.6 2.0 2.0 NaN
7 Sean Murphy C 4.0 3.0 2.0 2.0 2.0 1.0 6.0 0.224 0.318 0.419 0.737 25.0 15.0 0.143 0.38 0.151 -0.007% 0.15% 0.67 2.7 7.0 0.0 2B
8 Matt Chapman 3B 1.0 3.0 0.0 0.0 5.0 1.0 6.0 0.214 0.310 0.365 0.676 31.0 10.0 0.051 0.28 0.051 0.000% 0.05% 0.49 2.2 1.0 3.0 NaN
9 Seth Brown RF-CF 5.0 1.0 1.0 1.0 0.0 1.0 6.0 0.204 0.278 0.451 0.730 18.0 12.0 -0.067 0.40 0.000 -0.067% -0.07% 0.70 -1.7 4.0 0.0 SF
10 Elvis Andrus SS 5.0 2.0 1.0 2.0 1.0 0.0 6.0 0.233 0.283 0.310 0.593 20.0 15.0 0.015 0.42 0.050 -0.034% 0.02% 0.73 -0.1 0.0 4.0 NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Chris Bassitt P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 0.0 NaN
13 A.J. Puk P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
14 Deolis Guerra P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
15 Jake Diekman P NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 0.0 NaN
16 Team Totals 40.0 17.0 14.0 17.0 10.0 4.0 54.0 0.350 0.500 0.575 1.075 203.0 123.0 0.317 0.41 0.562 -0.243% 0.33% 0.72 12.2 27.0 10.0 NaN
In the following df_goals, I see that all NaNs could be replaced by the cell value either above or below it, eliminatig duplicate row values 'For Team':
For Team Goals Home Goals Away Color
0 Arsenal NaN 1.17 #EF0107
1 Arsenal 1.70 NaN #EF0107
2 Aston Villa NaN 1.10 #770038
3 Aston Villa 1.45 NaN #770038
4 Bournemouth NaN 0.77 #D3151B
5 Bournemouth 1.17 NaN #D3151B
6 Brighton and Hove Albion NaN 1.00 #005DAA
7 Brighton and Hove Albion 1.45 NaN #005DAA
8 Burnley NaN 1.25 #630F33
9 Burnley 1.33 NaN #630F33
10 Chelsea NaN 1.82 #034694
11 Chelsea 1.11 NaN #034694
12 Crystal Palace NaN 0.89 #C4122E
13 Crystal Palace 0.79 NaN #C4122E
14 Everton NaN 1.30 #274488
15 Everton 1.40 NaN #274488
16 Leicester City NaN 2.25 #0053A0
17 Leicester City 2.00 NaN #0053A0
18 Liverpool NaN 2.00 #CE1317
19 Liverpool 2.62 NaN #CE1317
20 Manchester City NaN 2.25 #97C1E7
21 Manchester City 2.73 NaN #97C1E7
22 Manchester United NaN 0.92 #E80909
23 Manchester United 1.82 NaN #E80909
24 Newcastle United NaN 0.67 #231F20
25 Newcastle United 1.10 NaN #231F20
26 Norwich City NaN 0.56 #00A14E
27 Norwich City 1.36 NaN #00A14E
28 Sheffield United NaN 0.88 #E52126
29 Sheffield United 0.83 NaN #E52126
30 Southampton NaN 1.42 #ED1A3B
31 Southampton 1.15 NaN #ED1A3B
32 Tottenham Hotspur NaN 1.20 #132257
33 Tottenham Hotspur 1.90 NaN #132257
34 Watford NaN 0.83 #FBEE23
35 Watford 0.90 NaN #FBEE23
36 West Ham United NaN 1.09 #7C2C3B
37 West Ham United 1.83 NaN #7C2C3B
38 Wolverhampton Wanderers NaN 1.40 #FDB913
39 Wolverhampton Wanderers 1.33 NaN #FDB913
How do I do this?
Check with groupby + first
df=df.groupby('For Team').first()
If you want to drop all the NaN rows:
df_goals = df_goals.dropna(axis=0)
You could set all Nan values to 0:
no_nan_list = df_goals.index.tolist()
nan_list = df_goals.drop(no_nan_list).apply(lambda x: to_numeric(x,errors='coerce'))
df_goals = nan_list.fillna(0)
df.head()
I want to automate the search process on a website and scrape the table of individual players (I'm getting the players' names from an Excel sheet). I want to add that scraped information to an existing Excel sheet with the list of players. For each year that player has been in the league, the player's name needs to be in the first column. So far, I was able to grab the information from the existing Excel sheet, but I'm not sure how to automate the search process using that. I'm not sure if Selenium can help. The website is https://basketball.realgm.com/.
import openpyxl
path = r"C:\Users\Name\Desktop\NBAPlayers.xlsx"
workbook = openpyxl.load_workbook(path)
sheet = workbook.active
rows = sheet.max_row
cols = sheet.max_column
print(rows)
print(cols)
for r in range(2, rows+1):
for c in range(2,cols+1):
print(sheet.cell(row=r,column=c).value, end=" ")
print()
I presume you have got the names from excel sheet so I used a name list and using python request module and get the page text and then use beautiful soup to get table content and Then I have use pandas to get the info in dataframe.
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
playernames=['Dominique Jones', 'Joe Young', 'Darius Adams', 'Lester Hudson', 'Marcus Denmon', 'Courtney Fortson']
for name in playernames:
fname=name.split(" ")[0]
lname=name.split(" ")[1]
url="https://basketball.realgm.com/search?q={}+{}".format(fname,lname)
print(url)
r=requests.get(url)
soup=BeautifulSoup(r.text,'html.parser')
table=soup.select_one(".tablesaw ")
dfs=pd.read_html(str(table))
for df in dfs:
print(df)
Output:
https://basketball.realgm.com/search?q=Dominique+Jones
Player Pos HT ... Draft Year College NBA
0 Dominique Jones G 6-4 ... 2010 South Florida Dallas Mavericks
1 Dominique Jones G 6-2 ... 2009 Liberty -
2 Dominique Jones PG 5-9 ... 2011 Fort Hays State -
[3 rows x 8 columns]
https://basketball.realgm.com/search?q=Joe+Young
Player Pos HT ... Draft Year College NBA
0 Joe Young F 6-6 ... 2007 Holy Cross -
1 Joe Young G 6-0 ... 2009 Canisius -
2 Joe Young G 6-2 ... 2015 Oregon Indiana Pacers
3 Joe Young G 6-2 ... 2009 Central Missouri -
[4 rows x 8 columns]
https://basketball.realgm.com/search?q=Darius+Adams
Player Pos HT ... Draft Year College NBA
0 Darius Adams PG 6-1 ... 2011 Indianapolis -
1 Darius Adams G 6-0 ... 2018 Coast Guard Academy -
[2 rows x 8 columns]
https://basketball.realgm.com/search?q=Lester+Hudson
Season Team GP GS MIN ... STL BLK PF TOV PTS
0 2009-10 * All Teams 25 0 5.3 ... 0.32 0.12 0.48 0.56 2.32
1 2009-10 * BOS 16 0 4.4 ... 0.19 0.12 0.44 0.56 1.38
2 2009-10 * MEM 9 0 6.8 ... 0.56 0.11 0.56 0.56 4.00
3 2010-11 WAS 11 0 6.7 ... 0.36 0.09 0.91 0.64 1.64
4 2011-12 * All Teams 16 0 20.9 ... 0.88 0.19 1.62 2.00 10.88
5 2011-12 * CLE 13 0 24.2 ... 1.08 0.23 2.00 2.31 12.69
6 2011-12 * MEM 3 0 6.5 ... 0.00 0.00 0.00 0.67 3.00
7 2014-15 LAC 5 0 11.1 ... 1.20 0.20 0.80 0.60 3.60
8 CAREER NaN 57 0 10.4 ... 0.56 0.14 0.91 0.98 4.70
[9 rows x 23 columns]
https://basketball.realgm.com/search?q=Marcus+Denmon
Season Team Location GP GS ... STL BLK PF TOV PTS
0 2012-13 SAN Las Vegas 5 0 ... 0.4 0.0 1.60 0.20 5.40
1 2013-14 SAN Las Vegas 5 1 ... 0.8 0.0 2.20 1.20 10.80
2 2014-15 SAN Las Vegas 6 2 ... 0.5 0.0 1.50 0.17 5.00
3 2015-16 SAN Salt Lake City 2 0 ... 0.0 0.0 0.00 0.00 0.00
4 CAREER NaN NaN 18 3 ... 0.5 0.0 1.56 0.44 6.17
[5 rows x 24 columns]
https://basketball.realgm.com/search?q=Courtney+Fortson
Season Team GP GS MIN FGM ... AST STL BLK PF TOV PTS
0 2011-12 * All Teams 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
1 2011-12 * HOU 6 0 8.2 1.00 ... 0.83 0.5 0.0 0.33 0.83 3.00
2 2011-12 * LAC 4 0 11.5 1.25 ... 1.25 0.0 0.0 0.75 1.25 4.25
3 CAREER NaN 10 0 9.5 1.10 ... 1.00 0.3 0.0 0.50 1.00 3.50
[4 rows x 23 columns]
You have to have url list with players and scrape the pages using beautiful soup.
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())
I want to read the csv file as a pandas dataframe. CSV file is here: https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0
In particular,
I want to skip the first row
The column headers are in row 2. In this case, they are: 1, 1, 2 and TOT. I do not want to hardcode them though. It is ok if the only column that gets extracted is TOT
I do not want to use a non-pandas approach if possible.
Here is what I am doing:
df = pandas.read_csv('https://www.dropbox.com/s/o3xc74f8v4winaj/aaaa.csv?dl=0', skiprows=1, skipinitialspace=True, sep=' ')
But this gives the error:
*** CParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 6
The output should look something like this:
1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52
1 BD 33kpa(t/m3) 1.6 1.6 1.6
2 SAND(%) 42.1 42.1 65.1
3 SILT(%) 37.9 37.9 16.9
4 CLAY(%) 20 20 18
5 ROCK(%) 12 12 12
6 WLS(kg/ha) 0 5 0.1 5.1
7 WLM(kg/ha) 0 5 0.1 5.1
8 WLSL(kg/ha) 0 4 0.1 4.1
9 WLSC(kg/ha) 0 2.1 0 2.1
10 WLMC(kg/ha) 0 2.1 0 2.1
11 WLSLC(kg/ha) 0 1.7 0 1.7
12 WLSLNC(kg/ha) 0 0.4 0 0.4
13 WBMC(kg/ha) 9 1102.1 250.9 1361.9
14 WHSC(kg/ha) 69 8432 1920 10420
15 WHPC(kg/ha) 146 18018 4102 22266
16 WOC(kg/ha) 224 27556 6272 34
17 WLSN(kg/ha) 0 0 0 0
18 WLMN(kg/ha) 0 0.2 0 0.2
19 WBMN(kg/ha) 0.9 110.2 25.1 136.2
20 WHSN(kg/ha) 7 843 192 1042
21 WHPN(kg/ha) 15 1802 410 2227
22 WON(kg/ha) 22 2755 627 3405
23 CFEM(kg/ha) 0
You can specify a regular expression to be used as your delimiter, in your case it will work with [\s,]{2,20}, i.e. 2 or more spaces or commas:
In [180]: pd.read_csv('aaaa.csv',
skiprows = 1,
sep='[\s,]{2,20}',
index_col=0)
Out[180]:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 0.00 5.00 0.10 5.1
8 WLM(kg/ha) 0.00 5.00 0.10 5.1
9 WLSL(kg/ha) 0.00 4.00 0.10 4.1
10 WLSC(kg/ha) 0.00 2.10 0.00 2.1
11 WLMC(kg/ha) 0.00 2.10 0.00 2.1
12 WLSLC(kg/ha) 0.00 1.70 0.00 1.7
13 WLSLNC(kg/ha) 0.00 0.40 0.00 0.4
14 WBMC(kg/ha) 9.00 1102.10 250.90 1361.9
15 WHSC(kg/ha) 69.00 8432.00 1920.00 10420.0
16 WHPC(kg/ha) 146.00 18018.00 4102.00 22266.0
17 WOC(kg/ha) 224.00 27556.00 6272.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.20 0.00 0.2
20 WBMN(kg/ha) 0.90 110.20 25.10 136.2
21 WHSN(kg/ha) 7.00 843.00 192.00 1042.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2755.00 627.00 3405.0
24 CFEM(kg/ha) 0.00 NaN NaN NaN
25, None NaN NaN NaN NaN
26, None NaN NaN NaN NaN
You need to specify the names of the columns. Notice the trick I used to get two columns called 1 (one is an integer name and the other is text).
Given how badly the data is structured, this is not perfect (note row 2 where BD and 33kpa got split because of the space between them).
pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
Description 1 1 2 TOT
Index
1, DEPTH(m) 0.01 1.24 1.52 NaN
2, BD 33kpa(t/m3) 1.60 1.60 1.6
3, SAND(%) 42.1 42.10 65.10 NaN
4, SILT(%) 37.9 37.90 16.90 NaN
5, CLAY(%) 20.0 20.00 18.00 NaN
6, ROCK(%) 12.0 12.00 12.00 NaN
7, WLS(kg/ha) 0.0 5.00 0.10 5.1
8, WLM(kg/ha) 0.0 5.00 0.10 5.1
9, WLSL(kg/ha) 0.0 4.00 0.10 4.1
10, WLSC(kg/ha) 0.0 2.10 0.00 2.1
11, WLMC(kg/ha) 0.0 2.10 0.00 2.1
12, WLSLC(kg/ha) 0.0 1.70 0.00 1.7
13, WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
14, WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
15, WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
16, WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
17, WOC(kg/ha) 224. 27556.00 6272.00 34.0
18, WLSN(kg/ha) 0.0 0.00 0.00 0.0
19, WLMN(kg/ha) 0.0 0.20 0.00 0.2
20, WBMN(kg/ha) 0.9 110.20 25.10 136.2
21, WHSN(kg/ha) 7. 843.00 192.00 1042.0
22, WHPN(kg/ha) 15. 1802.00 410.00 2227.0
23, WON(kg/ha) 22. 2755.00 627.00 3405.0
24, CFEM(kg/ha) 0. NaN NaN NaN
25, NaN NaN NaN NaN NaN
26, NaN NaN NaN NaN NaN
Or you can reset the index.
>>> (pd.read_csv('/Downloads/aaaa.csv',
skiprows=2,
skipinitialspace=True,
sep=' ',
names=['Index', 'Description',1,"1",2,'TOT'],
index_col=0)
.reset_index(drop=True)
.dropna(axis=0, how='all'))
Description 1 1 2 TOT
0 DEPTH(m) 0.01 1.24 1.52 NaN
1 BD 33kpa(t/m3) 1.60 1.60 1.6
2 SAND(%) 42.1 42.10 65.10 NaN
3 SILT(%) 37.9 37.90 16.90 NaN
4 CLAY(%) 20.0 20.00 18.00 NaN
5 ROCK(%) 12.0 12.00 12.00 NaN
6 WLS(kg/ha) 0.0 5.00 0.10 5.1
7 WLM(kg/ha) 0.0 5.00 0.10 5.1
8 WLSL(kg/ha) 0.0 4.00 0.10 4.1
9 WLSC(kg/ha) 0.0 2.10 0.00 2.1
10 WLMC(kg/ha) 0.0 2.10 0.00 2.1
11 WLSLC(kg/ha) 0.0 1.70 0.00 1.7
12 WLSLNC(kg/ha) 0.0 0.40 0.00 0.4
13 WBMC(kg/ha) 9.0 1102.10 250.90 1361.9
14 WHSC(kg/ha) 69. 8432.00 1920.00 10420.0
15 WHPC(kg/ha) 146. 18018.00 4102.00 22266.0
16 WOC(kg/ha) 224. 27556.00 6272.00 34.0
17 WLSN(kg/ha) 0.0 0.00 0.00 0.0
18 WLMN(kg/ha) 0.0 0.20 0.00 0.2
19 WBMN(kg/ha) 0.9 110.20 25.10 136.2
20 WHSN(kg/ha) 7. 843.00 192.00 1042.0
21 WHPN(kg/ha) 15. 1802.00 410.00 2227.0
22 WON(kg/ha) 22. 2755.00 627.00 3405.0
23 CFEM(kg/ha) 0. NaN NaN NaN