Beautifulsoup: Scrape Table with Key Word Search

Beautifulsoup: Scrape Table with Key Word Search - python

I'm trying to scrape tables from multiple websites with key words. I want to scrape values from table which fulfill "Cash and cash equivalent" as row header and "2020" as column header at the same time in order to print to excel file in the future. But I cannot get the code work. Hope you can help me on this! Thank you!!
from bs4 import BeautifulSoup
import requests
import time
from pandas import DataFrame
import pandas as pd
#headers={"Content-Type":"text"}
headers = {'User-Agent': 'registr#jh.edu'}
urls={'https://www.sec.gov/Archives/edgar/data/1127993/0001091818-21-000003.txt',
'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt'}
Cash=[]
for url in urls:
response = requests.get(url, headers = headers)
response.raise_for_status()
time.sleep(0.1)
soup = BeautifulSoup(response.text,'lxml')
for table in soup.find_all('table'):
for tr in table.find_all('tr'):
row = [td.get_text(strip=True) for td in tr.find_all('td')]
headers = [header.get_text(strip=True).encode("utf-8") for header in tr[0].find_all("th")]
try:
if '2020' in headers[0]:
if row[0] == 'Cash and cash equivalent':
Cash_and_cash_equivalent = f'{url}'+ ' ' + headers+ str(row)
Cash.append(Cash_and_cash_equivalent)
if row[0] == 'Cash':
Cash_ = f'{url}'+ ' ' + headers+ str(row)
Cash.append(Cash_)
except IndexError:
continue
print(Cash)

You could do something along these lines:
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {'User-Agent': 'registr#jh.edu'}
r = requests.get('https://www.sec.gov/Archives/edgar/data/1127993/0001091818-21-000003.txt', headers=headers)
dfs = pd.read_html(str(r.text))
for x in range(len(dfs)):
if dfs[x].apply(lambda row: row.astype(str).str.contains('Cash and Cash Equivalents').any(), axis=1).any():
df = dfs[x]
df.dropna(how='all')
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
display(df) ## or print(df) if you're not in a jupyter notebook
This will return two dataframes, with tables #37 and respectively #71. You may need to improve the table header detection, as only table #71 will come out with proper headers (years).
I tried to look at the second url, however it was hanging for me (huge page).
The printout in terminal will look something like this:
NaN NaN 2020 NaN 2019
3 Cash Flows from Operating Activities NaN NaN NaN NaN
4 Net loss NaN $(13,134,778) NaN $ (2,017,347)
5 Adjustments to reconcile net loss to net cash used in operating activities: NaN NaN NaN NaN
6 Depreciation and amortization NaN 84940 NaN 7832
7 Amortization of convertible debt discounts NaN 74775 NaN 60268
8 Accretion and settlement of financing instruments NaN NaN NaN NaN
9 and change in fair value of derivative liability NaN 1381363 NaN (1,346,797)
10 Stock compensation and stock issued for services NaN 2870472 NaN -
11 Stock issued under Put Purchase Agreement NaN 7865077 NaN -
12 NaN NaN NaN NaN NaN
13 Changes in assets and liabilities: NaN NaN NaN NaN
14 Accounts receivable NaN (696,710) NaN 82359
15 Inventories NaN (78,919) NaN 304970
16 Accounts payable NaN (1,462,072) NaN (22,995)
17 Accrued expenses NaN (158,601) NaN (346,095)
18 Deferred revenue NaN 431147 NaN (91,453)
19 Net cash used in operating activities NaN (2,823,306) NaN (3,369,258)
20 NaN NaN NaN NaN NaN
21 Cash Flows from Investing Activities NaN NaN NaN NaN
22 Acquisition of business, net of cash NaN - NaN 2967918
23 Purchases of property and equipment NaN - NaN (17,636)
24 Net cash provided by investing activities NaN - NaN 2950282
25 NaN NaN NaN NaN NaN
26 Cash Flows from Financing Activities NaN NaN NaN NaN
27 Principal payments on financing lease obligations NaN - NaN (1,649)
28 Principal payments on notes payable NaN (774) NaN -
29 Payments on advances from stockholder, net NaN (33,110) NaN -
30 Proceeds from convertible notes payable NaN 840000 NaN 667000
31 Payments on line of credit, net NaN (300,000) NaN -
32 Proceeds from sale of common stock under Purchase Agreement NaN 2316520 NaN -
33 Net cash provided by financing activities NaN 2822636 NaN 665351
34 NaN NaN NaN NaN NaN
35 Net Increase (Decrease) in Cash and Cash Equivalents NaN (670) NaN 246375
36 NaN NaN NaN NaN NaN
37 Cash, Beginning of Period NaN 412391 NaN 169430
38 NaN NaN NaN NaN NaN
39 Cash, End of Period NaN $ 411,721 NaN $ 415,805

Related

Use dataframe columns as arguments for function

I want to get arguments from a datafile (excel, .csv, whatever) and pass them as arguments to a Python function.
To get the arguments from the datafile I've converted it to a Pandas dataframe. Created a list of the index of the df and iterate over this list whilst finding all the cell values and passing these as arguments.
I've got some working code (see below) but I feel like it's kinda clunky.
Is there a better way to do this?
import pandas as pd
import os
curdir = os.getcwd()
def pdFunc(Name, Module): #function that takes multiple arguments from the dataframe
print(str(Name) + ',' + str(Module))
#further code will be added here which will create new .csv files etc. This output is not suitable to be placed in a dataframe.
assetList = os.path.join(curdir, 'Lists', 'Assets_ShortTesting_v1.0.xlsx') # setting path for the excel file with the data
assetdf = pd.read_excel(assetList) #importing the data to a dataframe
indexList = assetdf.index.tolist() #creating list to iterate over
for i in indexList: #iterating over list
pdFunc(assetdf.loc[i]['Name'], assetdf.loc[i]['Module']) #finding the cell values from the dataframe and setting them as arguments for the function
Here's the dataframe:
Name ISIN SymbolYF SymbolInvestpy Currency Country Exchange Type Module Constituent of
0 Adyen N.V. NaN ADYEN.AS NaN EUR Netherlands NaN Stock 1 AEX
1 Aegon N.V. NaN AGN.AS NaN EUR Netherlands NaN Stock 1 AEX
2 Aalberts N.V. NaN AALB.AS NaN EUR Netherlands NaN Stock 1 AMX
3 ABN AMRO Bank N.V. NaN ABN.AS NaN EUR Netherlands NaN Stock 1 AMX
4 Anheuser-Busch InBev SA/NV NaN ABI.BR NaN EUR Belgium NaN Stock 2 BEL20
5 Ackermans & Van Haaren NV NaN ACKB.BR NaN EUR Belgium NaN Stock 2 BEL20
6 L'Air Liquide S.A. NaN AI.PA NaN EUR France NaN Stock 2 CAC40
7 Airbus SE NaN AIR.PA NaN EUR France NaN Stock 2 CAC40
8 Vonovia SE NaN VNA.DE NaN EUR Germany NaN Stock 2 DAX
9 US Dollar NaN USD-EUR NaN EUR US NaN Forex 3 Forex
10 Shiba Inu NaN SHIB-EUR NaN EUR NaN NaN Crypto 3 Forex
11 FTSE 1000 NaN ^FTSE NaN EUR United Kingdom NaN Index 3 Index
12 Wheat NaN ZW=F NaN USD NaN NaN Commodity 3 Commodity
13 Apple Inc. NaN AAPL NaN USD US NaN Stock 4 US MegaCap
14 Sirius XM Holdings Inc. NaN SIRI NaN USD US NaN Stock 4 US High volume

tabula.read_pdf in python, getting a list variable and can't read it

I am using tabula to extract some data from a pdf, when I read the file, it outputs a list, not a dataframe, and I'm having problems reading the values,
file = "example.pdf"
path = 'data/' + file
df = tabula.read_pdf(path, pages = '1', multiple_tables = False)
cliente_raw = tabula.read_pdf(path, pages=1,output_format="dataframe")
print(cliente_raw)
This is the output
[ Beneficiario: Nury García Unnamed: 1 NIT/Cédula:
0 Dirección: Calle 115 #53-74 Apto 307 NaN Ciudad:
1 Referencia Descripción NaN
2 Spectral + Porcelai Perfect Face Kit, -/- NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
39564525 Teléfono: 601 6299329 Unnamed: 5 Unnamed: 6
0 BOGOTA (C/MARCA) País: COLOMBIA NaN NaN
1 Cantidad IVA Valor Unitario NaN Valor Total
2 1 19% 125,210 NaN 125,210
3 NaN Subtotal NaN 125,210
4 NaN IVA NaN 23,790
5 NaN TOTAL NaN 149,000 ]
The len of this variable is 1, so I dont know how to extract the values, any help?

pandas.read_html tables not found

I'm trying to get a list of the major world indices in Yahoo Finance at this URL: https://finance.yahoo.com/world-indices.
I tried first to get the indices in a table by just running
major_indices=pd.read_html("https://finance.yahoo.com/world-indices")[0]
In this case the error was:
ValueError: No tables found
So I read a solution using selenium at pandas read_html - no tables found
the solution they came up with is (with some adjustment):
from selenium import webdriver
import pandas as pd
from selenium.webdriver.common.keys import Keys
from webdrivermanager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().download_and_install())
driver.get("https://finance.yahoo.com/world-indices")
html = driver.page_source
tables = pd.read_html(html)
data = tables[1]
Again this code gave me another error:
ValueError: No tables found
I don't know whether to keep using selenium or the pd.read_html is just fine. Either way I'm trying to get this data and don't know how to procede. Can anyone help me?

You don't need Selenium here, you just have to set the euConsentId cookie:
import pandas as pd
import requests
import uuid
url = 'https://finance.yahoo.com/world-indices'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
df = pd.read_html(html)[0]
Output:
>>> df
Symbol Name Last Price Change % Change Volume Intraday High/Low 52 Week Range Day Chart
0 ^GSPC S&P 500 4023.89 93.81 +2.39% 2.545B NaN NaN NaN
1 ^DJI Dow 30 32196.66 466.36 +1.47% 388.524M NaN NaN NaN
2 ^IXIC Nasdaq 11805.00 434.04 +3.82% 5.15B NaN NaN NaN
3 ^NYA NYSE COMPOSITE (DJ) 15257.36 326.26 +2.19% 0 NaN NaN NaN
4 ^XAX NYSE AMEX COMPOSITE INDEX 4025.81 122.66 +3.14% 0 NaN NaN NaN
5 ^BUK100P Cboe UK 100 739.68 17.83 +2.47% 0 NaN NaN NaN
6 ^RUT Russell 2000 1792.67 53.28 +3.06% 0 NaN NaN NaN
7 ^VIX CBOE Volatility Index 28.87 -2.90 -9.13% 0 NaN NaN NaN
8 ^FTSE FTSE 100 7418.15 184.81 +2.55% 0 NaN NaN NaN
9 ^GDAXI DAX PERFORMANCE-INDEX 14027.93 288.29 +2.10% 0 NaN NaN NaN
10 ^FCHI CAC 40 6362.68 156.42 +2.52% 0 NaN NaN NaN
11 ^STOXX50E ESTX 50 PR.EUR 3703.42 89.99 +2.49% 0 NaN NaN NaN
12 ^N100 Euronext 100 Index 1211.74 28.89 +2.44% 0 NaN NaN NaN
13 ^BFX BEL 20 3944.56 14.35 +0.37% 0 NaN NaN NaN
14 IMOEX.ME MOEX Russia Index 2307.50 9.61 +0.42% 0 NaN NaN NaN
15 ^N225 Nikkei 225 26427.65 678.93 +2.64% 0 NaN NaN NaN
16 ^HSI HANG SENG INDEX 19898.77 518.43 +2.68% 0 NaN NaN NaN
17 000001.SS SSE Composite Index 3084.28 29.29 +0.96% 3.109B NaN NaN NaN
18 399001.SZ Shenzhen Component 11159.79 64.92 +0.59% 3.16B NaN NaN NaN
19 ^STI STI Index 3191.16 25.98 +0.82% 0 NaN NaN NaN
20 ^AXJO S&P/ASX 200 7075.10 134.10 +1.93% 0 NaN NaN NaN
21 ^AORD ALL ORDINARIES 7307.70 141.10 +1.97% 0 NaN NaN NaN
22 ^BSESN S&P BSE SENSEX 52793.62 -136.69 -0.26% 0 NaN NaN NaN
23 ^JKSE Jakarta Composite Index 6597.99 -1.85 -0.03% 0 NaN NaN NaN
24 ^KLSE FTSE Bursa Malaysia KLCI 1544.41 5.61 +0.36% 0 NaN NaN NaN
25 ^NZ50 S&P/NZX 50 INDEX GROSS 11168.18 -9.18 -0.08% 0 NaN NaN NaN
26 ^KS11 KOSPI Composite Index 2604.24 54.16 +2.12% 788539 NaN NaN NaN
27 ^TWII TSEC weighted index 15832.54 215.86 +1.38% 0 NaN NaN NaN
28 ^GSPTSE S&P/TSX Composite index 20099.81 400.76 +2.03% 294.637M NaN NaN NaN
29 ^BVSP IBOVESPA 106924.18 1236.54 +1.17% 0 NaN NaN NaN
30 ^MXX IPC MEXICO 49579.90 270.58 +0.55% 212.868M NaN NaN NaN
31 ^IPSA S&P/CLX IPSA 5058.88 0.00 0.00% 0 NaN NaN NaN
32 ^MERV MERVAL 38390.84 233.89 +0.61% 0 NaN NaN NaN
33 ^TA125.TA TA-125 1964.95 23.38 +1.20% 0 NaN NaN NaN
34 ^CASE30 EGX 30 Price Return Index 10642.40 -213.50 -1.97% 36.837M NaN NaN NaN
35 ^JN0U.JO Top 40 USD Net TRI Index 4118.19 65.63 +1.62% 0 NaN NaN NaN

Further table url link scraping

Using this awesome code I can scrape the majority of the data, however it seems like some of the url's may not scrape. Also I need to be able to identify the data source in the future. Therefore I have two question:
how to program in a check that all links are identified and scraped?
how to insert the url link as a new column so each row is identified?
url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)
urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
data = []
for url in urlList:
try:
data.append(pd.read_html(url)[0])
except:
print(f'No tables found:{url}')
pd.concat(data)
Original post
Scraping URL links in a table

A small adjustment to the try/except block, that writes the url and also the status of scraping:
try:
df = pd.read_html(url)[0]
df['url'] = url
df['scraped'] = True
data.extend(df.to_dict(orient = 'records'))
except:
print(f'No tables found:{url}')
data.append(dict(url=url, scraped=False))
Example
Limited to for url in urlList[5:10] to not iterate all of the links in this example, just kick the [5:10] to get all:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)
urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
data = []
for url in urlList[5:10]:
try:
df = pd.read_html(url)[0]
df['url'] = url
df['scraped'] = True
data.extend(df.to_dict(orient = 'records'))
except:
print(f'No tables found:{url}')
data.append(dict(url=url, scraped=False))
pd.DataFrame(data)
#to get rid of all these NaN columns
#pd.DataFrame(data).dropna(axis=1, how='all')
Output
url
scraped
Pos⇧
Bib
Name
Licence
YoB
Sponsors
km/h
sector1 +
sector2 +
sector3 +
sector4 +
sector5 =
Qualifier
km/h.1
sector1 +.1
sector2 +.1
sector3 +.1
sector4 +.1
sector5 =.1
Run 1
Diff
https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
False
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
False
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
False
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
https://www.rootsandrain.com/event9607/2022-apr-23-dhi-masters-world-championships-22-villa-la-angostura/
False
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
1st
6
Amaury PIERRON
1.00088e+10
1996
COMMENCAL - MUC-OFF BY RIDING
62.211
45.680s5
32.179s3
36.880s2
27.636s5
28.399s8
2:50.7743
60.3411
44.269s1
33.091s23
36.159s1
26.774s2
27.418s1
2:47.7111
-
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
2nd
14
Finn ILES
1.00909e+10
1999
Specialized Gravity
60.072
44.915s3
32.242s5
37.407s5
27.636s5
28.254s3
2:50.4542
59.0821
44.924s2
31.474s3
37.175s4
26.781s3
28.204s6
2:48.5582
0.847s
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
3rd
1
Loïc BRUNI
1.00075e+10
1994
Specialized Gravity
59.6027
45.080s4
31.429s1
37.254s4
27.164s2
28.344s6
2:49.2711
58.8225
45.023s4
31.653s6
37.112s3
27.172s5
27.837s3
2:48.7973
1.086s
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
4th
8
Benoit COULANGES
1.00082e+10
1994
Dorval AM Commencal
60.671
46.785s15
31.773s2
36.688s1
27.496s3
28.348s7
2:51.0905
59.0821
45.302s7
31.591s5
37.745s9
26.711s1
27.828s2
2:49.1774
1.466s
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
5th
17
Luca SHAW
1.00088e+10
1996
Canyon Collective Factory Team
59.0836
46.267s7
32.952s14
38.067s10
28.504s17
28.453s9
2:54.24310
61.223
45.223s6
31.572s4
37.436s7
27.629s12
28.282s10
2:50.1425
2.431s
https://www.rootsandrain.com/event9596/2022-mar-27-mercedes-benz-uci-world-cup-dh-1-lourdes/
True
6th
9
Danny HART (elt)
1.00055e+10
1991
CUBE FACTORY RACING
59.8722
47.041s22
32.715s8
37.246s3
28.000s11
28.301s4
2:53.3037
60.0714
46.511s21
31.868s7
36.600s2
27.391s10
28.036s4
2:50.4066
2.695s

Select a (non-indexed) column based on text content of a cell in a python/pandas dataframe

TL:DR - how do I create a dataframe/series from one or more columns in an existing non-indexed dataframe based on the column(s) containing a specific piece of text?
Relatively new to Python and data analysis and (this is my first time posting a question on Stack Overflow but I've been hunting for an answer for a long time (and used to code regularly) and not having any success.
I have a dataframe import from an Excel file that doesn't have named/indexed columns. I am trying to successfully extract data from nearly 2000 of these files which all have slightly different columns of data (of course - why make it simple... or follow a template... or simply use something other than poorly formatted Excel spreadsheets...).
The original dataframe (from a poorly structured XLS file) looks a bit like this:
0 NaN RIGHT NaN
1 Date UCVA Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
3 4 5 6 7 8 9 \
0 NaN NaN NaN NaN NaN NaN NaN
1 Cyl Axis BSCVA Pentacam remarks K1 K2 K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
... 17 18 19 20 21 22 \
0 ... NaN NaN NaN NaN NaN NaN
1 ... BSCVA Pentacam remarks K1 K2 K2 back K max
2 ... 6/5 NaN NaN NaN NaN NaN
3 ... NaN NaN NaN NaN NaN NaN
4 ... NaN Pentacam 44.3 43.7 -6.2 45.5
5 ... 6/4-4 NaN NaN NaN NaN NaN
6 ... 6/5 NaN NaN NaN NaN NaN
I want to extract a set of dataframes/series that I can then combine back together to get a 'tidy' dataframe e.g.:
1 Date R-UCVA R-Sph
2 2007-01-13 00:00:00 6/38 [-2.00]
3 2009-11-05 00:00:00 6/9 NaN
4 2009-11-18 00:00:00 6/12 NaN
5 2009-12-14 00:00:00 6/9 [-1.25]
6 2018-04-24 00:00:00 worn CL [-5.50]
1 R-Cyl R-Axis R-BSCVA R-Penta R-K1 R-K2 R-K2 back
2 [-2.75] 65 6/9 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 6/5 Pentacam 46 43.9 -6.6
5 [-5.75] 60 6/6-1 NaN NaN NaN NaN
6 [+7.00} 170 6/7.5 NaN NaN NaN NaN
etc. etc. so I'm trying to write some code that will pull a series of columns that I define by looking for the words "Date" or "UCVA" etc. etc. Then I plan to stitch them back together into a single dataframe with patient identifier as an extra column. And then cycle through all the XLS files, appending the whole lot to a single CSV file that I can then do useful stuff on (like put into an Access Database - yes, I know, but it has to be easy to use and already installed on an NHS computer - and statistical analysis).
Any suggestions? I hope that's enough information.
Thanks very much in advance.
Kind regards
Vicky

Here a something that will hopefully get you started.
I have prepared a text.xlsx file:
and I can read it as follows
path = 'text.xlsx'
df = pd.read_excel(path, header=[0,1])
# Deal with two levels of headers, here I just join them together crudely
df.columns = df.columns.map(lambda h: ' '.join(h))
# Slight hack because I messed with the column names
# I create two dataframes, one with the first column, one with the second column
df1 = df[[df.columns[0],df.columns[1]]]
df2 = df[[df.columns[0], df.columns[2]]]
# Stacking them on top of each other
result = pd.concat([df1, df2])
print(result)
#Merging them on the Date column
result = pd.merge(left=df1, right=df2, on=df1.columns[0])
print(result)
This gives the output
RIGHT Sph RIGHT UCVA Unnamed: 0_level_0 Date
0 NaN 6/38 2007-01-13 00:00:00
1 NaN 6/37 2009-11-05 00:00:00
2 NaN 9/56 2009-11-18 00:00:00
0 [-2.00] NaN 2007-01-13 00:00:00
1 NaN NaN 2009-11-05 00:00:00
2 NaN NaN 2009-11-18 00:00:00
and
Unnamed: 0_level_0 Date RIGHT UCVA RIGHT Sph
0 2007-01-13 00:00:00 6/38 [-2.00]
1 2009-11-05 00:00:00 6/37 NaN
2 2009-11-18 00:00:00 9/56 NaN
Some pointers:
How to merger two header rows? See this question and answer.
How to select pandas columns conditionally? See e.g. this or this
How to merge dataframes? There is a very good guide in the pandas doc

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.