Scraping SVG Chart using Selenium - Python - python

I am trying to scrape the SVG chart which contains the previous months prices of the house, from the following URL: https://www.zameen.com/Property/dha_defence_dha_defence_phase_2_1_kanal_neat_and_clean_upper_portion_for_rent-24195800-339-4.html
I am scraping the "Price Index" section, image attached:
The code snippet is below:
elements = driver.find_elements(by=By.XPATH, value="//*[local-name() = 'svg' and #class='ct-chart-line']//*[name() = 'g' and #class='ct-labels']//span[contains(#class,'ct-horizontal')]")
print("WebElements: ", len(elements))
actions = ActionChains(driver)
for el in elements:
actions.move_to_element(el).perform()
time.sleep(1)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.chartist-tooltip").text)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.ct-axis-tooltip-x").text)
Following is the output I have received:
WebElements: 7
32,388,054 ==> Jun 2021
33,828,816 ==> Aug 2021
36,064,647 ==> Oct 2021
38,336,196 ==> Dec 2021
39,535,707 ==> Feb 2022
40,257,851 ==> Apr 2022
40,733,506 ==> May 2022
I am using the X-Axis Labels of Months as my elements and moving the selenium on it and capturing the price for that month. Unfortunately, the label of the months is of every 2 months and I get a total of 7 months of data rather than 12 months.
What could be a better solution so I can get complete 12-month prices?

import requests
cookies = {}
headers = {'x-requested-with': 'XMLHttpRequest'}
params = {
'property_id': '24195800',
'purpose': '2',
}
response = requests.get('https://www.zameen.com/nfpage/async/property/get_property_search_index', params=params, cookies=cookies, headers=headers)
will return all data for the graph in nested dict format:
response.text
'{"status":"success","search_index_data":{"section_data":{"heading_txt":" Islamabad DHA Defence Phase 2, 1 Kanal Plots Price Index","heading_month":"May 2022"},"index_data":{"class":"inc","current_year":"2022","last_value":"235.60","ratio":135.6,"ratio_percent":135.6,"range":"100.00 - 235.60","weeks_range":"187.33 - 235.60","this_year_range":"225.89 - 235.60","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data_per_unit":{"class":"inc","current_year":"2022","last_value":"9,052","ratio":"5,210","ratio_percent":135.6,"range":"3,842 - 9,052","weeks_range":"7,197 - 9,052","this_year_range":"8,679 - 9,052","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data":{"class":"inc","current_year":"2022","last_value":"40,734,000","ratio":"23,445,000","ratio_percent":135.6,"range":"17,289,000 - 40,734,000","weeks_range":"32,386,500 - 40,734,000","this_year_range":"39,055,500 - 40,734,000","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"chart_data":{"0":{"moving_avg":"100.0000","org_moving_avg":"1.000000","period_end_date":"2018-01-31","slope":"1.000000"},"1":{"moving_avg":"101.5000","org_moving_avg":"1.015000","period_end_date":"2018-02-28","slope":"0.970000"},"2":{"moving_avg":"101.6667","org_moving_avg":"1.016667","period_end_date":"2018-03-31","slope":"0.980000"},"3":{"moving_avg":"102.3333","org_moving_avg":"1.023333","period_end_date":"2018-04-30","slope":"0.980000"},"4":{"moving_avg":"103.0000","org_moving_avg":"1.030000","period_end_date":"2018-05-31","slope":"0.950000"},"5":{"moving_avg":"104.3333","org_moving_avg":"1.043333","period_end_date":"2018-06-30","slope":"0.940000"},"6":{"moving_avg":"105.6667","org_moving_avg":"1.056667","period_end_date":"2018-07-31","slope":"0.940000"},"7":{"moving_avg":"106.3333","org_moving_avg":"1.063333","period_end_date":"2018-08-31","slope":"0.940000"},"8":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2018-09-30","slope":"0.910000"},"9":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-10-31","slope":"0.930000"},"10":{"moving_avg":"109.0000","org_moving_avg":"1.090000","period_end_date":"2018-11-30","slope":"0.920000"},"11":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-12-31","slope":"0.930000"},"12":{"moving_avg":"108.0000","org_moving_avg":"1.080000","period_end_date":"2019-01-31","slope":"0.930000"},"13":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-02-28","slope":"0.930000"},"14":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-03-31","slope":"0.930000"},"15":{"moving_avg":"108.6667","org_moving_avg":"1.086667","period_end_date":"2019-04-30","slope":"0.910000"},"16":{"moving_avg":"110.0000","org_moving_avg":"1.100000","period_end_date":"2019-05-31","slope":"0.890000"},"17":{"moving_avg":"112.3333","org_moving_avg":"1.123333","period_end_date":"2019-06-30","slope":"0.870000"},"18":{"moving_avg":"113.3333","org_moving_avg":"1.133333","period_end_date":"2019-07-31","slope":"0.880000"},"19":{"moving_avg":"114.3333","org_moving_avg":"1.143333","period_end_date":"2019-08-31","slope":"0.870000"},"20":{"moving_avg":"114.6667","org_moving_avg":"1.146667","period_end_date":"2019-09-30","slope":"0.860000"},"21":{"moving_avg":"116.3333","org_moving_avg":"1.163333","period_end_date":"2019-10-31","slope":"0.850000"},"22":{"moving_avg":"118.0000","org_moving_avg":"1.180000","period_end_date":"2019-11-30","slope":"0.830000"},"23":{"moving_avg":"120.0000","org_moving_avg":"1.200000","period_end_date":"2019-12-31","slope":"0.820000"},"24":{"moving_avg":"121.0000","org_moving_avg":"1.210000","period_end_date":"2020-01-31","slope":"0.830000"},"25":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-02-29","slope":"0.840000"},"26":{"moving_avg":"120.3333","org_moving_avg":"1.203333","period_end_date":"2020-03-31","slope":"0.830000"},"27":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-04-30","slope":"0.820000"},"28":{"moving_avg":"122.3333","org_moving_avg":"1.223333","period_end_date":"2020-05-31","slope":"0.810000"},"29":{"moving_avg":"124.0000","org_moving_avg":"1.240000","period_end_date":"2020-06-30","slope":"0.790000"},"30":{"moving_avg":"125.0000","org_moving_avg":"1.250000","period_end_date":"2020-07-31","slope":"0.800000"},"31":{"moving_avg":"127.3333","org_moving_avg":"1.273333","period_end_date":"2020-08-31","slope":"0.760000"},"32":{"moving_avg":"131.3333","org_moving_avg":"1.313333","period_end_date":"2020-09-30","slope":"0.730000"},"33":{"moving_avg":"137.0000","org_moving_avg":"1.370000","period_end_date":"2020-10-31","slope":"0.700000"},"34":{"moving_avg":"142.3333","org_moving_avg":"1.423333","period_end_date":"2020-11-30","slope":"0.680000"},"35":{"moving_avg":"147.0000","org_moving_avg":"1.470000","period_end_date":"2020-12-31","slope":"0.660000"},"36":{"moving_avg":"152.3333","org_moving_avg":"1.523333","period_end_date":"2021-01-31","slope":"0.630000"},"37":{"moving_avg":"159.6667","org_moving_avg":"1.596667","period_end_date":"2021-02-28","slope":"0.590000"},"38":{"moving_avg":"168.0000","org_moving_avg":"1.680000","period_end_date":"2021-03-31","slope":"0.560000"},"39":{"moving_avg":"176.6667","org_moving_avg":"1.766667","period_end_date":"2021-04-30","slope":"0.540000"},"40":{"moving_avg":"182.6667","org_moving_avg":"1.826667","period_end_date":"2021-05-31","slope":"0.540000"},"41":{"moving_avg":"187.3333","org_moving_avg":"1.873333","period_end_date":"2021-06-30","slope":"0.520000"},"42":{"moving_avg":"191.3333","org_moving_avg":"1.913333","period_end_date":"2021-07-31","slope":"0.510000"},"43":{"moving_avg":"195.6667","org_moving_avg":"1.956667","period_end_date":"2021-08-31","slope":"0.500000"},"44":{"moving_avg":"202.0000","org_moving_avg":"2.020000","period_end_date":"2021-09-30","slope":"0.480000"},"45":{"moving_avg":"208.5988","org_moving_avg":"2.085988","period_end_date":"2021-10-31","slope":"0.463400"},"46":{"moving_avg":"216.2373","org_moving_avg":"2.162373","period_end_date":"2021-11-30","slope":"0.448600"},"47":{"moving_avg":"221.7375","org_moving_avg":"2.217375","period_end_date":"2021-12-31","slope":"0.441500"},"48":{"moving_avg":"225.8916","org_moving_avg":"2.258916","period_end_date":"2022-01-31","slope":"0.438100"},"49":{"moving_avg":"228.6755","org_moving_avg":"2.286755","period_end_date":"2022-02-28","slope":"0.432400"},"50":{"moving_avg":"231.0205","org_moving_avg":"2.310205","period_end_date":"2022-03-31","slope":"0.428200"},"51":{"moving_avg":"232.8524","org_moving_avg":"2.328524","period_end_date":"2022-04-30","slope":"0.427800"},"52":{"moving_avg":"235.6036","org_moving_avg":"2.356036","period_end_date":"2022-05-31","slope":"0.417500"}},"base_avg_price":3842,"calculated_value":4500,"selectedMonthData":[{"date":"2021-12-31","price_sqft":"8,519","price":"38,335,500","index":"221.74"},{"date":"2021-06-30","price_sqft":"7,197","price":"32,386,500","index":"187.33"},{"date":"2020-06-30","price_sqft":"4,764","price":"21,438,000","index":"124.00"}]},"index_type":false}'

Related

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')
The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

How can I scrape historical data when date range is in a filter?

I'm trying to scrape some historical data on the following url: https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR
I would like to scrape all the available historical data, however, the website only allows me to scrape the daily prices of the last 30 days. In order to go back further, I have to make use of the filters and can only filter one year at a time.
I can easily scrape the information available on the first table using the following code for a couple of funds:
import pandas as pd
import datetime
import csv
urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR',
'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR']
# Change date format as there appears to be two versions of the date on the FT website for different sized browsers
def format_date(date):
date = date.split(',')[-2][1:] + date.split(',')[-1]
return pd.Series({'Date': date})
# Create list to allow all scraping data to be saved in one .csv file
dfs = []
# Create scraping loop for all defined urls
for url in urls:
ISIN = url.split('=')[-1].replace(':', '_')
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df['Date'] = df['Date'].apply(format_date)
print (df)
dfs.append(df)
However, I'm unable to make use of the filters on the webpage to obtain more historical data? I've tried many things but always get a different error message. How can I do this?
The data is loaded from an api that allows to set date ranges, eg https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333. This makes it possible to skip the filters issue:
import requests
import pandas as pd
from datetime import datetime
import time
#create list of annual dates for the past 100 years starting from today
datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d')
#create empty df
df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume'])
#not sure when the historical data starts, so let's wrap it in a while loop
while True:
for end, start in zip(datelist, datelist[1:]):
try:
r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol=535700333').json()
df_temp = pd.read_html('<table>'+r['html']+'</table>')[0]
df_temp.columns=['Date','Open','High','Low','Close','Volume']
df = df.append(df_temp)
time.sleep(2)
except:
break
break
Output:
Date
Open
High
Low
Close
Volume
0
Friday, October 15, 2021Fri, Oct 15, 2021
78.8
78.8
78.8
78.8
----
1
Thursday, October 14, 2021Thu, Oct 14, 2021
78.89
78.89
78.89
78.89
----
2
Wednesday, October 13, 2021Wed, Oct 13, 2021
78.7
78.7
78.7
78.7
----
3
Tuesday, October 12, 2021Tue, Oct 12, 2021
78.58
78.58
78.58
78.58
----
4
Monday, October 11, 2021Mon, Oct 11, 2021
78.58
78.58
78.58
78.58
----

Extraction of some date formats failed when using Dateutil in Python

I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
Problem: I need to parse multiple dates in multiple formats in Python
Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.
Formats which still can't be parsed are:
text ='I want to visit from May 16-May 18'
text ='I want to visit from May 16-18'
text ='I want to visit from May 6 May 18'
I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.
This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
This converts the test strings as follows:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
This works as follows:
First create a list of valid months names, i.e. both full and abbreviated.
Make a translation table to make it easy to quickly remove any punctuation from the text.
Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.
Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.
Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.
If a date appears to be before the previous one, add a whole year.

Regex to use monthly.out for user accounting

I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this:
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
I was able to extract the username / totals pairs with this regex:
\t(\w*)\W*(\d*\.\d{2})
In Python:
>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]
But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.
Use str.split().
import re
re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'
with open('/var/log/monthly.out', 'r') as f:
content = f.read()
sections = content.split('-- End of monthly output --')
for section in sections:
date = re.findall(re_date, section)
matches = re.findall(re_user_amount, section)
print(date, matches)
If you want to turn the date string into an actual datetime, check out Converting string into datetime.
Well, there's rarely a magical cure for everything based on regex. The regex are a great tool
for simple string parsing, but it shall not replace good old programming!
So if you look at your data, you'll notice that it always start with a date, and ends with the
-- End of monthly output -- line. So a nice way to handle that would be to split your data
by each monthly output.
Let's start with your data:
>>> s = """\
... Mon Feb 1 09:12:41 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 688.31
... example 401.12
... _mbsetupuser 287.10
... root 0.05
... admin 0.04
...
... -- End of monthly output --
...
... Tue Feb 16 14:27:21 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 0.00
...
... -- End of monthly output --
...
... Thu Mar 3 09:37:31 GMT 2016
...
... Rotating fax log files:
...
... Doing login accounting:
... total 377.92
... example 377.92
...
... -- End of monthly output --"""
And let's split it based ont that end of month line:
>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb 1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 688.31\n example 401.12\n _mbsetupuser 287.10\n root 0.05\n admin 0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 0.00\n\n', '\n\nThu Mar 3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 377.92\n example 377.92\n\n', '']
Then you can separate the accounting data from the rest of the log:
>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')
Now let's extract the date line:
>>> date_line = head.strip().split('\n')[0]
And fill up a dict with those username/totals pairs:
>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
the trick here is to use zip() to create pairs out of iterators on tail. The "left"
side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~
side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes:
{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}
So now that's done, you can do that in a for loop:
import datetime
def parse_monthly_log(log_path='/var/log/monthly.out'):
with open(log_path, 'r') as log:
reports = log.read().strip('\n ').split('-- End of monthly output --')
for report in filter(lambda it: it, reports):
head, tail = report.split('Doing login accounting:')
date_line = head.strip().split('\n')[0]
accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
yield {
'date': datetime.datetime.strptime(date_line.replace(' ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
'accounting': accounting
}
>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
'admin': '0.04',
'example': '401.12',
'root': '0.05',
'total': '688.31'},
'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]
And there you go with a pythonic solution without a single regex.
N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string
N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.
Here's something shorter:
with open("/var/log/monthly.out") as f:
months = map(str.strip, f.read().split("-- End of monthly output --"))
for sec in filter(None, y):
date = sec.splitlines()[0]
accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec)
print(date, accs)
This divides the file content into months, extracts the date of each month and searches for all accounts in each month.
You may want to try the following regex, which is not so elegant though:
import re
string = """
Mon Feb 1 09:12:41 GMT 2016
Rotating fax log files:
Doing login accounting:
total 688.31
example 401.12
_mbsetupuser 287.10
root 0.05
admin 0.04
-- End of monthly output --
Tue Feb 16 14:27:21 GMT 2016
Rotating fax log files:
Doing login accounting:
total 0.00
-- End of monthly output --
Thu Mar 3 09:37:31 GMT 2016
Rotating fax log files:
Doing login accounting:
total 377.92
example 377.92
-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)
Output:
[('Mon Feb 1 09:12:41 GMT 2016', 'total 688.31', 'example 401.12', '_mbsetupuser 287.10', 'root 0.05', 'admin 0.04'),
('Tue Feb 16 14:27:21 GMT 2016', 'total 0.00', '', '', '', ''),
('Thu Mar 3 09:37:31 GMT 2016', 'total 377.92', 'example 377.92', '', '', '')]
REGEX DEMO.

Beautiful soup and extracting values

I would be gretful if you could give me some guidance on how I would grab the date of birth "16 June 1723" below while using beautifulsoup. Now using my code I have managed to grab the values which you see below under results however all what I need is to only grab the value 16 June 1723. any advice?
My code:
birth = soup.find("table",{"class":"infobox"})
test = birth.find(text='Born')
next_cell = test.find_parent('th').find_next_sibling('td').get_text()
print next_cell
Result:
16 June 1723 NS (5 June 1723 OS)Kirkcaldy, Scotland,Great Britain
Instead of last print statement, add this
print ' '.join(str(next_cell).split()[:3])

Categories