Scraping SVG Chart using Selenium - Python - python
I am trying to scrape the SVG chart which contains the previous months prices of the house, from the following URL: https://www.zameen.com/Property/dha_defence_dha_defence_phase_2_1_kanal_neat_and_clean_upper_portion_for_rent-24195800-339-4.html
I am scraping the "Price Index" section, image attached:
The code snippet is below:
elements = driver.find_elements(by=By.XPATH, value="//*[local-name() = 'svg' and #class='ct-chart-line']//*[name() = 'g' and #class='ct-labels']//span[contains(#class,'ct-horizontal')]")
print("WebElements: ", len(elements))
actions = ActionChains(driver)
for el in elements:
actions.move_to_element(el).perform()
time.sleep(1)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.chartist-tooltip").text)
print(driver.find_element(by=By.CSS_SELECTOR, value="div.ct-axis-tooltip-x").text)
Following is the output I have received:
WebElements: 7
32,388,054 ==> Jun 2021
33,828,816 ==> Aug 2021
36,064,647 ==> Oct 2021
38,336,196 ==> Dec 2021
39,535,707 ==> Feb 2022
40,257,851 ==> Apr 2022
40,733,506 ==> May 2022
I am using the X-Axis Labels of Months as my elements and moving the selenium on it and capturing the price for that month. Unfortunately, the label of the months is of every 2 months and I get a total of 7 months of data rather than 12 months.
What could be a better solution so I can get complete 12-month prices?
import requests
cookies = {}
headers = {'x-requested-with': 'XMLHttpRequest'}
params = {
'property_id': '24195800',
'purpose': '2',
}
response = requests.get('https://www.zameen.com/nfpage/async/property/get_property_search_index', params=params, cookies=cookies, headers=headers)
will return all data for the graph in nested dict format:
response.text
'{"status":"success","search_index_data":{"section_data":{"heading_txt":" Islamabad DHA Defence Phase 2, 1 Kanal Plots Price Index","heading_month":"May 2022"},"index_data":{"class":"inc","current_year":"2022","last_value":"235.60","ratio":135.6,"ratio_percent":135.6,"range":"100.00 - 235.60","weeks_range":"187.33 - 235.60","this_year_range":"225.89 - 235.60","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data_per_unit":{"class":"inc","current_year":"2022","last_value":"9,052","ratio":"5,210","ratio_percent":135.6,"range":"3,842 - 9,052","weeks_range":"7,197 - 9,052","this_year_range":"8,679 - 9,052","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"price_data":{"class":"inc","current_year":"2022","last_value":"40,734,000","ratio":"23,445,000","ratio_percent":135.6,"range":"17,289,000 - 40,734,000","weeks_range":"32,386,500 - 40,734,000","this_year_range":"39,055,500 - 40,734,000","start_date":"Jan 2018","end_date":"May 2022","end_date_formatted":"May 2022"},"chart_data":{"0":{"moving_avg":"100.0000","org_moving_avg":"1.000000","period_end_date":"2018-01-31","slope":"1.000000"},"1":{"moving_avg":"101.5000","org_moving_avg":"1.015000","period_end_date":"2018-02-28","slope":"0.970000"},"2":{"moving_avg":"101.6667","org_moving_avg":"1.016667","period_end_date":"2018-03-31","slope":"0.980000"},"3":{"moving_avg":"102.3333","org_moving_avg":"1.023333","period_end_date":"2018-04-30","slope":"0.980000"},"4":{"moving_avg":"103.0000","org_moving_avg":"1.030000","period_end_date":"2018-05-31","slope":"0.950000"},"5":{"moving_avg":"104.3333","org_moving_avg":"1.043333","period_end_date":"2018-06-30","slope":"0.940000"},"6":{"moving_avg":"105.6667","org_moving_avg":"1.056667","period_end_date":"2018-07-31","slope":"0.940000"},"7":{"moving_avg":"106.3333","org_moving_avg":"1.063333","period_end_date":"2018-08-31","slope":"0.940000"},"8":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2018-09-30","slope":"0.910000"},"9":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-10-31","slope":"0.930000"},"10":{"moving_avg":"109.0000","org_moving_avg":"1.090000","period_end_date":"2018-11-30","slope":"0.920000"},"11":{"moving_avg":"108.3333","org_moving_avg":"1.083333","period_end_date":"2018-12-31","slope":"0.930000"},"12":{"moving_avg":"108.0000","org_moving_avg":"1.080000","period_end_date":"2019-01-31","slope":"0.930000"},"13":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-02-28","slope":"0.930000"},"14":{"moving_avg":"107.6667","org_moving_avg":"1.076667","period_end_date":"2019-03-31","slope":"0.930000"},"15":{"moving_avg":"108.6667","org_moving_avg":"1.086667","period_end_date":"2019-04-30","slope":"0.910000"},"16":{"moving_avg":"110.0000","org_moving_avg":"1.100000","period_end_date":"2019-05-31","slope":"0.890000"},"17":{"moving_avg":"112.3333","org_moving_avg":"1.123333","period_end_date":"2019-06-30","slope":"0.870000"},"18":{"moving_avg":"113.3333","org_moving_avg":"1.133333","period_end_date":"2019-07-31","slope":"0.880000"},"19":{"moving_avg":"114.3333","org_moving_avg":"1.143333","period_end_date":"2019-08-31","slope":"0.870000"},"20":{"moving_avg":"114.6667","org_moving_avg":"1.146667","period_end_date":"2019-09-30","slope":"0.860000"},"21":{"moving_avg":"116.3333","org_moving_avg":"1.163333","period_end_date":"2019-10-31","slope":"0.850000"},"22":{"moving_avg":"118.0000","org_moving_avg":"1.180000","period_end_date":"2019-11-30","slope":"0.830000"},"23":{"moving_avg":"120.0000","org_moving_avg":"1.200000","period_end_date":"2019-12-31","slope":"0.820000"},"24":{"moving_avg":"121.0000","org_moving_avg":"1.210000","period_end_date":"2020-01-31","slope":"0.830000"},"25":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-02-29","slope":"0.840000"},"26":{"moving_avg":"120.3333","org_moving_avg":"1.203333","period_end_date":"2020-03-31","slope":"0.830000"},"27":{"moving_avg":"120.6667","org_moving_avg":"1.206667","period_end_date":"2020-04-30","slope":"0.820000"},"28":{"moving_avg":"122.3333","org_moving_avg":"1.223333","period_end_date":"2020-05-31","slope":"0.810000"},"29":{"moving_avg":"124.0000","org_moving_avg":"1.240000","period_end_date":"2020-06-30","slope":"0.790000"},"30":{"moving_avg":"125.0000","org_moving_avg":"1.250000","period_end_date":"2020-07-31","slope":"0.800000"},"31":{"moving_avg":"127.3333","org_moving_avg":"1.273333","period_end_date":"2020-08-31","slope":"0.760000"},"32":{"moving_avg":"131.3333","org_moving_avg":"1.313333","period_end_date":"2020-09-30","slope":"0.730000"},"33":{"moving_avg":"137.0000","org_moving_avg":"1.370000","period_end_date":"2020-10-31","slope":"0.700000"},"34":{"moving_avg":"142.3333","org_moving_avg":"1.423333","period_end_date":"2020-11-30","slope":"0.680000"},"35":{"moving_avg":"147.0000","org_moving_avg":"1.470000","period_end_date":"2020-12-31","slope":"0.660000"},"36":{"moving_avg":"152.3333","org_moving_avg":"1.523333","period_end_date":"2021-01-31","slope":"0.630000"},"37":{"moving_avg":"159.6667","org_moving_avg":"1.596667","period_end_date":"2021-02-28","slope":"0.590000"},"38":{"moving_avg":"168.0000","org_moving_avg":"1.680000","period_end_date":"2021-03-31","slope":"0.560000"},"39":{"moving_avg":"176.6667","org_moving_avg":"1.766667","period_end_date":"2021-04-30","slope":"0.540000"},"40":{"moving_avg":"182.6667","org_moving_avg":"1.826667","period_end_date":"2021-05-31","slope":"0.540000"},"41":{"moving_avg":"187.3333","org_moving_avg":"1.873333","period_end_date":"2021-06-30","slope":"0.520000"},"42":{"moving_avg":"191.3333","org_moving_avg":"1.913333","period_end_date":"2021-07-31","slope":"0.510000"},"43":{"moving_avg":"195.6667","org_moving_avg":"1.956667","period_end_date":"2021-08-31","slope":"0.500000"},"44":{"moving_avg":"202.0000","org_moving_avg":"2.020000","period_end_date":"2021-09-30","slope":"0.480000"},"45":{"moving_avg":"208.5988","org_moving_avg":"2.085988","period_end_date":"2021-10-31","slope":"0.463400"},"46":{"moving_avg":"216.2373","org_moving_avg":"2.162373","period_end_date":"2021-11-30","slope":"0.448600"},"47":{"moving_avg":"221.7375","org_moving_avg":"2.217375","period_end_date":"2021-12-31","slope":"0.441500"},"48":{"moving_avg":"225.8916","org_moving_avg":"2.258916","period_end_date":"2022-01-31","slope":"0.438100"},"49":{"moving_avg":"228.6755","org_moving_avg":"2.286755","period_end_date":"2022-02-28","slope":"0.432400"},"50":{"moving_avg":"231.0205","org_moving_avg":"2.310205","period_end_date":"2022-03-31","slope":"0.428200"},"51":{"moving_avg":"232.8524","org_moving_avg":"2.328524","period_end_date":"2022-04-30","slope":"0.427800"},"52":{"moving_avg":"235.6036","org_moving_avg":"2.356036","period_end_date":"2022-05-31","slope":"0.417500"}},"base_avg_price":3842,"calculated_value":4500,"selectedMonthData":[{"date":"2021-12-31","price_sqft":"8,519","price":"38,335,500","index":"221.74"},{"date":"2021-06-30","price_sqft":"7,197","price":"32,386,500","index":"187.33"},{"date":"2020-06-30","price_sqft":"4,764","price":"21,438,000","index":"124.00"}]},"index_type":false}'
Related
Scraping all entries of lazyloading page using python
See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time. I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links. How to get the rest? from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver') driver.get(url) element = driver.find_element_by_xpath('//*[#id="lazyload-container"]') element = element.get_attribute('innerHTML')
The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years: import requests from bs4 import BeautifulSoup url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html" for year in range(1997, 2023): soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser") for a in soup.select(".title a")[::-1]: print(a.find_previous(class_="date").text, a.text) Prints: 25 April 1997 "EUR" - the new currency code for the euro 1 July 1997 Change of presidency of the European Monetary Institute 2 July 1997 The security features of the euro banknotes 2 July 1997 The EMI's mandate with respect to banknotes ... 17 February 2022 Financial statements of the ECB for 2021 21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021 21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) EDIT: To print links: import requests from bs4 import BeautifulSoup url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html" for year in range(1997, 2023): soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser") for a in soup.select(".title a")[::-1]: print( a.find_previous(class_="date").text, a.text, "https://www.ecb.europa.eu" + a["href"], ) Prints: ... 15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html 20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html ...
How can I scrape historical data when date range is in a filter?
I'm trying to scrape some historical data on the following url: https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR I would like to scrape all the available historical data, however, the website only allows me to scrape the daily prices of the last 30 days. In order to go back further, I have to make use of the filters and can only filter one year at a time. I can easily scrape the information available on the first table using the following code for a couple of funds: import pandas as pd import datetime import csv urls = ['https://markets.ft.com/data/funds/tearsheet/historical?s=LU0526609390:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=IE00BHBX0Z19:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1076093779:EUR', 'https://markets.ft.com/data/funds/tearsheet/historical?s=LU1116896363:EUR'] # Change date format as there appears to be two versions of the date on the FT website for different sized browsers def format_date(date): date = date.split(',')[-2][1:] + date.split(',')[-1] return pd.Series({'Date': date}) # Create list to allow all scraping data to be saved in one .csv file dfs = [] # Create scraping loop for all defined urls for url in urls: ISIN = url.split('=')[-1].replace(':', '_') html = requests.get(url).content df_list = pd.read_html(html) df = df_list[-1] df['Date'] = df['Date'].apply(format_date) print (df) dfs.append(df) However, I'm unable to make use of the filters on the webpage to obtain more historical data? I've tried many things but always get a different error message. How can I do this?
The data is loaded from an api that allows to set date ranges, eg https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate=2020/10/01&endDate=2021/10/01&symbol=535700333. This makes it possible to skip the filters issue: import requests import pandas as pd from datetime import datetime import time #create list of annual dates for the past 100 years starting from today datelist = pd.date_range(end=datetime.now(),periods=100,freq=pd.DateOffset(years=1))[::-1].strftime('%Y/%m/%d') #create empty df df = pd.DataFrame(None, columns=['Date','Open','High','Low','Close','Volume']) #not sure when the historical data starts, so let's wrap it in a while loop while True: for end, start in zip(datelist, datelist[1:]): try: r = requests.get(f'https://markets.ft.com/data/equities/ajax/get-historical-prices?startDate={start}&endDate={end}&symbol=535700333').json() df_temp = pd.read_html('<table>'+r['html']+'</table>')[0] df_temp.columns=['Date','Open','High','Low','Close','Volume'] df = df.append(df_temp) time.sleep(2) except: break break Output: Date Open High Low Close Volume 0 Friday, October 15, 2021Fri, Oct 15, 2021 78.8 78.8 78.8 78.8 ---- 1 Thursday, October 14, 2021Thu, Oct 14, 2021 78.89 78.89 78.89 78.89 ---- 2 Wednesday, October 13, 2021Wed, Oct 13, 2021 78.7 78.7 78.7 78.7 ---- 3 Tuesday, October 12, 2021Tue, Oct 12, 2021 78.58 78.58 78.58 78.58 ---- 4 Monday, October 11, 2021Mon, Oct 11, 2021 78.58 78.58 78.58 78.58 ----
Extraction of some date formats failed when using Dateutil in Python
I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem: parse multiple dates using dateutil How to parse multiple dates from a block of text in Python (or another language) Problem: I need to parse multiple dates in multiple formats in Python Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so. Formats which still can't be parsed are: text ='I want to visit from May 16-May 18' text ='I want to visit from May 16-18' text ='I want to visit from May 6 May 18' I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.
This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust: from itertools import groupby, izip_longest from datetime import datetime, timedelta import calendar import string import re def get_date_part(x): if x.lower() in month_list: return x day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I) if day: return day.group(1) return False def month_full(month): try: return datetime.strptime(month, '%B').strftime('%b') except: return datetime.strptime(month, '%b').strftime('%b') tests = [ 'I want to visit from May 16-May 18', 'I want to visit from May 16-18', 'I want to visit from May 6 May 18', 'May 6,7,8,9,10', '8 May to 10 June', 'July 10/20/30', 'from June 1, july 5 to aug 5 please', '2nd March to the 3rd January', '15 march, 10 feb, 5 jan', '1 nov 2017', '27th Oct 2010 until 1st jan', '27th Oct 2010 until 1st jan 2012' ] cur_year = 2017 month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)] remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation)) for date in tests: date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)] days = [] months = [] years = [] for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()): values = list(g) if k: months = map(month_full, values) else: for v in values: if 1900 <= int(v) <= 2100: years.append(int(v)) else: days.append(v) if days and months: if years: dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])] else: dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])] years = [cur_year] # Fix for jumps in year dates = [] start_date = datetime(years[0], 1, 1) next_year = years[0] + 1 for d in dates_raw: if d < start_date: d = d.replace(year=next_year) next_year += 1 start_date = d dates.append(d) print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates)) This converts the test strings as follows: I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017 I want to visit from May 16-18 -> 16/05/2017, 18/05/2017 I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017 May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017 8 May to 10 June -> 08/05/2017, 10/06/2017 July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017 from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017 2nd March to the 3rd January -> 02/03/2017, 03/01/2018 15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019 1 nov 2017 -> 01/11/2017 27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011 27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012 This works as follows: First create a list of valid months names, i.e. both full and abbreviated. Make a translation table to make it easy to quickly remove any punctuation from the text. Split the text, and extract only the date parts by using a function with a regular expression to spot days or months. Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end. Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects. If a date appears to be before the previous one, add a whole year.
Regex to use monthly.out for user accounting
I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this: Mon Feb 1 09:12:41 GMT 2016 Rotating fax log files: Doing login accounting: total 688.31 example 401.12 _mbsetupuser 287.10 root 0.05 admin 0.04 -- End of monthly output -- Tue Feb 16 14:27:21 GMT 2016 Rotating fax log files: Doing login accounting: total 0.00 -- End of monthly output -- Thu Mar 3 09:37:31 GMT 2016 Rotating fax log files: Doing login accounting: total 377.92 example 377.92 -- End of monthly output -- I was able to extract the username / totals pairs with this regex: \t(\w*)\W*(\d*\.\d{2}) In Python: >>> import re >>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read()) [('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')] But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.
Use str.split(). import re re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})' re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}' with open('/var/log/monthly.out', 'r') as f: content = f.read() sections = content.split('-- End of monthly output --') for section in sections: date = re.findall(re_date, section) matches = re.findall(re_user_amount, section) print(date, matches) If you want to turn the date string into an actual datetime, check out Converting string into datetime.
Well, there's rarely a magical cure for everything based on regex. The regex are a great tool for simple string parsing, but it shall not replace good old programming! So if you look at your data, you'll notice that it always start with a date, and ends with the -- End of monthly output -- line. So a nice way to handle that would be to split your data by each monthly output. Let's start with your data: >>> s = """\ ... Mon Feb 1 09:12:41 GMT 2016 ... ... Rotating fax log files: ... ... Doing login accounting: ... total 688.31 ... example 401.12 ... _mbsetupuser 287.10 ... root 0.05 ... admin 0.04 ... ... -- End of monthly output -- ... ... Tue Feb 16 14:27:21 GMT 2016 ... ... Rotating fax log files: ... ... Doing login accounting: ... total 0.00 ... ... -- End of monthly output -- ... ... Thu Mar 3 09:37:31 GMT 2016 ... ... Rotating fax log files: ... ... Doing login accounting: ... total 377.92 ... example 377.92 ... ... -- End of monthly output --""" And let's split it based ont that end of month line: >>> reports = s.split('-- End of monthly output --') >>> reports ['Mon Feb 1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 688.31\n example 401.12\n _mbsetupuser 287.10\n root 0.05\n admin 0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 0.00\n\n', '\n\nThu Mar 3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n total 377.92\n example 377.92\n\n', ''] Then you can separate the accounting data from the rest of the log: >>> report = reports[0] >>> head, tail = report.split('Doing login accounting:') Now let's extract the date line: >>> date_line = head.strip().split('\n')[0] And fill up a dict with those username/totals pairs: >>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2])) the trick here is to use zip() to create pairs out of iterators on tail. The "left" side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~ side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes: {'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'} So now that's done, you can do that in a for loop: import datetime def parse_monthly_log(log_path='/var/log/monthly.out'): with open(log_path, 'r') as log: reports = log.read().strip('\n ').split('-- End of monthly output --') for report in filter(lambda it: it, reports): head, tail = report.split('Doing login accounting:') date_line = head.strip().split('\n')[0] accounting = dict(zip(tail.split()[::2], tail.split()[1::2])) yield { 'date': datetime.datetime.strptime(date_line.replace(' ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'), 'accounting': accounting } >>> import pprint >>> pprint.pprint(list(parse_monthly_log()), indent=2) [ { 'accounting': { '_mbsetupuser': '287.10', 'admin': '0.04', 'example': '401.12', 'root': '0.05', 'total': '688.31'}, 'date': datetime.datetime(2016, 2, 1, 9, 12, 41)}, { 'accounting': { 'total': '0.00'}, 'date': datetime.datetime(2016, 2, 16, 14, 27, 21)}, { 'accounting': { 'example': '377.92', 'total': '377.92'}, 'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}] And there you go with a pythonic solution without a single regex. N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.
Here's something shorter: with open("/var/log/monthly.out") as f: months = map(str.strip, f.read().split("-- End of monthly output --")) for sec in filter(None, y): date = sec.splitlines()[0] accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec) print(date, accs) This divides the file content into months, extracts the date of each month and searches for all accounts in each month.
You may want to try the following regex, which is not so elegant though: import re string = """ Mon Feb 1 09:12:41 GMT 2016 Rotating fax log files: Doing login accounting: total 688.31 example 401.12 _mbsetupuser 287.10 root 0.05 admin 0.04 -- End of monthly output -- Tue Feb 16 14:27:21 GMT 2016 Rotating fax log files: Doing login accounting: total 0.00 -- End of monthly output -- Thu Mar 3 09:37:31 GMT 2016 Rotating fax log files: Doing login accounting: total 377.92 example 377.92 -- End of monthly output -- """ pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?' print re.findall(pattern, string) Output: [('Mon Feb 1 09:12:41 GMT 2016', 'total 688.31', 'example 401.12', '_mbsetupuser 287.10', 'root 0.05', 'admin 0.04'), ('Tue Feb 16 14:27:21 GMT 2016', 'total 0.00', '', '', '', ''), ('Thu Mar 3 09:37:31 GMT 2016', 'total 377.92', 'example 377.92', '', '', '')] REGEX DEMO.
Beautiful soup and extracting values
I would be gretful if you could give me some guidance on how I would grab the date of birth "16 June 1723" below while using beautifulsoup. Now using my code I have managed to grab the values which you see below under results however all what I need is to only grab the value 16 June 1723. any advice? My code: birth = soup.find("table",{"class":"infobox"}) test = birth.find(text='Born') next_cell = test.find_parent('th').find_next_sibling('td').get_text() print next_cell Result: 16 June 1723 NS (5 June 1723 OS)Kirkcaldy, Scotland,Great Britain
Instead of last print statement, add this print ' '.join(str(next_cell).split()[:3])