I'm pretty new to the world of web scraping so am looking for some guidance to an issue I've been trying to resolve for a few hours.
I'm trying to loop through a table looking structure (it's not an actual table though) and have used findall to bring back all the details of a certain tag.
The challenge I have is that every element of the "table" has the same class name "final-leaderboard__content" so I'm left with a huge list so I want to iterate through and retrieve the details for so I can create a csv/excel with the details. This is the code below
from bs4 import BeautifulSoup
import requests
TournamentURL = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/"
TournamentResponse = requests.get(TournamentURL)
TournamentData = TournamentResponse.text
TournamentSoup = BeautifulSoup(TournamentData, 'html.parser')
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
for RowContent in RowContents:
The result is something like this and I can't work out the best way without there being any explicit tag/id to know that item 0,8,16 etc is the Player Name, item 1,9,17 is the Finish etc etc
[0] - Name
[1] - Finish
[2] - R1
[3] - R2
[4] - R3
[5] - R4
[6] - Total
[7] - Par
[8] - Name (The second Name)
[9] - Finish (The second Finish)
etc
etc
I've tried splice, modulo and various other variants of the same but can't seem to work it out.
You can use the fact that this is indeed a kind of tabular data and grab all divs that represent a row, split it by a number of columns, and there's your data:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
page = requests.get(url).content
leaderboard = BeautifulSoup(page, "html.parser").find_all("div", {"class": "final-leaderboard__content"})
column_count = 8
split_by_columns = [
leaderboard[i:i+column_count] for i in range(0, len(leaderboard), column_count)
]
table = [[i.getText(strip=True) for i in row] for row in split_by_columns]
print(tabulate(table[1:], headers=table[0]))
Output:
Name Finish R1 R2 R3 R4 Total Par
----------------------------- -------- ---- ---- ---- ---- ------- -----
Jamie ANDERSONChampion Golfer 1 84 85 - - 169 M/C
Andrew KIRKALDY 2 86 86 - - 172 M/C
Jamie ALLAN 2 88 84 - - 172 M/C
George PAXTON 4 89 85 - - 174 M/C
Tom KIDD 5 87 88 - - 175 M/C
Bob FERGUSON 6 89 87 - - 176 M/C
J.O.F. MORRIS 7 92 87 - - 179 M/C
Jack KIRKALDY 8 92 89 - - 181 M/C
James RENNIE 8 93 88 - - 181 M/C
Willie FERNIE 8 92 89 - - 181 M/C
David AYTON 11 95 89 - - 184 M/C
Henry LAMB 11 91 93 - - 184 M/C
Tom ARUNDEL 11 95 89 - - 184 M/C
Tom MORRIS SR 14 92 93 - - 185 M/C
William DOLEMAN 14 91 94 - - 185 M/C
Robert KINSMAN 14 88 97 - - 185 M/C
Bob MARTIN 17 93 93 - - 186 M/C
Ben SAYERS 18 92 95 - - 187 M/C
David ANDERSON SR 19 94 94 - - 188 M/C
David CORSTORPHINE 20 93 96 - - 189 M/C
Tom DUNN 20 90 99 - - 189 M/C
Peter PAXTON 20 99 90 - - 189 M/C
[A] SMITH 20 94 95 - - 189 M/C
D. GRANT 20 95 94 - - 189 M/C
Bob DOW 20 95 94 - - 189 M/C
Walter GOURLAY 20 92 97 - - 189 M/C
A.W. SMITH 27 91 99 - - 190 M/C
Douglas Argyll ROBERTSON 27 97 93 - - 190 M/C
Robert ARMIT 29 95 96 - - 191 M/C
George STRATH 29 97 94 - - 191 M/C
J.H. BLACKWELL 31 96 96 - - 192 M/C
Tom MANZIE 32 96 97 - - 193 M/C
George LOWE 33 94 100 - - 194 M/C
G. HONEYMAN 33 97 97 - - 194 M/C
James FENTON 35 99 97 - - 196 M/C
Robert TAIT 35 99 97 - - 196 M/C
Bob KIRK 37 99 98 - - 197 M/C
Rev. D. LUNDIE 37 98 99 - - 197 M/C
Fitz BOOTHBY 39 96 102 - - 198 M/C
J. Thomson WHITE 40 102 99 - - 201 M/C
James KIRK 41 105 97 - - 202 M/C
W.H. GOFF 42 105 99 - - 204 M/C
import requests
from bs4 import BeautifulSoup
def parse_row(row):
for div in row.find_all("div", {"class": "final-leaderboard__content"}):
yield div.text.strip().replace('\n', ' ')
url = "https://www.theopen.com/previous-opens/19th-open-st-andrews-1879/#leaderboard"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find("div", {"class": "final-leaderboard__table"})
rows = table.find_all('div', {'class':"final-leaderboard__row"})
header = list(parse_row(rows[0]))
for row in rows[1:]:
print(dict(zip(header, list(parse_row(row)))))
output
{'Name': 'Jamie ANDERSON Champion Golfer', 'Finish': '1', 'R1': '84', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '169', 'Par': 'M/C'}
{'Name': 'Andrew KIRKALDY', 'Finish': '2', 'R1': '86', 'R2': '86', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'Jamie ALLAN', 'Finish': '2', 'R1': '88', 'R2': '84', 'R3': '-', 'R4': '-', 'Total': '172', 'Par': 'M/C'}
{'Name': 'George PAXTON', 'Finish': '4', 'R1': '89', 'R2': '85', 'R3': '-', 'R4': '-', 'Total': '174', 'Par': 'M/C'}
{'Name': 'Tom KIDD', 'Finish': '5', 'R1': '87', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '175', 'Par': 'M/C'}
{'Name': 'Bob FERGUSON', 'Finish': '6', 'R1': '89', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '176', 'Par': 'M/C'}
{'Name': 'J.O.F. MORRIS', 'Finish': '7', 'R1': '92', 'R2': '87', 'R3': '-', 'R4': '-', 'Total': '179', 'Par': 'M/C'}
{'Name': 'Jack KIRKALDY', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'James RENNIE', 'Finish': '8', 'R1': '93', 'R2': '88', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'Willie FERNIE', 'Finish': '8', 'R1': '92', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '181', 'Par': 'M/C'}
{'Name': 'David AYTON', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Henry LAMB', 'Finish': '11', 'R1': '91', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom ARUNDEL', 'Finish': '11', 'R1': '95', 'R2': '89', 'R3': '-', 'R4': '-', 'Total': '184', 'Par': 'M/C'}
{'Name': 'Tom MORRIS SR', 'Finish': '14', 'R1': '92', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'William DOLEMAN', 'Finish': '14', 'R1': '91', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Robert KINSMAN', 'Finish': '14', 'R1': '88', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '185', 'Par': 'M/C'}
{'Name': 'Bob MARTIN', 'Finish': '17', 'R1': '93', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '186', 'Par': 'M/C'}
{'Name': 'Ben SAYERS', 'Finish': '18', 'R1': '92', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '187', 'Par': 'M/C'}
{'Name': 'David ANDERSON SR', 'Finish': '19', 'R1': '94', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '188', 'Par': 'M/C'}
{'Name': 'David CORSTORPHINE', 'Finish': '20', 'R1': '93', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Tom DUNN', 'Finish': '20', 'R1': '90', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Peter PAXTON', 'Finish': '20', 'R1': '99', 'R2': '90', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': '[A] SMITH', 'Finish': '20', 'R1': '94', 'R2': '95', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'D. GRANT', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Bob DOW', 'Finish': '20', 'R1': '95', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'Walter GOURLAY', 'Finish': '20', 'R1': '92', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '189', 'Par': 'M/C'}
{'Name': 'A.W. SMITH', 'Finish': '27', 'R1': '91', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Douglas Argyll ROBERTSON', 'Finish': '27', 'R1': '97', 'R2': '93', 'R3': '-', 'R4': '-', 'Total': '190', 'Par': 'M/C'}
{'Name': 'Robert ARMIT', 'Finish': '29', 'R1': '95', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'George STRATH', 'Finish': '29', 'R1': '97', 'R2': '94', 'R3': '-', 'R4': '-', 'Total': '191', 'Par': 'M/C'}
{'Name': 'J.H. BLACKWELL', 'Finish': '31', 'R1': '96', 'R2': '96', 'R3': '-', 'R4': '-', 'Total': '192', 'Par': 'M/C'}
{'Name': 'Tom MANZIE', 'Finish': '32', 'R1': '96', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '193', 'Par': 'M/C'}
{'Name': 'George LOWE', 'Finish': '33', 'R1': '94', 'R2': '100', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'G. HONEYMAN', 'Finish': '33', 'R1': '97', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '194', 'Par': 'M/C'}
{'Name': 'James FENTON', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Robert TAIT', 'Finish': '35', 'R1': '99', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '196', 'Par': 'M/C'}
{'Name': 'Bob KIRK', 'Finish': '37', 'R1': '99', 'R2': '98', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Rev. D. LUNDIE', 'Finish': '37', 'R1': '98', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '197', 'Par': 'M/C'}
{'Name': 'Fitz BOOTHBY', 'Finish': '39', 'R1': '96', 'R2': '102', 'R3': '-', 'R4': '-', 'Total': '198', 'Par': 'M/C'}
{'Name': 'J. Thomson WHITE', 'Finish': '40', 'R1': '102', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '201', 'Par': 'M/C'}
{'Name': 'James KIRK', 'Finish': '41', 'R1': '105', 'R2': '97', 'R3': '-', 'R4': '-', 'Total': '202', 'Par': 'M/C'}
{'Name': 'W.H. GOFF', 'Finish': '42', 'R1': '105', 'R2': '99', 'R3': '-', 'R4': '-', 'Total': '204', 'Par': 'M/C'}
of course, instead of dict you may use other data structure like namedtuple
Another way is to create a dictionary , enumerate through your Rowcontents and update the dictionary with key as enumerated index(i) mod 8 (i%8) and value "the text"
RowContents = TournamentSoup.findAll("div", {"class": "final-leaderboard__content"})
d={}
for i, RowContent in enumerate(RowContents):
key = (i)%8
d.setdefault(key, []).append(' '.join(RowContent.text.strip().split()))
>>> d
{
0: ['Name','Jamie ANDERSON Champion Golfer','Andrew KIRKALDY','Jamie ALLAN',....]
1: ['Finish','1','2','2','4',....]
2: ['R1','84','86','88','89','87',....]
.......
7: ['Par','M/C','M/C','M/C','M/C','M/C',.....]
if you can use pandas
df = pd.DataFrame(d).rename(columns=df.iloc[0]).drop(df.index[0])
>>> print(df)
Name Finish R1 R2 R3 R4 Total Par
1 Jamie ANDERSON Champion Golfer 1 84 85 - - 169 M/C
2 Andrew KIRKALDY 2 86 86 - - 172 M/C
3 Jamie ALLAN 2 88 84 - - 172 M/C
4 George PAXTON 4 89 85 - - 174 M/C
5 Tom KIDD 5 87 88 - - 175 M/C
6 Bob FERGUSON 6 89 87 - - 176 M/C
7 J.O.F. MORRIS 7 92 87 - - 179 M/C
to save a dataframe to csv use pandas.to_csv()
df.to_csv('yourfile.csv', index=False)
Related
I'm trying to change a matrix of numbers from string to integer but it just doesn't work.
for element in list:
for i in element:
i = int(i)
What am I doing wrong?
Edit:
This is the whole code:
import numpy as np
t_list = []
t_list = np.array(t_list)
list_rains_per_months = [['63', '65', '50', '77', '66', '69'],
['65', '65', '67', '50', '54', '58'],
['77', '73', '80', '83', '89', '100'],
['90', '85', '90', '90', '84', '90'],
['129', '113', '120', '135', '117', '130'],
['99', '116', '114', '111', '119', '100'],
['105', '98', '112', '113', '102', '100'],
['131', '120', '111', '141', '130', '126'],
['85', '101', '88', '89', '94', '91'],
['122', '103', '119', '98', '101', '107'],
['121', '101', '104', '121', '115', '104'],
['67', '44', '58', '61', '64', '58']]
for element in t_list:
for i in element:
i = int(i)
I apologize for any mistakes, I'm new to python
What you're doing wrong, is that you're not changing the list or any list element: the 'i' inside the loop starts by pointing to each element of the list, then you make it point to something else, but that doesn't affect your list (also, avoid using 'list' as an identifier, it's an existing type, that's asking for trouble).
One way to do it is with list comprehensions. Assuming your matrix is a list of (inner) lists, for example:
a_list = [["3", "56", "78"], ["2", "39", "60"], ["87", "9", "71"]]
then two nested list comprehensions should do the trick:
a_list = [[int(i) for i in inner_list] for inner_list in a_list]
This builds a new list, formed by going over your initial list, applying the change you want, and saving it a another (or the same) list.
In numpy you do it that way.
import numpy as np
list_rains_per_months = [['63', '65', '50', '77', '66', '69'],
['65', '65', '67', '50', '54', '58'],
['77', '73', '80', '83', '89', '100'],
['90', '85', '90', '90', '84', '90'],
['129', '113', '120', '135', '117', '130'],
['99', '116', '114', '111', '119', '100'],
['105', '98', '112', '113', '102', '100'],
['131', '120', '111', '141', '130', '126'],
['85', '101', '88', '89', '94', '91'],
['122', '103', '119', '98', '101', '107'],
['121', '101', '104', '121', '115', '104'],
['67', '44', '58', '61', '64', '58']]
list_rains_per_months = np.array(list_rains_per_months)
myfunc = np.vectorize(lambda x: int(x))
list_rains_per_months = myfunc(list_rains_per_months)
print(list_rains_per_months)
Output
[[ 63 65 50 77 66 69]
[ 65 65 67 50 54 58]
[ 77 73 80 83 89 100]
[ 90 85 90 90 84 90]
[129 113 120 135 117 130]
[ 99 116 114 111 119 100]
[105 98 112 113 102 100]
[131 120 111 141 130 126]
[ 85 101 88 89 94 91]
[122 103 119 98 101 107]
[121 101 104 121 115 104]
[ 67 44 58 61 64 58]]
You could use enumerate object in loops:
list = [["12", "10", "0"],
["0", "33", "60"]]
for h, i in enumerate(list):
for j, k in enumerate(i):
list[h][j] = int(k)
print(list)
Could also just map each row's values to int:
for row in list_rains_per_months:
row[:] = map(int, row)
Note that I assign to row[:], i.e., into the row and thus into the matrix. If I assigned to row instead, I'd have the same problem as you with your i: I'd only assign to the variable, not into the row/matrix.
hello I'm new to coding and I have to Define a function named filterByMonth with two parameters. The first argument passed to the function should be a list of dictionaries (the data), the second a number (representing a month, 1 through 12). This function must return a list of all the dictionaries from the input list whose 'issued' date is in the indicated month. 'slice' a string to more easily extract the relevant portion of a date from a string. Sample function call: filterByMonth(data,9)
With a list of dictionaries I need to slice the month which is a part of the value for the key 'issued' which is the third key in the dictionaries.
data = [
{'apno': 'FEND19-9487084', 'aptype': 'FENDRIVE', 'issued': '2019-09-05T00:00:00.000', 'stname': '129 PARKSIDE CT', 'city': 'BUFFALO', 'state': 'NY', 'zip': '14214', 'applicant': 'PETER CICERO', 'fees': '150', 'value': '3500', 'plans': '0', 'sbl': '0795300004001000', 'landuse': '411', 'inspector': 'ANDREW BLERSCH', 'expdate': '2020-03-05T00:00:00.000', 'descofwork': 'REMOVE EXISTING DRIVEWAY AND REPLACE IN KIND WITH CONCRETE TO CODE ON SOUTH / RIGHT SIDE OF STRUCTURE TO CODE - SURVEY SCANNED', 'location_1': {'latitude': '42.95116080935555', 'longitude': '-78.83406536395538', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}, 'latitude': '42.95116080935555', 'longitude': '-78.83406536395538', 'council_district': 'UNIVERSITY', 'police_district': 'District D', 'census': '45', 'census_block_group': '1', 'census_block': '1010', 'neighborhood': 'UNKNOWN', ':#computed_region_fk4y_hpmh': '5', ':#computed_region_eziv_p4ck': '64', ':#computed_region_tmcg_v66k': '8', ':#computed_region_kwzn_pe6v': '18', ':#computed_region_uh5x_q5mi': '88', ':#computed_region_dwzh_dtk5': '1573', ':#computed_region_b3rm_3c8a': '37', ':#computed_region_xbxg_7ifr': '24', ':#computed_region_urdz_b6n8': '7'},
{'apno': 'SWIM19-9485898', 'aptype': 'SWIM POOL', 'issued': '2019-08-19T00:00:00.000', 'stname': '341 NORWALK', 'city': 'BUFFALO', 'state': 'NY', 'zip': '14216', 'applicant': 'MS CHRISTINE SALAMONE', 'fees': '75', 'value': '500', 'plans': '0', 'sbl': '0785000006033000', 'landuse': '210', 'inspector': 'ANDREW BLERSCH', 'expdate': '2020-02-19T00:00:00.000', 'descofwork': 'INSTALLATION OF AN ABOVE GROUND SWIMMING POOL SUBMITTED THROUGH IDT', 'location_1': {'latitude': '42.95333872723409', 'longitude': '-78.85429233887896', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}, 'latitude': '42.95333872723409', 'longitude': '-78.85429233887896', 'council_district': 'DELAWARE', 'police_district': 'District D', 'census': '49', 'census_block_group': '1', 'census_block': '1000', 'neighborhood': 'UNKNOWN', ':#computed_region_fk4y_hpmh': '5', ':#computed_region_eziv_p4ck': '51', ':#computed_region_tmcg_v66k': '7', ':#computed_region_kwzn_pe6v': '5', ':#computed_region_uh5x_q5mi': '190', ':#computed_region_dwzh_dtk5': '944', ':#computed_region_b3rm_3c8a': '28', ':#computed_region_xbxg_7ifr': '25', ':#computed_region_urdz_b6n8': '2'},
]
def filterByMonth(dta,month):
result=[]
for x in dta:
for y in x:
for x['issued'] in y
if month== x[6]:
result.append(x)
return result
print(filterByMonth(data,9))
You could do this more easily with the datetime module like so
from datetime import datetime
data = [
{'apno': 'FEND19-9487084', 'aptype': 'FENDRIVE', 'issued': '2019-09-05T00:00:00.000', 'stname': '129 PARKSIDE CT', 'city': 'BUFFALO', 'state': 'NY', 'zip': '14214', 'applicant': 'PETER CICERO', 'fees': '150', 'value': '3500', 'plans': '0', 'sbl': '0795300004001000', 'landuse': '411', 'inspector': 'ANDREW BLERSCH', 'expdate': '2020-03-05T00:00:00.000', 'descofwork': 'REMOVE EXISTING DRIVEWAY AND REPLACE IN KIND WITH CONCRETE TO CODE ON SOUTH / RIGHT SIDE OF STRUCTURE TO CODE - SURVEY SCANNED', 'location_1': {'latitude': '42.95116080935555', 'longitude': '-78.83406536395538', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}, 'latitude': '42.95116080935555', 'longitude': '-78.83406536395538', 'council_district': 'UNIVERSITY', 'police_district': 'District D', 'census': '45', 'census_block_group': '1', 'census_block': '1010', 'neighborhood': 'UNKNOWN', ':#computed_region_fk4y_hpmh': '5', ':#computed_region_eziv_p4ck': '64', ':#computed_region_tmcg_v66k': '8', ':#computed_region_kwzn_pe6v': '18', ':#computed_region_uh5x_q5mi': '88', ':#computed_region_dwzh_dtk5': '1573', ':#computed_region_b3rm_3c8a': '37', ':#computed_region_xbxg_7ifr': '24', ':#computed_region_urdz_b6n8': '7'},
{'apno': 'SWIM19-9485898', 'aptype': 'SWIM POOL', 'issued': '2019-08-19T00:00:00.000', 'stname': '341 NORWALK', 'city': 'BUFFALO', 'state': 'NY', 'zip': '14216', 'applicant': 'MS CHRISTINE SALAMONE', 'fees': '75', 'value': '500', 'plans': '0', 'sbl': '0785000006033000', 'landuse': '210', 'inspector': 'ANDREW BLERSCH', 'expdate': '2020-02-19T00:00:00.000', 'descofwork': 'INSTALLATION OF AN ABOVE GROUND SWIMMING POOL SUBMITTED THROUGH IDT', 'location_1': {'latitude': '42.95333872723409', 'longitude': '-78.85429233887896', 'human_address': '{"address": "", "city": "", "state": "", "zip": ""}'}, 'latitude': '42.95333872723409', 'longitude': '-78.85429233887896', 'council_district': 'DELAWARE', 'police_district': 'District D', 'census': '49', 'census_block_group': '1', 'census_block': '1000', 'neighborhood': 'UNKNOWN', ':#computed_region_fk4y_hpmh': '5', ':#computed_region_eziv_p4ck': '51', ':#computed_region_tmcg_v66k': '7', ':#computed_region_kwzn_pe6v': '5', ':#computed_region_uh5x_q5mi': '190', ':#computed_region_dwzh_dtk5': '944', ':#computed_region_b3rm_3c8a': '28', ':#computed_region_xbxg_7ifr': '25', ':#computed_region_urdz_b6n8': '2'}
]
def filterByMonth(data, month):
result = []
for item in data:
datestring = item['issued']
dt = datetime.strptime(datestring, '%Y-%m-%dT%H:%M:%S.%f')
if dt.month == month:
result.append(item)
return result
print(filterByMonth(data, 9))
and a more pythonic way would be this
def filterByMonth(data, month):
return [item for item in data if datetime.strptime(item['issued'], '%Y-%m-%dT%H:%M:%S.%f').month == month]
I have a dictionary like this:
{'https://github.com/project1': {'Batchfile': '91', 'Gradle': '110', 'INI': '25', 'Java': '1879', 'Markdown': '393', 'QMake': '52', 'Shell': '161', 'Text': '202', 'XML': '943'}}
{'https://github.com/project2': {'Batchfile': '91', 'Gradle': '123', 'INI': '25', 'Java': '1305', 'Markdown': '121', 'QMake': '52', 'Shell': '161', 'XML': '234'}}
{'https://github.com/project3': {'Batchfile': '91', 'Gradle': '360', 'INI': '27', 'Java': '805', 'Markdown': '27', 'QMake': '156', 'Shell': '161', 'XML': '380'}}
It is a structured in this way:
{'url': {'lang1': 'locs', 'lang2': 'locs', ...}}
{'url2': {'lang6': 'locs', 'lang5': 'locs', ...}}
where lang stay for languages and locs stay for line of codes (related to the previous language).
What i want to do is print this dictionary in a pretty way,so i can see the results before the export.
After that i want to export the dictionary into a csv file to make other operation. The problem is the languages are not sorted. That is what i mean:
{'https://github.com/Project4': {'HTML': '29', 'Java': '229', 'Markdown': '101', 'Maven POM': '88', 'XML': '62'}}
{'https://github.com/Project5': {'Batchfile': '85', 'Gradle': '84', 'INI': '22', 'Java': '2422', 'Markdown': '25', 'Prolog': '25', 'Shell': '173', 'XML': '3243', 'YAML': '43'}}
Any idea?
You could use pandas:
import pandas as pd
t = [{'https://github.com/project1': {'Batchfile': '91', 'Gradle': '110', 'INI': '25', 'Java': '1879', 'Markdown': '393', 'QMake': '52', 'Shell': '161', 'Text': '202', 'XML': '943'}},
{'https://github.com/project2': {'Batchfile': '91', 'Gradle': '123', 'INI': '25', 'Java': '1305', 'Markdown': '121', 'QMake': '52', 'Shell': '161', 'XML': '234'}},
{'https://github.com/project3': {'Batchfile': '91', 'Gradle': '360', 'INI': '27', 'Java': '805', 'Markdown': '27', 'QMake': '156', 'Shell': '161', 'XML': '380'}}]
columns = set([lang for x in t for l in x.values() for lang in l])
index = [p for x in t for p in x.keys()]
rows = [l for x in t for l in x.values() ]
df = pd.DataFrame(rows, columns=columns, index=index).fillna('N/A')
df.to_csv('projects.csv')
Which gives:
>>> df
Gradle INI Markdown ... Batchfile Java QMake
https://github.com/project1 110 25 393 ... 91 1879 52
https://github.com/project2 123 25 121 ... 91 1305 52
https://github.com/project3 360 27 27 ... 91 805 156
[3 rows x 9 columns]
And in the csv:
I have a text file with four columns:
time serial domain server
the contents of the text file are as follows:
15 14 google.com 8.8.8.8
19 45 google.com 8.8.4.4
98 76 google.com 208.67.222.222
20 23 intuit.com 8.8.8.8
45 89 intuit.com 8.8.4.4
43 21 intuit.com 208.67.222.222
78 14 google.com 8.8.8.8
92 76 google.com 8.8.4.4
64 54 google.com 208.67.222.222
91 18 intuit.com 8.8.8.8
93 74 intuit.com 8.8.4.4
65 59 intuit.com 208.67.222.222
What would be the best way to read this file and create a list of dict as follows:
[{"server":"8.8.8.8",
"domains":[{"google.com":[{"time":15, "serial":14}, {"time":78, "serial":14}]},
{"intuit.com":[{"time":20, "serial":23}, {"time":91, "serial":18}]}
]
},
{"server":"8.8.4.4",
"domains":[{"google.com":[{"time":19, "serial":45}, {"time":92, "serial":76}]},
{"intuit.com":[{"time":45, "serial":89}, {"time":93, "serial":74}]}
]
},
{"server":"206.67.222.222",
"domains":[{"google.com":[{"time":98, "serial":76}, {"time":64, "serial":54}]},
{"intuit.com":[{"time":43, "serial":21}, {"time":65, "serial":59}]}
]
}]
The order of the rows could change but the columns always remain the same.
Perhaps not the best way, but a way that is beneficial in some ways:
servers = {}
file_path = './test.file'
from pprint import pprint
with open(file_path,'rb') as f:
for line in f:
_time, serial, domain, ip = line.split()
current_domains = servers.get(ip, {})
times = current_domains.get(domain, [])
times.append({"time": _time, "serial": serial})
current_domains[domain] = times
servers[ip] = current_domains
pprint(servers)
pprint([{"server": ip, "domains": [{domain: _time} for domain, _time in domains.items()]} for ip, domains in servers.items()])
Output:
{'208.67.222.222': {'google.com': [{'serial': '76', 'time': '98'},
{'serial': '54', 'time': '64'}],
'intuit.com': [{'serial': '21', 'time': '43'},
{'serial': '59', 'time': '65'}]},
'8.8.4.4': {'google.com': [{'serial': '45', 'time': '19'},
{'serial': '76', 'time': '92'}],
'intuit.com': [{'serial': '89', 'time': '45'},
{'serial': '74', 'time': '93'}]},
'8.8.8.8': {'google.com': [{'serial': '14', 'time': '15'},
{'serial': '14', 'time': '78'}],
'intuit.com': [{'serial': '23', 'time': '20'},
{'serial': '18', 'time': '91'}]}}
[{'domains': [{'intuit.com': [{'serial': '21', 'time': '43'},
{'serial': '59', 'time': '65'}]},
{'google.com': [{'serial': '76', 'time': '98'},
{'serial': '54', 'time': '64'}]}],
'server': '208.67.222.222'},
{'domains': [{'intuit.com': [{'serial': '23', 'time': '20'},
{'serial': '18', 'time': '91'}]},
{'google.com': [{'serial': '14', 'time': '15'},
{'serial': '14', 'time': '78'}]}],
'server': '8.8.8.8'},
{'domains': [{'intuit.com': [{'serial': '89', 'time': '45'},
{'serial': '74', 'time': '93'}]},
{'google.com': [{'serial': '45', 'time': '19'},
{'serial': '76', 'time': '92'}]}],
'server': '8.8.4.4'}]
The benefits being, easily keying into the dictionaries, only looping over once to create the insertions.
The only downside to this is that it is not in the same format, and has to be looped over 1 more time to do so, this however still beats having to iterate over the list for every line being inserted.
Okay so I've been working on processing some annotated text output. What I have so far is a dictionary with annotation as key and relations an array of elements:
'Adenotonsillectomy': ['0', '18', '1869', '1716'],
'OSAS': ['57', '61'],
'apnea': ['41', '46'],
'can': ['94', '97', '1796', '1746'],
'deleterious': ['103', '114'],
'effects': ['122', '129', '1806', '1752'],
'for': ['19', '22'],
'gain': ['82', '86', '1776', '1734'],
'have': ['98', '102', ['1776 1786 1796 1806 1816'], '1702'],
'health': ['115', '121'],
'lead': ['67', '71', ['1869 1879 1889'], '1695'],
'leading': ['135', '142', ['1842 1852'], '1709'],
'may': ['63', '66', '1879', '1722'],
'obesity': ['146', '153'],
'obstructive': ['23', '34'],
'sleep': ['35', '40'],
'syndrome': ['47', '55'],
'to': ['143', '145', '1852', '1770'],
'weight': ['75', '81'],
'when': ['130', '134', '1842', '1758'],
'which': ['88', '93', '1786', '1740']}
What I want to do is sort this by the first element in the array and reorder the dict as:
'Adenotonsillectomy': ['0', '18', '1869', '1716']
'for': ['19', '22'],
'obstructive': ['23', '34'],
'sleep': ['35', '40'],
'apnea': ['41', '46'],
etc...
right now I've tried to use operator to sort by value:
sorted(dependency_dict.items(), key=lambda x: x[1][0])
However the output I'm getting is still incorrect:
[('Adenotonsillectomy', ['0', '18', '1869', '1716']),
('deleterious', ['103', '114']),
('health', ['115', '121']),
('effects', ['122', '129', '1806', '1752']),
('when', ['130', '134', '1842', '1758']),
('leading', ['135', '142', ['1842 1852'], '1709']),
('to', ['143', '145', '1852', '1770']),
('obesity', ['146', '153']),
('for', ['19', '22']),
('obstructive', ['23', '34']),
('sleep', ['35', '40']),
('apnea', ['41', '46']),
('syndrome', ['47', '55']),
('OSAS', ['57', '61']),
('may', ['63', '66', '1879', '1722']),
('lead', ['67', '71', ['1869 1879 1889'], '1695']),
('weight', ['75', '81']),
('gain', ['82', '86', '1776', '1734']),
('which', ['88', '93', '1786', '1740']),
('can', ['94', '97', '1796', '1746']),
('have', ['98', '102', ['1776 1786 1796 1806 1816'], '1702'])]
I'm not sure whats going wrong. Any help is appreciated.
The entries are sorted in alphabetical order. If you want to sort them on integer value, convert the value to int first:
sorted(dependency_dict.items(), key=lambda x: int(x[1][0]))