Extracting data Frame from XML file using Beautiful soup

Extracting data Frame from XML file using Beautiful soup - python

I was trying to convert the xml table into data frame using Beautiful Soup .
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen("http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml").read()
soup = bs.BeautifulSoup(source,'xml')
GName = soup.find_all('GeneratorName')
Ftype = soup.find_all('FuelType')
Hour = soup.find_all('Hour')
Mwatt = soup.find_all('EnergyMW')
data = []
for i in range(0,len(GName)):
rows = [GName[i].get_text(),Ftype[i].get_text(),
Hour [i].get_text(),Mwatt[i].get_text()
]
data.append(rows)
df = pd.DataFrame(data,columns = ['Generator Name','Fuel Type',
'Hour','Energy MW'],
dtype = int)
display(df)
Generator Name Fuel Type Hour Energy MW
0 BRUCEA-G1 NUCLEAR 1 777
1 BRUCEA-G2 NUCLEAR 2 777
2 BRUCEA-G3 NUCLEAR 3 777
3 BRUCEA-G4 NUCLEAR 4 778
4 BRUCEB-G5 NUCLEAR 5 780
... ... ... ... ...
175 STONE MILLS SF SOLAR 8 0
176 WINDSOR AIRPORT SF SOLAR 9 0
177 ATIKOKAN-G1 BIOFUEL 10 0
178 CALSTOCKGS BIOFUEL 11 0
179 TBAYBOWATER CTS BIOFUEL 12 0
180 rows × 4 columns
The final data frame gives only the Energy MW of index 0 only . It should be for all 180 stations .
I am stuck up .
Thanks

from lxml.html import parse
import pandas as pd
def main(url):
data = parse(url).find('.//generators')
allin = []
for i in data:
allin.append({
'GeneratorName': i[0].text,
'FuelType': i[1].text,
'Outputs': [x.text for x in i[2].cssselect('EnergyMW')],
'Capabilities': [x.text for x in i[3].cssselect('EnergyMW')],
'capacities': [x.text for x in i[4].cssselect('EnergyMW')]
})
df = pd.DataFrame(allin)
print(df)
main('http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml')
Output:
GeneratorName ... capacities
0 BRUCEA-G1 ... [795, 795, 795, 795, 795, 795, 795, 795, 795, ...
1 BRUCEA-G2 ... [779, 779, 779, 779, 779, 779, 779, 779, 779, ...
2 BRUCEA-G3 ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 BRUCEA-G4 ... [760, 760, 760, 760, 760, 760, 760, 760, 760, ...
4 BRUCEB-G5 ... [817, 817, 817, 817, 817, 817, 817, 817, 817, ...
.. ... ... ...
175 STONE MILLS SF ... [54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 5...
176 WINDSOR AIRPORT SF ... [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 5...
177 ATIKOKAN-G1 ... [215, 215, 215, 215, 215, 215, 215, 215, 215, ...
178 CALSTOCKGS ... [38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
179 TBAYBOWATER CTS ... [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
[180 rows x 5 columns]

One way might be to use the indicated transform document and etree.XSLT to apply the transform which generates the table. Select that table, with pandas read_html, then do a little sprucing on the headers if desired.
from lxml import etree
from pandas import read_html as rh
transform = etree.XSLT(etree.parse('http://reports.ieso.ca/docrefs/stylesheet/GenOutputCapability_HTML_t1-4.xsl'))
result_tree = transform(etree.parse('http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml'))
df = rh(str(result_tree), match = 'Hours')[0]
df.columns = df.iloc[1, :]
df = df.iloc[2:, ]
df
Read about the transform steps here: https://lxml.de/xpathxslt.html#xslt

That's a pretty gnarly xml you have there and it takes some efforts to convert to a table which looks like the one on the page:
#first, some required imports
import itertools
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = 'http://reports.ieso.ca/public/GenOutputCapability/PUB_GenOutputCapability.xml'
req = requests.get(url)
soup = bs(req.text,'lxml')
gens = soup.select('Generator')
#We have to get the names of the 180 generators and double it to have 2 rows each:
gen_names = [gen.select_one('generatorname').text for gen in gens]
gen_names = list(itertools.chain.from_iterable(itertools.repeat(x, 2) for x in gen_names))
# we also need to create a list 180 Capability and Output pairs and flatten it to 360:
vars = list(itertools.chain.from_iterable(itertools.repeat(x, 180) for x in [["Capability","Output"]]))
vars = list(itertools.chain(*vars))
#all that in order to create a MultiIndex dataframe:
index = pd.MultiIndex.from_arrays([gen_names,vars], names=["Generator", "Hours"])
#create column names equal to the hours - note that, depending on the time of day the data is downloaded there could be more or less columns
cols = list(range(1,20))
#now collect the data; there may be shorter ways to do that, but for I used a longer method, for easier readability
data = []
for gen in gens:
row = []
row.append([g.text for g in gen.select('Capability energymw')])
row.append([g.text for g in gen.select('Output energymw')])
data.extend(row)
pd.DataFrame(data,index=index,columns=cols)
Output (pardon the formatting):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Generator Hours
BRUCEA-G1 Capability 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795 795
Output 776 775 775 774 775 775 774 774 774 775 776 777 777 775 774 773 773 774 774
BRUCEA-G2 Capability 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779 779
etc.

Related

Can a Dataframe of NBA players be sorted by various conditions: to combine the rows of players w/ multiple entries bc they played on many teams?

I want to remove any players who didn't have over 1000 MP(minutes played).
I could easily write:
league_stats= pd.read_csv("1996.csv")
league_stats = league_stats.drop("Player-additional", axis=1)
league_stats_1000 = league_stats[league_stats['MP'] > 1000]
However, because players sometimes play for multiple teams in a year...this code doesn't account for that.
For example, Sam Cassell has four entries and none are above 1000 MP, but in total his MP for the season was over 1000. By running the above code I remove him from the new dataframe.
I am wondering if there is a way to sort the Dataframe by matching Rank(the RK column gives players who played on different teams the same rank number for each team they played on) and then sort it by... if the total of their MP is 1000=<.
This is the page I got the data from: 1996-1997 season.
Above the data table and to the left of the blue check box there is a dropdown menu called "Share and Export". From there I clicked on "Get table as CSV (for Excel)". After that I saved the CSV to a text editor and change the file extension to .csv to upload it to Jupyter Notebook.
This is a solution I came up with:
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
tot_df = df.loc[df['Tm'] == 'TOT']
mp_1000 = tot_df.loc[tot_df["MP"] < 1000]
# Create list of indexes with unnecessary entries to be removed. We have TOT and don't need these rows.
# *** For the record, I came up with this list by manually going through the data.
indexes_to_remove = [5,6,24, 25, 66, 67, 248, 249, 447, 448, 449, 275, 276, 277, 19, 20, 21, 377, 378, 477, 478, 479,
54, 55, 451, 452, 337, 338, 156, 157, 73, 74, 546, 547, 435, 436, 437, 142, 143, 421, 42, 43, 232,
233, 571, 572, 363, 364, 531, 532, 201, 202, 111, 112, 139, 140, 307, 308, 557, 558, 93, 94, 512,
513, 206, 207, 208, 250, 259, 286, 287, 367, 368, 271, 272, 102, 103, 34, 35, 457, 458, 190, 191,
372, 373, 165, 166
]
df_drop_tot = df.drop(labels=indexes_to_remove, axis=0)
df_drop_tot

First off, no need to manually download the csv and then read it into pandas. You can load in the table using pandas' .read_html().
And yes, you can simply get the list of ranks, player names, or whatever, that have greater than 1000 MP, then use that list to filter the dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
df = pd.read_html(url)[0]
df = df[df['Rk'].ne('Rk')]
df['MP'] = df['MP'].astype(int)
players_1000_rk_list = list(df[df['MP'] >= 1000]['Rk']) #<- coverts the "Rk" column into a list. I can then use that in the next line to only keep the "Rk" values that are in the list of "Rk"s that are >= 1000 MPs
players_df = df[df['Rk'].isin(players_1000_rk_list)]
Output: filters down from 574 rows to 282 rows
print(players_df)
Rk Player Pos Age Tm G ... AST STL BLK TOV PF PTS
0 1 Mahmoud Abdul-Rauf PG 27 SAC 75 ... 189 56 6 119 174 1031
1 2 Shareef Abdur-Rahim PF 20 VAN 80 ... 175 79 79 225 199 1494
3 4 Cory Alexander PG 23 SAS 80 ... 254 82 16 146 148 577
7 6 Ray Allen* SG 21 MIL 82 ... 210 75 10 149 218 1102
10 9 Greg Anderson C 32 SAS 82 ... 34 63 67 73 225 322
.. ... ... .. .. ... .. ... ... ... .. ... ... ...
581 430 Walt Williams SF 26 TOR 73 ... 197 97 62 174 282 1199
582 431 Corliss Williamson SF 23 SAC 79 ... 124 60 49 157 263 915
583 432 Kevin Willis PF 34 HOU 75 ... 71 42 32 119 216 842
589 438 Lorenzen Wright C 21 LAC 77 ... 49 48 60 79 211 561
590 439 Sharone Wright C 24 TOR 60 ... 28 15 50 93 146 390
[282 rows x 30 columns]

Numpyic way to take the first N rows and columns out of every M rows and columns from a square matrix

I have a 20 x 20 square matrix. I want to take the first 2 rows and columns out of every 5 rows and columns, which means the output should be a 8 x 8 square matrix. This can be done in 2 consecutive steps as follows:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
However, I am looking for efficiency. Is there a more Numpyic way to do this? I would prefer to do this in one step.

You can use np.ix_ to retain the elements whose row / column indices are less than 2 modulo 5:
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
mask = np.arange(20) % 5 < 2
result = A[np.ix_(mask, mask)]
print(result)
This outputs:
[[ 0 1 5 6 10 11 15 16]
[ 20 21 25 26 30 31 35 36]
[100 101 105 106 110 111 115 116]
[120 121 125 126 130 131 135 136]
[200 201 205 206 210 211 215 216]
[220 221 225 226 230 231 235 236]
[300 301 305 306 310 311 315 316]
[320 321 325 326 330 331 335 336]]

Very similar to accepted answered, but can just reference rows/column indices directly. Would be interested to see if benchmark is any different than using np.ix_() in accepted answer
Return Specific Row/Column by Numeric Indices
import numpy as np
m = 5
n = 2
A = np.arange(400).reshape(20,-1)
B = np.asarray([row for i, row in enumerate(A) if i % m < n])
C = np.asarray([col for j, col in enumerate(B.T) if j % m < n]).T
rowAndColIds = list(filter(lambda x: x % m < n,range(20)))
# print(rowAndColsIds)
result = A[:,rowAndColIds][rowAndColIds]
print (result)

You could use index broadcasting
i = (np.r_[:20:5][:, None] + np.r_[:2]).ravel()
A[i[:,None], i]
output:
array([[ 0, 1, 5, 6, 10, 11, 15, 16],
[ 20, 21, 25, 26, 30, 31, 35, 36],
[100, 101, 105, 106, 110, 111, 115, 116],
[120, 121, 125, 126, 130, 131, 135, 136],
[200, 201, 205, 206, 210, 211, 215, 216],
[220, 221, 225, 226, 230, 231, 235, 236],
[300, 301, 305, 306, 310, 311, 315, 316],
[320, 321, 325, 326, 330, 331, 335, 336]])

Comparing previous and next value in row csv

I have a CSV file (.txt) containing detections from a CNN:
Example of CSV file:
filename,type,confidence,xmin,ymin,xmax,ymax
27cz1_SLRM_0, barrow,87, 128, 176, 176, 224
27cz1_SLRM_101, barrow,80, 480, 400, 512, 432
27cz1_SLRM_103, celtic_field,85, 0, 112, 96, 256
27cz1_SLRM_103, celtic_field,80, 256, 384, 384, 544
27cz1_SLRM_103, celtic_field,80, 160, 96, 304, 272
27cz1_SLRM_103, barrow,85, 416, 160, 464, 208
27cz1_SLRM_107, celtic_field,84, 96, 448, 224, 576
27cz1_SLRM_107, barrow,94, 256, 432, 304, 480
27cz1_SLRM_107, barrow,87, 128, 368, 176, 416
27cz1_SLRM_107, barrow,84, 64, 304, 112, 352
27cz1_SLRM_107, barrow,80, 64, 208, 96, 240
Example of Coordinate file:
27cz1_SLRM_0, 179927.5, 475140.0
27cz1_SLRM_101, 183062.5, 476565.0
27cz1_SLRM_103, 183632.5, 476565.0
27cz1_SLRM_107, 184772.5, 476565.0
In order to reduce the number of false positives I want to take out all the single detections of the class celtic_field.
In the above example the celtic_field detections from 27cz1_SLRM_103 should remain, but the celtic_field detection from 27cz1_SLRM_107 should be removed.
As part of the further processing, the CSV is opened as a dictionary and turned into a GEOJSON entry (see below). This works fine but I would like to include the above.
coords = {}
coords_file = csv.reader(open(coordinate_location))
for row in coords_file:
coords[row[0]] = [float(row[1]),float(row[2])]
# open output file
output_file = csv.DictReader(open(output_location))
# turn detections into polygons
for row in output_file:
img_name = row['filename']
detection_class = row['type'].strip()
confidence = row['confidence']
#combo = row['filename'] + row['type']
#detection_type = detection['tool_label']
if detection_class == 'celtic_field':
detectionDict = {
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": []
},
"properties": {
"detection_type": detection_class,
"confidence": confidence
}
}
polyCoords = []
coordinate_x_1 = coords[img_name][0] + float(row['xmin']) * 0.5
coordinate_x_2 = coords[img_name][0] + float(row['xmin']) * 0.5
coordinate_x_3 = coords[img_name][0] + float(row['xmax']) * 0.5
coordinate_x_4 = coords[img_name][0] + float(row['xmax']) * 0.5
coordinate_y_1 = coords[img_name][1] - float(row['ymin']) * 0.5
coordinate_y_2 = coords[img_name][1] - float(row['ymax']) * 0.5
coordinate_y_3 = coords[img_name][1] - float(row['ymax']) * 0.5
coordinate_y_4 = coords[img_name][1] - float(row['ymin']) * 0.5
polyCoords.append([coordinate_x_1,coordinate_y_1])
polyCoords.append([coordinate_x_2,coordinate_y_2])
polyCoords.append([coordinate_x_3,coordinate_y_3])
polyCoords.append([coordinate_x_4,coordinate_y_4])
polyCoords.append([coordinate_x_1,coordinate_y_1])
detectionDict['geometry']['coordinates'].append(polyCoords)
output['features'].append(detectionDict)

try this:
df['count']=df.groupby(['filename','type']).transform('count')['confidence']
df=df[~((df['count']==1)&(df['type']=='celtic_field'))]
print(df)
filename type confidence xmin ymin xmax ymax count
0 27cz1_SLRM_0 barrow 87 128 176 176 224 1
1 27cz1_SLRM_101 barrow 80 480 400 512 432 1
2 27cz1_SLRM_103 celtic_field 85 0 112 96 256 3
3 27cz1_SLRM_103 celtic_field 80 256 384 384 544 3
4 27cz1_SLRM_103 celtic_field 80 160 96 304 272 3
5 27cz1_SLRM_103 barrow 85 416 160 464 208 1
7 27cz1_SLRM_107 barrow 94 256 432 304 480 4
8 27cz1_SLRM_107 barrow 87 128 368 176 416 4
9 27cz1_SLRM_107 barrow 84 64 304 112 352 4
10 27cz1_SLRM_107 barrow 80 64 208 96 240 4

Thanks to the help of many of you (especially Billy Bonaros), I have found a working solution:
# remove loose celtic_fields
df = pd.read_csv(output_location)
df['count'] = df.groupby(['filename','type']).transform('count')['confidence']
for i,row in df.iterrows():
if row['count']==1 and row['type']==' celtic_field':
df.drop(i, inplace=True)
df.to_csv('...\csv.txt')
# open output file
output_file = csv.DictReader(open('...\csv.txt'))
Many thanks!

Pyparsing two-dimensional list

I have the following sample data:
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;
It is the equivalent of a two dimensional array (except each row does not have an equal amount of columns). At the end of each line is a space and then a \n except for the final entry which is followed by no space and only a ;.
Would anyone know the pyparsing grammer to parse this? I've been trying something along the following lines but it will not match.
data = Group(OneOrMore(Group(OneOrMore(Word(nums) + SPACE)) + LINE) + \
Group(OneOrMore(Word(nums) + SPACE)) + Word(nums) + Literal(";")
The desired output would ideally be as follows
[['165', '150', '238', '402', '395', '571', '365', '446', '284', '278',
'322', '282', '236'], ['16', '5', ... ], [...], ['6013', ..., '11393']]
Any assistance would be greatly appreciated.

You can use the stopOn argument to OneOrMore to make it stop matching. Then, since newlines are by default skippable whitespace, the next group can start matching, and it will just skip over the newline and start at the next integer.
import pyparsing as pp
data_line = pp.Group(pp.OneOrMore(pp.pyparsing_common.integer(), stopOn=pp.LineEnd()))
data_lines = pp.OneOrMore(data_line) + pp.Suppress(';')
Applying this to your sample data:
data = """\
165 150 238 402 395 571 365 446 284 278 322 282 236
16 5 19 10 12 5 18 22 6 4 5
259 224 249 193 170 151 95 86 101 58 49
6013 7413 8976 10392 12678 9618 9054 8842 9387 11088 11393;"""
parsed = data_lines.parseString(data)
from pprint import pprint
pprint(parsed.asList())
Prints:
[[165, 150, 238, 402, 395, 571, 365, 446, 284, 278, 322, 282, 236],
[16, 5, 19, 10, 12, 5, 18, 22, 6, 4, 5],
[259, 224, 249, 193, 170, 151, 95, 86, 101, 58, 49],
[6013, 7413, 8976, 10392, 12678, 9618, 9054, 8842, 9387, 11088, 11393]]

How to transform a 3d arrays into a dataframe in python

I have a 3d array as follows:
ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
array([[[715, 226, 632],
[305, 97, 534],
[ 88, 592, 902],
[172, 932, 263]],
[[895, 837, 431],
[649, 717, 39],
[363, 121, 274],
[334, 359, 816]],
[[520, 692, 230],
[452, 816, 887],
[688, 509, 770],
[290, 856, 584]],
[[286, 358, 462],
[831, 26, 332],
[424, 178, 642],
[955, 42, 938]],
[[ 44, 119, 757],
[908, 937, 728],
[809, 28, 442],
[832, 220, 348]]])
Now I would like to have it into a DataFrame like this:
Add a Date column like indicated and the column names A, B, C.
How to do this transformation? Thanks!

Based on the answer to this question, we can use a MultiIndex. First, create the MultiIndex and a flattened DataFrame.
A = np.random.randint(0, 1000, (5, 4, 3))
names = ['x', 'y', 'z']
index = pd.MultiIndex.from_product([range(s)for s in A.shape], names=names)
df = pd.DataFrame({'A': A.flatten()}, index=index)['A']
Now we can reshape it however we like:
df = df.unstack(level='x').swaplevel().sort_index()
df.columns = ['A', 'B', 'C']
df.index.names = ['DATE', 'i']
This is the result:
A B C
DATE i
0 0 715 226 632
1 895 837 431
2 520 692 230
3 286 358 462
4 44 119 757
1 0 305 97 534
1 649 717 39
2 452 816 887
3 831 26 332
4 908 937 728
2 0 88 592 902
1 363 121 274
2 688 509 770
3 424 178 642
4 809 28 442
3 0 172 932 263
1 334 359 816
2 290 856 584
3 955 42 938
4 832 220 348

You could convert your 3D array to a Pandas Panel, then flatten it to a 2D DataFrame (using .to_frame()):
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
pan = pd.Panel(arr)
df = pan.swapaxes(0, 2).to_frame()
df.index = df.index.droplevel('minor')
df.index.name = 'Date'
df.index = df.index+1
df.columns = list('ABC')
yields
A B C
Date
1 875 702 266
1 940 180 971
1 254 649 353
1 824 677 745
...
4 675 488 939
4 382 238 225
4 923 926 633
4 664 639 616
4 770 274 378
Alternatively, you could reshape the array to shape (20, 3), form the DataFrame as usual, and then fix the index:
import numpy as np
import pandas as pd
np.random.seed(2016)
arr = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame(arr.reshape(-1, 3), columns=list('ABC'))
df.index = np.repeat(np.arange(arr.shape[0]), arr.shape[1]) + 1
df.index.name = 'Date'
print(df)
yields the same result.

ThreeD_Arrays = np.random.randint(0, 1000, (5, 4, 3))
df = pd.DataFrame([list(l) for l in ThreeD_Arrays]).stack().apply(pd.Series).reset_index(1, drop=True)
df.index.name = 'Date'
df.columns = list('ABC')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data Frame from XML file using Beautiful soup - python

Related

Can a Dataframe of NBA players be sorted by various conditions: to combine the rows of players w/ multiple entries bc they played on many teams?

Numpyic way to take the first N rows and columns out of every M rows and columns from a square matrix

Comparing previous and next value in row csv

Pyparsing two-dimensional list

How to transform a 3d arrays into a dataframe in python

Categories

Resources