python get next lines after matching - python

I need to grab info about software version and build data (maybe loading file name too) from page download GIS "Panorama"
I am using python and grab. Now i have sctript like this:
from grab import Grab
g = Grab(log_file='out.html')
g.go('http://www.gisinfo.ru/download/download.htm')
for table in g.doc.select('//div[#id="article_header_rubric"]/following-sibling::table[2]'):
#if test.text().startswith('Профессиона'):
for tr in table.select('tr'):
type(tr.select('td'))
for td in tr.select('td'):
#if td.text().startswith('Профессиональная ГИС'):
print (td.text())
And result is like this:
Драйвер электронного ключа x86 (версия 6.20, 32-разрядные операционные системы,
для Панорама 10 и выше)
15.08.2013
9,6 Mb
drivers.zip
Сервер Guardant Net (версия 5.5.0.10, для Панорама 10 и 11)
25.07.2013
4.1 Mb
gnserver.zip
Сервер Guardant Net (версия 6.3.1.713, для Панорама 12)
11.10.2016
4 Mb
netkey6.zip
Программа для диагностики ключей
13.07.2016
2,6 Mb
diagnostics.zip
Than i filtering what i want:
from grab import Grab
g = Grab(log_file='out.html')
g.go('http://www.gisinfo.ru/download/download.htm')
for table in g.doc.select('//div[#id="article_header_rubric"]/following-sibling::table[2]'):
#if test.text().startswith('Профессиона'):
for tr in table.select('tr'):
type(tr.select('td'))
for td in tr.select('td'):
if td.text().startswith('Профессиональная ГИС'):
print (td.text())
And result now:
Профессиональная ГИС
Профессиональная ГИС "Панорама" (версия 12.4.0, для платформы "x64")
Профессиональная ГИС "Панорама" (версия 12.3.2, для платформы "x64", на английском языке)
Профессиональная ГИС "Карта 2011" (версия 11.13.5.7)
But i want result like this:
Профессиональная ГИС "Панорама" (версия 12.4.0, для платформы "x64")
29.12.2016
347 Mb
panorama12x64.zip
Профессиональная ГИС "Панорама" (версия 12.3.2, для платформы "x64", на английском языке)
24.11.2016
376 Mb
panorama12x64en.zip
Профессиональная ГИС "Карта 2011" (версия 11.13.5.7)
11.01.2017
263 Mb
panorama11.zip
And ideas?

tr.select('td') returns an iterable. In you current code, you test the string on each iteration, so you only print the first one.
You should store an iterator, take its first value and test it, and if it matches print all values from the iterable:
for tr in table.select('tr'):
type(tr.select('td'))
it = iter(tr.select('td'))
td = next(it) # process first field in row
if td.text().startswith('Профессиональная ГИС'):
print (td.text())
for td in it: # print remaining fields
print(td.text())

is ok for me like this:
from grab import Grab
g = Grab(log_file='out.html')
g.go('http://www.gisinfo.ru/download/download.htm')
list = []
for table in g.doc.select('//div[#id="article_header_rubric"]/following-sibling::table[2]'):
for tr in table.select('tr'):
type(tr.select('td'))
for td in tr.select('td'):
list.append(td.text())
for i in list:
indexi=list.index(i)
if (i.startswith('Профессиональная ГИС "Панорама" (версия 12') and i.endswith('x64")')):
panorama12onsite = list[indexi]
panorama12onsitedata = list[indexi+1]
elif i.startswith('Профессиональная ГИС "Карта 2011"'):
panorama11onsite = list[indexi]
panorama11onsitedata = list[indexi+1]
elif (i.startswith('Профессиональный векторизатор "Панорама-редактор" (версия 12') and i.endswith('x64")')):
panedit12onsite = list[indexi]
panedit12onsitedata = list[indexi+1]
elif i.startswith('Профессиональный векторизатор "Панорама-редактор" (версия 11'):
panedit11onsite = list[indexi]
panedit11onsitedata = list[indexi+1]
print (panorama12onsite)
print (panorama12onsitedata)
print (panorama11onsite)
print (panorama11onsitedata)
print (panedit12onsite)
print (panedit12onsitedata)
print (panedit11onsite)
print (panedit11onsitedata)
Now i can use this variables in other part of script (merge vs downloaded version and download new if exists). Thanks all for help

Related

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

Storing keyvalue as header and value text as rows using data frame in python using beautiful soup

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

Using beautiful soup inside a loop to search for urls

I am trying to write a python script that goes onto the website https://www.premierleague.com/players, takes a list of football player names from a spreadsheet I have (400+ footballer names), and inside a loop, iteratively searches for the link to each football player's page. For example : https://www.premierleague.com/players/4040/Benik-Afobe/overview.
The final bit of the script is commented out as I have finalised that yet, but for context of what I'm eventually going to get to: it will take this list of urls that I will have obtained, and iteratively search for each players link to the player image, and append it to a list.
I managed to get it to work for an individual player (Benik Afobe), but since adding the 'players_list' and trying a loop, I get the following error:
Traceback (most recent call last):
File "C:/Users/Liam/Documents/GitHub/Football_Scraping/fantast_pl_images.py", line 33, in <module>
player_link = soup.find('a', href=re.compile('%s'))['href'] %player
TypeError: 'NoneType' object is not subscriptable
Does anyone know what I'm doing wrong and how to get my loop working?
The Repo of my project can be found here: https://github.com/leej11/Football_Scraping
# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup
import re
import pandas as pd
# Specify the URL
url = 'https://www.premierleague.com/players'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")
#Importing the list of players I want to scrape the image of
players_list = pd.read_csv('epl_players_anki_clean.csv')
#Test that it's pulling all of the players names correctly
print(players_list.iloc[:,0])
print (type(players_list))
#Convert the pandas dataframe to a list of strings, with each item being the string of a player name
list_of_players = players_list['name'].values.tolist()
print(list_of_players)
#Setup an empty list to append the player links to
player_link_list = []
#Loop over the list of player names, and search for the player url and append it to the player_link_list
for player in players_list:
player_link = soup.find('a', href=re.compile('%s'))['href'] %player
print(player_link)
player_url = 'https://www.premierleague.com' + '%s' %player_link
print(player_url)
player_link_list.append(player_url)
##### The final step
##### To be worked on in a bit, basically take the list of links and loop over it pulling out the player image links and appending them to a list
#####
# url2 = player_url
# http2 = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
# response2 = http2.request('GET', url2, headers={'User-Agent': 'Mozilla/5.0'})
# soup2 = BeautifulSoup(response2.data, "html.parser")
# player_img = str(soup2.find("img", {'alt':'Benik Afobe'})['data-player'])
# print(player_img)
#
# photo_link = 'http://platform-static-files.s3.amazonaws.com/premierleague/photos/players/250x250/' + '%s' %player_img + '.png'
# print(photo_link)
It appears that the Premier League's player listing is dynamic, meaning that a browser script is loading additional players as the user scrolls down. Thus, using requests or urllib to find all the players will not work. Therefore, you will have to use a browser manipulation tool called selenium:
Install:
pip install selenium:
Then, install the proper binding for the webbrowser you are using:
http://selenium-python.readthedocs.io/installation.html#drivers
import re
import selenium
import time
import csv
driver = selenium.webdriver.Chrome('/path/to/driver')#substitute Chrome with browser you are using
driver.get('https://www.premierleague.com/players')
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(0.5)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
time.sleep(10)
players = re.findall('www\.premierleague\.com/players/(.*?)/(.*?)/overview', driver.page_source)
csv_filedata = list(csv.reader(open('epl_players_anki_clean.csv')))
player_dict = {re.sub('-', ' ', b):(a, b) for a, b in players}
new_rows = [[csv_filedata[0]]+['url']]+[a+['https://www.premierleague.com/players/{}/{}/overview'.format(*player_dict[a[0]])] for a in csv_filedata]
with open('players.csv', 'a') as f:
write = csv.writer(f)
write.writerows(new_rows)
player_dict stores the following (truncated data):
Output:
{u'Asmir Begovic': [u'2537', u'Asmir-Begovic'], u'Ragnar Klavan': [u'15608', u'Ragnar-Klavan'], u'Eddie Nketiah': [u'14451', u'Eddie-Nketiah'], u'Mohamed Salah': [u'5178', u'Mohamed-Salah'], u'Jos\xe9 Holebas': [u'5713', u'Jos\xe9-Holebas'], u'Christian Eriksen': [u'4845', u'Christian-Eriksen'], u'Kurt Zouma': [u'5175', u'Kurt-Zouma'], u'Gareth Barry': [u'1308', u'Gareth-Barry'], u'Diego Costa': [u'4941', u'Diego-Costa'], u'Sam McQueen': [u'9649', u'Sam-McQueen'], u'Roque Mesa': [u'22575', u'Roque-Mesa'], u'Siem de Jong': [u'4885', u'Siem-de-Jong'], u'Lazar Markovic': [u'5078', u'Lazar-Markovic'], u'Adam Federici': [u'3182', u'Adam-Federici'], u'Dean Marney': [u'2359', u'Dean-Marney'], u'Nathan Broadhead': [u'14636', u'Nathan-Broadhead'], u'Alex Pritchard': [u'4433', u'Alex-Pritchard'], u'Matthew Pennington': [u'9895', u'Matthew-Pennington'], u'Tomas Kalas': [u'4101', u'Tomas-Kalas'], u'Nathan Ak\xe9': [u'4499', u'Nathan-Ak\xe9'], u'Mathias Normann': [u'23556', u'Mathias-Normann'], u'Grzegorz Krychowiak': [u'12735', u'Grzegorz-Krychowiak'], u'Wojciech Szczesny': [u'3543', u'Wojciech-Szczesny'], u'Charlie Adam': [u'4081', u'Charlie-Adam'], u'Marko Grujic': [u'13985', u'Marko-Grujic'], u'Harry Maguire': [u'9566', u'Harry-Maguire'], u'Isaiah Brown': [u'4674', u'Isaiah-Brown'], u'Matz Sels': [u'16723', u'Matz-Sels'], u'Leighton Baines': [u'3030', u'Leighton-Baines'], u'Marouane Fellaini': [u'3604', u'Marouane-Fellaini'], u'Jairo Riedewald': [u'4878', u'Jairo-Riedewald'], u'Glenn Murray': [u'4772', u'Glenn-Murray'], u'Tom Cadman': [u'14547', u'Tom-Cadman'], u'Ryan Shawcross': [u'3158', u'Ryan-Shawcross'], u"N'Golo Kant\xe9": [u'13492', u"N'Golo-Kant\xe9"], u'Aaron Ramsey': [u'3548', u'Aaron-Ramsey'], u'Stephen Kingsley': [u'10517', u'Stephen-Kingsley'], u'Eliaquim Mangala': [u'5334', u'Eliaquim-Mangala'], u'Josh Tymon': [u'13477', u'Josh-Tymon'], u'Mohamed Diam\xe9': [u'3982', u'Mohamed-Diam\xe9'], u'Sofiane Boufal': [u'12584', u'Sofiane-Boufal'], u'Nya Kirby': [u'15134', u'Nya-Kirby'], u'Max Melbourne': [u'15160', u'Max-Melbourne'], u'Marcin Bulka': [u'23695', u'Marcin-Bulka'], u'Rub\xe9n Sobrino': [u'16608', u'Rub\xe9n-Sobrino'], u'Tareiq Holmes Dennis': [u'8340', u'Tareiq-Holmes-Dennis'], u'Martin Cranie': [u'2559', u'Martin-Cranie'], u'Connor Mahoney': [u'7985', u'Connor-Mahoney'], u'Jamaal Lascelles': [u'9257', u'Jamaal-Lascelles'], u'Phil Foden': [u'14805', u'Phil-Foden'], u'Arijanet Muric': [u'19911', u'Arijanet-Muric'], u'Sadio Man\xe9': [u'6519', u'Sadio-Man\xe9'], u"Aiden O'Neill": [u'20095', u"Aiden-O'Neill"], u'Steve Cook': [u'8045', u'Steve-Cook'], u'Samuel Shashoua': [u'15142', u'Samuel-Shashoua'], u'Kyle Bartley': [u'3312', u'Kyle-Bartley'], u'Bojan': [u'4898', u'Bojan'], u'Jason Puncheon': [u'4084', u'Jason-Puncheon'], u'Damien Delaney': [u'1911', u'Damien-Delaney'], u'Steven Defour': [u'5345', u'Steven-Defour'], u'Christian Walton': [u'8159', u'Christian-Walton'], u'Timothy Fosu Mensah': [u'13561', u'Timothy-Fosu-Mensah'], u'Michael Keane': [u'4333', u'Michael-Keane'], u'Levi Lumeka': [u'14170', u'Levi-Lumeka'], u'Chancel Mbemba': [u'5850', u'Chancel-Mbemba'], u'Brice Dja Dj\xe9dj\xe9': [u'5577', u'Brice-Dja-Dj\xe9dj\xe9'], u'Vurnon Anita': [u'4550', u'Vurnon-Anita'], u'Jefferson Montero': [u'10518', u'Jefferson-Montero'], u'Toby Alderweireld': [u'4916', u'Toby-Alderweireld'], u'Dominic Calvert Lewin': [u'9576', u'Dominic-Calvert-Lewin'], u'Brad Jackson': [u'13246', u'Brad-Jackson'], u'James McArthur': [u'4224', u'James-McArthur'], u'Mat Ryan': [u'12192', u'Mat-Ryan'], u'Bartosz Kapustka': [u'19679', u'Bartosz-Kapustka'], u'Robert Snodgrass': [u'4558', u'Robert-Snodgrass'], u'Jonathan Leko': [u'13866', u'Jonathan-Leko'], u'Harry Arter': [u'8050', u'Harry-Arter'], u'Connor Goldson': [u'9634', u'Connor-Goldson'], u'Shaun Hobson': [u'21691', u'Shaun-Hobson'], u'Ayoze P\xe9rez': [u'10487', u'Ayoze-P\xe9rez'], u'Marc Pugh': [u'8049', u'Marc-Pugh'], u'Luciano Narsingh': [u'7122', u'Luciano-Narsingh'], u'Michael Folivi': [u'14298', u'Michael-Folivi'], u'Adri\xe1n': [u'4852', u'Adri\xe1n'], u'Mason Holgate': [u'10564', u'Mason-Holgate'], u'Joy Mukena': [u'15118', u'Joy-Mukena'], u"Lewis O'Brien": [u'24353', u"Lewis-O'Brien"], u'Javier Manquillo': [u'4918', u'Javier-Manquillo'], u"Dara O'Shea": [u'15154', u"Dara-O'Shea"], u"Clinton N'Jie": [u'6903', u"Clinton-N'Jie"], u'Yoan Gouffran': [u'4554', u'Yoan-Gouffran'], u'Michael Carrick': [u'1634', u'Michael-Carrick'], u'Moha': [u'13778', u'Moha'], u'Michy Batshuayi': [u'7450', u'Michy-Batshuayi'], u'Nathaniel Chalobah': [u'4105', u'Nathaniel-Chalobah'], u'Ryan Inniss': [u'4760', u'Ryan-Inniss'], u'Etienne Capoue': [u'4843', u'Etienne-Capoue'], u'Badou Ndiaye': [u'20538', u'Badou-Ndiaye'], u'Alexandre Lacazette': [u'6899', u'Alexandre-Lacazette'], u'Charlie Rowan': [u'14285', u'Charlie-Rowan'], u'Nathan Ferguson': [u'23976', u'Nathan-Ferguson'], u'Anthony Georgiou': [u'15146', u'Anthony-Georgiou'], u'Dujon Sterling': [u'14572', u'Dujon-Sterling'], u'Axel Tuanzebe': [u'13559', u'Axel-Tuanzebe'], u'Emre Can': [u'5001', u'Emre-Can'], u'Sam Surridge': [u'13195', u'Sam-Surridge'], u'Ryan Kent': [u'13509', u'Ryan-Kent'], u'Marc Albrighton': [u'3564', u'Marc-Albrighton'], u'Joe Williams': [u'10454', u'Joe-Williams'], u'Tom Heaton': [u'2933', u'Tom-Heaton'], u'Danny Rose': [u'3507', u'Danny-Rose'], u'Nathan Redmond': [u'3811', u'Nathan-Redmond'], u'Chicharito': [u'4161', u'Chicharito'], u'Dean Whitehead': [u'2980', u'Dean-Whitehead'], u'M.J. Williams': [u'10464', u'M.J.-Williams'], u'Harry Winks': [u'7488', u'Harry-Winks'], u'Josh Sims': [u'15374', u'Josh-Sims'], u'Charlie Gilmour': [u'14453', u'Charlie-Gilmour'], u'Aaron Wan Bissaka': [u'14164', u'Aaron-Wan-Bissaka'], u'Marc Muniesa': [u'4822', u'Marc-Muniesa'], u'Beni Baningime': [u'14623', u'Beni-Baningime'], u'Demarai Gray': [u'7946', u'Demarai-Gray'], u'Junior Stanislas': [u'3766', u'Junior-Stanislas'], u'Liam Rosenior': [u'2464', u'Liam-Rosenior'], u'Nathaniel Clyne': [u'4604', u'Nathaniel-Clyne'], u'Kamil Grabara': [u'19909', u'Kamil-Grabara'], u'Anthony Martial': [u'11272', u'Anthony-Martial'], u'Ben Foster': [u'2932', u'Ben-Foster'], u'Laurent Depoitre': [u'16747', u'Laurent-Depoitre'], u'Mike van der Hoorn': [u'4877', u'Mike-van-der-Hoorn'], u'Didier Ndong': [u'20708', u'Didier-Ndong'], u'Jordon Mutch': [u'3333', u'Jordon-Mutch'], u'Harry Kane': [u'3960', u'Harry-Kane'], u'Fernandinho': [u'4804', u'Fernandinho'], u'Riyad Mahrez': [u'8983', u'Riyad-Mahrez'], u'Kleton Perntreou': [u'14144', u'Kleton-Perntreou'], u'Dion Henry': [u'10855', u'Dion-Henry'], u'Kelechi Iheanacho': [u'13554', u'Kelechi-Iheanacho'], u'Salom\xf3n Rond\xf3n': [u'6030', u'Salom\xf3n-Rond\xf3n'], u'Ryan Allsop': [u'3732', u'Ryan-Allsop'], u'Erik Pieters': [u'4821', u'Erik-Pieters'], u'Willy Caballero': [u'10466', u'Willy-Caballero'], u'Claudio Yacob': [u'4673', u'Claudio-Yacob'], u'Craig Dawson': [u'4198', u'Craig-Dawson'], u'Jayson Molumby': [u'15293', u'Jayson-Molumby'], u'Lucas Leiva': [u'3137', u'Lucas-Leiva'], u'Martin Dubravka': [u'6451', u'Martin-Dubravka'], u'Bruno': [u'8162', u'Bruno'], u'Sam Johnstone': [u'4331', u'Sam-Johnstone'], u'Jes\xfas G\xe1mez': [u'11070', u'Jes\xfas-G\xe1mez'], u'Tomer Hemed': [u'13234', u'Tomer-Hemed'], u'Victor Moses': [u'3983', u'Victor-Moses'], u'Vincent Janssen': [u'15481', u'Vincent-Janssen'], u'Lo\xefc Remy': [u'4572', u'Lo\xefc-Remy'], u'Craig Cathcart': [u'3160', u'Craig-Cathcart'], u'Leroy Fer': [u'4810', u'Leroy-Fer'], u"Kieran O'Hara": [u'13584', u"Kieran-O'Hara"], u'Ola Aina': [u'10439', u'Ola-Aina'], u'Winston Reid': [u'4209', u'Winston-Reid'], u'Jose Baxter': [u'3608', u'Jose-Baxter'], u'Michael Obafemi': [u'21532', u'Michael-Obafemi'], u'Bruno Martins Indi': [u'11177', u'Bruno-Martins-Indi'], u'Laurent Koscielny': [u'4030', u'Laurent-Koscielny'], u'Borja Bast\xf3n': [u'16622', u'Borja-Bast\xf3n'], u'Daryl Janmaat': [u'10480', u'Daryl-Janmaat'], u'Freddy Woodman': [u'10479', u'Freddy-Woodman'], u'Jordy Hiwula Mayifuila': [u'10949', u'Jordy-Hiwula-Mayifuila'], u'Raphael Spiegel': [u'4679', u'Raphael-Spiegel'], u'Anthony Knockaert': [u'8982', u'Anthony-Knockaert'], u'Harry Lewis': [u'14982', u'Harry-Lewis'], u'Henrikh Mkhitaryan': [u'5102', u'Henrikh-Mkhitaryan'], u'Santiago Cazorla': [u'4477', u'Santiago-Cazorla'], u'Sean Scannell': [u'8887', u'Sean-Scannell'], u'Christian Atsu': [u'4859', u'Christian-Atsu'], u'Pascal Gro\xdf': [u'22542', u'Pascal-Gro\xdf'], u'Charlie Austin': [u'9468', u'Charlie-Austin'], u'Sam Byram': [u'8945', u'Sam-Byram'], u'Daniel Sturridge': [u'3154', u'Daniel-Sturridge'], u'Ga\xebtan Bong': [u'5721', u'Ga\xebtan-Bong'], u'Martin Kelly': [u'3644', u'Martin-Kelly'], u'Jack Payne': [u'9664', u'Jack-Payne'], u'Michel Vorm': [u'4398', u'Michel-Vorm'], u'Oriol Romeu': [u'4286', u'Oriol-Romeu'], u'Philip Billing': [u'8882', u'Philip-Billing'], u'Matthew Lowton': [u'4487', u'Matthew-Lowton'], u'Wayne Hennessey': [u'2569', u'Wayne-Hennessey'], u'Geoff Cameron': [u'4636', u'Geoff-Cameron'], u'Tammy Abraham': [u'13286', u'Tammy-Abraham'], u'Elvis Manu': [u'12374', u'Elvis-Manu'], u'Marvin Zeegelaar': [u'10123', u'Marvin-Zeegelaar'], u'Jordy Clasie': [u'12365', u'Jordy-Clasie'], u'Wayne Routledge': [u'2681', u'Wayne-Routledge'], u'Tom Anderson': [u'8234', u'Tom-Anderson'], u'Stephen Duke McKenna': [u'23738', u'Stephen-Duke-McKenna'], u'Harry Charsley': [u'14632', u'Harry-Charsley'], u'Erik Lamela': [u'4842', u'Erik-Lamela'], u'Elias Kachunga': [u'19611', u'Elias-Kachunga'], u'Molla Wagu\xe9': [u'21730', u'Molla-Wagu\xe9'], u'Ilkay G\xfcndogan': [u'5101', u'Ilkay-G\xfcndogan'], u'Ashley Williams': [u'4403', u'Ashley-Williams'], u'Lewis Grabban': [u'8055', u'Lewis-Grabban'], u'Seamus Coleman': [u'3600', u'Seamus-Coleman'], u'Jason Denayer': [u'11002', u'Jason-Denayer'], u'Jack Wilshere': [u'3547', u'Jack-Wilshere'], u'Calum Chambers': [u'4620', u'Calum-Chambers'], u'Samir Nasri': [u'3546', u'Samir-Nasri'], u'Alexis S\xe1nchez': [u'4973', u'Alexis-S\xe1nchez'], u'Kyle Walker': [u'3955', u'Kyle-Walker'], u'Martin Olsson': [u'2867', u'Martin-Olsson'], u'Modou Barrow': [u'10520', u'Modou-Barrow'], u'Robbie Brady': [u'4158', u'Robbie-Brady'], u'Tom Davies': [u'13389', u'Tom-Davies'], u'Fraser Forster': [u'3170', u'Fraser-Forster'], u'Francis Coquelin': [u'3549', u'Francis-Coquelin'], u'Matt Targett': [u'4815', u'Matt-Targett'], u'Davy Klaassen': [u'4886', u'Davy-Klaassen'], u"Stefan O'Connor": [u'10425', u"Stefan-O'Connor"], u'Fraser Hornby': [u'23744', u'Fraser-Hornby'], u'Tim Krul': [u'3169', u'Tim-Krul'], u'Ryan Hill': [u'21858', u'Ryan-Hill'], u'J\xfcrgen Locadia': [u'7124', u'J\xfcrgen-Locadia'], u'Ki Sung yueng': [u'4656', u'Ki-Sung-yueng'], u'Leon Britton': [u'2152', u'Leon-Britton'], u'Mesut \xd6zil': [u'4714', u'Mesut-\xd6zil'], u'Alex Denny': [u'14643', u'Alex-Denny'], u'Nemanja Matic': [u'3861', u'Nemanja-Matic'], u'Ryan Fraser': [u'8052', u'Ryan-Fraser'], u'Julian Speroni': [u'2664', u'Julian-Speroni'], u'Joel Campbell': [u'4254', u'Joel-Campbell'], u'Robert Elliot': [u'2214', u'Robert-Elliot'], u'Tosin Adarabioyo': [u'13549', u'Tosin-Adarabioyo'], u'Jack Colback': [u'3713', u'Jack-Colback'], u'Soufyan Ahannach': [u'24695', u'Soufyan-Ahannach'], u'Aaron Connolly': [u'21653', u'Aaron-Connolly'], u'Yasin Ben El Mhanni': [u'20879', u'Yasin-Ben-El-Mhanni'], u'Kazenga Lua Lua': [u'3173', u'Kazenga-Lua-Lua'], u'Ben Chilwell': [u'13491', u'Ben-Chilwell'], u'Aaron Ramsdale': [u'13703', u'Aaron-Ramsdale']}

XML to CSV in PYTHON: Extract series of subnodes for every node

My goal is to convert an .XML file into a .CSV file.
This part of the code is already functional.
However, I also want to extract the sub-sub-nodes of one of the "father" nodes.
Maybe an example would be more self explanatory;
Here is the structure of my XML:
<nedisCatalogue>
<headerInfo>
<feedVersion>1-0</feedVersion>
<dateCreated>2018-01-22T23:37:01+0100</dateCreated>
<supplier>Nedis_BENED</supplier>
<locale>nl_BE</locale>
</headerInfo>
<productList>
<product>
<nedisPartnr><![CDATA[VS-150/63BA]]></nedisPartnr>
<nedisArtlid>17005</nedisArtlid>
<vendorPartnr><![CDATA[TONFREQ-ELKOS / BIPOL 150, 5390]]></vendorPartnr>
<brand><![CDATA[Visaton]]></brand>
<EAN>4007540053905</EAN>
<intrastatCode>8532220000</intrastatCode>
<UNSPSC>52161514</UNSPSC>
<headerText><![CDATA[Crossover Foil capacitor]]></headerText>
<internetText><![CDATA[Bipolaire elco met een ruwe folie en een zeer goede prijs/kwaliteits-verhouding voor de bouw van cross-overs. 63 Vdc, 10% tolerantie.]]></internetText>
<generalText><![CDATA[Dimensions 16 x 35 mm
]]></generalText>
<images>
<image type="2" order="15">767736.JPG</image>
</images>
<attachments>
</attachments>
<categories>
<tree name="Internet_Tree_ISHP">
<entry depth="001" id="1067858"><![CDATA[Audio]]></entry>
<entry depth="002" id="1067945"><![CDATA[Speakers]]></entry>
<entry depth="003" id="1068470"><![CDATA[Accessoires]]></entry>
</tree>
</categories>
<properties>
<property id="360" multiplierID="" unitID="" valueID=""><![CDATA[...]]></property>
</properties>
<status>
<code status="NORMAL"></code>
</status>
<packaging quantity="1" weight="8"></packaging>
<introductionDate>2015-10-26</introductionDate>
<serialnumberKeeping>N</serialnumberKeeping>
<priceLevels>
<normalPricing from="2017-02-13" to="2018-01-23">
<price level="1" moq="1" currency="EUR">2.48</price>
</normalPricing>
<specialOfferPricing></specialOfferPricing>
<goingPriceInclVAT currency="EUR" quantity="1">3.99</goingPriceInclVAT>
</priceLevels>
<tax>
</tax>
<stock>
<inStockLocal>25</inStockLocal>
<inStockCentral>25</inStockCentral>
<ATP>
<nextExpectedStockDateLocal></nextExpectedStockDateLocal>
<nextExpectedStockDateCentral></nextExpectedStockDateCentral>
</ATP>
</stock>
</product>
....
</nedisCatalogue>
And here is the code that I have now:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("/Users/BE07861/Documents/nedis_catalog_2018-01-23_nl_BE_32191_v1-0_xml")
root = tree.getroot()
f = open('/Users/BE07861/Documents/test2.csv', 'w')
csvwriter = csv.writer(f, delimiter='ç')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text', 'categories']
prdlist = root[1]
prdct = prdlist[5]
cat = prdct[12]
tree1=cat[0]
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
categ = time.find('categories').find('tree').find('entry').text
row.append(categ)
csvwriter.writerow(row)
f.close()
If you run the code, you'll see that I only retrieve the first "entry" of the categories/tree; which is normal. However, I don't know how to create a loop that, for every node "categories", creates new columns such as categories1, categories2 & categories3 with the values: "entry".
My result should look like this
Nedis Part Number Nedis Article ID Vendor Part Number
VS-150/63BA 17005 TONFREQ-ELKOS / BIPOL 150, 5390
Brand EAN Header text Internet Text
Visaton 4,00754E+12 Crossover Foil capacitor Bipolaire elco …
General Text Category1 Categroy2 Category3
Dimensions 16 x 35 mm Audio Speakers Accessoires
I've really tried my best but didn't manage to find the solution.
Any help would be very much appreciated!!! :)
Thanks a lot,
Allan
I think this is what you're looking for:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
Here's a solution that loops through the xml once to figure out how many headers to add, adds the headers, and then loops through each product's category list:
**Updated to iterate through images in addition to categories. This is the biggest difference:
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
It's going to figure out the maximum number of categories on a single record and then and that many columns. If a particular record has less categories, this code will stick blank values in as placeholders so the column headers always line up with the data.
For instance:
Cat1 Cat2 Cat3 Img1 Img2 Img3
A B C 1 2 3
D E <blank> 4 <blank> <blank>
Here's the full solution:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("c:\\python\\xml.xml")
root = tree.getroot()
f = open('c:\\python\\xml.csv', 'w')
csvwriter = csv.writer(f, delimiter=',')
count = 0
head = ['Nedis Part Number', 'Nedis Article ID', 'Vendor Part Number', 'Brand', 'EAN', 'Header text', 'Internet Text', 'General Text']
prdlist = root[1]
maxcat = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('categories').find('tree'):
cur += 1
if cur > maxcat:
maxcat = cur
for cnt in range (0, maxcat):
head.append('Category ' + str(cnt + 1))
maximg = 0
for time in prdlist.findall('product'):
cur = 0
for child in time.find('images'):
cur += 1
if cur > maximg:
maximg = cur
for cnt in range(0, maximg):
head.append('Image ' + str(cnt + 1))
csvwriter.writerow(head)
for time in prdlist.findall('product'):
row = []
nedis_number = time.find('nedisPartnr').text
row.append(nedis_number)
nedis_art_id = time.find('nedisArtlid').text
row.append(nedis_art_id)
vendor_part_nbr = time.find('vendorPartnr').text
row.append(vendor_part_nbr)
Brand = time.find('brand').text
row.append(Brand)
ean = time.find('EAN').text
row.append(ean)
header_text = time.find('headerText').text
row.append(header_text)
internet_text = time.find('internetText').text
row.append(internet_text)
general_text = time.find('generalText').text
row.append(general_text)
curcat = 0
for child in time.find('categories').find('tree'):
categ = child.text
row.append(categ)
curcat += 1
while curcat < maxcat:
row.append('')
curcat += 1
curimg = 0
for img in time.find('images'):
image = img.text
row.append(image)
curimg += 1
while curimg < maximg:
row.append('')
curimg += 1
csvwriter.writerow(row)
f.close()

Write a simple list into a new CSV file

I'm new in python and I would like to know how could I write a new CSV file which contains a simple list. Then I will use this file in an Excel Worksheet.
My ENTIRE code:
import csv
import urllib
from bs4 import BeautifulSoup
sock = urllib.urlopen("http://www.fatm.com.es/Datos_Equipo.asp?Cod=03CA0007")
htmlSource = sock.read()
sock.close()
soup = BeautifulSoup(htmlSource)
form = soup.find("form", {'id': "FORM1"})
table = form.find("table")
entidad = [item.text.strip() for item in table.find_all('td')]
valores = [item.get('value') for item in form.find_all('input')]
lista = entidad
i = 0
x = 1
while i <=10:
lista.insert(i+x, valores[i])
i += 1
x += 1
print lista
w = csv.writer(file(r'C:\Python27\yo.csv','wb'),dialect='excel')
w.writerows(lista)
lista = [u'Club',
u'CLUB TENIS DE MESA PORTUENSE',
u'Nombre Equipo',
u'C.T.M. PORTUENSE',
u'Telefono fijo',
u'630970055',
u'Telefono Movil',
u'630970055',
u'E_Mail',
u'M.LOPEZ_70#HOTMAIL.COM',
u'Local de Juego',
u'INSTITUTO MAR DE CADIZ',
u'Categoria',
u'DIVISION HONOR ANDALUZA',
u'Grupo',
u'GRUPO 9',
u'Delegado',
u'SANCHEZ SANTANO, JOSE MARIA',
u'*Dias de Juego',
u'SABADO',
u'*Hora de Juego',
u'17:00']
My results: an empty CSV file. :(
Thanks in advance!!!
Here you go:
import csv
lista = [u'Club',
u'CLUB TENIS DE MESA PORTUENSE',
u'Nombre Equipo',
u'C.T.M. PORTUENSE',
u'Telefono fijo',
u'630970055',
u'Telefono Movil',
u'630970055', u'E_Mail', u'M.LOPEZ_70#HOTMAIL.COM', u'Local de Juego', u'INSTITUTO MAR DE CADIZ', u'Categoria', u'DIVISION HONOR ANDALUZA', u'Grupo', u'GRUPO 9', u'Delegado', u'SANCHEZ SANTANO, JOSE MARIA', u'*Dias de Juego', u'SABADO', u'*Hora de Juego', u'17:00']
header = []
row = []
for i, val in enumerate(lista):
if i%2==0:
header.append(val)
else:
row.append(val)
out = open('file.csv', 'w')
w = csv.writer(out, dialect='excel')
w.writerow(header)
w.writerow(row)
out.close()
As a follow up side question. What I would do in your place is to create one list for your column names, like:
header = ['col_name1', 'col_name2', ... ]
and a list of lists for the values like:
values = [
[row1_val1, row1_val2, ...],
[row2_val1, row2_val2, ...],
...
]
Then you can do:
w.writerow(header)
for row in values:
w.writerow(row)
Check the doc of the csv module, there might be a way to write all rows in one go. I've never used it myself.

Categories