unfortunately I'm a beginner in XPath and not completly sure how ir works. For a project of mine I'm looking for a way to parse 5 columns of a 9 column table. here is what I got working so far:
url="".join(["http://www.basketball-reference.com/leagues/NBA_2011_games.html"])
#getting the columns 4-7
page=requests.get(url)
tree=html.fromstring(page.content)
# the //text() is because some of the entries are inside <a></a>s
data = tree.xpath('//table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()')
so what my workaround idea is, is to just get another list that gets only the first column and then combining the two in an extra step however, that seems unelgegant and unnecessary.
for the XPath I tried so far
//table[#id="games"]/tbody/tr/td[position() = 1]/text() | //table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()
That doesn't include the first column (date) too somehow. (according to w3schools) the | is the operator to connect two XPath statements.
so here is my complete code right now. The data will then be put into two lists as of now.
In hopes that I didn't do anything too stupid, thank you for your help.
from lxml import html
import requests
url="".join(["http://www.basketball-reference.com/leagues/NBA_1952_games.html"])
page=requests.get(url)
tree=html.fromstring(page.content)
reg_data = tree.xpath('//table[#id="games"]/tbody/tr/td[position() = 1]/text() | //table[#id="games"]/tbody/tr/td[position()>3 and position()<8]//text()')
po_data = tree.xpath('//table[#id="games_playoffs"]/tbody/tr/td[position() = 1]/text() | //table[#id="games_playoffs"]/tbody/tr/td[position()>3 and position()<8]//text()')
n=int(len(reg_data)/5)
if int(year) == 2016:
for i in range(0,len(reg_data)):
if len(reg_data[i])>3 and len(reg_data[i+1])>3:
n = int((i)/5)
break
games=[]
for i in range(0,n):
games.append([])
for j in range(0,5):
games[i].append(reg_data[5*i+j])
po_games=[]
m=int(len(po_data)/5)
if year != 2016:
for i in range(0,m):
po_games.append([])
for j in range(0,5):
po_games[i].append(po_data[5*i+j])
print(games)
print(po_games)
It looks like a lot of the data is wrapped in link (a) tags so that when you are asking for text node children, you aren't finding any because you need to go one level deeper.
Instead of
/text()
do
//text()
The two slashes means to select text() nodes which are decendants at ANY level.
You can also combine the entire expression into
//table[#id="games"]/tbody/tr/td[position() = 1 or (position()>3 and position()<8)]//text()
instead of having two expressions.
We can even shorten further to
//table[#id="games"]//td[position() = 1 or (position()>3 and position()<8)]//text()
but there is a risk to this expression, as it will pick up td elements which occur anywhere in the table (provided they are a 1st, 4th, 5th, 6th, or 7th column), not just in rows in the body. In your target this will work, however.
Note also that an expression like [position()=1] is not necessary. You can shorten it to [1]. You only need the position function if you need the position of a node other than the context node, or need to write a more complex selection like we have when needing more than just one specific index.
Related
I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:
CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:
P03133 Auxiliar helper un 203.02 417.54 437.22 675.80
My approach to it until now has been:
Creating lists to storage each information:
codes = []
description = []
unity = []
cost = []
Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.
Finding a code's easy, it's a distinct type of information amongst the other data.
For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:
The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.
The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.
The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.
for i, carc in enumerate(txtl):
if carc[0] == "P" and carc[1].isnumeric():
codes.append(carc)
j = 0
while not txtl[i+j].isnumeric():
j = j + 1
description.append(" ".join(txtl[i+1:i+j-2]))
unity.append(txtl[i+j-1])
cost.append(txtl[i+j])
I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:
while not txtl[i+j].isnumeric():
txtl[i+j] list index out of range.
Accepting any solution to debug my code or even new solutions to problem.
OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!
A slight addition to your code should resolve this:
while i+j < len(txtl) and not txtl[i+j].isnumeric():
j += 1
The first condition fails when out of bounds, so the second one doesn't get checked.
Also, please use a list of dict items instead of 4 different lists, fe:
thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})
In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...
EDIT after comment interactions:
Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.
from pprint import pprint # just for pretty printing
textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []
def handle_line(textl: str):
description = ''
unity = None
values = []
for word in textl.split()[1:]:
# it splits on space characters by default
# you can ignore the first item in the list, as this will always be the code
# str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
if not word.replace(',', '').replace('.', '').isnumeric():
if len(description) == 0:
description = word
else:
description = f'{description} {word}' # I like f-strings
elif not unity:
# if unity is still None, that means it has not been set yet
unity = word
else:
values.append(word)
return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}
the_list.append(handle_line(textl))
pprint(the_list)
str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
I have two separate data-sets that I am working with:
The primary database, containing retail data (POS), has an item-code column, e.g (12831903, E03JD920).
(I imported this as a DF AS importing as tibble threw parsing error)
The secondary database has the item-codes and the corresponding item description. I put this into an R list:
itemlist <- setNames(as.list(itemcodes$l1desc),itemcodes$itemcode)
The main dataset is stored as main. I am attempting to write a function / loop that checks every single row of the column itemcode in main and see if there is a match in itemlist. If there is a match, the function/loop pastes the corresponding value from the matching key into a new vector. This process loops until every single row has been found a match for.
How can I write this loop?
The image below shows what the list looks like
The second image is a snippet of what the main dataframe looks like
So to summarise again, I am looking to write a function that parses through the list to match values with the itemcodes in the main df. My eventual goal is to take the new vector with the matching item descriptions, and then merge that with the main-dataframe once again.
for(i in 1:nrow(transactions)) {
if (transactions[i,4] == itemlist[i]) {
#if item code in pos log is equal to item code in list, paste corresponding
## value from key-value pair
mapped_items[i] <- paste0(names(itemlist))
}}
View(mapped_items)
This is what my pseudo-code looks like, I tried running a few initial variants of this, but kept getting thrown an error that NAs were being parsed through, but this may have been because of the tibble. Any ideas as to how I can elegantly implement this function/loop?
I am also filling to try this with dict in python if necessary
My python pseudocode is as follows:
def search(itemcode, code):
match "" " "
for i in itemcode:
if (code == i):
match = code
return match
def mapitems(transactions,itemcode):
mappedlist = {}
i = 0
for code in transactions:
mappedlist.append(search(itemcode,code))
return mappedlist
Actually, you should be using neither R nor Python to do what would be a very basic operation on your actual SQL database:
SELECT *
FROM item_codes ic
INNER JOIN retail_data rd
ON rd.item_code = ic.item_code
Just execute the above query, and then access the result set as some kind of collection in your R or Python script.
I need to calculate the total amount of time each group uses a meeting space. But the data set has double and triple booking, so I think I need to fix the data first. Disclosure: My coding experience consists solely of working through a few Dataquest courses, and this is my first stackoverflow posting, so I apologize for errors and transgressions.
Each line of the data set contains the group ID and a start and end time. It also includes the booking type, ie. reserved, meeting, etc. Generally, the staff reserve a space for the entire period, which would create a single line, and then add multiple lines for each individual function when the details are known. They should segment the original reserved line so it's only holding space in between functions, but instead they double book the space, so I need to add multiple lines for these interim RES holds, based on the actual holds.
Here's what the data basically looks like:
Existing data:
functions = [['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/06 17:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00]]
I've tried to iterate using a for loop:
for index, row in enumerate(functions):
last_row_index = len(functions) - 1
if index == last_row_index:
pass
else:
current_index = index
next_index = index + 1
if row[3] <= functions[next_index][2]:
next
elif row[4] == 'RES' or row[6] < functions[next_index][6]:
copied_current_row = row.copy()
row[3] = functions[next_index][2]
copied_current_row[2] = functions[next_index][3]
functions.append(copied_current_row)
There seems to be a logical problem in here, because that last append line seems to put the program into some kind of loop and I have to manually interrupt it. So I'm sure it's obvious to someone experienced, but I'm pretty new.
The reason I've done the comparison to see if a function is RES is that reserved should be subordinate to actual functions. But sometimes there are overlaps between actual functions, so I'll need to create another comparison to decide which one takes precedence, but this is where I'm starting.
How I (think) I want it to end up:
[['Function', 'Group', 'FunctionType', 'StartTime', 'EndTime'],
[01,01,'RES',2019/10/04 07:00,2019/10/05 09:00],
[02,01,'MTG',2019/10/05 09:00,2019/10/05 12:00],
[01,01,'RES',2019/10/05 12:00,2019/10/05 12:30],
[03,01,'LUN',2019/10/05 12:30,2019/10/05 13:30],
[01,01,'RES',2019/10/05 13:30,2019/10/05 14:00],
[04,01,'MTG',2019/10/05 14:00,2019/10/05 17:00],
[01,01,'RES',2019/10/05 14:00,2019/10/06 09:00],
[05,01,'MTG',2019/10/06 09:00,2019/10/06 12:00],
[01,01,'RES',2019/10/06 12:00,2019/10/06 17:00]]
This way, I could do a simple calculation of elapsed time for each function line and add it up to see how much time they had the space booked for.
What I'm looking for here is just some direction I should pursue, and I'm definitely not expecting anyone to do the work for me. For example, am I on the right path here, or would it be better to use pandas and vectorized functions? If I can get the basic direction right, I think I can muddle through the specifics.
Thank-you very much,
AF
How can I increment the Xpath variable value in a loop in python for a selenium webdriver script ?
search_result1 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[1])").text
search_result2 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[2])").text
search_result3 = sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[3])").text
why dont you create a list for storing search results similar to
search_results=[]
for i in range(1,11) #I am assuming 10 results in a page so you can set your own range
result=sel.find_element_by_xpath("//a[not((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])]|((//div[contains(#class,'s')]//div[contains(#class,'kv')]//cite)[%s])"%(i,i)).text
search_results.append(result)
this sample code will create list of 10 values of results. you can get idea from this code to write your own. its just matter of automating task.
so
search_results[0] will give you first search result
search_results[1] will give you second search results
...
...
search_results[9] will give you 10th search result
#Alok Singh Mahor, I don't like hardcoding ranges. Guess, better approach is to iterate through the list of webelements:
search_results=[]
result_elements = sel.find_elements_by_xpath("//not/indexed/xpath/for/any/search/result")
for element in result_elements:
search_result = element.text
search_results.append(search_result)
I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)