I have this Solr index that contains a large numer of quite long text files, indexed with the text_sv schema. I want to print out every single snippet for each indexed document. However, I only retrieve a few ones, even though I have tried to smanipulate the various settings as specified in the documentation.
Here is the code section:
results = solr.search(search_string, rows = result_limit, sort = order,
**{
'hl':'true',
'hl.fragsize': 100,
'hl.fl': 'fulltext',
'hl.maxAnalyzedChars': -1,
'hl.snippets': 100,
})
resultcounter = 0
for result in results:
resultcounter += 1
fulltexturl = '<a href="http://localhost/source/\
' + result['filename'] + '">' + result['filename'][:-4] + '</a>'
year = str(result['year'])
number = str(result['number'])
highlights = results.highlighting
print("Saw {0} result(s).".format(len(results)))
print('<p>' + str(resultcounter) + '. <b>År:</b> ' + year + ', <b>Nummer\
: </b>' + number +' ,<b>Fulltext:</b> ' + fulltexturl + '. <b>\
</b> träffar.<br></p>')
inSOUresults = 1
for idnumber, h in highlights.items():
for key, value in h.items():
for v in value:
print('<p>' + str(inSOUresults) + ". " + v + "</p>")
inSOUresults += 1
What am I doing wrong?
You probably want a very large (or 0) value for the hl.fragments parameter (from the Highlighting wiki page):
With the original Highlighter, if you have a use case where you need to highlight the complete text of a field and need to highlight every instance of the search term(s) you can set hl.fragsize to a very high value (whatever it takes to include all the text for the largest value for that field), for example &hl.fragsize=50000.
However, if you want to change fragsize to a value greater than 51200 to return long document texts with highlighting, you will need to pass the same value to hl.maxAnalyzedChars parameter too. These two parameters go hand in hand and changing just the hl.fragsize would not be sufficient for highlighting in very large fields.
Related
There are 6 test cases and 5 are getting passed based on python 3 on a string operations problem, but 1 test case is failing since inception.
Pls help me out. The question is as follows:
8 Strings are given in a function.
Remove spaces from both end of strings: first, second, parent, city
Capitalize : first, second, parent
Print Strings with a space : first, second, parent, city
Check if string : 'phone' only contains digits
Check if phone number starts with value in string 'start' and print the result(True or False)
Print : total no. of times 'strfind' appears in the strings : first, second, parent, city
Print : list generated by using split function on 'string1'
Find position of 'strfind' in 'city'
My Code is as follows: Let me know what wrong I have done. 5/6 test cases are passed only 1 test case failed for unknown reason. :(
def resume(first, second, parent, city, phone, start, strfind, string1):
first = first.strip()
second = second.strip()
parent = parent.strip()
city = city.strip()
first = first.capitalize()
second = second.capitalize()
parent = parent.capitalize()
print(first + " " + second + " " + parent + " " +city)
print(phone.isdigit())
print(phone[0]==start[0])
res = first + second + parent + city
res_count = res.count(strfind)
print(res_count)
print(string1.split())
print(city.find(strfind))
Not too sure without being given details on the test case. However, number 5 may be incorrect as you are only checking if the first values of the strings are the same. This is not the same as checking if "phone number starts with value in string 'start'". I recommend using the following code instead:
print(phone.startswith(start))
In addition number 6 seems like it could cause some mismatches with overlapping strings. Instead I would suggest using:
print(first.count(strfind) + second.count(strfind) + parent.count(strfind) + city.count(strfind))
first = first.strip()
second = second.strip()
parent = parent.strip()
city = city.strip()
first = first.capitalize()
second = second.capitalize()
parent = parent.capitalize()
print(first + " " + second + " " + parent + " " +city)
print(phone.isnumeric())
print(phone.startswith(start))
res = first + second + parent + city
res_count = res.count(strfind)
print(res_count)
print(string1.split())
print(city.find(strfind))
I started learning python two days ago. Today I built a web scraping script which pulls data from yahoo finance and puts it in a csv file. The problem I have is that some values are string because yahoo finance displays them as such.
For example: Revenue: 806.43M
When I copy them into the csv I cant use them for calculation so I was wondering if it is possible to separate the "806.43" and "M" while still keeping both to see the unit of the number and put them in two different columns.
for the excel writing I use this command:
f.write(revenue + "," + revenue_value + "\n")
where:
print(revenue)
Revenue (ttm)
print(revenue_value)
806.43M
so in the end I should be able to use a command which looks something like this
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
where revenue_value is 806.43 and revenue_unit is M
Hope someone could help with the problem.
I believe the easiest way is to parse the number as string and convert it to a float based on the unit in the end of the string.
The following should do the trick:
def parse_number(number_str) -> float:
mapping = {
"K": 1000,
"M": 1000000,
"B": 1000000000
}
unit = number_str[-1]
number_float = float(number_str[:-1])
return number_float * mapping[unit]
And here's an example:
my_number = "806.43M"
print(parse_number(my_number))
>>> 806430000.0
You can always try regular expressions.
Here's a pretty good online tool to let you practice using Python-specific standards.
import re
sample = "Revenue (ttm): 806.43M"
# Note: the `(?P<name here>)` section is a named group. That way we can identify what we want to capture.
financials_pattern = r'''
(?P<category>.+?):?\s+? # Capture everything up until the colon
(?P<value>[\d\.]+) # Capture only numeric values and decimal points
(?P<unit>[\w]*)? # Capture a trailing unit type (M, MM, etc.)
'''
# Flags:
# re.I -> Ignore character case (upper vs lower)
# re.X -> Allows for 'verbose' pattern construction, as seen above
res = re.search(financials_pattern, sample, flags = re.I | re.X)
Print our dictionary of values:
res.groupdict()
Output:
{'category': 'Revenue (ttm)',
'value': '806.43',
'unit': 'M'}
We can also use .groups() to list results in a tuple.
res.groups()
Output:
('Revenue (ttm)', '806.43', 'M')
In this case, we'll immediately unpack those results into your variable names.
revenue = None # If this is None after trying to set it, don't print anything.
revenue, revenue_value, revenue_unit = res.groups()
We'll use fancy f-strings to print out both your f.write() call along with the results we've captured.
if revenue:
print(f'f.write(revenue + "," + revenue_value + "," + revenue_unit + "\\n")\n')
print(f'f.write("{revenue}" + "," + "{revenue_value}" + "," + "{revenue_unit}" + "\\n")')
Output:
f.write(revenue + "," + revenue_value + "," + revenue_unit + "\n")
f.write("Revenue (ttm)" + "," + "806.43" + "," + "M" + "\n")
I wrote code to append a json response into a list for some API work I am doing, but it stores the single quotes around the alphanumerical value I desire. I would like to get rid of the single quotes. Here is what I have so far:
i = 0
deviceID = []
while i < deviceCount:
deviceID.append(devicesRanOn['resources'][i])
deviceID[i] = re.sub('[\W_]', '', deviceID[i])
i += 1
if i >= deviceCount:
break
if (deviceCount == 1):
print ('Device ID: ', deviceID)
elif (deviceCount > 1):
print ('Device IDs: ', deviceID)
the desired input should look like this:
input Device IDs:
['14*************************00b29', '58*************************c3df4']
Output:
['14*************************00b29', '58*************************c3df4']
Desired Output:
[14*************************00b29, 58*************************c3df4]
As you can see, I am trying to use RegEx to filter non Alphanumeric and replace those with nothing. It is not giving me an error nor is it preforming the actions I am looking for. Does anyone have a recommendation on how to fix this?
Thank you,
xOm3ga
You won't be able to use the default print. You'll need to use your own means of making a representation for the list. But this is easy with string formatting.
'[' + ', '.join(f'{id!s}' for id in ids) + ']'
The f'{id:!s} is an f-string which formats the variable id using it's __str__ method. If you're on a version pre-3.6 which doesn't use f-strings, you can also use
'%s' % id
'{!s}'.format(id)
PS:
You can simplify you're code significantly by using a list comprehension and custom formatting instead of regexes.
ids = [device for device in devicesRanOn['resources'][:deviceCount]]
if deviceCount == 1:
label = 'Device ID:'
elif deviceCount > 1:
label = 'Device IDs:'
print(label, '[' + ', '.join(f'{id!s}' for id in ids) + ']')
I'm trying to display values in HTML that have a "$" at the beginning, but the way I print out the values in HTML makes it so that with the justification I can only add it at the end of the previous value or at the end of the value.
I'm thinking I have to somehow incorporate the "$" into the for loop, but I'm not sure how to do that.
BODY['html'] += '<br>Total shipped this month:..............Orders........Qty...........Value<br>'
SQL5 = '''
select count(*) as CNT, sum(A.USER_SHIPPED_QTY) as QTY, sum(( A.USER_SHIPPED_QTY) * A.UNIT_PRICE) as VALUE
from SHIPPER_LINE A, SHIPPER B
where B.PACKLIST_ID = A.PACKLIST_ID
and A.CUST_ORDER_ID like ('CO%')
and B.SHIPPED_DATE between ('{}') and ('{}')
'''.format(RP.get_first_of_cur_month_ora(), RP.get_rep_date_ora())
## {} and .format get around the issue of using %s with CO%
print SQL5
curs.execute(SQL5)
for line in curs: ##used to print database lines in HTML
print line
i=0
for c in line:
if i==0:
BODY['html'] += '<pre>' + str(c).rjust(60,' ')
elif i == 1:
BODY['html'] += str(c).rjust(15,' ')
else:
BODY['html'] += str(c).rjust(22,' ') + '</pre>'
i+=1
The "pre" in HTML is used to keep the whitespace and the ' ' after rjust is used to space the numbers properly to go under the column headings. The values that are printed out are generated from the database using the SQL.
Here is what displays in HTML for this code:
Total shipped this month:..............Orders........Qty...........Value
3968 16996 1153525.96
This is what I want it to look like:
Total shipped this month:..............Orders........Qty...........Value
3968 16996 $1153525.96
You could apply the format in the DB by wrapping your sum with a to_char and a currency/numeric format model ...
select to_char(12345.67, 'FML999,999.99') FROM DUAL;
I recently acquired a trial version of some source code to check MISRA compliance before purchasing. I have run pc-lint over the C code to verify compliance, and have got an output of a huge amount of violations. I was wanting to nicify the html generated so that I can sort what violations there are. I have tried googling for something that exists already to do this with little yield, so instead i began writing a python script...
In short, the script iterates through every line of the html output multiple times in order to check for a particular string. Of course this takes a ridiculously long time to execute, I have been unable to find an elegant solution to this, but I'm hoping im missing something obvious that someone could point out... otherwise, perhaps another language would be more appropriate that would execute faster. Cheers!
#!/usr/bin/env python
import re
rule_search = re.compile("Required Rule (.*?),",re.DOTALL|re.M)
rule_search2 = re.compile("MISRA 2004 Rule (.*?)]",re.DOTALL|re.M)
line_search = re.compile("<br>(.*?)<br>",re.DOTALL|re.M)
data=open('lint-all.html').read()
unique_rules = list(set(rule_search.findall(data)))
unique_rules2 = list(set(rule_search2.findall(data)))
MISRA_Rules = unique_rules + unique_rules2
count = [0] * len(MISRA_Rules)
page_lines = {}
pages = {}
counts = open("pages/counts.html",'w')
counts.write("<h2>Violated Rules Count</h2><h3><ol>")
counts.close()
for i in range (len(MISRA_Rules)):
pages[i] = open("pages/" + str(MISRA_Rules[i]).translate(None, '.') + ".html", 'w')
pages[i].close()
counts = open("pages/counts.html",'a+')
counts.write("<a href=" + str(MISRA_Rules[i]).translate(None, '.') + ".html>" + str(MISRA_Rules[i]) + "</a>: <font size='3'> 0 </font> " )
if i%4 == 0 and i != 0:
counts.write("<br />")
counts.write("<br /><a href=sorted.html>Total:</a> " + "<font size='3'>" + str(count) + "</font>")
counts.write("</h3>")
for i in range (len(MISRA_Rules)):
pages[i] = open("pages/" + str(MISRA_Rules[i]).translate(None, '.') + ".html", 'a+')
pages[i].write("<h1>MISRA Rule " + str(MISRA_Rules[i]) + "</h1>")
pages[i].write("""<link rel="import" href="counts.html">""")
for j in range (len(line_search.findall(data))):
if "Rule " + str(MISRA_Rules[i]) in line_search.findall(data)[j]:
count[i] += 1
pages[i].write("<br>")
pages[i].write(line_search.findall(data)[j])
pages[i].write("</br>")
print "out"
new_html = open('pages/sorted.html', 'w')
counts = """<h2>Violated Rules Count</h2><h3><ol>"""
for i in range (len(MISRA_Rules)):
counts += """""" + str(MISRA_Rules[i]) + """: <font size="3">""" + str(count[i]) + """</font> """
if i%4 == 0 and i != 0:
counts += """<br />"""
counts += """<br /><a href=sorted.html>Total:</a> """ + """<font size="3">""" + str(count) + """</font>"""
counts += """</h3>"""
counts.close()
new_html.write(counts)
new_html.write(data)
new_html.close()
Several approaches possible.
First is to optimize existing code. It's difficult to say what's wrong with it. In this case one goes to cprofile docs and sets up a profiler. There you'll see the bottlenecks.
Second approach (most preferable to my opinion): parse data in Python, but leave HTML generation to specialized tools, such as jinja2 template engine, which is extensively used in web development. The simpler version of jinja2 is mustache, most likely that in won't require any installation.
Third approach is to do all this stuff in-browser. Add jQuery for DOM manipulation (introduce new tags and classes) and a css stylesheet (determine how new tags and classes should look like).