So I got a string output out of a css selector while web crawling and the string has 7 lines, 6 of them are useless and I only want the 4th line.
The string is as follows:
کارکرد:
۵۰,۰۰۰
رنگ:
سفید
وضعیت بدنه:
بدون رنگ
قیمت صفر : ۳۱۵,۰۰۰,۰۰۰ تومان
Is there a way to print only the 4th line?
The crawling code:
color = driver.find_elements_by_css_selector("div[class='col']")
for c in color:
print(c.text)
Yes, of course! See python documentation about list items
color = driver.find_elements_by_css_selector("div[class='col']")
print(color[3].text)
List items are indexed, the first item has index [0], the second item has index 1 etc.
I'm not sure if i understand the problem correctly, but assuming you have a string with multiple lines, a solution could be:
string = '''this string
exists
on multiple
lines
so lets pick
a line
'''
def select_line(string, line_index):
return string.splitlines()[line_index]
result = select_line(string,3)
print(result)
This function would select the number line you want (index 0 being the first line)
If you want items one to four try this:
for idx in range(4):
print(color[idx].text)
And if you want only 4th try this: (in python index in list start from zero.)
print(color[3].text)
Related
Below 4 URLs Contain Letter s and We need to remove this Letter and
Print the 4 x URLs But The Problem is I got only the last web site not the 4
Sites printed
Note :Language used is Python
file1 = ['https:/www.google.com\n', 'https:/www.yahoo.com\n', 'https:/www.stackoverflow.com\n',
'https:/www.pythonhow.com\n']
file1_remove_s = []
for line in file1:
file1_remove_s = line.replace('s','',1)
print(file1_remove_s)
You are reassigning file1_remove_s from a list object to the modified list element. You want to use append instead
file1 = ['https:/www.google.com\n', 'https:/www.yahoo.com\n', 'https:/www.stackoverflow.com\n',
'https:/www.pythonhow.com\n']
file1_remove_s = []
for line in file1:
file1_remove_s.append(line.replace('s','',1))
print(file1_remove_s)
You are assigning only the last item on the dict by using the = operator. This is actually a perfect place to use a list comprehension, hence your code should look like:
file1 = [file1_remove_s.replace('s','',1) for file1_remove_s in file1]
This will automatically append the formatted text -strings with removed "s" - to a list and by setting the variable name of that list to the name of the initial list, the initial list gets overwritten by the new one which have the proper format of the texts you want.
I have some files that have "TITLE..." then have "JOURNAL..." followed directly afterward. The specific lines are varied and are not static per file. I am trying to pull all of the information that exists between "...TITLE..." and "...JOURNAL...". So far, I am able to only pull the line that contains "TITLE", but for some files, that spills onto the next line.
I deduced that I must use a=line.find("TITLE") and b=line.find("JOURNAL")
then set up a for loop of for i in range(a,b): which displays all of the numerical values of the strings from 698-768, but only displays the number instead of the string. How do I display the string? and how do I then, clean that up to not display "TITLE", "JOURNAL", and the whitespaces in between those two and the text I need? Thanks!
This is the one that displays the single line that "TITLE" exists on
def extract_title():
f=open("GenBank1.gb","r")
line=f.readline()
while line:
line=f.readline()
if "TITLE" in line:
line.strip("TITLE ")
print(line)
f.close()
extract_title()
This the the current block that displays all of thos enumbers in increasing order on seperate lines.
def extract_title():
f=open("GenBank1.gb","r")
line=f.read()
a=line.find("TITLE")
b=line.find("JOURNAL")
line.strip()
f.close()
if "TITLE" in line and "JOURNAL" in line:
for i in range(a,b):
print(i)
extract_title()
Currently, I have from 698-768 displayed like:
698
699
700
etc...
I want to first get them like, 698 699 700,
then convert them to their string value
then I want to understand how to strip the white spaces and the "TITLE" and "JOURNAL" values. Thanks!
I am not sure if I get what you want to achieve here but if I understood it correctly you have a string similar to this "TITLE 659 JOURNAL" and want to get the value in the middle ? If so you could use the slicing notation as such:
line = f.read()
a = line.find("TITLE") + 5 # Because find gives index of the start so we add length
b = line.find("JOURNAL")
value = line[a:b]
value = value.strip() # Strip whitespace
If we now were to return value or print it out we get:
'659'
Similar if you want to get the value after JOURNAL you could use slicing notation again:
idx = line.find("JOURNAL") + 7
value = line[idx:] # Start after JOURNAL till end of string
you don't need the loop. just use slicing:
line = 'fooTITLEspamJOURNAL'
start = line.find('TITLE') + 5 # 5 is len('TITLE')
end = line.find('JOURNAL')
print(line[start:end])
output
spam
another option is to split
print(line.split('TITLE')[1].split('JOURNAL')[0])
str.split() returns list. we use indexes to get the element we want.
in slow motion:
part2 = line.split('TITLE')[1]
title = part2.split('JOURNAL')[0]
print(title)
Assume I have A1 as the only cell in a workbook, and it's blank.
I want my code to add "1" "2" and "3" to it so it says "1 2 3"
As of now I have:
NUMBERS = [1, 2, 3, 4, 5]
ThisSheet.Cells(1,1).Value = NUMBERS
this just writes the first value to the cell. I tried
ThisSheet.Cells(1,1).Value = Numbers[0-2]
but that just puts the LAST value in there. Is there a way for me to just add all of the data in there? This information will always be in String format, and I need to use Win32Com.
update:
I did
stringVar = ', '.join(str(v) for v in LIST)
UPDATE:this .join works perfectly for the NUMBERS list. Now I tried attributing it to another list that looks like this
LIST=[Description Good\nBad, Description Valid\nInvalid]
If I print LIST[0] The outcome is
Description Good
Bad
Which is what I want. But if I use .join on this one, it prints
('Description Good\nBad, Description Valid\nInvalid')
so for this one I need it to print as though I did LIST[0] and LIST[1]
So if you want to put each number in a different cell, you would do something like:
it = 1
for num in NUMBERS:
ThisSheet.Cells(1,it).Value = num
it += 1
Or if you want the first 3 numbers in the same cell:
ThisSheet.Cells(1,it).Value = ' '.join([str(num) for num in NUMBERS[:3]])
Or all of the elements in NUMBERS:
ThisSheet.Cells(1,1).Value = ' '.join([str(num) for num in NUMBERS])
EDIT
Based on your question edit, for string types containing \n and assuming every time you find a newline character, you want to jump to the next row:
# Split the LIST[0] by the \n character
splitted_lst0 = LIST[0].split('\n')
# Iterate through the LIST[0] splitted by newlines
it = 1
for line in splitted_lst0:
ThisSheet.Cells(1,it).Value = line
it += 1
If you want to do this for the whole LIST and not only for LIST[0], first merge it with the join method and split it just after it:
joined_list = (''.join(LIST)).split('\n')
And then, iterate through it the same way as we did before.
Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...
Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:
Output
*********************** Contents of 'listOfCallSigns' List ***********************
0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']
As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.
Ideally, I want the output to be without any of the square brackets and just the callsign as a string.
*********************** Contents of 'listOfCallSigns' List ***********************
0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590
This script I'm using currently is shown below:
My script
# Importing the modules needed to run the script
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint
# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"
# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")
# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []
# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))
print "\n\n*********************** Contents of 'listOfCallSigns' List ***********************\n"
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row
Does anyone know how to remove the square brackets surrounding each callsign and just display the string?
Thanks in advance! :)
Change the last lines to:
# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
print i, row[0] # <-- added a [0] here
Alternatively, you can also add the [0] here:
for i, row in enumerate(table):
if i:
listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here
The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":
>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]
When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).
So row and re.findall(...) are both of type "list of string(s)" and look like this:
>>> row
>>> ['311062900']
And to get the string inside the list, you need access its first element, i.e.:
>>> row[0]
>>> '311062900'
Hope this helps!
This can also be done by stripping the unwanted characters from the string like so:
a = "string with bad characters []'] in here"
a = a.translate(None, "[]'")
print a
I am attempting to write a python script that shows the URL flow on my installation of nginx. So I currently have my script opening my 'rewrites' file that contains a list of of regex's and locations like so:
rewritei ^/ungrad/info.cfm$ /ungrad/info/ permanent;
So what I currently have python doing is reading the file, trimming the first and last word off (rewritei and premanent;) which just leaves a list like so:
[
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
This results in the first element being the URL watched, and the second being the URL redirected to. What I would like to do now, is take each of the first elements, and run the regex over the entire list, and check if it matches any of the second elements.
With the example above, [0][0] would match [2][1].
However I am having trouble thinking of a good and efficient way to do this.
import re
a = [
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
def matchingfun(b):
for list1 in a: # iterating the main list
for reglist in list1: # iterating the inner lists
count = 0
matchedurl = []
for innerlist in reglist[:1]: # iterating the inner list items
c = b.match(innerlist) # matching the regx
if c:
count = count+1
if count > 0:
matchedurl.append(reglist)
return matchedurl
result1 = []
for list1 in a:
for reglist in list1:
b = re.compile(reglist[0])
result = matchingfun(b)
result1.extend(result)
bs = list(set(result1))
print "matched url is", bs
This is bit unefficient i guess but I have done to some extent. Hope this answers your query. the above snippet prints the urls which are matched with the second items in the entire list.