Python How to split text with no fixed blankspace - python

I have bellow text( get repsond from Zebra):
30.0 DARKNESS
4 IPS PRINT SPEED
+000 TEAR OFF
TEAR OFF PRINT MODE
GAP/NOTCH MEDIA TYPE
WEB SENSOR TYPE
MANUAL SENSOR SELECT
THERMAL-TRANS. PRINT METHOD
480 PRINT WIDTH
0387 LABEL LENGTH
39.0IN 975MM MAXIMUM LENGTH
CONNECTED USB COMM.
BIDIRECTIONAL PARALLEL COMM.
9600 BAUD
8 BITS DATA BITS
NONE PARITY
DTR & XON/XOFF HOST HANDSHAKE
NONE PROTOCOL
AUTO SER COMM. MODE
<~> 7EH CONTROL CHAR
<^> 5EH COMMAND CHAR
<,> 2CH DELIM. CHAR
ZPL II ZPL MODE
NO MOTION MEDIA POWER UP
FEED
I want to get values for each settings via python.
Expect to get something like a dict {'DARKNESS':30,'PRINT SPEED':'4 IPS' ....}
Normally, expect code is
for line in lines:
x=line.split(' ')
the_value=x[0]
the_setting=x[1]
but it's without fixed blankspace.
I don't have good idea to split it.
Using split() function isn't a good choose here.
The value also have blankspace as well.
I was stuck here.
Any idea?

Using my suggestion in combination with yours, I got this to work (I made a txt file with your examples in it):
import re
file = open('untitled.txt','r')
my_dict = {}
for line in file:
x,y = re.split(r'\s{4,}',line.strip())
my_dict[y] = x
This is the output of the dictionary I made with this code:
{'DARKNESS': '30.0', 'PRINT SPEED': '4 IPS', 'TEAR OFF': '+000', 'PRINT MODE': 'TEAR OFF', 'MEDIA TYPE': 'GAP/NOTCH', 'SENSOR TYPE': 'WEB', 'SENSOR SELECT': 'MANUAL', 'PRINT METHOD': 'THERMAL-TRANS.', 'PRINT WIDTH': '480', 'LABEL LENGTH': '0387', 'MAXIMUM LENGTH': '39.0IN 975MM', 'USB COMM.': 'CONNECTED', 'PARALLEL COMM.': 'BIDIRECTIONAL', 'BAUD': '9600', 'DATA BITS': '8 BITS', 'PARITY': 'NONE', 'HOST HANDSHAKE': 'DTR & XON/XOFF', 'PROTOCOL': 'NONE', 'SER COMM. MODE': 'AUTO', 'CONTROL CHAR': '<~> 7EH', 'COMMAND CHAR': '<^> 5EH', 'DELIM. CHAR': '<,> 2CH', 'ZPL MODE': 'ZPL II', 'MEDIA POWER UP': 'NO MOTION'}

well, you can do the following
file=open('yourfile','r').read().split('\n')
lines=[line.split(' ') for line in file]
items=[[i.replace(' ','') for i in item if i!=''] for item in lines]
output_dict={i[0]:i[1] for i in items if i}
I used 3 main features of python here the one line loop
loop=[dosomething(item) for item in array if item=='somevalue'] #if statement is not necessary
,the replace() function
print 'Hello You'.replace('You','world') # outputs hello world
and the split() function
print 'hello,world'.split(',') # outputs ['hello',world]
you can find more documentation here: python string methods

Thanks #TheDetective , you answer is useful.
Now better.
(comments have string limit,so I have to post in answer)
>>> for line in lines:
... re.split(r'\s{4,}',line.rstrip().lstrip())
...
['\x02 30.0', 'DARKNESS']
['4 IPS', 'PRINT SPEED']
['+000', 'TEAR OFF']
['TEAR OFF', 'PRINT MODE']
['GAP/NOTCH', 'MEDIA TYPE']
['WEB', 'SENSOR TYPE']
['MANUAL', 'SENSOR SELECT']
['THERMAL-TRANS.', 'PRINT METHOD']
['480', 'PRINT WIDTH']
['0387', 'LABEL LENGTH']
['39.0IN 975MM', 'MAXIMUM LENGTH']
['CONNECTED', 'USB COMM.']
['BIDIRECTIONAL', 'PARALLEL COMM.']
['9600', 'BAUD']
['8 BITS', 'DATA BITS']
['NONE', 'PARITY']
['DTR & XON/XOFF', 'HOST HANDSHAKE']
['NONE', 'PROTOCOL']
['AUTO', 'SER COMM. MODE']
['<~> 7EH', 'CONTROL CHAR']
['<^> 5EH', 'COMMAND CHAR']
['<,> 2CH', 'DELIM. CHAR']
['ZPL II', 'ZPL MODE']
['NO MOTION', 'MEDIA POWER UP']
['FEED']
>>>

Since I've already done it, you might as well have my answer too.
The two items of information occupy fixed places on each line. Therefore, string slicing can be used to pick them from lines. I omit the last line because there is no information about its field name.
>>> result = {}
>>> with open('temp.txt') as temp:
... for line in temp.readlines():
... if line.startswith('FEED'):
... break
... result[line[20:].strip()] = line[:20].strip()
...
>>> result
{'DARKNESS': '30.0', 'PARITY': 'NONE', 'PRINT WIDTH': '480', 'DATA BITS': '8 BITS', 'PROTOCOL': 'NONE', 'COMMAND CHAR': '<^> 5EH', 'USB COMM.': 'CONNECTED', 'BAUD': '9600', 'PRINT MODE': 'TEAR OFF', 'MEDIA POWER UP': 'NO MOTION', 'DELIM. CHAR': '<,> 2CH', 'MAXIMUM LENGTH': '39.0IN 975MM', 'SENSOR SELECT': 'MANUAL', 'SENSOR TYPE': 'WEB', 'LABEL LENGTH': '0387', 'PARALLEL COMM.': 'BIDIRECTIONAL', 'CONTROL CHAR': '<~> 7EH', 'TEAR OFF': '+000', 'PRINT SPEED': '4 IPS', 'PRINT METHOD': 'THERMAL-TRANS.', 'HOST HANDSHAKE': 'DTR & XON/XOFF', 'ZPL MODE': 'ZPL II', 'MEDIA TYPE': 'GAP/NOTCH', 'SER COMM. MODE': 'AUTO'}

Use the python split function
https://www.tutorialspoint.com/python/string_split.htm
You can iterate over the lines using split('\n')
and then you can use regex to split the rest.
In your accepted answer it only splits when the whitespace between the key and value is 4 or bigger. This can give buggs when it is smaller. My solution normally fixes this.
dict = {}
for line in input.split('\n'):
# Split the line in the correct parts
myArray = re.findall('(^.{20})(.*)', line.lstrip().rstrip())
# Check that you have found both key and value
if len(myArray) > 0:
myTupple = myArray[0]
dict[myTupple[1].rstrip()] = myTupple[0].rstrip()

Related

Making a Python executable for a specific task

I'm a neuroscientist and thus not very Python skilled, but i have managed to come up with a code which uses API access to download certain neurons from a specific website (neuromorpho.org). I want this to be publicly available so that other people which are not that familiar with Python can easily get what they need (planing on posting it to GitHub and making other similar stuff like this).
So i wanted to basically create an executable file in which people can select what they want and get a .csv file with neurons at the end. This perfectly works from inside the JupyterNotebook. However, when i use Auto Py to EXE to create and executable it doesn't work. It works for a long time, creates thousands of files (more than 1GB of data) and when you launch the executable nothing happens.
I presume it has something to do with the ipywidget that i have used to create selections for the initial query.
Here is the first part of the code where i try to query the neurons based on the widget selection:
widg1 = widget.Dropdown(options=['abdominal ganglion', 'accessory lobe', 'accessory olfactory bulb', 'adult subesophageal zone', 'amygdala'
'antenna', 'antennal lobe', 'anterior olfactory nucleus', 'basal forbrain', 'basal ganglia',
'brainstem', 'Central complex', 'Central nervous system', 'cerebellum', 'cerebral ganglion',
'Cochlea', 'corpus callosum', 'cortex', 'electrosensory lobe', 'endocrine system', 'enthorinal cortex',
'eye circuit', 'forebrain', 'fornix', 'ganglion', 'hippocampus', 'hypothalamus', 'lateral complex',
'lateral horn', 'lateral line organ', 'left', 'Left Adult Central Complex', 'Left Mushroom Body', 'main olfactory bulb'
'meninges', 'mesencephalon', 'myelencephalon', 'neocortex', 'nuchal organs', 'olfactory cortex', 'olfactory pit', 'optic lobe',
'pallium', 'parasubiculum', ' peptidergic circuit', 'peripheral nervous system', 'pharyngeal nervous system', 'pons', 'Pro-subiculum',
'protocerebrum', 'retina', 'retinorecipient mesencephalon and diencephalon', 'Right Adult Central Complex',
'Right Mushroom Body', 'somatic nervous system', 'spinal cord', 'stomatogastric ganglion', 'subesophageal ganglion',
'subesophageal zone-(SEZ)', 'subiculum', 'subpallium', 'Subventricular zone', 'thalamus', 'ventral nerve cord',
'ventral striatum', 'ventral thalamus', 'ventrolateral neuropils', 'Not reported'],
value= 'cerebellum', description='Brain Region:')
display(widg1)
widg2 = widget.Dropdown(options=['African wild dog', 'agouti', 'Apis mellifera', 'Aplysia', 'Axolotl', 'Baboon',
'Blind mole-rat', 'blowfly', 'Blue wildebeest', 'Bonobo', 'bottlenose dolphin', 'C. elegans',
'Calango lizard', 'capuchin monkey', 'Caracal', 'cat', 'cheetah', 'chicken', 'chimpanzee', 'Clam worm', 'clouded leopard', 'Crab', 'cricket',
'Crisia eburnea', 'Domestic dog', 'domestic pig', 'dragonfly', 'drosophila melanogaster', 'drosophila sechellia',
'elephant', 'ferret', 'giraffe', 'goldfish', 'grasshopper', 'Greater kudu', 'guinea pig', 'Hamster', 'human', 'humpback whale',
'Lemur', 'leopard', 'Lion', 'locust', 'manatee', 'minke whale', 'Mongoose', 'monkey', 'Mormyrid fish', 'moth',
'mouse', 'pouched lamprey', 'Praying mantis (Hierodula membranacea)', 'Praying mantis (Hierodula membranacea)',
'proechimys', 'rabbit', 'Rana esculenta', 'Ranitomeya imitator', 'rat', 'Rhinella arenarum', 'Ruddy turnstone', 'salamander',
'Scinax granulatus', 'Sea lamprey', 'Semipalmated plover', 'Semipalmated sandpiper', 'sheep', 'Silkmoth', 'spiny lobster', 'Stellers Sculpin',
'Tiger', 'Toadfish', 'Treeshrew', 'turtle', 'Wallaby', 'Xenopus laevis', 'Xenopus tropicalis', 'Zebra', 'zebra finch', 'zebrafish', 'Not reported'],
value= 'mouse', description='Animal:')
display(widg2)
widg3 = widget.Dropdown(options=['Glia', 'interneuron', 'principal cell', 'sensory', 'Not reported'],
value= 'principal cell', description='Cell Type:')
display(widg3)
str1 = widg1.value
str2 = widg2.value
str3 = widg3.value
query = (
"http://neuromorpho.org/api/neuron/select?q=brain_region:%s&fq=species:%s&fq=cell_type:%s" % (str1, str2, str3))
print(query)
response = requests.get(query)
json_data = response.json()
rat_data = json_data
rat_data
url = 'http://neuromorpho.org/api/neuron/select'
params = {
'page' : 0,
'q' : 'brain_region:' + widg1.value,
'fq' : [
'cell_type:' + widg3.value,
'species:' + widg2.value,
]
}
first_page_response = requests.get(url, params)
if first_page_response.status_code == 404 or first_page_response.status_code == 500:
exit (1)
print (first_page_response.json())
totalPages = first_page_response.json()['page']['totalPages']
df_dict = {
'NeuronID' : list(),
'Neuron Name' : list(),
'Archive' : list(),
'Note' : list(),
'Age Scale' : list(),
'Gender' : list(),
'Age Classification' : list(),
'Brain Region' : list(),
'Cell Type' : list(),
'Species' : list(),
'Strain' : list(),
'Scientific Name' : list(),
'Stain' : list(),
'Experiment Condition' : list(),
'Protocol' : list(),
'Slicing Direction' : list(),
'Reconstruction Software' : list(),
'Objective Type' : list(),
'Original Format' : list(),
'Domain' : list(),
'Attributes' : list(),
'Magnification' : list(),
'Upload Date' : list(),
'Deposition Date' : list(),
'Shrinkage Reported' : list(),
'Shrinkage Corrected' : list(),
'Reported Value' : list(),
'Reported XY' : list(),
'Reported Z' : list(),
'Corrected Value' : list(),
'Corrected XY' : list(),
'Corrected Z' : list(),
'Slicing Thickness' : list(),
'Min Age' : list(),
'Max Age' : list(),
'Min Weight' : list(),
'Max Weight' : list(),
'Png URL' : list(),
'Reference PMID' : list(),
'Reference DOI' : list(),
'Physical Integrity' : list() }
for pageNum in range(totalPages):
params['page'] = pageNum
response = requests.get(url, params)
print('Querying page {} -> status code: {}'.format(
pageNum, response.status_code))
if (response.status_code == 200): #only parse successful requests
data = response.json()
for row in data['_embedded']['neuronResources']:
df_dict['NeuronID'].append(str(row['neuron_id']))
df_dict['Neuron Name'].append(str(row['neuron_name']))
df_dict['Archive'].append(str(row['archive']))
df_dict['Note'].append(str(row['note']))
df_dict['Age Scale'].append(str(row['age_scale']))
df_dict['Gender'].append(str(row['gender']))
df_dict['Age Classification'].append(str(row['age_classification']))
df_dict['Brain Region'].append(str(row['brain_region']))
df_dict['Cell Type'].append(str(row['cell_type']))
df_dict['Species'].append(str(row['species']))
df_dict['Strain'].append(str(row['strain']))
df_dict['Scientific Name'].append(str(row['scientific_name']))
df_dict['Stain'].append(str(row['stain']))
df_dict['Experiment Condition'].append(str(row['experiment_condition']))
df_dict['Protocol'].append(str(row['protocol']))
df_dict['Slicing Direction'].append(str(row['slicing_direction']))
df_dict['Reconstruction Software'].append(str(row['reconstruction_software']))
df_dict['Objective Type'].append(str(row['objective_type']))
df_dict['Original Format'].append(str(row['original_format']))
df_dict['Domain'].append(str(row['domain']))
df_dict['Attributes'].append(str(row['attributes']))
df_dict['Magnification'].append(str(row['magnification']))
df_dict['Upload Date'].append(str(row['upload_date']))
df_dict['Deposition Date'].append(str(row['deposition_date']))
df_dict['Shrinkage Reported'].append(str(row['shrinkage_reported']))
df_dict['Shrinkage Corrected'].append(str(row['shrinkage_corrected']))
df_dict['Reported Value'].append(str(row['reported_value']))
df_dict['Reported XY'].append(str(row['reported_xy']))
df_dict['Reported Z'].append(str(row['reported_z']))
df_dict['Corrected Value'].append(str(row['corrected_value']))
df_dict['Corrected XY'].append(str(row['corrected_xy']))
df_dict['Corrected Z'].append(str(row['corrected_z']))
df_dict['Slicing Thickness'].append(str(row['slicing_thickness']))
df_dict['Min Age'].append(str(row['min_age']))
df_dict['Max Age'].append(str(row['max_age']))
df_dict['Min Weight'].append(str(row['min_weight']))
df_dict['Max Weight'].append(str(row['max_weight']))
df_dict['Png URL'].append(str(row['png_url']))
df_dict['Reference PMID'].append(str(row['reference_pmid']))
df_dict['Reference DOI'].append(str(row['reference_doi']))
df_dict['Physical Integrity'].append(str(row['physical_Integrity']))
neurons_df = pd.DataFrame(df_dict)
I know that this might be confusing to somebody not familiar to this, but i have placed some markdowns inside the notebook to explain in detail what is the problem.
I recommend to take a look on the PyInstaller and Nuitka. They can produce the standalone executables.
Example with nuitka:
(linux) python -m nuitka --onefile --output-dir=./nuitka --standalone --follow-imports --plugin-enable=qt-plugins ./../updater.py
(windows) python -m nuitka --onefile --windows-uac-admin --windows-disable-console --windows-icon-from-ico=.\updater\resources\ico\au.ico --output-dir=.\nuitka --standalone --follow-imports --plugin-enable=qt-plugins --windows-company-name=Name --windows-product-name=Name --windows-product-version=1.0.0 --windows-file-description=Name .\..\updater.py
Example with pyinstaller:
pyinstaller --onefile --windowed --icon=./updater/resources/ico/au.ico updater.py

My search input functions works, but it only prints the last person's information in the list of dicts

I'm kind of new to python, and I need some help. I'm making an employee list menu. My list of dictionaries is:
person_infos = [ {'name': 'John Doe', 'age': '46', 'job position': 'Chair Builder', 'pay per hour': '14.96','date hired': '2/26/19'},
{'name': 'Phillip Waltertower', 'age': '19', 'job position': 'Sign Holder', 'pay per hour': '10','date hired': '5/9/19'},
{'name': 'Karen Johnson', 'age': '40', 'job position': 'Manager', 'pay per hour': '100','date hired': '9/10/01'},
{'name': 'Linda Bledsoe', 'age': '60', 'job position': 'CEO', 'pay per hour': '700', 'date hired': '8/24/99'},
{'name': 'Beto Aretz', 'age': '22', 'job position': 'Social Media Manager', 'pay per hour': '49','date hired': '2/18/12'}]
and my "search the list of dicts input function" is how the program is supposed to print the correct dictionary based on the name the user inputs:
def search_query(person_infos):
if answer == '3':
search_query = input('Who would you like to find: ')
they_are_found = False
location = None
for i, each_employee in enumerate(person_infos):
if each_employee['name'] == search_query:
they_are_found = True
location = i
if they_are_found:
print('Found: ', person_infos[location]['name'], person_infos[location]['job position'], person_infos[location]['date hired'], person_infos[location]['pay per hour'])
else:
print('Sorry, your search query is non-existent.')
and I also have this-
elif answer =='3':
person_infos = search_query(person_infos)
This seems like a step in the right direction, but for
search_query = input('Who would you like to find: ')
if I input of the names in person_infos, like "John Doe," it just prints the last dictionary's information (no matter which specific dictionary it is, the last one in the order will always be outputted) instead of John Doe's. in this case, it would only print "Beto Aretz's."
Can someone please help? It's something I've been struggling on for a while and it would be awesome.
I've researched so much and I could not find something with things that I either knew how to do, or were the input search.
Thanks,
LR
At first glance it looks like because your location=i is not indented inside your if statement so it is getting set to the latest i on each iteration of the for loop. Let me know if this helps.
def search_query(person_infos):
if answer == '3':
search_query = input('Who would you like to find: ')
they_are_found = False
location = None
for i, each_employee in enumerate(person_infos):
if each_employee['name'] == search_query:
they_are_found = True
location = i
if they_are_found:
print('Found: ', person_infos[location]['name'], person_infos[location]['job position'], person_infos[location]['date hired'], person_infos[location]['pay per hour'])
else:
print('Sorry, your search query is non-existent.')

Adding symbols to json Python

I was getting from website a json with tags.
['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
And i want to print it more or less like that
#adventureinthecomments #artist:s.guri #clothes #commentslockeddown #dashie #slippers #edit #fractal #nopony #recursion
Does somebody knows what method i need to use to remove all comas an add hashtag before word?
P.S Using Python 3
One of the ways is to join to the single string with '#' and strip all white spaces and replace '#' with ' #' (with space)
arr = ['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
s= "#"
res = '#' + s.join(arr)
newVal = res.replace(' ','')
newNew = newVal.replace('#', ' #')
print(newNew)
What's the rule for split the original sentences?, because the first one looks like
'adventure in the comments' = '#adventureinthecomments'
but
'comments locked down' is splitted to #comments #locked #down
?
If there is no rules this could works
>>> jsontags = ['adventure in the comments', 'artist:s.guri', 'clothes', 'comments locked down', 'dashie slippers', 'edit', 'fractal', 'no pony', 'recursion', 'safe', 'simple background', 'slippers', 'tanks for the memories', 'the ride never ends', 'transparent background', 'vector', 'wat', 'we need to go deeper']
>>> '#'+' #'.join([tag.replace(' ','') for tag in jsontags])
This will be the result
'#adventureinthecomments #artist:s.guri #clothes #commentslockeddown #dashieslippers #edit #fractal #nopony #recursion #safe #simplebackground #slippers #tanksforthememories #therideneverends #transparentbackground #vector #wat #weneedtogodeeper'

Scraping HTML data from a page that adds new tables as you scroll

Im trying to learn html scraping for a project, I'm using python and lxml. I've been successful so far in getting the data I needed but now I have another problem. The site that I'm scraping from (op.gg) when you scroll down it adds new tables with more information. When I run my script (below) it only gets the first 50 entries and nothing more. My question is how can I get at least the first 200 names on the page or if it is even possible.
from lxml import html
import requests
page = requests.get('https://na.op.gg/ranking/ladder/')
tree = html.fromstring(page.content)
names = tree.xpath('//td[#class="SummonerName Cell"]/a/text()')
print (names)
Borrow the idea from Pedro, https://na.op.gg/ranking/ajax2/ladders/start=number will give you 50 records start from number, for example:
https://na.op.gg/ranking/ajax2/ladders/start=0 get (1-50),
https://na.op.gg/ranking/ajax2/ladders/start=50 get (51-100),
https://na.op.gg/ranking/ajax2/ladders/start=100 get (101-150),
https://na.op.gg/ranking/ajax2/ladders/start=150 get (151-200),
etc....
After that you could change your scrap code, as the page is different as your original one, suppose you want get first 200 names, here is the amended code:
from lxml import html
import requests
start_url = 'https://na.op.gg/ranking/ajax2/ladders/start='
names_200 = list()
for i in [0,50,100,150]:
dest_url = start_url + str(i)
page = requests.get(dest_url)
tree = html.fromstring(page.content)
names_50 = tree.xpath('//a[not(#target) and not(#onclick)]/text()')
names_200.extend(names_50)
print names_200
print len(names_200)
Output:
[u'am\xc3\xa9liorer', 'pireaNn', 'C9 Ray', 'P1 Pirean', 'Pobelter', 'mulgokizary', 'consensual clown', 'Jue VioIe Grace', 'Deep Learning', 'Keegun', 'Free Papa Chau', 'C9 Gun', 'Dhokla', 'Arrowlol', 'FOX Brandini', 'Jurassiq', 'Win or Learn', 'Acoldblazeolive', u'R\xc3\xa9venge', u'M\xc3\xa9ru', 'Imaqtpie', 'Rohammers', 'blaberfish2', 'qldurtms', u'd\xc3\xa0wolfsclaw', 'TheOddOrange', 'PandaTv 656826', 'stuntopolis', 'Butler Delta', 'P1 Shady', 'Entranced', u'Linsan\xc3\xadty', 'Ablazeolive', 'BukZacH', 'Anivia Kid', 'Contractz', 'Eitori', 'MistyStumpey', 'Prodedgy', 'Splitting', u'S\xc4\x99b B\xc4\x99rnal', 'N For New York', 'Naeun', '5tunt', 'C9 Winter', 'Doubtfull', 'MikeYeung', 'Rikara', u'RAH\xc3\x9cLK', ' Sudzzi', 'joong ki song', 'xWeixin VinLeous', 'rhubarbs', u'Ch\xc3\xa0se', 'XueGao', 'Erry', 'C9 EonYoung', 'Yeonbee', 'M ckg', u'Ari\xc3\xa1na Lovato', 'OmarGod', 'Wiggily', 'lmpactful', 'Str1fe', 'LL Stylish', '2017', 'FlREFLY', 'God Fist Monk', 'rWeiXin VinLeous', 'Grigne', 'fantastic ad', 'bobqinX', 'grigne 1v10', 'Sora1', 'Juuichi san ', 'duoking2', 'SandPaperX', 'Xinthus', 'TwichTv CoMMa', 'xFSN Rin', 'UBC CJ', 'PotIuck', 'DarkWingsForSale', 'Get After lt', 'old chicken', u'\xc4\x86ris', 'VK Deemo', 'Pekin Woof', 'YIlIlIlIlI', 'RiceLegend', 'Chimonaa1', 'DJNDREE5', u'CloudNguy\xc3\xa9n', 'Diamond 1 Khazix', 'dawolfsfang', 'clg imaqtpie69', 'Pyrites', 'Lava', 'Rathma', 'PieCakeLord', 'feed l0rd', 'Eygon', 'Autolycus1', 'FateFalls 20xx', 'nIsHIlEzHIlA', 'C9 Sword', 'TET Fear', 'a very bad time', u'Jur\xc3\xa1ssiq', 'Ginormous Noob', 'Saskioo', 'S D 2 NA', 'C9 Smoothie', 'dufTlalgkqtlek', 'Pants are Dragon', u'H\xc3\xb3llywood', 'Serenitty', 'Waggily ', 'never lucky help', u'insan\xc3\xadty', 'Joyul', 'TheeBrandini', 'FoTheWin', 'RyuShoryu', 'avi is me', 'iKingVex', 'PrismaI', 'An Obese Panda', 'TdollasAKATmoney', 'feud999', 'Soligo', 'Steel I', 'SNH48 Ruri', 'BillyBoss1', 'Annie Bot', 'Descraton', 'Cris', 'GrayHoves', 'RegisZZ', 'lron Pyrite', 'Zaion', 'Allorim', 't d', u'Alex \xc3\xafch', 'godrjsdnd', 'DOUBLELIFTSUCKS', 'John Mcrae', u'Lobo Solitari\xc3\xb3', 'MikeYeunglol', 'i xo u', 'NoahMost', 'Vsionz', 'GladeGleamBright', 'Tuesdayy', 'RealDarkness', 'CC Dean', 'na mid xd LFT', 'Piggy Kitten', 'Abou222', 'TG Strompest', 'MooseHater', 'Day after Day', 'bat8man', 'AxAxAxAxA', 'Boyfriend', 'EvanRL', '63FYWJMbam', 'Fiftygbl', u'Br\xc4\xb1an', 'MlST', u'S\xc3\xb8ren Bjerg', 'FOX Akaadian', '5word', 'tchikou', 'Hakuho', 'Noobkiller291', 'woxiangwanAD', 'Doublelift', 'Jlaol', u'z\xc3\xa3ts', 'Cow Goes Mooooo', u'Be Like \xc3\x91e\xc3\xb8\xc3\xb8', 'Liquid Painless', 'Zergy', 'Huge Rooster', 'Shiphtur', 'Nikkone', 'wiggily1', 'Dylaran', u'C\xc3\xa0m', 'byulbit', 'dirtybirdy82', 'FreeXpHere', u'V\xc2\xb5lcan', 'KaNKl', 'LCS Actor 4', 'bie sha wo', 'Mookiez', 'BKSMOOTH', 'FatMiku']
200
BTW, you could expand it based on your requirement.

Python search loop slow

I am running a search on a list of ads (adscrape). Each ad is a dict within adscrape (e.g. ad below). It searches through a list of IDs (database_ids) which could be between 200,000 - 1,000,000 items long. I want to find any ads in adscrape that don't have an ID already in database_ids.
My current code is below. It takes a loooong time, and multiple seconds for each ad to scan through database_ids. Is there a more efficient/faster way of running this (finding which items in a big list, are in another big list)?
database_ids = ['id1','id2','id3'...]
ad = {'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'id': u'OAG-AD-12371713', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
for ad in adscrape:
ad['last scan'] = date
ad['active'] = 'Y'
adscrape_ids.append(ad['id'])
if ad['id'] not in database_ids:
ad['first scan'] = date
print 'new ad:',ad
newads.append(ad)
`You can use list comprehensions for this as the code base given below. Use the existing database_ids list and adscrape dict as given above.
Code base:
new_adds_ids = [ad for ad in adscrape if ad['id'] not in database_ids]`
You can build ids_map as dict and check whether id in list by accessing key in that ids_map as in code snippet below:
database_ids = ['id1','id2','id3']
ad = {'id': u'OAG-AD-12371713', 'body': u'\xa0SUV', 'loc': u'SA', 'last scan': '06/02/16', 'eng': u'\xa06cyl 2.7L ', 'make': u'Hyundai', 'year': u'2006', 'first scan': '06/02/16', 'odo': u'168911', 'active': 'Y', 'adtype': u'Dealer: Used Car', 'model': u'Tucson Auto 4x4 ', 'trans': u'\xa0Automatic', 'price': u'9990'}
#build ids map
ids_map = dict((k, v) for v, k in enumerate(database_ids))
for ad in adscrape:
# some logic before checking whether id in database_ids
try:
ids_map[ad['id']]
except KeyError:
pass
else:
#error not thrown perform logic for existed ids
print 'id %s in list' % ad['id']

Categories