I am trying to iterate over a Pandas data frame with close to a million entries. I am using a for loop to iterate over them. Consider the following code as an example
import pandas as pd
import os
from requests_html import HTMLSession
from tqdm import tqdm
import time
df = pd.read_csv(os.getcwd()+'/test-urls.csv')
df = df.drop('Unnamed: 0', axis=1 )
new_df = pd.DataFrame(columns = ['pid', 'orig_url', 'hosted_url'])
refused_df = pd.DataFrame(columns = ['pid', 'refused_url'])
tic = time.time()
for idx, row in df.iterrows():
img_id = row['pid']
url = row['image_url']
#Let's do scrapping
session = HTMLSession()
r = session.get(url)
r.html.render(sleep=1, keep_page=True, scrolldown=1)
count = 0
link_vals = r.html.find('.zoomable')
if len(link_vals) != 0 :
attrs = link_vals[0].attrs
# print(attrs['src'])
embed_link = attrs['src']
else:
while count <=7:
link_vals = r.html.find('.zoomable')
count += 1
else:
print('Link refused connection for 7 tries. Adding URL to Refused URLs Data Frame')
ref_val = [img_id,URL]
len_ref = len(refused_df)
refused_df.loc[len_ref] = ref_val
print('Refused URL added')
continue
print('Got 1 link')
#Append scraped data to new_df
len_df = len(new_df)
append_value = [img_id,url, embed_link]
new_df.loc[len_df] = append_value
I wanted to know how could I use a progress bar to add a visual representation of how far along I am. I will appreciate any help. Please let me know if you need any clarification.
You could try out TQDM
from tqdm import tqdm
for idx, row in tqdm(df.iterrows()):
do something
This is primarily for a command-line progress bar. There are other solutions if you're looking for more of a GUI. PySimpleGUI comes to mind, but is definitely a little more complicated.
Would comment, but the reason you might want a progress bar is because it is taking a long time because iterrows() is a slow way to do operations in pandas.
I would suggest you use apply/ avoid using iterrows().
If you want to continue using iterrows just include a counter that counts up to the number of rows, df.shape[0]
PySimpleGUI makes this about as simple of a problem to solve as possible, assuming you know ahead of time time how items you have in your list. Indeterminate progress meters are possible, but a little more complicated.
There is no setup required before your loop. You don't need to make a special iterator. The only need you have to do is add 1 line of code inside your loop.
Inside your loop add a call to - one_line_progress_meter. The name sums up what it is. Add this call to the top of your loop, the bottom, it doesn't matter... just add it somewhere that's looped.
There 4 parameters you pass are:
A title to put on the meter (any string will do)
Where you are now - current counter
What the max counter value is
A "key" - a unique string, number, anything you want.
Here's a loop that iterates through a list of integers to demonstrate.
import PySimpleGUI as sg
items = list(range(1000))
total_items = len(items)
for index, item in enumerate(items):
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
The list iteration code will be whatever your loop code is. The line of code to focus on that you'll be adding is this one:
sg.one_line_progress_meter('My meter', index+1, total_items, 'my meter' )
This line of code will show you the window below. It contains statistical information like how long you've been running the loop and an estimation on how much longer you have to go.
How to do that in pandas apply?
I do this
def some_func(a,b):
global index
some function involve a and b
index+=1
sg.one_line_progress_meter('My meter', index, len(df), 'my meter' )
return c
index=0
df['c'] = df[['a','b']].apply(lambda : some_func(*x),axis=1)
Related
I understand that Vectorization or Parallel Programming is the way to go. But what if my program doesn't fit in those use case scenarios, like let's say NumPy doesn't work for a particular problem.
For demonstration purposes, I have wrote a simple program here:
import pandas as pd
import requests
from bs4 import BeautifulSoup
def extract_data(location_name, date):
ex_data = []
url = f'http://www.example.com/search.php?locationid={location_name}&date={date}'
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
for tr in soup.find('table', class_='classTable').find_all('tr', attrs={'class': None}):
text = [td.text for td in tr.find_all('td')]
ex_data.append(text)
return ex_data
# list of all dates in a year
get_dates = pd.read_csv('dates.csv')
#list of a number of locations
location_list = pd.read_csv('locations.csv')
master_data = []
for indexLoc, rowLoc in location_list.iterrows():
data = []
for index, row in dates.iterrows():
_date = row['Date']
location = rowLoc['Location']
row = extract_data(location, _date)
data = data + row
master_data = master_data + data
master_df = pd.DataFrame(master_data)
print(master_df)
The program basically puts a list of dates and location in separate dataframes, then loops through each location and nested loops through each date to execute a function. The function makes a request to url taking the parameters and gets some information from a table using BeautifulSoup, which it returns back. Program then stores those return values in a list and loop continues.
Now, let's say there are 100 locations and 365 dates, the program will run through each location for 365 days which makes the loop execution : 100*365. This infinity number and on top of the temp storage required for the returned list from the function for each loop execution - is definitely not anywhere near the efficient solution.
Using NumPy may change the date into Datetime variable, rather than keeping it as String (at least, that's what happened in my case). Using Multiprocessing/Multithreading might break the sequence in which the final master_data should be displayed, if a request in the function took too long to fulfill. For instance, Feb 17,...,20 data will be added in the list before Feb 16, because requesting url for Feb 16 too long. I understand that it can be sorted later on, but what if the data is such that it can't sorted.
What would be a simple, light-weight solution for these nested loops, which would be the best way to get maximum speed efficiency for the program execution. I would also like to know why that would be the best option with some example, if you can provide.
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I'm trying to subset and return a pandas df. The main obstacle is I'm executing the subset from a df that is being continually updated.
I'm appending data on a timer that imports the same dataset every minute. I then want to subset this updated data and return it for a separate function. Specifically, the subset df will be emailed. I'm hoping to continually repeat this process.
I'll lay out each intended step below. I'm falling down on step 3.
Import the dataset over a 24 hour period
Continually update same dataset every minute
Subset the data frame by condition
If a new row is appended to df, execute email notification
Using below, data is imported from yahoo finance where the same data is pulled every minute.
I'm then aiming to subset specific rows from this updated dataframe and return the data to be emailed.
I only want to execute the email function when a new row of data has been appended.
The condition outlined below will return a new row at every minute (which is by design for testing purposes). My actual condition will return between 0-10 instances a day.
The example df outlined in df_out is an example that may be taken at a point throughout a day.
import pandas as pd
import yfinance as yf
import datetime
import pytz
from threading import Thread
from time import sleep
import numpy as np
import pandas as pd
import requests
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib
def scheduled_update():
# Step 1. import data for a 24 hour period
my_date = datetime.datetime.now(pytz.timezone('Etc/GMT-5'))
prev_24hrs = my_date - datetime.timedelta(hours = 25, minutes = 0)
data = yf.download(tickers = 'EURUSD=X',
start = prev_24hrs,
end = my_date,
interval = '1m'
).iloc[:-1]#.reset_index()
# Step 2. update data every minute in 5 min blocks
while True:
sleep(60)
upd_data = yf.download(tickers = 'EURUSD=X',
start = my_date - datetime.timedelta(hours = 0, minutes = 5),
end = datetime.datetime.now(pytz.timezone('Etc/GMT-5')),
interval = '1m')
print(upd_data)
if len(upd_data) != 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].lt(0.98000)].iloc[:-1]
# Merges the two sets of data.
data = pd.concat([data, upd_data])
# For the first few minutes after the function starts running, there will be
# duplicate timestamps, so this gets rid of those. This way is fastest, according to:
# https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
data = data[~data.index.duplicated(keep = 'first')]
print(data)
else:
print('No new data')
return data
thread = Thread(target = scheduled_update)
thread.start()
For arguments sake, let's say during the day that a new row has been appended and we call the df as df_out. When the new row has been appended, I want to execute the email notification.
# Step 3. return subset of data to be emailed
#df_out = scheduled_update()
# example df
df_out = pd.DataFrame({'Datetime' : ['2022-10-10 01:44:00+01:00','2022-10-10 01:45:00+01:00','2022-10-10 01:46:00+01:00','2022-10-10 01:47:00+01:00','2022-10-10 01:48:00+01:00'],
'Open' : [0.973899,0.973710,0.973615,0.973410,0.973799],
'High' : [0.973999,0.974110,0.973115,0.973210,0.973899],
'Low' : [0.973899,0.973710,0.973615,0.973710,0.973499],
'Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
'Adj Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
'Volume' : [0,0,0,0,0],
})
# Step 4. send notification containing df_out
def send_tradeNotification(send_to, subject, df):
# google account and password
send_from = 'xxxx1#gmail.com'
password = 'password'
# email message
message = """\
<p><strong>Notification </strong></p>
<p>
<br>
</p>
<p><strong>-
</strong><br><strong>Regards </strong></p>
"""
for receiver in send_to:
multipart = MIMEMultipart()
multipart['From'] = send_from
multipart['To'] = receiver
multipart['Subject'] = subject
attachment = MIMEApplication(df.to_csv())
attachment['Content-Disposition'] = 'attachment; filename=" {}"'.format(f'{subject}.csv')
multipart.attach(attachment)
multipart.attach(MIMEText(message, 'html'))
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(multipart['From'], password)
server.sendmail(multipart['From'], multipart['To'], multipart.as_string())
server.quit()
#send_tradeNotification(['xxxx2#gmail.com'], 'Trade Setup', df_out)
It's a little unclear exactly what you're trying to do, but my assumptions are:
You want to have a 'baseline' dataset from the previous 24 hours.
You want to pull new data every minute after that.
If the new data matches a specific criteria, you want to append that data to the existing data.
Unless there's some specific reason you have the initial data outside of the scheduled_update() function, I think it's likely easiest just to keep everything in there. So for the first few lines of your function, I'd do the following:
def scheduled_update():
# Using pandas time functions as they're a bit cleaner than datetime, but either will work
my_date = pd.Timestamp.now('Etc/GMT-5')
prev_24hrs = my_date - pd.Timedelta(hours = 25)
data = yf.download(tickers = 'EURUSD=X',
start = prev_24hrs,
end = my_date,
interval = '1m'
).iloc[:-1]
In my testing, I noticed that the last value in the downloaded data sometimes included a seconds value in it. Adding the .iloc[:-1] to the end of the DataFrame ensures that only whole minutes are included.
What was the original purpose of including the while datetime.datetime.now().minute % 1 != 0 lines? .minute % 1 will always be equal to 0 so it wasn't clear what that was doing.
As #Josh Friedlander indicated, upd_data is defined as a list outside of the function, so trying to use pandas methods on it won't work.
Instead, upd_data should be a new DataFrame that contains the recent data that you're importing. If the data is being updated every minute, it doesn't make sense to download a full 24 hours every time, so I changed that to the following so it only pulls the last 5 minutes:
while True:
sleep(60)
upd_data = yf.download(tickers = 'EURUSD=X',
start = pd.Timestamp.now('Etc/GMT-5') - pd.Timedelta(minutes = 5),
end = pd.Timestamp.now('Etc/GMT-5'),
interval = '1m')
You could probably get away with only pulling 2 or 3 minutes instead, but this ensures that there's some overlap in case there are delays or issues with the download.
As I'm writing this, Yahoo isn't updating the exchange rate since trading is apparently closed for the day, so there's no new data in the upd_data DataFrame.
When that's the case, it doesn't make sense to do any assessment on it, so the following chunk of code checks to see if there were any updates. And if there are, then it updates the data DataFrame. If not, it just prints a statement that there are no updates (the whole else block can be deleted, otherwise it'll print that out all night... Ideally you could set things up so that it doesn't even run overnight, but that's not really critical at this point.)
if len(upd_data.index) > 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]
# Merges the two sets of data.
data = pd.concat([data, upd_data])
# For the first few minutes after the function starts running, there will be
# duplicate timestamps, so this gets rid of those. This way is fastest, according to:
# https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
data = data[~data.index.duplicated(keep = 'first')]
print(data)
else:
print('No new data')
After this, it's not clear how you want to proceed. In your original code, the return will never be executed since the while loop will continue indefinitely. If you put the return in the loop, though, it will break the loop when the loop reaches that line. It looks like there are ways in the threading module that can pass values out, but that's not something I'm familiar with so I can't comment on that. There are likely other options as well, depending on what your end goals are!
EDIT based on updated question
Based on your update, the line in my code:
data = data[~data.index.duplicated(keep = 'first')]
can be changed to:
data = data[~data.index.duplicated(keep = False)]
This will get rid of all duplicate rows, leaving only the rows with new information.
You should just include the email in this section as well. Put in another if statement to check if the length of data is greater than 0, and send an email if it is.
You also need to increase the line indent of the if/else from my code in your while loop. It shouldn't be at the same indentation level as the start of the while loop.
SECOND EDIT based on comments
Based on what you've written, just add the following if/else loop to check for updates, and put your email function within that loop if the condition is met.
if len(upd_data.index) > 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]
# Merges the two sets of data.
# Keeps all data values but drops duplicate entries.
data = pd.concat([data, upd_data])
data = data[~data.index.duplicated(keep = 'first')]
# Note change here to False to delete all rows with duplicate timestamps
# Also changed df name to send_data to keep it separate from full df.
send_data = data[~data.index.duplicated(keep = False)]
if len(send_data.index) != 0: # If there is new data that meets the condition
**run your whole email code here**
print(send_data)
else: # If there is new data, but it doesn't meet the condition
print('New rows retrieved, none meet condition')
else: # If there is no new data (markets are closed, for example)
print('No new data')
Ok I got this, So you want to append a data into every minute You created a list and used append function but you want to use Pandas library so you can actually do this for your function
def scheduled_update():
while datetime.datetime.now().minute % 1 != 0:
sleep(1)
x = pd.DataFrame(data)
while True:
sleep(60)
df_out = x[x['High'] > 0.97000]
print(df_out)
return df_out
thread = Thread(target = scheduled_update)
thread.start()
Afterwards, If you want to append a data into pandas array dataframe, you can use:
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
use this dataframe.append function
every 1 minute where you made the modulus operator to get if it is surpassed 1 minute inside first while loop and remove the upd_data[] array which is useless now.
Logic:
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
print(df1, "\n")
print(df2)
df1.append(df2, ignore_index = True) # we set Ignore index because we don't want the indexs to be the same.
Output Picture
I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)
I am trying to convert 33000 zipcodes into coordinates using geocoder package. I was hoping there was a way to parallelize this method because it is consuming quite a bit of resources.
from geopy.geocoders import ArcGIS
import pandas as pd
import time
geolocator = ArcGIS()
df1 = pd.DataFrame(0.0, index=list(range(0,len(df))), columns=list(['lat','lon']))
df = pd.concat([df,df1], axis=1)
for index in range(0,len(df)):
row = df['zipcode'].loc[index]
print index
# time.sleep(1)
# I put this function in just in case it would give me a timeout error.
myzip = geolocator.geocode(row)
try:
df['lat'].loc[index] = myzip.latitude
df['lon'].loc[index] = myzip.longitude
except:
continue
geopy.geocoders.ArcGIS.geocode queries a web server. Sending 33,000 queries alone will probably get you IP banned, so I wouldn't suggest sending them in parallel.
You're looking up almost every single ZIP code in the US. The US Census Bureau has a 1MB CSV file that contains this information for 33,144 ZIP codes: https://www.census.gov/geo/maps-data/data/gazetteer2017.html.
You can process it all in a fraction of a second:
zip_df = pd.read_csv('2017_Gaz_zcta_national.zip', sep='\t')
zip_df.rename(columns=str.strip, inplace=True)
One thing to watch out for is that the last column's name isn't properly parsed by Pandas and contains a lot of trailing whitespace. You have to strip the column names before use.
Here would be one way to do it, using multiprocessing.Pool
from multiprocessing import Pool
def get_longlat(x):
index, row = x
print index
time.sleep(1)
myzip = geolocator.geocode(row['zipcode'])
try:
return myzip.latitude, myzip.longitude
except:
return None, None
p = Pool()
df[['lat', 'long']] = p.map(get_longlat, df.iterrows())
More generally, using DataFrame.iterrows (for which each item iterated over is an index, row tuple) is likely slightly more efficient than the index-based method you use above
EDIT: after reading the other answer, you should be aware of rate limiting; you could use a fix number of processes in the Pool along with a time.sleep delay to mitigate this to some extent, however.