Fetching NUTS2 region from geocoordinates and/or postalcode - python

I have a df that contains data including geocoordinates and postcodes.
lat | lon | postcode
54.3077 | 12.7 | 18314
51.898 | 9.26 | 32676
I need a new colum with the NUTS2 region, so in this case the resulting df should look something like this:
lat | lon | postcode | NUTS_ID
54.3077 | 12.7 | 18314 | DE80
51.898 | 9.26 | 32676 | DEA4
I found this package: https://github.com/vis4/pyshpgeocode which I managed to run. My first approach are the following two functions:
def get_gc(nuts='geoc\\shapes\\nuts2\\nuts2.shp'):
"""
nuts -> path to nuts file
"""
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
return gc
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = get_gc().geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
And this does work! However, it takes ~ 0.6s for one entry and some of my df have more than a million entries. Since my original dataframes usually contain postcodes I was thinking about aggregating them using a combination of groupby / apply / transform?
Or is there any other (more efficient) way of doing this?
I am very grateful for any help and look forward to receiving replies.

If I understand your code correctly you are re-creating the gc object for every single request from the same input file. I don't understand why.
One possibility therefore could be to do the following:
def add_nuts_to_df(df):
"""
df must have lon/lat geocoodinates in it
This function will add a column ['NUTS_ID'] with the corresponding
NUTS region
"""
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
start_time = time.time()
for idx, row in df.iterrows():
df.loc[idx, 'NUTS_ID'] = gc.geocode(row.lat,
row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
print('Done with index {}\nTime since start: {}s'.format(idx,
round(time.time() - start_time, 0 )))
return df
Maybe it would speed up the process even more if you try to use the df.apply() method and pass your geocode logic in a function there.
Something like:
nuts='geoc\\shapes\\nuts2\\nuts2.shp'
gc = geocoder(nuts, filter=lambda r: r['LEVL_CODE'] == 2)
def get_nuts_id(row):
return gc.geocode(row.lat, row.lon,
filter=lambda r: r['NUTS_ID'][:2] == 'DE')['NUTS_ID']
df["NUTS_ID"] = df.apply(get_nuts_id,axis=1)
I didn't try this out though so beware of typos.

Related

Need to extract specific word from text

I am trying to run data cleaning process in python and one of the column which has too many rows is as follows:
|Website |
|:------------------|
|m.google.com |
|uk.search.yahoo |
|us.search.yahoo.com|
|google.co.in |
|m.youtube |
|youtube.com |
I want to extract company name from the text
Output will be as follows
|Website |Company|
|:------------------|:------|
|m.google.com |google |
|uk.search.yahoo |yahoo |
|us.search.yahoo.com|yahoo |
|google.co.in |google |
|m.youtube |youtube|
|youtube.com |youtube|
Data is too big to do it manually and being a beginner, I tried all of the things I learned. Please help!
Not bullet-proof but maybe a feasible heuristic:
import pandas as pd
d = {'Website': {0: 'm.google.com', 1: 'uk.search.yahoo', 2: 'us.search.yahoo.com', 3: 'google.co.in', 4: 'm.youtube', 5: 'youtube.com'}}
df = pd.DataFrame(data=d)
df['Website'].str.split('.').map(lambda l: [e for e in l if len(e)>3][-1])
0 google
1 yahoo
2 yahoo
3 google
4 youtube
5 youtube
Name: Website, dtype: object
Explaination:
Split string on ., filter out substrings with less than 3 characters, then take the rightmost element which wasn't filtered out.
I applied this trick on a kaggle large dataset and it works for me. Assuming that you already have a dataframe object of Pandas named as df.
company = df['Website']
ext_list = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
for extension in ext_list:
company = company.str.replace(extension,'')
df['company'] = company
df['company'].head(15)
Now look at your data carefully either by looking at head or tail of data and try to find if there is any extension that you miss in the list if you find another then add it in ext_list.
Now you can also verify it using
df['company'].unique()
Here is a way of checking the running time of it also its Big O Complexity would be O(n) so it also perfoms well on a large number of datasets.
import time
def time_it(func):
def wrapper(*args,**kwargs):
start = time.time()
result = func(*args,**kwargs)
end = time.time()
print(func.__name__ + " took "+ str((end-start)* 1000)+ " miliseconds")
return result
return wrapper
#time_it
def specific_word(col_name, ext_list):
for extension in ext_list:
col_name = col_name.str.replace(extension,'')
return col_name
if __name__ == '__main__':
company = df['Website']
extensions = ['www.','.com','.edu','.net','.org','.gov','.mil','m.','uk.search.','us.search.','.co.in','.af']
result = specific_word(company, extensions)
print(result.head())
Here is I applied on estimate 10,000 values.

How to find the correlation between columns starting with Open_ and Close_?

I'm trying to find the correlation between the open and close prices of 150 cryptocurrencies using pandas.
Each cryptocurrency data is stored in its own CSV file and looks something like this:
|---------------------|------------------|------------------|
| Date | Open | Close |
|---------------------|------------------|------------------|
| 2019-02-01 00:00:00 | 0.00001115 | 0.00001119 |
|---------------------|------------------|------------------|
| 2019-02-01 00:05:00 | 0.00001116 | 0.00001119 |
|---------------------|------------------|------------------|
| . | . | . |
I would like to find the correlation between the Close and Open column of every cryptocurrency.
As of right now, my code looks like this:
temporary_dataframe = pandas.DataFrame()
for csv_path, coin in zip(all_csv_paths, coin_name):
data_file = pandas.read_csv(csv_path)
temporary_dataframe[f"Open_{coin}"] = data_file["Open"]
temporary_dataframe[f"Close_{coin}"] = data_file["Close"]
# Create all_open based on temporary_dataframe data.
corr_file = all_open.corr()
print(corr_file.unstack().sort_values().drop_duplicates())
Here is a part of the output (the output has a shape of (43661,)):
Open_QKC_BTC Close_QKC_BTC 0.996229
Open_TNT_BTC Close_TNT_BTC 0.996312
Open_ETC_BTC Close_ETC_BTC 0.996423
The problem is that I don't want to see the following correlations:
between columns starting with Close_ and Close_(e.g. Close_USD_BTC and Close_ETH_BTC)
between columns starting with Open_ and Open_ (e.g. Open_USD_BTC and Open_ETH_BTC)
between the same coin (e.g. Open_USD_BTC and Close_USD_BTC).
In short, the perfect output would look like this:
Open_TNT_BTC Close_QKC_BTC 0.996229
Open_ETH_BTC Close_TNT_BTC 0.996312
Open_ADA_BTC Close_ETC_BTC 0.996423
(PS: I'm pretty sure this is not the most elegant to do what I'm doing. If anyone has any suggestions on how to make this script better I would be more than happy to hear them)
Thank you very much in advance for your help!
This is quite messy but it at least shows you an option.
Her i am generating some random data and have made some suffixes (coin names) easier than in your case
import string
import numpy as np
import pandas as pd
#Generate random data
prefix = ['Open_','Close_']
suffix = string.ascii_uppercase #All uppercase letter to simulate coin-names
var1 = [None] * 100
var2 = [None] * 100
for i in range(len(var1)) :
var1[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
var2[i] = prefix[np.random.randint(0,len(prefix))] + suffix[np.random.randint(0,len(suffix))]
df = pd.DataFrame(data = {'var1': var1, 'var2':var2 })
df['DropScenario_1'] = False
df['DropScenario_2'] = False
df['DropScenario_3'] = False
df['DropScenario_Final'] = False
df['DropScenario_1'] = df.apply(lambda row: bool(prefix[0] in row.var1) and (prefix[0] in row.var2), axis=1) #Both are Open_
df['DropScenario_2'] = df.apply(lambda row: bool(prefix[1] in row.var1) and (prefix[1] in row.var2), axis=1) #Both are Close_
df['DropScenario_3'] = df.apply(lambda row: bool(row.var1[len(row.var1)-1] == row.var2[len(row.var2)-1]), axis=1) #Both suffixes are the same
#Combine all scenarios
df['DropScenario_Final'] = df['DropScenario_1'] | df['DropScenario_2'] | df['DropScenario_3']
#Keep only the part of the df that we want
df = df[df['DropScenario_Final'] == False]
#Drop our messy columns
df = df.drop(['DropScenario_1','DropScenario_2','DropScenario_3','DropScenario_Final'], axis = 1)
Hope this helps
P.S If you find the secret key to trading bitcoins without ending up on r/wallstreetbets, ill take 5% ;)

python - extract values of a dataframe 1 based on a range of values(times) from another dataframe 2

I'm trying to do something that seems quite easy on excel using vlookup. All the times bellow are of timedelta datatype . Couldn't find a solution that fit me by google searching the errors.
DF1 (bellow) is my main DataFrame one value is Arrival time.
+--------+------+
|Arrival | idBin|
+--------+------+
|10:01:40| nan |
|10:03:12| nan |
|10:05:55| nan |
|10:05:10| nan |
+--------+------+
DF2(bellow) is my parameters Dataframe with 1k+ time ranges (manually creating a dictionary seems impractical).
+--------+--------+------+
|start |end |idBin |
+--------+--------+------+
|10:00:00|10:00:30| 1 |
|10:00:31|10:01:00| 2 |
|10:01:01|10:01:30| 3 |
|10:01:31|10:02:00| 4 |
+--------+--------+------+
What I need is to get DF2.idBin into DF1.idBin where DF1.arrival between DF2.start and DF2.end
What I tried so far:
**.loc** > returns ValueError: Can only compare identically-labeled Series objects
pd.DataFrame.loc[ (df1['arrival'] >= df2['start'])
& (df1['arrival'] <= df2['end']), 'idBin'] = df2['idBin']
**date_range()** so I could transform it into dictionary, but return TypeError: Cannot convert input [0 days 10:00:00] of type <class 'pandas._libs.tslibs.timedeltas.Timedelta'> to Timestamp
dt_range = pd.date_range(start=df2['start'].min(), end=df2['end'].max(), name=df2['idBin'])
IIUC
x = pd.Series(df2['idBin'], pd.IntervalIndex.from_arrays(df2['start'], df2['end']))
inds = np.array([np.flatnonzero(np.array([k in z for z in x.index])) for k in df.Arrival])
bools = [arr.size>0 for arr in inds]
df.loc[bools, 'idBin'] = df2.iloc[[ind[0] for ind in inds[bools]]].idBin.values
DF2_intervals = pd.Series(DF2['idBin'], pd.IntervalIndex.from_arrays(DF2['start'], DF2['end']))
DF1['idBin'] = DF1['Arrival'].map(DF2_intervals)
You can also turn that into one line to be more efficient, should you wish to.
Let me know if that works.
Im not sure if there is a pre_built solution but you can do something similar to what you tried but in a UDF and then apply this to the column in df1 and have that output a new column.
def match_idbin(date, df2):
idbin = df2.loc[(df2['start'] > date)&
(df2['end'] < date),'idBin']
return idbin
df1['idBin'] = df1['Arrival'].apply(lambda x: match_idbin(x, df2))

Pandas - DateTime within X amount of minutes from row

I am not entirely positive the best way to ask or phrase this question so I will highlight my problem, dataset, my thoughts on the method and end goal and hopefully it will be clear by the end.
My problem:
My company dispatches workers and will load up dispatches to a single employee even if they are on their current dispatch. This is due to limitation in the software we use. If an employee receives two dispatches within 30 minutes, we call this a double dispatch.
We are analyzing our dispatching efficiency and I am running into a bit of a head scratcher. I need to run through our 100k row database and add an additional column that will read as a dummy variable 1 for double 0 for normal. BUT as we have multiple people we dispatch and B our records do not start ordered by dispatch, I need to determine how often a dispatch occurs to the same person within 30 minutes.
Dataset:
The dataset is incredibly massive due to poor organization in our data warehouse but for terms of what items I need these are the columns I will need for my calc.
Tech Name | Dispatch Time (PST)
John Smith | 1/1/2017 12:34
Jane Smith | 1/1/2017 12:46
John Smith | 1/1/2017 18:32
John Smith | 1/1/2017 18:50
My Thoughts:
How I would do it is clunky and it could work one way but not backwards. I would more or less write my code as:
import pandas as pd
df = pd.read_excel('data.xlsx')
df.sort('Dispatch Time (PST)', inplace = True)
tech_name = None
dispatch_time = pd.to_datetime('1/1/1900 00:00:00')
for index, row in df.iterrows():
if tech_name is None:
tech_name = row['Tech Name']
else:
if dispatch_time.pd.time_delta('0 Days 00:30:00') > row['Tech Dispatch Time (PST)'] AND row['Tech Name'] = tech_name:
row['Double Dispatch'] = 1
dispatch_time = row['Tech Dispatch Time (PST)']
else:
dispatch_time = row['Tech Dispatch Time (PST)']
tech_name = row['Tech Name']
This has many problems from being slow, only tracking dates going backwards and not forwards so I will be missing many dispatches.
End Goal:
My goal is to have a dataset I can then plug back into Tableau for my report by adding on one column that reads as that dummy variable so I can filter and calculate on that.
I appreciate your time and help and let me know if any more details are necessary.
Thank you!
------------------ EDIT -------------
Added a edit to make the question clear as I failed to do so earlier.
Question: Is Pandas the best tool to use to iterate over my dataframe to see each for each datetime dispatch, is there a record that matches the Tech's Name AND is less then 30 minutes away from this record.
If so, how could I improve my algorithm or theory, if not what would the best tool be.
Desired Output - An additional column that records if a dispatch happened within a 30 minute window as a dummy variable 1 for True 0 for False. I need to see when double dispatches are occuring and how many records are true double dispatches, and not just a count that says there were 100 instances of double dispatch, but that involved over 200 records. I need to be able to sort and see each record.
Hello I think I found a solution. It slow, only compares one index before or after, but in terms of cases that have 3 dispatches within thirty minutes, this represents less then .5 % for us.
import pandas as pd
import numpy as np
import datetime as dt
dispatch = 'Tech Dispatched Date-Time (PST)'
tech = 'CombinedTech'
df = pd.read_excel('combined_data.xlsx')
df.sort_values(dispatch, inplace=True)
df.reset_index(inplace = True)
df['Double Dispatch'] = np.NaN
writer = pd.ExcelWriter('final_output.xlsx', engine='xlsxwriter')
dispatch_count = 0
time = dt.timedelta(minutes = 30)
for index, row in df.iterrows():
try:
tech_one = df[tech].loc[(index - 1)]
dispatch_one = df[dispatch].loc[(index - 1)]
except KeyError:
tech_one = None
dispatch_one = pd.to_datetime('1/1/1990 00:00:00')
try:
tech_two = df[tech].loc[(index + 1)]
dispatch_two = df[dispatch].loc[(index + 1)]
except KeyError:
tech_two = None
dispatch_two = pd.to_datetime('1/1/2020 00:00:00')
first_time = dispatch_one + time
second_time = pd.to_datetime(row[dispatch]) + time
dispatch_pd = pd.to_datetime(row[dispatch])
if tech_one == row[tech] or tech_two == row[tech]:
if first_time > row[dispatch] or second_time > dispatch_two:
df.set_value(index, 'Double Dispatch', 1)
dispatch_count += 1
else:
df.set_value(index, 'Double Dispatch', 0)
dispatch_count += 1
print(dispatch_count) # This was to monitor total # of records being pushed through
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
writer.close()

Using regex to extract information from a large SFrame or dataframe without using a loop

I have the following code in which I use a loop to extract some information and use these information to create a new matrix. However, because I am using a loop, this code takes forever to finish.
I wonder if there is a better way of doing this by using GraphLab's SFrame or pandas dataframe. I appreciate any help!
# This is the regex pattern
pattern_topic_entry_read = r"\d{15}/discussion_topics/(?P<topic>\d{9})/entries/(?P<entry>\d{9})/read"
# Using the pattern, I filter my records
requests_topic_entry_read = requests[requests['url'].apply(lambda x: False if regex.match(pattern_topic_entry_read, x) == None else True)]
# Then for each record in the final set,
# I need to extract topic and entry info using match.group
for request in requests_topic_entry_read:
for match in regex.finditer(pattern_topic_entry_read, request['url']):
topic, entry = match.group('topic'), match.group('entry')
# Then, I need to create a new SFrame (or dataframe, or anything suitable)
newRow = gl.SFrame({'user_id':[request['user_id']],
'url':[request['url']],
'topic':[topic], 'entry':[entry]})
# And, append it to my existing SFrame (or dataframe)
entry_read_matrix = entry_read_matrix.append(newRow)
Some sample data:
user_id | url
1000 | /123456832960900/discussion_topics/770000832912345/read
1001 | /123456832960900/discussion_topics/770000832923456/view?per_page=832945307
1002 | /123456832960900/discussion_topics/770000834562343/entries/832350330/read
1003 | /123456832960900/discussion_topics/770000534344444/entries/832350367/read
I want to obtain this:
user_id | topic | entry
1002 | 770000834562343 | 832350330
1003 | 770000534344444 | 832350367
Pandas' series has string functions for that. E.g., with your data in df:
pattern = re.compile(r'.*/discussion_topics/(?P<topic>\d+)(?:/entries/(?P<entry>\d+))?')
df = pd.read_table(io.StringIO(data), sep=r'\s*\|\s*', index_col='user_id')
df.url.str.extract(pattern, expand=True)
yields
topic entry
user_id
1000 770000832912345 NaN
1001 770000832923456 NaN
1002 770000834562343 832350330
1003 770000534344444 832350367
Here, let me reproduce it:
>>> import pandas as pd
>>> df = pd.DataFrame(columns=["user_id","url"])
>>> df.user_id = [1000,1001,1002,1003]
>>> df.url = ['/123456832960900/discussion_topics/770000832912345/read', '/123456832960900/discussion_topics/770000832923456/view?per_page=832945307', '/123456832960900/discussion_topics/770000834562343/entries/832350330/read','/123456832960900/discussion_topics/770000534344444/entries/832350367/read']
>>> df["entry"] = df.url.apply(lambda x: x.split("/")[-2] if "entries" in x.split("/") else "---")
>>> df["topic"] = df.url.apply(lambda x: x.split("/")[-4] if "entries" in x.split("/") else "---")
>>> df[df.entry!="---"]
gives you desired DataFrame

Categories