pandas dataframe apply function over column creating multiple columns - python

I have the pandas df below, with a few columns, one of which is ip_addresses
df.head()
my_id someother_id created_at ip_address state
308074 309115 2859690 2014-09-26 22:55:20 67.000.000.000 rejected
308757 309798 2859690 2014-09-30 04:16:56 173.000.000.000 approved
309576 310619 2859690 2014-10-02 20:13:12 173.000.000.000 approved
310347 311390 2859690 2014-10-05 04:16:01 173.000.000.000 approved
311784 312827 2859690 2014-10-10 06:38:39 69.000.000.000 approved
For each ip_address I'm trying to return the description, city, country
I wrote a function below and tried to apply it
from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return [description, city, country]
# ---
suspect['ipwhois'] = suspect['ip_address'].apply(returnIP)
My problem is that this returns a list, I want three separate columns.
Any help is greatly appreciated. I'm new to Pandas/Python so if there's a better way to write the function and use Pandas would be very helpful.

from ipwhois import IPWhois
def returnIP(ip) :
obj = IPWhois(str(ip))
result = obj.lookup_whois()
description = result["nets"][len(result["nets"]) - 1 ]["description"]
city = result["nets"][len(result["nets"]) - 1 ]["city"]
country = result["nets"][len(result["nets"]) - 1 ]["country"]
return (description, city, country)
suspect['description'], suspect['city'], suspect['country'] = \
suspect['ip_address'].apply(returnIP)

I was able to solve it with another stackoverflow solution
for n,col in enumerate(cols):
suspect[col] = suspect['ipwhois'].apply(lambda ipwhois: ipwhois[n])
If there's a more elegant way to solve this, please share!

Related

Matching thousands of data takes too much time with Pandas

I receive every day a report with some values and I have to match postal codes from countries all over the world to get the right region. Then I upload the result in my Django app.
Here's a look at my report:
Order Number
Date
City
Postal code
930276
27/09/2022
Madrid
cp: 28033
929670
27/09/2022
Lisboa
cp: 1600-812
I have thousands of rows like this. The objective is to retrieve the region in ISO 3166-2 format. To help me, I accessed the following page Geonames and downloaded all the countries' information (example: "FR.txt", "ES.txt"...)
Because this is a huge txt file, I chose to store it on a S3 Server.
Here is what I tried:
def access_scaleway(region_name, endpoint_url, access_key, secret_key):
""" Accessing Scaleway Bucket """
scaleway = boto3.client('s3', region_name=region_name, endpoint_url=endpoint_url, aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
return scaleway
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
scaleway_session = access_scaleway(region_name=settings.SCALEWAY_S3_REGION_NAME,
endpoint_url=settings.SCALEWAY_S3_ENDPOINT_URL,
access_key=settings.SCALEWAY_ACCESS_KEY_ID,
secret_key=settings.SCALEWAY_SECRET_ACCESS_KEY)
for country, region in zip(list_countries, list_regions):
try:
obj = scaleway_session.get_object(Bucket=settings.SCALEWAY_STORAGE_BUCKET_NAME, Key=f'countries/{country}.txt')
df = pd.read_csv(io.BytesIO(obj['Body'].read()), sep='\t', header=None)
df.columns = ['country code', 'postal code', 'place name', 'admin name1', 'admin code1', 'admin name2', 'admin code2', 'admin name3', 'admin code3', 'latitude', 'longitude', 'accuracy']
df['postal code'] = df['postal code'].astype(str)
df['postal code'] = df['postal code'].str.zfill(5)
# Removing all spaces and special characters
postal_code = re.sub("[^0-9^-]", '', region).strip()
region_code = country + "-" + df[df['postal code'] == postal_code]['admin code1'].values[0]
list_regions_codes.append(region_code)
except AttributeError:
list_regions_codes.append(None)
except ValueError:
list_regions_codes.append(None)
return list_regions_codes
But it is way too long. For a simple report of 1000 rows, it takes like 30 min.
My second try was to go with the OpenDataSoft public API. Here is what I tried:
def fetch_data(url, params, headers=None):
response = requests.get(url=url, params=params, headers=headers)
return response
def get_region_code_accessing_scaleway(countries, regions):
''' Retrieves the region code from the region name. '''
list_countries = countries
list_regions = regions
list_regions_codes = []
for country, region in zip(list_countries, list_regions):
try:
#Get response from API
postal_code = re.sub("[^0-9^-]", '', region).strip()
response = fetch_data(
url="https://data.opendatasoft.com/api/v2/catalog/datasets/geonames-postal-code%40public/records?",
params="select=country_code%2C%20postal_code%2C%20admin_code1&where=country_code%3D%22" + country + "%22%20and%20postal_code%3D%22" + postal_code + "%22")
if response.status_code == 200:
data = response.json()
if len(data['records']) > 0:
list_regions_codes.append(country + "-" + data['records'][0]['record']['fields']['admin_code1'])
else:
list_regions_codes.append(None)
else:
print('Error:" + response.status_code')
list_regions_codes.append(None)
But once again, it takes like forever to get matching values.
The last thing I tried was to go with pgeocode but it was also too long.
I don't understand why it is so long because the desired output is this one:
Order Number
Date
City
Postal code
Region code
930276
27/09/2022
Madrid
cp: 28033
ES-MD
929670
27/09/2022
Lisboa
cp: 1600-812
PT-08
Do you have any idea to speed up the process?

Storing keyvalue as header and value text as rows using data frame in python using beautiful soup

for imo in imos:
...
...
keys_div= soup.find_all("div", {"class","col-4 keytext"})
values_div = soup.find_all("div",{"class","col-8 valuetext"})
for key, value in zip(keys_div, values_div):
print(key.text + ": " + value.text)
'......
Output:
Ship Name: MAERSK ADRIATIC
Shiptype: Chemical/Products Tanker
IMO/LR No.: 9636632
Gross: 23,297
Call Sign: 9V3388
Deadweight: 37,538
MMSI No.: 566429000
Year of Build: 2012
Flag: Singapore
Status: In Service/Commission
Operator: Handytankers K/S
Shipbuilder: Hyundai Mipo Dockyard Co Ltd
ShipType: Chemical/Products Tanker
Built: 2012
GT: 23,297
Deadweight: 37,538
Length Overall: 184.000
Length (BP): 176.000
Length (Reg): 177.460
Bulbous Bow: Yes
Breadth Extreme: 27.430
Breadth Moulded: 27.400
Draught: 11.500
Depth: 17.200
Keel To Mast Height: 46.900
Displacement: 46565
T/CM: 45.0
This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe.
import pandas as pd
filepath = r"C\users\test\test_file.csv"
output_data = []
for imo in imos:
keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})]
values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})]
dict1 = dict(zip(keys_div, values_div))
output_data.append(dict1)
df = pd.DataFrame(output_data)
df.to_csv(filepath, index=False)

How to get street name from osm.pbf file in OpenStreetMap

You can download any dataset from here https://download.geofabrik.de/australia-oceania.html
Here's my code
import osmium as osm
import pandas as pd
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data.append([elem_type,
elem.id,
elem.version,
elem.visible,
pd.Timestamp(elem.timestamp),
elem.uid,
elem.user,
elem.changeset,
len(elem.tags),
tag.k,
tag.v])
def node(self, n):
self.tag_inventory(n, "node")
def way(self, w):
self.tag_inventory(w, "way")
def relation(self, r):
self.tag_inventory(r, "relation")
osmhandler = OSMHandler()
# scan the input file and fills the handler list accordingly
osmhandler.apply_file("/DATA/user/nabih/pitcairn-islands-latest.osm.pbf")
# transform the list into a pandas DataFrame
data_colnames = ['type', 'id', 'version', 'visible', 'ts', 'uid',
'user', 'chgset', 'ntags', 'tagkey', 'tagvalue']
df_osm = pd.DataFrame(osmhandler.osm_data, columns=data_colnames)
Here's the df_osm
Street names are values of the name key of highway elements (see https://wiki.openstreetmap.org/wiki/Map_features#Highway for all possible highway types, you may want to further filter it in the query). You can then self join all highway rows with their name rows on id:
df_osm.loc[df_osm.tagkey=='highway', ['id', 'tagvalue']].merge(
df_osm.loc[df_osm.tagkey=='name', ['id', 'tagvalue']],
on='id', suffixes=['_kind', '_name'])
Result for pitcairn-islands-latest.osm.pbf:
id tagvalue_kind tagvalue_name
0 1034153953 residential Main Road
1 1034161481 residential Hill of Difficulty Road
If you want to also include national names you can replace df_osm.tagkey=='name' with df_osm.tagkey.str.startswith('name'). See https://wiki.openstreetmap.org/wiki/Key:name for details and other possible names.

How to compare two CSV files using pySpark and validating exist or not

So, I have a input.csv something like this:
First_Name Last_Name Birthdate Gender Email_ID Mobile
Smit Will 21-04-1974 M da1#gmail.com 5224521452
Bob Builder 14-03-1992 M ad4#gmail.com 2452586253
And Database.csv with few more records to it:
First_Name Last_Name Birthdate Gender Email_ID Mobile
Bob Micheles 10-04-1982 M ya4#gmail.com 7845214525
Will Smith 21-04-1974 M da1#gmail.com 9874521452
Emma Watson 21-08-1989 F emma#gmail.com 5748214563
Emma Smit 21-08-1999 F da1#gmail.com 9874521452
bob robison 14-03-1992 M za#gmail.com 2452586253
df_DataBase = spark.read.csv("DataBase.csv",inferSchema=True,header=True)
My expected out is:
Bob Builder is the same as that of Bob robison as only his Last_Name and Email_ID are different
Smit Will and Will Smith are the same as only the Names and the mobile number is different.
and finally print the if they exist or not in the existing input file like this:
NOTE: The person is not same when email, phone and birthdate don't match.
Thus using pyspark if we can achieve this I would be great.
You can try something like below:
ip = spark.read.csv("input.csv")
db = spark.read.csv("database.csv")
#condition if person is same
person_exists = [((col('a.Email_id') == col('b.Email_id')) | (col('a.Mobile') == col('b.Mobile')) | (col('a.Birthdate') == col('b.Birthdate'))) ]
#people existing in db
existing_persons =
ip.alias('a').join(db.alias('b'),person_exists,"inner").select([col('a.'+x) for x in a.columns])
#people not existing in db
non_existing = ip.subtract(existing_persons)
#add a column to indicate if same person or not
existing_persons = existing_persons.withColumn('Same_Person',lit('Yes'))
non_existing = non_existing.withColumn('Same_Person',lit('No'))

Pandas - Filter DataFrame with Logics

I have a Pandas DataFrame like this,
Employee ID ActionCode ActionReason ConcatenatedOutput
1 TER DEA TER_DEA
1 RET ABC RET_ABC
1 RET DEF RET_DEF
2 TER DEA TER_DEA
2 ABC ABC ABC_ABC
2 DEF DEF DEF_DEF
3 RET FGH RET_FGH
3 RET EFG RET_EFG
4 PLA ABC PLA_ABC
4 TER DEA TER_DEA
And I want to filter it with the below logics and change it to something like this,
Employee ID ConcatenatedOutput Context
1 RET_ABC RET or TER Found
2 TER_DEA RET or TER Found
3 RET_FGH RET or TER Found
4 PLA_ABC RET or TER Not Found
Logics:-
1) If the first record of an Employee is TER_DEA then we go in to that employee and see if that employee has any other records, If that employee has another RET record, then we pick up the first available RET record or else we stick to TER_DEA record.
2) if the first record of an employee is anything other than TER_DEA then we stick with that record.
3) Context is conditional if it has a RET or TER then we say RET or TER Found, else it is not found.
Note:- The final output will have only one record for an employee ID.
The data below,
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
I'd recommend you rather go with a Bool in that field. To get the test data I used this:
import pandas as pd
employee_id = [1,1,1,2,2,2,3,3,4,4]
action_code = ['TER','RET','RET','TER','ABC','DEF','RET','RET','PLA','TER']
action_reason = ['DEA','ABC','DEF','DEA','ABC','DEF','FGH','EFG','ABC','DEA']
concatenated_output = ['TER_DEA', 'RET_ABC', 'RET_DEF', 'TER_DEA', 'ABC_ABC', 'DEF_DEF', 'RET_FGH', 'RET_EFG', 'PLA_ABC', 'TER_DEA']
df = pd.DataFrame({
'Employee ID': employee_id,
'ActionCode': action_code,
'ActionReason': action_reason,
'ConcatenatedOutput': concatenated_output,
})
You can then do a group by on the employee ID and and apply a function to perform your specific program logic in there.
def myfunc(data):
if data.iloc[0]['ConcatenatedOutput'] == 'TER_DEA':
if len(data.loc[data['ActionCode'] == 'RET']) > 0:
located_record = data.loc[data['ActionCode'] == 'RET'].iloc[[0]]
else:
located_record = data.iloc[[0]]
else:
located_record = data.iloc[[0]]
located_record['RET or TER Context'] = data['ActionCode'].str.contains('|'.join(['RET', 'TER']))
return located_record
df.groupby(['Employee ID']).apply(myfunc)

Categories