Python noob here.
I am working with a large dataset that includes a column with unstructured strings. I need to develop a way to create a list that includes all of the suburb names in Australia (I can source this easily). I then need a program that parses through the string, and where a sequence matches an entry in the list, it saves the substring to a new column. The dataset was appended from multiple sources, so there is no consistent structure to the strings.
As an example, the rows look like this:
GIBSON AVE PADSTOW NSW 2211
SYDNEY ROAD COBURG VIC 3058
DUNLOP ST, ROSELANDS
FOREST RD HURSTVILLE NSW 2220
UNKNOWN
JOSEPHINE CRES CHERRYBROOK NSW 2126
I would be greatly appreciative if anyone has any example code that they can share with me, or if you can point me in the right direction for the most appropriate tool/method to use.
In this example, the expected output would look like:
'Padstow'
'Coburg'
'Roselands'
'Hurstville'
''
'Cherrybrook'
EDIT:
Would this code work?
import pandas as pd
import numpy as np
suburb_list = np.genfromtxt('filepath/nsw.csv',
delimiter=',', dtype=str)
top_row = suburb_list[:].tolist()
dataset = pd.read_csv(‘filepath/dataset.csv')
def get_suburb(dataset.address):
for s in suburb_list:
if s in address.lower()
return s
So for a pretty simple approach, you could just use a big list with all the suburb names in lower case, and then do:
suburbs = [ 'padstow', 'cowburg', .... many more]
def get_suburb(unstructured_string):
for s in suburbs:
if s in unstructured_string.lower()
return s
This will give you the first match. If you want to get fancy and maybe try to get it right in the face of misspellings etc., you could try "fuzzy" string comparison methods like the Levenshtein distance (for which you'd have to separate the string into individual words first).
Related
I would like to ask for a little support. I used google place API in order to get the formatted addresses of some suppliers.
My target is to check the addresses from the raw data and to compare with the ones proposed by Google.
So I managed to create a frame out of the data provided by Google with columns (Name, street_Google, ZIP_Google, Country_Google)
My raw file has also a similar structure (Name, street_raw)
In order to check if the addresses are similar I execute a fuzzy comparison, in which I compare the streets with each other and put the data into a column. The image below shows my output excel file with Street_X as the raw street, Street_Y the google provided data and Similarity as the fuzzy result
Here is the step I cannot go further and would be happy for any kind of support.
How can I group the supplier and keep the maximum of each supplier group, so that the frame has exactly the same column title as the screenshot and the following entries
The result should for example be: keept row[0],row[2],[3],row[4] to the corresponding columns
I tried with the groupby function, however it creates multiple indices
Please find below my code
from pprint import pprint
from thefuzz import fuzz
import pandas as pd
import requests
import googlemaps
from urllib.parse import urlencode
API_KEY="API"
map_client=googlemaps.Client(API_KEY)
vendor_data_pl=pd.read_excel(path)
supplier_name=vendor_data_pl["Supplier name"][:10]
#supplier_name=["HELLMANN WORLDWIDE LOGISTICS POLSKA SP. Z O.O. SP. K.","SIEMENS SP. Z O.O."]
supplier_list=[]
address_list=[]
phone_list=[]
for name in supplier_name:
response=map_client.places(query=name)
results=response.get("results")
MultipleLocation=len(results)
if MultipleLocation>= 1:
for i in range (MultipleLocation):
PlaceID=results[i]["place_id"]
url="https://maps.googleapis.com/maps/api/place/details/json"
params={
"key":API_KEY,
"place_id":PlaceID,
"inputtype":"textquery",
"language":"en"
}
params_encoded=urlencode(params)
places_endpoint=f"{url}?{params_encoded}"
r=requests.get(places_endpoint)
streetname=r.json()["result"]['formatted_address']
address_list.append(streetname)
supplier_list.append(name)
try:
phoneNumber=r.json()["result"]['international_phone_number']
phone_list.append(phoneNumber)
except:
phoneNumberNA="N.A"
phone_list.append(phoneNumberNA)
DataFrame=pd.DataFrame({"Suppliername":supplier_list,"address":address_list,"PhoneNumber":phone_list})
indexDrop=DataFrame[DataFrame["address"]=="No address found"].index
DataFrame=DataFrame.drop(indexDrop)
address=DataFrame["address"].str.split(",", expand=True)
expansion1=address[0].str.split("(.*?)\s*(\d+(?:[/-]\d+)?)?$",expand=True)
ZipCity=address[1].str.split(" ", expand=True)
DataFrame["City"] = ZipCity[2]
DataFrame["Street"]=expansion1[1]
DataFrame["Streetnumber"]=expansion1[2]
DataFrame["Zip"] = ZipCity[1]
DataFrame["Country"]=address[2]
merged_frame=pd.merge(vendor_data_pl,DataFrame,on="Supplier name")
zipaddress=zip(merged_frame["Street_x"].values,merged_frame["Street_y"].values)
similarity=[]
for x,y in zipaddress:
similarity.append(fuzz.ratio(x,y))
merged_frame["Similarity"]=similarity
merged_frame=merged_frame[["Supplier name","Street_x","Street_y","Similarity","Streetnumber","City postal code","Zip","Country","PhoneNumber"]]
Here is an easy neat solution:
df.sort_values('Similarity', ascending=False).drop_duplicates('Supplier Name')
If you had thousands of rows it wouldn't be too efficient since it requires sorting but it's still not too bad and rather neat.
The efficient way to do this is still quite neat so use whichever you like best:
df.loc[df.groupby('Supplier Name')['Similarity'].idxmax()]
It still involves sorting but only within groups so at scale it would be more efficient.
I have a question.
Is there a way on how to check wheteher there are typos in a specific column?
I have an Excel sheet which is read by use of pandas.
First I need to make a unique list in Python, based on the name of the column;
Second I need to replace the wrong values with the new values.
Working in a Jupyter notebook and doing this semi-manually might be the best way. One option could be to start by creating a list of correct spelling:
correct= ['terms','that','are','spelt','correctly']
and create a subset from your data frame which does not contain the values in that list.
df[~df['columnname'].str.startswith(tuple(correct))]
You will then know how many rows are affected. You can then count the number of different variations:
df['columnname'].value_counts()
and if reasonable, you could look at the unique values, and make them into a list:
listoftypos = list(df['columnname'].unique())
print(listoftypos)
and then create a dictionary again in a semi-manual way as:
typodict= {'terma':'term','thaaat':'that','arree':'are','speelt':'spelt','cooorrectly':'correct'}
then iterate over your original data frame, and if a row in the column contains the keyword which is in your list of typos, then replace it with the correct key from the dictionary, something like this:
for index,row in df.itterows():
if any(row['columnname'] in s for s in listoftypos):
correctspelling = list(typodict.keys())[list(typodict.values()).index(row['columnname'])])
df.at[index,'columnname'] = correctspelling
A strong caveat here though - of course, this would be something that would have to be done iteratively if the dataframe was extremely large.
Keep in mind that a generic spell check is a fairly tall order, but I believe this solution will fit your need with the lowest chance of false matches:
Setup:
import difflib
import re
from itertools import permutations
cardinal_directions=['north', 'south', 'east', 'west']
regions=['coast', 'central', 'international', 'mid']
p_lst=list(permutations(cardinal_directions+regions,2))
area=[''.join(i) for i in p_lst]+cardinal_directions+regions
df=pd.DataFrame({"ID":list(range(0,9)), "region":['Midwest', 'Northwest', 'West', 'Northeast', 'East coast', 'Central', 'South', 'International', 'Centrall']})
Initial DF:
ID
region
0
Midwest
1
Northwest
2
West
3
Northeast
4
East coast
5
Central
6
South
7
International
8
Centrall
Function:
def spell_check(my_str, name_bank):
prcnt=[]
for y in name_bank:
prcnt.append(difflib.SequenceMatcher(None, y, my_str.lower().strip()).ratio())
return name_bank[prcnt.index(max(prcnt))]
Apply Function to DF:
df.region=df.region.apply(lambda x: spell_check(x, area))
Resultant DF:
ID
region
0
midwest
1
northwest
2
west
3
northeast
4
eastcoast
5
central
6
south
7
international
8
central
I hope this answers your question and good luck.
I want to explore the population data freely available online at https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json . It contains population details of UK from 1981 to 2017. The code I used so far is below
import requests
import json
import pandas
json_url = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_31_1.jsonstat.json'
# download the data
j = requests.get(url=json_url)
# load the json
content = json.loads(j.content)
list(content.keys())
The last line of code above gives me the below output:
['version',
'class',
'label',
'source',
'updated',
'value',
'id',
'size',
'role',
'dimension',
'extension']
I then tried to have a look at the lengths of 'Value', 'size' and 'role'
print (len(content['value']))
print (len(content['size']))
print (len(content['role']))
And I got the results as below:
22200
5
3
As we can see the lengths very different. I cannot covert it into a dataframe as they are all different lengths.
How can I change this to a meaningful format so that I can start exploring it? Iam required to do analysis as below:
1.A table showing the male, female and total population in columns, per UK region in rows, as well as the UK total, for the most recent year
Exploratory data analysis to show how the population progressed by regions and age groups
You should first read the content of the Json file except value, because the other fields explain what the value field is. And it is a (flattened...) multidimensional matrix with dimensions content['size'], that is 37x4x3x25x2, and the description of each dimension is given in content['dimension']. First dimension is time with 37 years from 1981 to 2017, then geography with Wales, Scotland, Northern Ireland and England_and_Wales. Next come sex with Male, Female and Total, followed by ages with 25 classes. At the very end, you will find the measures where first is the total number of persons, and the second is its percent number.
Long story short, only content['value'] will be used to feed the dataframe, but you first need to understand how.
But because of the 5 dimensions, it is probably better to first use a numpy matrix...
The data is a complex JSON file and as you stated correctly, you need the data frame columns to be of an equal length. What you mean to say by that, is that you need to understand how the records are stored inside your dataset.
I would advise you to use some JSON Viewer/Prettifier to first research the file and understand its structure.
Only then you would be able to understand which data you need to load to the DataFrame. For example, obviously, there is no need to load the 'version' and 'class' values into the DataFrame as they are not part of any record, but are metadata about the dataset itself.
This is JSON-stat format. See https://json-stat.org. You can use the python libraries pyjstat or json.stat.py to get the data to a pandas dataframe.
You can explore this dataset using the JSON-stat explorer
I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:
index name zipcode
------- ---------- ---------
0 john doe 12345
1 jane smith 54321
2 john dooe 12345
3 jane smtih 54321
The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?
I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:
from fuzzywuzzy import fuzz
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
ratio = fuzz.partial_ratio(row['name'], row2['name'])
if ratio > 90: # A good threshold for single character typos on names
# Do something to declare a match and throw out the duplicate
Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples(), which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2) solution?
Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.
the package pandas-dedupe can help you with your task.
pandas-dedupe works as follows: first it asks you to label a bunch of records he is most confused about. Afterwards, he uses this knowledge to resolve duplicates entitites. And that is it :)
You can try the following:
import pandas as pd
from pandas_dedupe import dedupe_dataframe
df = pd.DataFrame.from_dict({'name':['john', 'mark', 'frank', 'jon', 'john'], 'zip':['11', '22', '33', '11', '11']})
dd = dedupe_dataframe(df, ['name', 'zip'], canonicalize=True, sample_size=1)
The console will then ask you to label some example.
If duplicates click 'y', otherwise 'n'. And once done click 'f' for finished.
It will then perform deduplication on the entire dataframe.
The string-grouper package is perfect for this. It uses TF-IDF with N-Grams underneath and is much faster than levenshtein.
from string_grouper import group_similar_strings
def group_strings(strings: List[str]) -> Dict[str, str]:
series = group_similar_strings(pd.Series(strings))
name_to_canonical = {}
for i, s in enumerate(strings):
deduped = series[i]
if (s != deduped):
name_to_canonical[s] = deduped
return name_to_canonical
For zipcodes, I can fairly confidently state that you can't detect typos without some mechanism for field validation (two zipcodes could look very close and both be valid zipcodes)
If the data is sorted, with some assumptions about where the typo is made (First letter is highly unlikely except in cases of common substitutions) you might be able to take advantage of that and search them as distinct per-letter chunks. If you assume the same for the last name, you can divide them into 26^2 distinct subgroups and only have them search within their field.
You could also try an approach just looking at the set of ORIGINAL first names and last names. If you're searching 7million items, and you have 60 thousand "Johns", you only need to compare them once against "Jhon" to find the error, then search for the "Jhon" and remove or fix it. But this is assuming, once again, that you break this up into a first-name and last-name series within the frame (using panda's str.extract(), with "([\w]+) ([\w]+)" or some such as your regex, as the data demands)
I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.
A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.