I have limited access to the MySQL database, I just can see a view called customer contains customer_id, name, and their location
id_customer name location
1 Andy Detro.it
2 Ben CALiforNIA
3 Mark uk
4 Niels London123
5 Pierre Paris
And a table called location contain list of city and country of customer location.
id_coutry country id_city city
1 US 1 Detroit
1 US 2 California
2 UK 3 London
2 UK 4 Manchester
I want to clean customer data automatically if there is new data in the database, I mean if in the raw data there is punctuation or number or typo, it will automatically clean and then after that the clean location will search their id_city based on location table, if there is no city similar/match, it will search id_country, and if there is no the id_city/country will be 0. and it will become new table called customer location
id_customer id_city status
1 1 Match
2 2 Match
3 2 Country
4 3 Match
5 0 Unknown
The status is label if the location is from city then it will be Match, if it's from country it will be Country, if there is no similar name or id_city/country 0 it will be unknown. The location can be city or country so the status will tell it's match with the city or with the country.
Can someone suggest what I must to do this project, I try to do it with python in jupyter notebook but will it be effective for this case? I really new to this things, sorry if I can't give enough information and thanks before.
This is a very stacked question with lots of steps needed to achieve what you want. So let's dive straight in!
First, we should read the data frames from your (uncleaned) customer database and your location database:
import pandas as pd
customer_df = pd.read_sql("SELECT * FROM customer", db_connection)
location_df = pd.read_sql("SELECT * FROM location", db_connection)
Now that we have the data stored in proper frames to handle them, we can start to clean the locations in your customer database. There are MANY ways to do so! Your requirements were as follows:
there is punctuation or number or typo
Now let's tackle the first two issues. We can do this using RegEx for cleaning out punctuations or numbers: pattern = r"[^a-zA-Z\s]"! With that pattern at hand we first clean the customer location data:
pattern = r"[^a-zA-Z\s]"
customer_df["location"] = customer_df["location"].str.replace(pattern, "")
For your typo issues there is no one-solution-fits-all. You could use a dictionary for often mismatches. Or review the database and add important ones manually. There are also a few libraries which can calculate the "distance" between the intended and actual word.
A good library (subjective opinion - though no affiliation) is FuzzyWuzzy as it allows you to use different metrics, such as the Levenshtein distance or the Jaccard similarity index!
import fuzzywuzzy
from fuzzywuzzy import process
levenshtein_matches = process.extract(
customer_df["location"], location_df["city"], limit=1, scorer=fuzzywuzzy.fuzz.token_set_ratio
)
Note that this is just an example. You may go ahead and read the docs or a good article I found on Medium!
You do need to do this twice for both the location_df["city"] and location_df["country"]`. Use an algorithm of your choice (depending on the average data you're getting) - but as mentioned, with the data you included I cannot conclusively decide for you what's best to use.
Now you can use a threshold value to determine whether a city / country is similar enough to be considered! A radical example: But if you got lots of customers from Iran or Iraq, you may need to adjust the values accordingly ;)
customer_df.loc[
levenshtein_matches["ratio"] > threshold, "id_city"
] = levenshtein_matches["match"]
Again, please do this for both the country and city!
Now, lastly, let's bring together the hard work we've done! I now create a new table with three columns: id_customer (which are the ids from the first table), id_city (which is either the city ID or country ID depending on the status) and a status (which will display Match = exact city was found, Country = only the country could be matched and Unknown = no data found -> in that case the default ID will be 0)!
Create the final dataframe: customer_location_df = customer_df[["id_customer"]].copy()
Now set the id_city as described (as mentioned above - you need to do this for country on your own, I sampled the code for city for you):
customer_location_df.loc[
(customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_city"]
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_country"]
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "id_city"
] = 0
Create the status column and set it:
customer_location_df.loc[
(customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Match"
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Country"
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "status"
] = "Unknown"
Lastly, save the customer_location_df as a new table in the database: customer_location_df.to_sql("customer_location", db_connection, if_exists="replace") (careful not to replace your main table if it's called customer_location)!
I don't know if I understood correctly here..
[...] I mean if in the raw data there is punctuation or number or typo, it will automatically clean [...]
You need some sort of validation method here, you cannot achieve that directly on the database, you need to handle it in your logic before the rows insertion.
In these cases, the best solution is to prepare a picklist (multiple choice) from which end users can choose the right values.
A free text input will always be error prone.
If the multiple choiceis not an applicable solution in your case, then you need to put in place a list of validation rules but you need to think how to prevent every possible issue.
Example, in your case:
You could use Regex to clean the input
import re
city = 'Detro.it'
cleanCity= re.sub(r'[^\w\s]', '', city)
print(cleanCity) // Detroit
You need to play with regex in those manner, for exampe If you want extract only chars [a-zA-Z]+
In order to handle the input in a case sensitive way you could use str.title()
After that all the chars, except the firts one, are coverted to lowercase
city = "caliFORnia"
cleanCity = city.title()
print(cleanCity) // California
The final resulting table is obtainable via MySql query.
You need to JOIN the tables. (here the only common fiels is the name of the city, not the best for the ON cluase, an id would be better)
In order to achieve the derived column 'Staus' you could leverage the MySql function CASE.
Example:
SELECT field1,field2,..,
CASE
WHEN field1 = field2 THEN "Match"
ELSE "Unmatch"
END AS Derived_Col
FROM table;
Result:
field1 field2 Derived_col
sometxt sometxt Match
another other Unmatch
Related
I am fairly new to Python and have been using Apriori to analyse my baskets. In saying that, my team has requested that I identify the top 3 products sold with certain ranges and I am unsure how to go about this considering I only have access to Excel and Python.
My data is structured in columns listed below.
DocumentNumber - This is the sales document number
DisplayName - product display name
MasterCategory - First hierarchy of the product
Category - second product hierarchy
SubCategory - third product hierarchy
Range - Collection name
Quantity - Number of units sold on that sales document
ProductCode - Product Internal ID
The task is to identify top 3 Sofas (Category) Ranges and the top 3 Occasional Chairs (Category), Top 3 Coffee Tables (Subcategory) and Top 3 Side Tables (Sub Category) that these are often sold with.
I cannot for the life of me figure out how to do this with apriori, and I have over 68,000 rows of transaction data with 33,059 unique transactions to scan for the data above.
Would one of you kind souls please be able to guide me in the right direction?
I have tried Apriori Algorithm in Python, but I am unsure that is the correct way to approach this problem.
You can try pandas. The code will look like that:
# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")
# Filter the data to include only the transactions that contain one of the ranges you're interested in
sofas = df[df['Range'] == 'Sofas']
top_3_sofas = (sofas.groupby('DisplayName')['Quantity']
.sum()
.sort_values(ascending=False)
.head(3))
Repeat for the other categories.
Looks like you are studying datascience. You can check Kaggle for some more problems with solutions and explanation.
I have a data export from a rack management software that I need to manipulate and load to a new system.
In the current export (df1) I have a column that includes the acronym of one of the data centers. That is: column "Region" with values in the format "DC_Rack_01".
I created an empty data frame (df2) that I will eventually import into the new system. This df2 will be populated with data from df1 but the fields do not match 1:1. This new data frame (df2) must have a column "site" which must have the acronym "DC" in it (or whatever the name of the other data centers are). I am thinking to do something as stated in the post question (I hope it's not too confusing):
if df1.region starts with("DC"), use "DC" to populate column df2.site
I tried the following code but it returns a boolean True/False, but I want the actual string "DC".
blade_netbox['site'] = np.where(blade_original.Region.str.contains('Site_acronym'),
blade_netbox['site'], 'Site_acronym')
I am not quite sure how to add tables in the body of the question other than ASCII tables, which I find a little too complicated for this example. I hope that images of my data frames work.
Here is a solution using regex:
Code:
import re
df=pd.read_csv('mycsv.csv')
print(df)
site_acronym = 'DC'
df['site'] = df['Region'].apply(lambda x:site_acronym if re.match(f'[${site_acronym}].+', x) else '')
print(df)
Input:
Region device
0 DC_Rack_01 blade server
1 DC_Rack_02 blade server
2 Rack_03 NaN
3 RackDC_03 NaN
Output:
Region device site
0 DC_Rack_01 blade server DC
1 DC_Rack_02 blade server DC
2 Rack_03 NaN
3 RackDC_03 NaN
Explanation:
$ to represent start of a string
[] to say match exact pattern inside the brackets
. any character except a newline character
+ one or more occurrences of the one-character regular expression, here .+ means one or more occurrences of any character after the site acronym
If you want zero or more occurences of any character after the site acronym you can use .*
If you want to get a prefix instead of a boolean you could try grabbing them with .str method (in case it follows a regular pattern)
e.g.
blade_netbox['site'] = blade_original['Region'].str[:2] # first two characters
or
blade_netbox['site'] = blade_original['Region'].str.split('_').str[0] # first fragment when splitting by underscore
supposing the 'Site_acronym' is always the first two letters from the column 'region' you can try:
blade_netbox['site'] = blade_original['Region'].apply(lambda x: x[:2])
output:
name device site
0 device_1 blade server DC
1 device_2 blade server DC
EDIT:
This only works if both DataFrame have the same length and the rows have same order.
Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.
I use datasets that are often CSV flats files that are the result of a variety of other source tables.
I import them using pandas read_csv.
For example. I get a table like this :
ID
Amount
Client
Company
Company Long Name
ID2
0
12
ClientA
CompanyA
The Company A
AA123
1
2
ClientA
CompanyA
The Company A
AA2339
2
32
ClientB
CompanyA
The Company A
AA3833
3
1
ClientB
CompanyB
The Company B
BB3933
Now I suppose there is a "Company" table somewhere. And I would like to find a way to find columns that are very likely to be from this company table.
So I want to ask with python if working with the columns Company there are any "good candidates" for a potential Company Table
In my example, Company Long Name is a good candidate because if I group by Company and count how many unique values I have for the column Company Long Name, the answer is 1.
Also, what I wish to find is whether part of a column would be a good match as well. In my example, the first two characters of ID2 are a good fit.
Ideally, I would like a just provide a columns and the code check everyother columns (I have hundreds of them) and suggest good candidates with maybe some idea of a matching score. Like 99% meaning that there may be a occurence where I have more than 1 distinct value for this company ID.
I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.
The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)
After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.
For instance:
Column 1
1. 9Com(panynAm9e00
2. NikE4
3. Mitrosof2
Column 2
1. Microsoft
2. Company Name
3. Nike
Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.
Any suggestions?
I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:
import pandas as pd
from fuzzywuzzy import process, fuzz
df = pd.DataFrame([['9Com(panynAm9e00'],
['NikE4'],
['Mitrosof2']],
columns=['Name'])
known_list = ['Microsoft','Company Name','Nike']
def find_match(x):
match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
return match
df['match found'] = [find_match(row) for row in df['Name']]
Yields:
Name match found
0 9Com(panynAm9e00 Company Name
1 NikE4 Nike
2 Mitrosof2 Microsoft
I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:
convert everything (col 1 and 2) to lowercase
For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
Do the same (step 2) for each entry in column 1
For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names
The dictionary-distance implementation is up to you.