I use datasets that are often CSV flats files that are the result of a variety of other source tables.
I import them using pandas read_csv.
For example. I get a table like this :
ID
Amount
Client
Company
Company Long Name
ID2
0
12
ClientA
CompanyA
The Company A
AA123
1
2
ClientA
CompanyA
The Company A
AA2339
2
32
ClientB
CompanyA
The Company A
AA3833
3
1
ClientB
CompanyB
The Company B
BB3933
Now I suppose there is a "Company" table somewhere. And I would like to find a way to find columns that are very likely to be from this company table.
So I want to ask with python if working with the columns Company there are any "good candidates" for a potential Company Table
In my example, Company Long Name is a good candidate because if I group by Company and count how many unique values I have for the column Company Long Name, the answer is 1.
Also, what I wish to find is whether part of a column would be a good match as well. In my example, the first two characters of ID2 are a good fit.
Ideally, I would like a just provide a columns and the code check everyother columns (I have hundreds of them) and suggest good candidates with maybe some idea of a matching score. Like 99% meaning that there may be a occurence where I have more than 1 distinct value for this company ID.
Related
I am fairly new to Python and have been using Apriori to analyse my baskets. In saying that, my team has requested that I identify the top 3 products sold with certain ranges and I am unsure how to go about this considering I only have access to Excel and Python.
My data is structured in columns listed below.
DocumentNumber - This is the sales document number
DisplayName - product display name
MasterCategory - First hierarchy of the product
Category - second product hierarchy
SubCategory - third product hierarchy
Range - Collection name
Quantity - Number of units sold on that sales document
ProductCode - Product Internal ID
The task is to identify top 3 Sofas (Category) Ranges and the top 3 Occasional Chairs (Category), Top 3 Coffee Tables (Subcategory) and Top 3 Side Tables (Sub Category) that these are often sold with.
I cannot for the life of me figure out how to do this with apriori, and I have over 68,000 rows of transaction data with 33,059 unique transactions to scan for the data above.
Would one of you kind souls please be able to guide me in the right direction?
I have tried Apriori Algorithm in Python, but I am unsure that is the correct way to approach this problem.
You can try pandas. The code will look like that:
# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")
# Filter the data to include only the transactions that contain one of the ranges you're interested in
sofas = df[df['Range'] == 'Sofas']
top_3_sofas = (sofas.groupby('DisplayName')['Quantity']
.sum()
.sort_values(ascending=False)
.head(3))
Repeat for the other categories.
Looks like you are studying datascience. You can check Kaggle for some more problems with solutions and explanation.
I have a SQLite table. Let's call it people.
Name
Age
Jane
50
John
80
Alice
46
Mark
25
Harry
33
I have another table work.
Name
work_id
Jane
1
Amanda
2
Filip
3
Alice
4
Jack
5
Harry
6
I'd like to get all rows whose name is in both people and work. But I do not only want the names. In fact I don't care about the names at all. I just need them to find the matching entries. I want the work.work_id and people.age columns of the matching rows. The result should look like this:
work_id
age
1
50
4
46
6
33
Both tables can have hundreds to thousands of entries.
I also need a difference of the two i.e. The rows of work whose name isn't in people. But this should be solvable with the second solution I have outlined below.
I am doing this in Python3 using the builtin sqlite3 module. But this should be a purely SQL problem independent of the SQL client.
What I've tried
The obvious choice is to do an INTERSECT:
SELECT Name FROM people INTERSECT SELECT Name FROM work_id
As I said, I need the Name columns to find the intersection of the tables but I need the rows themselves to get the things I actually want, people.age and work.work_id, not the Names.
The internet lead me to subqueries.
SELECT Name, Age FROM people where Name IN (SELECT Name FROM work)
This is a pretty powerful technique but I also need the work_id column of work so this isn't a solution.
Is this comparing each row in people with all rows of work? Is the number of comparisons SELECT Count(*) FROM people × SELECT Count(*) FROM work or is it somehow optimized?
You want to select columns from both tables and this means you need an INNER JOIN:
SELECT w.work_id,
p.Age
FROM work AS w INNER JOIN people AS p
ON p.Name = w.Name;
For the rows of work whose name isn't in people use NOT IN:
SELECT *
FROM work
WHERE Name NOT IN (SELECT Name FROM people);
See the demo.
SELECT work_id, Age
FROM people AS p
JOIN work AS w ON p.Name = w.Name;
SELECT name, work_id
FROM work
EXCEPT
SELECT name
FROM people
I have limited access to the MySQL database, I just can see a view called customer contains customer_id, name, and their location
id_customer name location
1 Andy Detro.it
2 Ben CALiforNIA
3 Mark uk
4 Niels London123
5 Pierre Paris
And a table called location contain list of city and country of customer location.
id_coutry country id_city city
1 US 1 Detroit
1 US 2 California
2 UK 3 London
2 UK 4 Manchester
I want to clean customer data automatically if there is new data in the database, I mean if in the raw data there is punctuation or number or typo, it will automatically clean and then after that the clean location will search their id_city based on location table, if there is no city similar/match, it will search id_country, and if there is no the id_city/country will be 0. and it will become new table called customer location
id_customer id_city status
1 1 Match
2 2 Match
3 2 Country
4 3 Match
5 0 Unknown
The status is label if the location is from city then it will be Match, if it's from country it will be Country, if there is no similar name or id_city/country 0 it will be unknown. The location can be city or country so the status will tell it's match with the city or with the country.
Can someone suggest what I must to do this project, I try to do it with python in jupyter notebook but will it be effective for this case? I really new to this things, sorry if I can't give enough information and thanks before.
This is a very stacked question with lots of steps needed to achieve what you want. So let's dive straight in!
First, we should read the data frames from your (uncleaned) customer database and your location database:
import pandas as pd
customer_df = pd.read_sql("SELECT * FROM customer", db_connection)
location_df = pd.read_sql("SELECT * FROM location", db_connection)
Now that we have the data stored in proper frames to handle them, we can start to clean the locations in your customer database. There are MANY ways to do so! Your requirements were as follows:
there is punctuation or number or typo
Now let's tackle the first two issues. We can do this using RegEx for cleaning out punctuations or numbers: pattern = r"[^a-zA-Z\s]"! With that pattern at hand we first clean the customer location data:
pattern = r"[^a-zA-Z\s]"
customer_df["location"] = customer_df["location"].str.replace(pattern, "")
For your typo issues there is no one-solution-fits-all. You could use a dictionary for often mismatches. Or review the database and add important ones manually. There are also a few libraries which can calculate the "distance" between the intended and actual word.
A good library (subjective opinion - though no affiliation) is FuzzyWuzzy as it allows you to use different metrics, such as the Levenshtein distance or the Jaccard similarity index!
import fuzzywuzzy
from fuzzywuzzy import process
levenshtein_matches = process.extract(
customer_df["location"], location_df["city"], limit=1, scorer=fuzzywuzzy.fuzz.token_set_ratio
)
Note that this is just an example. You may go ahead and read the docs or a good article I found on Medium!
You do need to do this twice for both the location_df["city"] and location_df["country"]`. Use an algorithm of your choice (depending on the average data you're getting) - but as mentioned, with the data you included I cannot conclusively decide for you what's best to use.
Now you can use a threshold value to determine whether a city / country is similar enough to be considered! A radical example: But if you got lots of customers from Iran or Iraq, you may need to adjust the values accordingly ;)
customer_df.loc[
levenshtein_matches["ratio"] > threshold, "id_city"
] = levenshtein_matches["match"]
Again, please do this for both the country and city!
Now, lastly, let's bring together the hard work we've done! I now create a new table with three columns: id_customer (which are the ids from the first table), id_city (which is either the city ID or country ID depending on the status) and a status (which will display Match = exact city was found, Country = only the country could be matched and Unknown = no data found -> in that case the default ID will be 0)!
Create the final dataframe: customer_location_df = customer_df[["id_customer"]].copy()
Now set the id_city as described (as mentioned above - you need to do this for country on your own, I sampled the code for city for you):
customer_location_df.loc[
(customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_city"]
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "id_city"
] = customer_df["id_country"]
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "id_city"
] = 0
Create the status column and set it:
customer_location_df.loc[
(customer_df["id_city"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Match"
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].notnull()) & (customer_df["ratio"] > threshold), "status"
] = "Country"
customer_location_df.loc[
(customer_df["id_city"].isnull()) & (customer_df["id_country"].isnull()), "status"
] = "Unknown"
Lastly, save the customer_location_df as a new table in the database: customer_location_df.to_sql("customer_location", db_connection, if_exists="replace") (careful not to replace your main table if it's called customer_location)!
I don't know if I understood correctly here..
[...] I mean if in the raw data there is punctuation or number or typo, it will automatically clean [...]
You need some sort of validation method here, you cannot achieve that directly on the database, you need to handle it in your logic before the rows insertion.
In these cases, the best solution is to prepare a picklist (multiple choice) from which end users can choose the right values.
A free text input will always be error prone.
If the multiple choiceis not an applicable solution in your case, then you need to put in place a list of validation rules but you need to think how to prevent every possible issue.
Example, in your case:
You could use Regex to clean the input
import re
city = 'Detro.it'
cleanCity= re.sub(r'[^\w\s]', '', city)
print(cleanCity) // Detroit
You need to play with regex in those manner, for exampe If you want extract only chars [a-zA-Z]+
In order to handle the input in a case sensitive way you could use str.title()
After that all the chars, except the firts one, are coverted to lowercase
city = "caliFORnia"
cleanCity = city.title()
print(cleanCity) // California
The final resulting table is obtainable via MySql query.
You need to JOIN the tables. (here the only common fiels is the name of the city, not the best for the ON cluase, an id would be better)
In order to achieve the derived column 'Staus' you could leverage the MySql function CASE.
Example:
SELECT field1,field2,..,
CASE
WHEN field1 = field2 THEN "Match"
ELSE "Unmatch"
END AS Derived_Col
FROM table;
Result:
field1 field2 Derived_col
sometxt sometxt Match
another other Unmatch
This code will generate a very simple dummy dataframe, where people filled a survey form:
df2 = pd.DataFrame({
'name':['John','John','John','Rachel','Rachel','Rachel'],
'gender':['Male','Male','Male','Female','Female','Female'],
'age':[40,40,40,39,39,39],
'SurveyQuestion':['Married?','HasKids?','Smokes?','Married?','HasKids?','Smokes?'],
'answers':['Yes','Yes','No','Yes','No','No']
})
The output looks like so:
Because of the way the table is structured, with each question having its own row, we see that the first 3 columns always contain the same info, as it's just repeating the info based on the person that filled in the survey.
It would be better to visualize the dataframe as a pivot-table, similar to the following:
df2.pivot(index='name',columns='SurveyQuestion',values='answers')
However, doing it this way results in many of the previous columns being lost, since only 1 column can be used as the index.
I'm wondering what the most straightforward way of doing this would be that didn't involve an extra step of rejoining the columns.
You can use df.pivot_table:
In [27]: df2.pivot_table(values='answers', index=['name','gender','age'], columns='SurveyQuestion', aggfunc='first')
Out[27]:
SurveyQuestion HasKids? Married? Smokes?
name gender age
John Male 40 Yes Yes No
Rachel Female 39 No Yes No
OR, you can use df.pivot with df.set_index, like this:
In [30]: df = df2.set_index(['name', 'gender', 'age'])
In [32]: df.pivot(index=df.index, columns='SurveyQuestion')['answers']
Out[32]:
SurveyQuestion HasKids? Married? Smokes?
name gender age
John Male 40 Yes Yes No
Rachel Female 39 No Yes No
I'm not sure there's any existing algorithms to do this for you but I've had a similar problem in my projects.
If you're trying to condense the rows in your table, first you need to make sure every person can have the same columns applied to them. For example, you can't reasonably do this if you didn't ask the 'HasKids?' question to Rachel unless you include an N/a option.
After this, order the table by some unique ID, that way any repeated people will definitely be next to each other in the table.
Then iterate through this table, and everytime you hit a row that's the same as the last, take whatever unique information it has, add it to the original row for that person and delete this repeat. If this is done for the whole table you should get your pivot.
I am working on a project in which I scraped NBA data from ESPN and created a DataFrame to store it. One of the columns of my DataFrame is Team. Certain players that have been traded within a season have a value such as LAL/LAC under team, rather than just having one team name like LAL. With these rows of data, I would like to make 2 entries instead of one. Both entries would have the same, original data, except for 1 of the entries the team name would be LAL and for the other entry the team name would be LAC. Some team abbreviations are 2 letters while others are 3 letters.
I have already managed to create a separate DataFrame with just these rows of data that have values in the form team1/team2. I figured a good way of getting the data the way I want it would be to first copy this DataFrame with the multiple team entries, and then with one DataFrame, keep everything in the Team column up until the /, and with the other, keep everything in the Team column after the slash. I'm not quite sure what the code would be for this in the context of a DataFrame. I tried the following but it is invalid syntax:
first_team = first_team['TEAM'].str[:first_team[first_team['TEAM'].index("/")]]
where first_team is my DataFrame with just the entries with multiple teams. Perhaps this can give you a better idea of what I'm trying to accomplish!
Thanks in advance!
You're probably better off using split first to separate the teams into columns (also see Pandas DataFrame, how do i split a column into two), something like this:
d = pd.DataFrame({'player':['jordan','johnson'],'team':['LAL/LAC','LAC']})
pd.concat([d, pd.DataFrame(d.team.str.split('/').tolist(), columns = ['team1','team2'])], axis = 1)
player team team1 team2
0 jordan LAL/LAC LAL LAC
1 johnson LAC LAC None
Then if you want separate rows, you can use append.