details about the goal: I'm learning basic ML and im tasked with finding the best match between some raw city names and some normalized city names.
Expected result: The idea is to find items that are similar as according to the Levenshtein distance, and output the best match on the right column of the raw data df.
What i did: Originally, I made a nested loop that compares the first row with the 36k thousand rows, output the smallest and its index, and store that in the right most column. I quickly concluded that it's not a best practice because you're not supposed to loop over pandas df, and the complexity of 10000^36k was just way too much. After some search, i found the following code, which is supposed to work properly:
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
Sadly this algo has been running for an hour on my computer so i dont think it does the job still. What would yall recommend to do this quicker ?
Thank you for any time you would spend on this.
#import libraries
import pandas as pd
import sys
!pip install fuzzywuzzy
!pip install python-Levenshtein
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def import_data(file):
return pd.read_csv(file, header=0, dtype=str)
rawdata = import_data("raw_cities.csv")
rawdata['city']=rawdata['city'].map(str)
normadata = import_data("normalized_cities.csv")
normadata['city']=normadata['city'].map(str)
rawdata["Best match"]=rawdata["city"].map(lambda x: process.extractOne(x, normadata["city"])[0])
#my tables look as such
city
0 cleron
1 aveillans
2 paray-vieille-poste
3 issac
4 rians
9995 neuville les dieppe |
9996 saint andre de vezines
9997 saint-germain-de-lusignan
9998 bergues-sur-sambre
9999 santa-maria-figaniella
[10000 rows x 1 columns]
city
0 abergement clemenciat
1 abergement de varey
2 amberieu en bugey
3 amberieux en dombes
4 ambleon
35352 m'tsangamouji
35353 ouangani
35354 pamandzi
35355 sada
35356 tsingoni
[35357 rows x 1 columns]
Related
I am trying to scrape a table from a website using pandas. The code is shown below:
import pandas as pd
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1.to_excel("combinedGP.xlsx", index=False)
In the resulting excel file, the numbers are saved as text. Since I am planning to build a file with around 1000 rows, I cannot manually change the data type. So is there another way to store them as actual values and not text? TIA
The website can be very unresponsive...
there are unwanted header rows, and two rows of column headers
simple way to manage this is to_csv(), from_csv() with appropriate parameters.
import pandas as pd
import io
url = "http://mnregaweb4.nic.in/netnrega/state_html/empstatusnewall_scst.aspx?page=S&lflag=eng&state_name=KERALA&state_code=16&fin_year=2020-2021&source=national&Digest=s5wXOIOkT98cNVkcwF6NQA"
df1 = pd.read_html(url)[3]
df1 = pd.read_csv(io.StringIO(df1.to_csv(index=False)), skiprows=3, header=[0,1])
# df1.to_excel("combinedGP.xlsx", index=False)
sample after cleaning up
S.No District HH issued jobcards No. of HH Provided Employment EMP. Provided No. of Persondays generated Families Completed 100 Days
S.No District SCs STs Others Total SCs STs Others Total No. of Women SCs STs Others Total Women SCs STs Others Total
0 1.0 ALAPPUZHA 32555 760 254085 287400 20237 565 132744 153546 157490 1104492 40209 6875586 8020287 7635748 1346 148 5840 7334
1 2.0 ERNAKULAM 36907 2529 212534 251970 15500 1517 68539 85556 82270 908035 104040 3788792 4800867 4467329 2848 301 11953 15102
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
Background
I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities.
Sample Data:
vendor_df=
Name of Vendor City State ZIP
FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101
CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102
GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102
LACKEY SHEET METAL St. Louis MO 63102
regulator_df =
Name of Entity Committies
LACKEY SHEET METAL Private
PRIMUS STERILIZER COMPANY LLC Private
HELGET GAS PRODUCTS INC Autonomous
ORTHOQUEST LLC Governmant
Problem Stmt:
I have to fuzzy match the entities of these two(Name of vendor & Name of Entity) columns and get a score. So, need to know if 1st value of dataframe 1(vendor_df) is matching with any of the 2000 entities of dataframe2(regulator_df).
StackOverflow Links I checked:
fuzzy match between 2 columns (Python)
create new column in dataframe using fuzzywuzzy
Apply fuzzy matching across a dataframe column and save results in a new column
Code
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)
regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)
compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
regulator_df['Name of Entity']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
#compare.apply(metrics) -- Either this works or the below line
result = compare.apply(metrics).unstack().idxmax().unstack(0)
Problems with Above Code:
The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link.
Any solution if the same thing can work fast or can work with large dataset?
UPDATE 1
Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?
Below solution is faster than what I posted but if someone has a more faster approach please tell:
matched_vendors = []
for row in vendor_df.index:
vendor_name = vendor_df.get_value(row,"Name of vendor")
for columns in regulator_df.index:
regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
if matched_token> 80:
matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])
I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:
https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py
Vesrion Compatibility:
pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0
Fuzzywuzzy metric explanation:
link text
Output from code:
in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps.
compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]
I'm new to Python and I'm trying to analyse this CSV file. It has a lot of different countries (as an example below).
country iso2 iso3 iso_numeric g_whoregion year e_pop_num e_inc_100k e_inc_100k_lo
Afghanistan AF AFG 4 EMR 2000 20093756 190 123
American Samoa AS ASM 16 WPR 2003 59117 5.8 5 6.7 3 3 4
Gambia GM GMB 270 AFR 2010 1692149 178 115 254 3000 1900 4300
I want to try and obtain only specific data, so only specific countries and only specific columns (like "e_pop_numb"). How would I go about doing that?
The only basic code I have is:
import csv
import itertools
f = csv.reader(open('TB_burden_countries_2018-03-06.csv'))
for row in itertools.islice(f, 0, 10):
print (row)
Which just lets me choose specific rows I want, but not necessarily the country I want to look at, or the specific columns I want.
IF you can help me or provide me a guide so I can do my own learning, I'd very much appreciate that! Thank you.
I recommend you to use pandas python library. Please follow the article as here below there is a snippet code to iluminate your thoughts.
import pandas as pd
df1=pd.read_csv("https://pythonhow.com/wp-content/uploads/2016/01/Income_data.csv")
df2.loc["Alaska":"Arkansas","2005":"2007"]
source of this information: https://pythonhow.com/accessing-dataframe-columns-rows-and-cells/
Pandas will probably be the easiest way. https://pandas.pydata.org/pandas-docs/stable/
To get it run
pip install pandas
Then read the csv into a dataframe and filter it
import pandas as pd
df = pd.read_csv(‘TB_burden_countries_2018-03-06.csv’)
df = df[df[‘country’] == ‘Gambia’]
print(df)
with
open('file') as f:
fields = f.readline().split("\t")
print(fields)
If you supply more details about what you want to see, the answer would differ.
I have a series of space-delimited data files in x y format as below for a dummy data set, where y represents independent sample population means for value x.
File1.dat
1 15.99
2 17.34
3 16.50
4 18.12
File2.dat
1 10.11
2 12.76
3 14.10
4 19.46
File3.dat
1 13.13
2 12.14
3 14.99
4 17.42
I am trying to compute the standard error of the mean (SEM) line-by-line to get an idea of the spread of the data for each value of x. As an example using the first line of each file (x = 1), a solution would first compute the SEM of sample population means 15.99, 10.11, and 13.13 and print the solution in format:
x1 SEMx1
...and so on, iterating for every line across the three files.
At the moment, I envisage a solution to be something along the lines of:
Read in the data using something like numpy, perhaps specifying only the line of interest for the current iteration. e.g.
import numpy as np
data1 = np.loadtxt('File1.dat')
data2 = np.loadtxt('File2.dat')
data3 = np.loadtxt('File3.dat')
Use a tool such as Scipy stats, calculate the SEM from the three sample population means extracted in step 1
Print result to stout
Repeat for remaining lines
While I imagine other stats packages such as R are well-suited to this task, I'd like to try and keep the solution solely contained within Python. I'm fairly new to the language, and I'm trying to get some practical knowledge in using it.
I see this as being a problem ideally suited for Scipy from what I've seen here in the forums, but haven't the vaguest idea where to start based upon the documentation.
NB: These files contain an equal number of lines.
First let's try to get just the columns of data that we need:
import numpy as np
filenames = map('File{}.dat'.format, range(1,4)) # ['File1.dat', ...]
data = map(np.loadtxt, filenames) # 3 arrays, each 4x2
stacked = np.vstack((arr[:,1] for arr in data))
Now we have just the data we need in a single array:
array([[ 15.99, 17.34, 16.5 , 18.12],
[ 10.11, 12.76, 14.1 , 19.46],
[ 13.13, 12.14, 14.99, 17.42]])
Next:
import scipy.stats as ss
result = ss.sem(stacked)
This gives you:
array([ 1.69761925, 1.63979674, 0.70048396, 0.59847956])
You can now print it, write it to a file (np.savetxt), etc.
Since you mentioned R, let's try that too!
filenames = c('File1.dat', 'File2.dat', 'File3.dat')
data = lapply(filenames, read.table)
stacked = cbind(data[[1]][2], data[[2]][2], data[[3]][2])
Now you have:
V2 V2 V2
1 15.99 10.11 13.13
2 17.34 12.76 12.14
3 16.50 14.10 14.99
4 18.12 19.46 17.42
Finally:
apply(stacked, 1, sd) / sqrt(length(stacked))
Gives you:
1.6976192 1.6397967 0.7004840 0.5984796
This R solution is actually quite a bit worse in terms of performance, because it uses apply on all the rows to get the standard deviation (and apply is slow, because it invokes the target function once per row). This is because base R does not offer row-wise (nor column-wise, etc.) standard deviation. And I needed sd because base R does not offer SEM. At least you can see it gives the same results.