find different string in one row pandas - python

I want to return join from this following text that contains find = ['gold', 'mining', 'silver, 'steel'] but turns out it just prints the first one that appears.
one of the row in output.csv
desc
"The **2014 Orkney earthquake** occurred at 12:22:33 SAST on 5 August, with the
epicentre near Orkney, a gold mining town in the Klerksdorp district in the
North West province of South Africa. The shock was assigned a magnitude of 5.5
on the Richter scale by the Council for Geoscience (CGS) in South Africa,
making it the biggest earthquake in South Africa since the 1969 Tulbagh
earthquake, which had a magnitude of 6.3 on the Richter scale. The United
States Geological Survey (USGS) estimated a focal depth of 5.0 km (3.1 mi).
The CGS reported 84 aftershocks on 5 August and 31 aftershocks on 6 August,
with a magnitude of 1.0 to 3.8 on the Richter scale. According to the CGS, the
earthquake is the biggest mining-related earthquake in South African history."
output: gold
expected output: gold, mining
here is what I have done
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
for s in find:
if s in x['desc']:
print('', s)
return s
return ''
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
How to join the result?
EDIT
def preprocess_patetnt(name):
reader = pd.read_csv('output.csv', chunksize=1000)
sname = sorted(name, key=len, reverse=True)
for chunk in reader:
chunk.columns = ['row', 'desc']
chunk['place'] = chunk["desc"].str.findall("|".join(sname)).apply(set)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
place = pd.read_csv('country.csv', chunksize=13000000, error_bad_lines=False)
for chunk in place:
chunk.columns = ['name']
preprocess_patetnt(chunk["name"])
from country.csv is a list of name country like following:
country.csv
and here is output.csv
output.csv
But it gives me this error: re.error: bad character range á-K at position 77230

Your process function returns as soon as it gets the first hit. Insted, you should store all your hits in a string and return it. Use list comprehension for these kind of loops
and str.join(iterable) method to join the list to a string (I'm guessing here that sname is actually find).
reader = pd.read_csv('output.csv', chunksize=1000)
find = ['gold','mining','silver','steel']
for chunk in reader:
chunk.columns = ['desc']
def process(x):
return ','.join([s for s in find if s in x['desc']])
chunk['place'] = chunk.apply(lambda x: (process(x)), axis=1)
chunk = chunk.drop(chunk[chunk['place'] == ''].index).reset_index(drop=True)
print(chunk)
happy coding!

Related

How to retrieve rows in dataframe as strings?

I have probably stupid question for you guys but I cannot find a working solution so far to my problem. I have a data frame provided via automatic input and I am transforming this data as such (not really relevant for my question):
import pandas as pd
import numpy as np
n = int(input())
matrix =[]
for n in range(n+1): # loop as long as expected # rows to be input
new_row = input() # get new row input
new_row = list(new_row.split(",")) # make it a list
matrix.append(new_row) #update the matrix
mat = pd.DataFrame(data=matrix[1:], index=None, columns=matrix[0])
mat.iloc[:,1] = pd.to_numeric(mat.iloc[:,1])
mat.iloc[:,2] = pd.to_numeric(mat.iloc[:,2])
mat.iloc[:,1] = round(mat.iloc[:,1] / mat.iloc[:,2])
mat2 = mat[['state', 'population']].head(5)
mat2['population'] = mat2['population'].astype(int)
mat2 = mat2.sort_values(by=['population'], ascending=False)
mat2 = mat2.to_string(index=False, header=False)
print(mat2)
the answer I am getting is equal to:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
Nicely formated etc, however I need to retrieve my data in string format as:
New York 354
Florida 331
California 240
Illinois 217
Texas 109
I have already tried changing the ending of my code to:
#mat2 = mat2.to_string(index=False, header=False)
print(mat2.iloc[1,:])
to retrieve e.g. first row, but then console returns:
state Florida
population 331
Name: 2, dtype: object
How can I simply access the data from my cells and format it in string format?
Thanks!
After mat2 = mat2.to_string(index=False, header=False), mat2 becomes a string you can transform to your liking. For instance, you could do:
>>> lines = mat2.split('\n')
>>> without_format = [line.strip() for line in lines]
>>> without_format
['New York 354',
'Florida 331',
'California 240',
'Illinois 217',
'Texas 109']
Where .strip() will remove any whitespace before or after the string.

For loop: writing dictionary iteration output to excel using pandas?

I am trying to iterate over a dictionary that contains multiple row indexes in its values and then apply pd.nsmallest function to generate the top 3 smallest values for the multiple sets of row indexes that are in the dictionary. However, there seems to be something wrong with my for loop statement as I am overwriting the top 3 values over and over till the last set of values in the dictionary and so my final excel file output shows only 3 rows for the last run of the for loop.
When I use print statements this works as expected and I get an output for all 16 values in the dictionary but when writing to excel file it only gives me the output of the last run on the loop
import pandas as pd
from tabulate import tabulate
VA = pd.read_excel('Columnar BU P&L.xlsx', sheet_name = 'Variance by Co')
legcon = VA[['Expense', 'Consolidation', 'Exp Category']]
legcon['Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in legcon['Consolidation']]
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Printing output directly works fine when I do this:
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
print(a)
Expense Consolidation Exp Category Variance Type
5 Transportation - AIR -19054 Travel & Entertainment Unfavorable
9 Meals -9617 Travel & Entertainment Unfavorable
7 Lodging -9439 Travel & Entertainment Unfavorable
Expense Consolidation Exp Category Variance Type
26 Bank Charges -4320 Fees & Assessments Unfavorable
27 Finance Charges -1389 Fees & Assessments Unfavorable
25 Payroll Fees -1145 Fees & Assessments Unfavorable
However when I use the below code to write to excel:
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
for i in range(0,16):
a.to_excel(writer, sheet_name = 'test', startrow = row+4, index = False)
writer.save()
my output looks like this and does not show all the exp categories:
I would really appreciate any feedback on how to correct this. Thanks in advance!
With some help from a friend I just realized my silly mistake, there was no row iterator in my for loop for the output to print on the next lines and using the below code fixed the issue (initially i placed the row iterator within my df.to_excel statement):
writer = pd.ExcelWriter('testt.xlsx', engine = 'xlsxwriter')
row = 0
for key,value in d.items():
a = legcon.iloc[value][legcon.iloc[:,1]<0].nsmallest(3,'Consolidation')
a.to_excel(writer, sheet_name = 'Testt', startrow = row, index = False)
row = row+4
writer.save()

Pandas - Matching reference number to find earliest date

I'm hoping to pick your brains on optimization. I am still learning more and more about python and using it for my day to day operation analyst position. One of the tasks I have is sorting through approx 60k unique record identifiers, and searching through another dataframe that has approx 120k records of interactions, the employee who authored the interaction and the time it happened.
For Reference, the two dataframes at this point look like:
main_data = Unique Identifier Only
nok_data = Authored By Name, Unique Identifer(known as Case File Identifier), Note Text, Created On.
My set up currently runs it at approximately sorting through and matching my data at 2500 rows per minute, so approximately 25-30 minutes or so for a run. What I am curious is are there any steps I performed that are:
Redundant and inefficient overall slowing my process
A poor use of syntax to work around my lack of knowledge.
Below is my code:
nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse
main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view
row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.
for row in main_data["iTx Case ID"]:
list_data = []
nok = nok_data["Case File Identifier"] == row
matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True] #takes created on date only if nok shows row was true
if len(matching_dates) > 0:
try:
min_dates = matching_dates.min(axis=0)
earliest_nok[row] = [min_dates[0], min_dates[1]]
except ValueError:
error_count += 1
earliest_nok[row] = None
row_count += 1
print("{} out of {} records").format(row_count, data_length)
with open('finaloutput.csv','wb') as csv_file:
writer = csv.writer(csv_file)
for key, value in earliest_nok.items():
writer.writerow([key, value])
Looking for any advice or expertise from those performing code like this much longer then I have. I appreciate all of you who even just took the time to read this. Happy Tuesday,
Andy M.
**** EDIT REQUESTED TO SHOW DATA
Sorry for my novice move there not including any data type.
main_data example
ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590
nok_data aka "raw nok data.csv"
Authored By: Case File Identifier: Note Text: Authored on
John Doe 2017-023594 Random Text 4/1/2017 13:24:35
John Doe 2017-023594 Random Text 4/1/2017 13:11:20
Jane Doe 2017-023590 Random Text 4/3/2017 09:32:00
Jane Doe 2017-023590 Random Text 4/3/2017 07:43:23
Jane Doe 2017-023590 Random Text 4/3/2017 7:41:00
John Doe 2017-023592 Random Text 4/5/2017 23:32:35
John Doe 2017-023592 Random Text 4/6/2017 00:00:35
It looks like you want to group on the Case File Identifier and get the minimum date and corresponding author.
# Sort the data by `Case File Identifier:` and `Authored on` date
# so that you can easily get the author corresponding to the min date using `first`.
nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
.groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}
>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
'2017-023592': ['4/5/17 23:32', 'John Doe'],
'2017-023594': ['4/1/17 13:11', 'John Doe']}
>>> df
Authored on Authored By:
Case File Identifier:
2017-023590 4/3/17 7:41 Jane Doe
2017-023592 4/5/17 23:32 John Doe
2017-023594 4/1/17 13:11 John Doe
It is probably easier to use df.to_csv(...).
The items from main_data['ITX Case ID'] where there is no matching record have been ignored but could be included if required.

Need to find a word in a CSV file based off of conditional

I am truly stuck. My task here is to filter the dates of a 5000 record CSV to find a specific date range, order it in ascending order, and then take the fields of a different column which creates a sentence. I have been able to successfully sort for the dates and order them, but my problem now is that I don't know how to get the words that correspond with that row. Here is the code:
#/usr/bin/python3
import csv
import time
def finder():
with open('sample_data.csv', encoding="utf8") as csvfile:
reader = csv.DictReader(csvfile)
r = [] # This will hold our ID numbers for rows
c = [] # This will hold our initial dates that are filtered out from the main csv
l = [] # This will hold our sorted dates from c
w = [] # This will hold our words
sentence = '' #This will be our sentence
# Filter out created_at dates we don't care about
def filterDates():
for row in reader:
createdOn = float(row['created_at'])
d = time.strftime('%Y-%m-%d', time.localtime(createdOn)) # Converts dates
if d < '2014-06-22':
pass
else:
c.append(d)
filterDates()
def sort(c):
for i in c:
if i > '2014-06-22' and i < '2014-07-22':
l.append(i)
l.sort(reverse=False)
else:
pass
sort(c)
def findWords(l):
for row in reader:
words = row['word']
for x in range(l):
print(words[0])
findWords(l)
finder()
I know this code is probably sloppy and all over the place. I saw it as a challenge for a job and thought I could do it easily, but apparently my Python isn't quite up to snuff. I haven't used Python CSV before. I will say right off the bat that I no longer plan to apply for this job, but this will drive me crazy if I can't figure it out. I've already spent hours trying different things, my issue lies in how to take the rows that have the correct dates and get the words.
All suggestions and help is appreciated! For my own sanity, I need to figure this out.
Thanks,
RDD
Data Sample:
id created_at first_name last_name email gender company currency word drug_brand drug_name drug_company pill_color frequency token keywords
1 1309380645 Stephanie Franklin sfranklin0#sakura.ne.jp Female Latz IDR transitional SUNLEYA Age minimizing sun care AVOBENZONE, OCTINOXATE, OCTISALATE, OXYBENZONE C.F.E.B. Sisley Maroon Yearly ______T______h__e________ _______N__e__z_____p______e_____________d______i______a_____n__ _____h__i__v__e___-_____m___i____n__d__ _____________f ________c_______h__a__________s_.__ _Z________a_____l_____g________o__._ est risus auctor sed tristique in
2 1237178109 Michelle Fowler mfowler1#oracle.com Female Skipstorm EUR flexibility Medulla Arnica Medulla Arnica Uriel Pharmacy Inc. Yellow Once _____ morbi vestibulum velit id
3 1303585711 Betty Barnes bbarnes2#howstuffworks.com Female Skibox IDR workforce Rash Relief Zinc Oxide Dimethicone Touchless Care Concepts LLC Purple Monthly ___ ac est lacinia
4 1231175716 Jerry Rogers jrogers3#canalblog.com Male Cogibox IDR content-based up and up acid controller complete Famotidine, Calcium Carbonate, Magnesium Hydroxide Target Corporation Maroon Daily NIL augue a suscipit nulla elit
5 1236709011 Harry Garrett hgarrett4#mlb.com Male Yotz RUB coherent Vistaril HYDROXYZINE PAMOATE Pfizer Laboratories Div Pfizer Inc Orange Never �_nb_l_ _u___ __olop __ __oq_l _n _unp_p__u_ _od___ po_sn__ op p_s '__l_ _u__s_d_p_ _n_____suo_ '____ __s _olop _nsd_ ___o_ morbi ut odio cras
6 1400030214 Lori Martin lmartin5#apache.org Female Aivee EUR software Fluorouracil Fluorouracil Taro Pharmaceutical Industries Ltd. Pink Daily _ dui vel sem
7 1368791435 Joe Turner jturner6#elpais.com Male Mycat IRR tangible Sulfacetamide Sodium Sulfacetamide Sodium Paddock Laboratories, LLC Aquamarine Often 1;DROP TABLE users nulla facilisi cras non velit
8 1394919241 Ruth Bryant rbryant7#dell.com Female Browsecat IDR incremental Pollens - Trees, Mesquite, Prosopis juliflora Mesquite, Prosopis juliflora Jubilant HollisterStier LLC Aquamarine Weekly ___________ et magnis dis
9 1352948920 Cynthia Lopez clopez8#gov.uk Female Twitterbeat USD Up-sized Ideal Flawless Octinoxate, Titanium Dioxide Avon Products, Inc Red Daily (_�_�___ ___) purus eu magna
10 1319910259 Phillip Ross pross9#ehow.com Male Buzzshare VEF data-warehouse Serotonin Serotonin BioActive Nutritional Orange Weekly __ vel sem
Okay, so after some tweaking a great help from Westley White, I was able to get this functioning! I have it condensed into one nested function that is doing what it is supposed to! Here is the code:
#/usr/bin/python3
import csv
import time
def finder():
with open('sample_data.csv', 'r', encoding='latin-1') as csvfile:
reader = csv.DictReader(csvfile)
def dates(reader):
# Set up variables
date_range = []
sentence = []
# Initiate iteration through CSV
for row in reader:
createdOn = float(row['created_at'])
words = str(row['word'])
d = time.strftime('%Y-%m-%d', time.localtime(createdOn)) # Converts dates
if d >= '2014-06-22' and d <= '2014-07-22':
date_range.append(d)
date_range.sort()
for word in words:
if d in date_range:
sentence.append(word)
print(sentence)
dates(reader)
finder()
There is only one problem left. When sentence[] appends, it appends each character one at a time. I don't know how to go about combining the letters into the words from the CSV column without combining them all together. Any ideas?
Thanks!
I dont know how the data is formatted, but here is my attempt.
import time
def finder(start_date='2014-06-22', end_date='2014-07-22'):
"""
:param start_date: Starting date
:param end_date: Ending date
"""
def filterDates(reader):
datelist = []
for row in reader:
created_on = float(row['created_at'])
d = time.strftime('%Y-%m-%d', time.localtime(createdOn)) # Converts dates
# Is between starting and ending dates
if d >= start_date and d <= end_date:
# Going to use the created_on value so we dont have to reformat it again
datelist.append(created_on)
return datelist
def findWords(reader, datelist):
for row in reader:
if float(row['created_at']) in datelist:
words = row['word']
for word in words:
print(word)
with open('sample_data.csv', encoding="utf8") as csvfile:
reader = csv.DictReader(csvfile)
dates = filterDates(reader)
dates = dates.sort()
findWords(reader, dates)
finder('2014-06-22', '2014-07-22')
EDIT:
If you want to add each word to a list use this
Add this outside of the loop
sentence_list = []
change
words = row['word']
to
word = row['word']
then change
for word in words:
print(word)
to
sentence_list.append(word)
If you want to use a string add this outside of the loop
sentence = ""
Then when you print the word, just add it to the sentence
# adding a Word to the sentence
sentence = "{} {}".format(sentence, word)
and finally add this to the the bottom outside of the loop
print(sentence)

How to smartly match two data frames using Python (using pandas or other means)?

I have one pandas dataframe composed of the names of the world's cities as well as countries, to which cities belong,
city.head(3)
city country
0 Qal eh-ye Now Afghanistan
1 Chaghcharan Afghanistan
2 Lashkar Gah Afghanistan
and another data frame consisting of addresses of the world's universities, which is shown below:
df.head(3)
university
0 Inst Huizhou, Huihzhou 516001, Guangdong, Peop...
1 Guangxi Acad Sci, Nanning 530004, Guangxi, Peo...
2 Shenzhen VisuCA Key Lab SIAT, Shenzhen, People...
The locations of cities' names are irregularly distributed across rows. I would like to match the city names with the addresses of world's universities. That is, I would like to know which city each university is located in. Hopefully, the city name matched is shown in the same row as the address of each university.
I've tried the following, and it doesn't work because the locations of cities are irregular across the rows.
df['university'].str.split(',').str[0]
I would suggest to use apply
city_list = city.tolist()
def match_city(row):
for city in city_list:
if city in row['university']: return city
return 'None'
df['city'] = df.apply(match_city, axis=1)
I assume the addresses of university data is clean enough. If you want to do more advanced checking of matching, you can adjust the match_city function.
In order to deal with the inconsistent structure of your strings, a good solution is to use regular expressions. I mocked up some data based on your description and created a function to capture the city from the strings.
In my solution I used numpy to output NaN values when there wasn't a match, but you could easily just make it a blank string. I also included a test case where the input was blank in order to display the NaN result.
import re
import numpy as np
data = ["Inst Huizhou, Huihzhou 516001, Guangdong, People's Republic of China",
"Guangxi Acad Sci, Nanning 530004, Guangxi, People's Republic of China",
"Shenzhen VisuCA Key Lab SIAT, Shenzhen, People's Republic of China",
"New York University, New York, New York 10012, United States of America",
""]
df = pd.DataFrame(data, columns = ['university'])
def extract_city(row):
match = re.match('^[^,]*,([^,]*),', row)
if match:
city = re.sub('\d+', '', match.group(1)).strip()
else:
city = np.nan
return city
df.university.apply(extract_city)
Here's the output:
0 Huihzhou
1 Nanning
2 Shenzhen
3 New York
4 NaN
Name: university, dtype: object
My suggestion is that after a few pre-processing that reduce address into city level information (we don't need to be exact, but try your best; like removing numbers etc), and then merge the dataframes based on text similarities.
You may consider text similarity measures like levenshtein distance or jaro-winkler which are commonly used to match the words.
Here is example for text similarity:
class DLDistance:
def __init__(self, s1):
self.s1 = s1
self.d = {}
self.lenstr1 = len(self.s1)
for i in xrange(-1,self.lenstr1+1):
self.d[(i,-1)] = i+1
def distance(self, s2):
lenstr2 = len(s2)
for j in xrange(-1,lenstr2+1):
self.d[(-1,j)] = j+1
for i in xrange(self.lenstr1):
for j in xrange(lenstr2):
if self.s1[i] == s2[j]:
cost = 0
else:
cost = 1
self.d[(i,j)] = min(
self.d[(i-1,j)] + 1, # deletion
self.d[(i,j-1)] + 1, # insertion
self.d[(i-1,j-1)] + cost, # substitution
)
if i and j and self.s1[i]==s2[j-1] and self.s1[i-1] == s2[j]:
self.d[(i,j)] = min (self.d[(i,j)], self.d[i-2,j-2] + cost) # transposition
return self.d[self.lenstr1-1,lenstr2-1]
if __name__ == '__main__':
base = u'abs'
cmpstrs = [u'abs', u'sdfbasz', u'asdf', u'hfghfg']
dl = DLDistance(base)
for s in cmpstrs:
print "damerau_levenshtein"
print dl.distance(s)
Even though, it has high level of computation complexity since it calculate N*M times of distance measure where N rows in the first dataframe, M rows in the second dataframe.(To reduce computatinal complexity, you can truncate the set who needs comparison by only comparing the rows which have the same first character)
levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance
jaro-winkler: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
I think one simple idea would be to create a mapping from any word or sequence of word of any address to the full address the word is part of, with the assumption that one of those address words is the cites. In a second step we match this with the set of known cities that you have, and anything that is not a known city gets discarded.
A mapping from each single word to address is as simple as:
def address_to_dict(address):
return {word: address for word in address.split(",")}
And we can easily extend this to include the set of bi-grams, tri-gram,... so that universities encoded in several words are also collected. See a discussion here: Elegant N-gram Generation in Python
We can then apply this to every address we have to obtain one grand mapping from any word to the full address:
word_to_address_mapping = pd.DataFrame(df.university.apply(address_to_dict ).tolist()).stack()
word_to_address_mapping = pd.DataFrame(word_to_address_mapping,
columns=["address"])
word_to_address_mapping.index = word_to_address_mapping.index.droplevel(level=0)
word_to_address_mapping
This yields something like this:
All you have to do then is join this with the actual city list you have: this will automatically discard any entry in word_to_address_mapping which is not a known city, and provide a mapping between university address and their city.
# the outer join here should ensure that several university in the
# same city do not overwrite each other
pd.merge(left=word_to_address_mapping, right=city,
left_index=True, right_on="city",
how="outer)
Partial match is prevented in the below function. Information of countries also considered while matching cities. To use this function university dataframe need to be split into list data type, such that every piece of address split into list of strings.
In [22]: def get_city(univ_name_split):
....: # find country from university address
....: for name in univ_name_split:
....: if name in city['country'].values:
....: country = name
....: else:
....: country = None
....: if country:
....: cities = city[city.country == country].city.values
....: else:
....: cities = city['city'].values
....: # find city from university address
....: for name in univ_name_split:
....: if name in cities:
....: return name
....: else:
....: return None
....:
In [1]: import pandas as pd
In [2]: city = pd.read_csv('city.csv')
In [3]: df = pd.read_csv('university.csv')
In [4]: # splitting university name and address
In [5]: df_split = df['university'].str.split(',')
In [6]: df_split = df_split.apply(lambda x:[i.strip() for i in x])
In [10]: df
Out[10]:
university
0 Kongu Engineering College, Perundurai, Erode, ...
1 Anna University - Guindy, Chennai, India
2 Birla Institute of Technology and Science, Pil...
In [11]: df_split
Out[11]:
0 [Kongu Engineering College, Perundurai, Erode,...
1 [Anna University - Guindy, Chennai, India]
2 [Birla Institute of Technology and Science, Pi...
Name: university, dtype: object
In [12]: city
Out[12]:
city country
0 Bangalore India
1 Chennai India
2 Coimbatore India
3 Delhi India
4 Erode India
#This function is shorter version of above function
In [14]: def get_city(univ_name_split):
....: for name in univ_name_split:
....: if name in city['city'].values:
....: return name
....: else:
....: return None
....:
In [15]: df['city'] = df_split.apply(get_city)
In [16]: df
Out[16]:
university city
0 Kongu Engineering College, Perundurai, Erode, ... Erode
1 Anna University - Guindy, Chennai, India Chennai
2 Birla Institute of Technology and Science, Pil... None
I've created a tiny library for my projects, especially for fuzzy joins. It might not be the fastest solution but it may help, feel free to use.
Link to my GitHub repo

Categories