I am working on merging a few datasets regarding over 200 countries in the world. In cleaning the data I need to convert some three-letter codes for each country into the countries' full names.
The three-letter codes and country full names come from a separate CSV file, which shows a slightly different set of countries.
My question is: Is there a better way to write this?
str.replace("USA", "United States of America")
str.replace("CAN", "Canada")
str.replace("BHM", "Bahamas")
str.replace("CUB", "Cuba")
str.replace("HAI", "Haiti")
str.replace("DOM", "Dominican Republic")
str.replace("JAM", "Jamaica")
and so on. It goes on for another 200 rows. Thank you!
Since the number of substitution is high, I would instead iterate over the words in the string and replace based upon a dictionary lookup.
mapofcodes = {'USA': 'United States of America', ....}
for word in mystring.split():
finalstr += mapofcodes.get(word, word)
Try reading the CSV file into a dictionary to a 2D array, you can access which ever one you want then.
that is if I understand your question correctly.
Here's a regular expressions solution:
import re
COUNTRIES = {'USA': 'United States of America', 'CAN': 'Canada'}
def repl(m):
country_code = m.group(1)
return COUNTRIES.get(country_code, country_code)
p = re.compile(r'([A-Z]{3})')
my_string = p.sub(repl, my_string)
Related
I have a dataframe that includes a column ['locality_name'] with names of villages, towns, cities. Some names are written like "town of Hamilton", some like "Hamilton", some like "city of Hamilton" etc. As such, it's hard to count unique values etc. My goal is to leave the names only.
I want to write a function that removes the part of a string till the capital letter and then apply it to my dataframe.
That's what I tried:
import re
def my_slicer(row):
"""
Returns a string with the name of locality
"""
return re.sub('ABCDEFGHIKLMNOPQRSTVXYZ','', row['locality_name'])
raw_data['locality_name_only'] = raw_data.apply(my_slicer, axis=1)
I excpected it to return a new column with the names of places. Instead, nothing changed - ['locality_name_only'] has the same values as in ['locality_name'].
You can use pandas.Series.str.extract. For the example :
ser = pd.Series(["town of Hamilton", "Hamilton", "city of Hamilton"])
ser_2= ser.str.extract("([A-Z][a-z]+-?\w+)")
In your case, use :
raw_data['locality_name_only'] = raw_data['locality_name'].str.extract("([A-Z][a-z]+-?\w+)")
# Output :
print(ser_2)
0
0 Hamilton
1 Hamilton
2 Hamilton
I would use str.replace and phrase the problem as removing all non uppercase words:
raw_data["locality_name_only"] = df["locality_name"].str.replace(r'\s*\b[a-z]\w*\s*', ' ', regex=True).str.strip()
I just started to learn python. I have a question about matching some of the words in my dataset in excel.
words_list is included some of the words I would like to find in a dataset.
words_list = ('tried','mobile','abc')
df is the extract from excel and picked up a single column.
df =
0 to make it possible or easier for someone to do ...
1 unable to acquire a buffer item very likely ...
2 The organization has tried to make...
3 Broadway tried a variety of mobile Phone for the..
I would like to get the result like this:
'None',
'None',
'tried',
'tried','mobile'
I tried in Jupiter like this:
list = [ ]
for word in df:
if any (aa in word for aa in words_List):
list.append(word)
else:
list.append('None')
print(list)
But the result will show the whole sentence in df
'None'
'None'
'The organization has tried to make...'
'Broadway tried a variety of mobile Phone for the..'
Can I only show the result only in the words list?
Sorry for my English and
thank you all
I'd suggest a manipulation on the DataFrame (that should always be your first thought, use the power of pandas)
import pandas as pd
words_list = {'tried', 'mobile', 'abc'}
df = pd.DataFrame({'col': ['to make it possible or easier for someone to do',
'unable to acquire a buffer item very likely',
'The organization has tried to make',
'Broadway tried a variety of mobile Phone for the']})
df['matches'] = df['col'].str.split().apply(lambda x: set(x) & words_list)
print(df)
col matches
0 to make it possible or easier for someone to do {}
1 unable to acquire a buffer item very likely {}
2 The organization has tried to make {tried}
3 Broadway tried a variety of mobile Phone for the {mobile, tried}
The reason it's printing the whole line has to do with your:
for word in df:
Your "word" variable is actually taking the whole line. Then it's checking the whole line to see if it contains your search word. If it does find it, then it basically says, "yes, I found ____ in this line, so append the line to your list.
What it sounds like you want to do is first split the line into words, and THEN check.
list = [ ]
found = False
for line in df:
words = line.split(" ")
for word in word_list:
if word in words:
found = True
list.append(word)
# this is just to append "None" if nothing found
if found:
found = False
else:
list.append("None")
print(list)
As a side note, you may want to use pprint instead of print when working with lists. It prints lists, dictionaries, etc in easier to read layouts. I don't know if you'll need to install the package. That depends on how you initially installed python. But usage would be something like:
from pprint import pprint
dictionary = {'firstkey':'firstval','secondkey':'secondval','thirdkey':'thirdval'}
pprint(dictionary)
I'm looking for the best way to categorise items based on keywords that may be found in the title for a clothing website please.
The categories will be the gender of the clothing item, so womens, mens, boys, girls. However, depending on the item, the titles may contain different keywords such as 'female', 'woman', 'women', 'lady's' and so on.
My thoughts are to put the keywords into a list and then cycle through the list looking for a match and then categorise accordingly.
If I follow this method though, is it possible to do this with a list within a list and cycle through that, so we could have:
gender = ['woman', [#keywords for females clothes], 'men', [#keywords for men's clothes]]
Then cycle through this and if we find a match, tag it accordingly. Alternatively it may be better to use a dictionary, have the key be the category and then a list of corresponding keywords.
Or, there could be an altogether different solution that I've completely missed. I feel there is a pretty simple solution to this but for some reason I can't seem to get my head around it. Thanks in advance.
Try this:
import pandas as pd
d = {'men': ['men', 'boy'], 'women': ['women', 'girl', 'lady']}
def classify(text):
gender = 'None of any'
for i in d:
if any(j in text for j in d[i]):
gender = i
return gender
df = pd.DataFrame({'text':['this is a boy', 'a girl']})
df['cat'] = df['text'].apply(lambda x: classify(x))
print(df)
you can use flashtext for extracting keyword from a given string
from flashtext import KeywordProcessor
kp = KeywordProcessor()
dict_= {'sport': ['cricket','football'],'movie' : ['horror', 'drama']} # here you can add list of word for men and woman
kp.add_keywords_from_dict(dict_)
# now you can extract keyword from a given string
kp.extract_keywords('I love playing football')
#op
['sport']
kp.extract_keywords("some people don't like to watch drama and horror movie, but love to watch cricket")
#op
['movie', 'movie', 'sport']
I have a dataset with city names and counts of crimes. The data is dirty such that a name of a city for example 'new york', is written as 'newyork', 'new york us', 'new york city', 'manhattan new york' etc. How can I group all these cities together and sum their crimes?
I tried the 'difflib' package in python that matches strings and gives you a score. It doesn't work well. I also tried the geocode package in python. It has limits on number of times you can access the api, and doesnt work well either. Any suggestions?
Maybe this might help:
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
Another way: if a string contains 'new' and 'york', then label it 'new york city'.
Another way: Create a dictionary of all the possible fuzzy words that occur and label each of them manually. And use that labelling to replace each of these fuzzy words with the label.
Another approach is to go through each entry and strip the white space and see if they contain a base city name. For example 'newyork', 'new york us', 'new york city', 'manhattan new york' when stripped of white space would be 'newyork', 'newyorkus', 'newyorkcity', 'manhattannewyork', which all contain the word 'newyork'.
There are two approaches with this method, you can go through and replace the all the 'new york' strings with ones that have no white space and are just 'newyork' or you can just check them on the fly.
I wrote down an example below, but since I don't know how your data is formatted, I'm not sure how helpful it is.
crime_count = 0
for (key, val) in dataset:
if 'newyork' in key.replace(" ", ""):
crime_count = crime_count + val
I have a list of strings in which one or more subsets of the strings have a common starting string. I would like a function that takes as input the original list of strings and returns a list of all the common starting strings. In my particular case I also know that each common prefix must end in a given delimiter. Below is an example of the type of input data I am talking about (ignore any color highlighting):
Population of metro area / Portland
Population of city / Portland
Population of metro area / San Francisco
Population of city / San Francisco
Population of metro area / Seattle
Population of city / Seattle
Here the delimiter is / and the common starting strings are Population of metro area and Population of city. Perhaps the delimiter won't ultimately matter but I've put it in to emphasize that I don't want just one result coming back, namely the universal common starting string Population of; nor do I want the common substrings Population of metro area / S and Population of city / S.
The ultimate use for this algorithm will be to group the strings by their common prefixes. For instance, the list above can be restructured into a hierarchy that eliminates redundant information, like so:
Population of metro area
Portland
San Francisco
Seattle
Population of city
Portland
San Francisco
Seattle
I'm using Python but pseudo-code in any language would be fine.
EDIT
As noted by Tom Anderson, the original problem as given can easily be reduced to simply splitting the strings and using a hash to group by the prefix. I had originally thought the problem might be more complicated because sometimes in practice I encounter prefixes with embedded delimiters, but I realize this could also be solved by simply doing a right split that is limited to splitting only one time.
Isn't this just looping over the strings, splitting them on the delimiter, then grouping the second halves by the first halves? Like so:
def groupByPrefix(strings):
stringsByPrefix = {}
for string in strings:
prefix, suffix = map(str.strip, string.split("/", 1))
group = stringsByPrefix.setdefault(prefix, [])
group.append(suffix)
return stringsByPrefix
In general, if you're looking for string prefices, the solution would be to whop the strings into a trie. Any branch node with multiple children is a maximal common prefix. But your need is more restricted than that.
d = collections.defaultdict(list)
for place, name in ((i.strip() for i in line.split('/'))
for line in text.splitlines()):
d[place].append(name)
so d will be a dict like:
{'Population of city':
['Portland',
'San Francisco',
'Seattle'],
'Population of metro area':
['Portland',
'San Francisco',
'Seattle']}
You can replace (i.strip() for i in line.split('/') by line.split(' / ') if you know there's no extra whitespace around your text.
Using csv.reader and itertools.groupby, treat the '/' as the delimiter and group by the first column:
for key, group in groupby(sorted(reader(inp, delimiter='/')), key=lambda x: x[0]):
print key
for line in group:
print "\t", line[1]
This isn't very general, but may do what you need:
def commons(strings):
return set(s.split(' / ')[0] for s in strings)
And to avoid going back over the data for the grouping:
def group(strings):
groups = {}
for s in strings:
prefix, remainder = s.split(' / ', 1)
groups.setdefault(prefix, []).append(remainder)
return groups