I intend to merge two data frames, Chicago crime and Redfin real estate data, but Redfin data was collected by neighborhood in Chicago, while crime data were collected by community area. To do so, I found neighborhood map in Chicago and I am kinda figured out how to assign neighborhood to the community area. the structure of two dataframe is a bit different, so I did a few step manipulation on that. here are the details about my attempt:
example data snippet
here is the public gist where I can view the example data snippet.
here is the neighborhood mapping that I collected from the online source.
my solution
here is my first mapping solution:
code_pairs_neighborhoods = [[p[0], p[1]] for p in [pair.strip().split('\t') for pair in neighborhood_Map.strip().split('\n')]]
neighborhood_name_dic = {k[0]:k[1] for k in code_pairs_neighborhoods} #neighborhood -> community area
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
Redfin['neighborhood'] = Redfin['Region'].map(neighborhood_name_dic)
final_df= pd.merge(chicago_crime, chicago_crime, on='neighborhood')
but this solution didn't find correct mapping and neighborhood becomes NAN, which is wrong.
second mapping attempt:
without using neighborhood mapping, I intuitively come up this solution for mapping:
chicago_crime['community_name']=[[y.split() for y in x] for x in chicago_crime['community_name']]
Redfin['Region']= [[j.split() for j in i] for i in Redfin['Region']]
idx, datavalue = [], []
for i,dv in enumerate(chicago_crime['community_name']):
for d in dv:
if d in Redfin['Region'][i]:
if i not in idx:
idx.append(i)
datavalue.append(d)
chicago_crime['merge_ref'] = datavalue
Redfin['merge_ref'] = datavalue
final_df= pd.merge(chicago_crime[['community_area','community_name','merge_ref']], Redfin, on='merge_ref')
but this solution gave me error: ValueError: Length of values does not match length of index, AttributeError: 'list' object has no attribute 'split'.
how can I make this work? based on neighborhood mapping, how can I get correct mapping both for Redfin data and chicago crime data? Any idea to make this mapping correct and get right merged dataframe? any thought? thanks in advance.
update:
I put all of my solution including dataset on this github repository all solution and data on github
Ok, here's what I found:
there is a unicode character in the first line of neighborhood_Map that you probably want to remove: Cabrini\xe2\x80\x93Green'-> Cabrini Green
switch the key and value in neighborhood_name_dic around, since you want to map the existing 'Rogers Park' to the neighborhood 'East Rogers Park', like so:
neighborhood_name_dic = {k[1]:k[0] for k in code_pairs_neighborhoods}
We still don't know from your code how your reading in your Redfin data, but I presume you'll have to remove the Chicago, IL - part in the Region column somewhere, before you can merge?
Update: So I think I managed to understand your code (again, please try to clean up these things a bit before posting), and I think that Redfin is equal to house_df there. So instead of the line that says:
house_df=house_df.set_index('Region',drop=False)
I would suggest to create a neighbourhood column:
house_df['neighborhood'] = house_df['Region'].map(lambda x: x.lstrip('Chicago, IL - '))
and then you can merge on:
crime_finalDF = pd.merge(chicago_crime, house_df, left_on='neighborhood', right_on='neighborhood')
To test it, try:
mask=(crime_finalDF['neighborhood']==u'Sheridan Park')
print(crime_finalDF[['robbery','neighborhood', u'2018-06-01 00:00:00']][mask])
which yields:
robbery neighborhood 2018-06-01 00:00:00
0 140.0 Sheridan Park 239.0
1 122.0 Sheridan Park 239.0
2 102.0 Sheridan Park 239.0
3 113.0 Sheridan Park 239.0
4 139.0 Sheridan Park 239.0
so a successful join of both datasets (I think).
Update 2, regarding the success of the merge().
This is how I read in and cleaned up your xlsx file:
house_df = pd.read_excel("./real_eastate_data_main.xlsx",)
house_df.replace({'-': None})
house_df.columns=house_df.columns.astype(str)
house_df = house_df[house_df['Region'] != 'Chicago, IL']
house_df = house_df[house_df['Region'] != 'Chicago, IL metro area']
house_df['neighborhood'] = house_df['Region'].str.split(' - ')## note the surrounding spaces
house_df['neighborhood'] = house_df['neighborhood'].map(lambda x: list(x)[-1])
chicago_crime['neighborhood'] = chicago_crime['community_name'].map(neighborhood_name_dic)
## Lakeview and Humboldt park not defined in neighborhood_name_dic
# print( chicago_crime[['community_name','neighborhood']][pd.isnull(chicago_crime['neighborhood'])] )
chicago_crime = chicago_crime[~pd.isnull(chicago_crime['neighborhood'])] ## remove them
Now we turn to finding all unique neighborhoods in both df's
cc=sorted(chicago_crime['neighborhood'].unique())
ho=sorted(house_df['neighborhood'].unique())
print(30*u"-"+u" chicago_crime: "+30*u"-")
print(len(cc),cc)
print(30*u"-"+u" house_df: "+30*u"-")
print(len(ho),ho)
print(60*"-")
# print('\n'.join(cc))
set1 = set(cc)
set2 = set(ho)
missing = list(sorted(set1 - set2))
added = list(sorted(set2 - set1))
print('These {0} are missing in house_df: {1}'.format(len(missing),missing))
print(60*"-")
print('These {0} are only in house_df: {1}'.format(len(added),added))
Which reveals that 29 are missing in house_df (e.g. 'East Pilsen') and 132 are found only within house_df (e.g. 'Albany Park'), i.e. we can "inner join" only 46 entries.
Now you have to decide how you want to continue, best if you first read this about the way merging works (e.g. understand the venn diagrams posted there) and then you can improve your code yourself accordingly! Or: clean up your data manually before, sometimes there isnĀ“t a fully automatic solution!
Related
I'm trying to filter out bogus locations from a column in a data frame. The column is filled with locations taken from tweets. Some of the locations aren't real. I am trying to separate them from the valid locations. Below is the code I have. However, the output is not producing the right thing, it instead will only return France. I'm hoping someone can identify what I'm doing wrong here or another way to try. Let me know if I didn't explain it well enough. Also, I assign variables both outside and inside the function for testing purposes.
import pandas as pd
cn_csv = pd.read_csv("~/Downloads/cntry_list.csv") #this is just a list of every country along with respective alpha 2 and alpha 3 codes, see the link below to download csv
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv") #this is a dataframe with multiple columns, one being "source location" See edit below that displays data in "Source Location" column
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
def country_name_check(input_country_list):
cn_csv = pd.read_csv("~/Downloads/cntrylst.csv")
country_names = cn_csv['country']
results = pd.read_csv("~/Downloads/results.csv")
src_locs = results["Source Location"]
locs_to_list = list(src_locs)
new_list = [entry.split(', ') for entry in locs_to_list]
valid_names = []
tobe_checked = []
for i in new_list:
if i in country_names.values:
valid_names.append(i)
else:
tobe_checked.append(i)
return valid_names, tobe_checked
print(country_name_check(src_locs))
EDIT 1: Adding the link for the cntry_list.csv file. I downloaded the csv of the table data. https://worldpopulationreview.com/country-rankings/country-codes
Since I am unable to share a file on here, here is the "Source Location" column data:
Source Location
She/her
South Carolina, USA
Torino
England, UK
trying to get by
Bemidiji, MN
St. Paul, MN
Stockport, England
Liverpool, England
EH7
DLR - LAX - PDX - SEA - GEG
Barcelona
Curitiba
kent
Paris, France
Moon
Denver, CO
France
If your goal is to find and list country names, both valid and not, you may filter the initial results DataFrame:
# make list from unique values of Source Location that match values from country_names
valid_names = list(results[results['Source Location']
.isin(country_names)]['Source Location']
.unique())
# with ~ select unique values that don't match country_names values
tobe_checked = list(results[~results['Source Location']
.isin(country_names)]['Source Location']
.unique())
Your unwanted result with only France being returned could be solved by trying that simpler approach. However, the problem in your code may be there when reading cntrylst outside of the function, as indicated by ScottC
I appreciate the help in advance!
The question may seem weird at first so let me illustrate what I am trying to accomplish:
I have this df of cities and abbreviations:
I need to add another column called 'Queries' and those queries are on a list as follows:
queries = ['Document Management','Document Imaging','Imaging Services']
The trick though is that I need to duplicate my df rows for each query in the list. For instance, for row 0 I have PHOENIX, AZ. I now need 3 rows saying PHOENIX, AZ, 'query[n]'.
Something that would look like this:
Of course I created that manually but I need to scale it for a large number of cities and a large list of queries.
This sounds simple but I've been trying for some hours now I don't see how to engineer any code for it. Again, thanks for the help!
Here is one way, using .explode():
import pandas as pd
df = pd.DataFrame({'City_Name': ['Phoenix', 'Tucson', 'Mesa', 'Los Angeles'],
'State': ['AZ', 'AZ', 'AZ', 'CA']})
# 'Query' is a column of tuples
df['Query'] = [('Doc Mgmt', 'Imaging', 'Services')] * len(df.index)
# ... and explode 'unpacks' the tuples, putting one item on each line
df = df.explode('Query')
print(df)
City_Name State Query
0 Phoenix AZ Doc Mgmt
0 Phoenix AZ Imaging
0 Phoenix AZ Services
1 Tucson AZ Doc Mgmt
1 Tucson AZ Imaging
1 Tucson AZ Services
2 Mesa AZ Doc Mgmt
2 Mesa AZ Imaging
2 Mesa AZ Services
3 Los Angeles CA Doc Mgmt
3 Los Angeles CA Imaging
3 Los Angeles CA Services
You should definitely go with jsmart's answer, but posting this as an exercise.
This can also be achieved by exporting the original cities/towns dataframe (df) to a list or records, manually duplicating each one for each query then reconstructing the final dataframe.
The entire thing can fit in a single line, and is even relatively readable if you can follow what's going on ;)
pd.DataFrame([{**record, 'query': query}
for query in queries
for record in df.to_dict(orient='records')])
new to python myself, but I would get around it by creating n (n=# of unique query values) identical data frames without "Query". Then for each of the data frame, create a new column with one of the "Query" values. Finally, stack all data frames together using append. A short example:
adf1 = pd.DataFrame([['city1','sate1'],['city2','state2']])
adf2 = adf1
adf1['query'] = 'doc management'
adf2['query'] = 'doc imaging'
df = adf1.append(adf2)
Another method if there are many types of queries.
Creating a dummy column, say 'key', in both the original data frame and the query data frame, and merge the two on 'key'.
adf = pd.DataFrame([['city1','state1'],['city2','state2']])
q = pd.DataFrame([['doc management'],['doc imaging']])
adf['key'] = 'key'
q['key'] = 'key'
df = pd.merge(adf, q, on='key', how='outer')
More advanced users should have better ways. This is a temporary solution if you are in a hurry.
I have a dictionary of dataframes called names_and_places in pandas that looks like the below.
names_and_places:
Alfred,,,
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
Brett,,,
Date,F_1,F_2,Key
4/1/2020,202,404,NAN
4/2/2020,101,401,NAN
4/3/2020,102,403,"[USA,CT, Fairfield, Stamford] "
Claire,,,
Date,F_1,F_2,Key
4/1/2020,NAN,12,NAN
4/2/2020,NAN,45,NAN
4/3/2020,7,78,"[USA,CT, Fairfield, Darian] "
Dane,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, New Haven] "
Edward,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, Milford] "
(text above or image below)
The key column is either going to be NAN or of the form [Country, State, County, City], but can be of length 3 or 4 elements (sometimes County is absent). I need to find all the elements with a given element that is contained in a key. For instance if the element = "CT", the script returns Edward, Brett, Dane and Claire (order is not important). If the element = "Stamford" then only Brett is returned. However I am going about the identification process in a way that seems very inefficient. I basically have variables that iterate through each possible combination of State, County, City (all of which I am currently manually inputting into variables) to identify which names to extract like below:
country = 'USA' #this never needs to change
element = 'CT'
#These next two are actually in .txt files that I create once I am asked for
#a given breakdown but I would like to not have to manually input these
middle_node = ['Fairfield','Bridgeport']
terminal_nodes = ['Stamford','Darian','New Haven','Milford']
names=[]
for a in middle_node:
for b in terminal_nodes:
my_key = [country,key_of_interest,a,b]
for s in names_and_places:
for z in names_and_places[s]['Key']:
if my_key == z:
names.append(s)
#Note having "if my_key in names_and_places[s]['Key']": was causing sporadic failures for
#some reason
display(names)
Output:
Edward, Brett, Dane, Claire
What I would like is to be able to input only the variable element and this can either be a level 2 (State), 3 (County), or 4 (City) node. However short of adding additional for loops and going into the Key column, I don't know how to do this. The one benefit (for a novice like myself) is that the double for loops allow me to keep bucketing intact and makes it easier for people to see where names are coming from when that is also needed.
But is there a better way? For bonus points if there is a way to handle the case when the key_of_interest is 'NY' and values in the Keys column can be like [USA, NY, NY, NY] or [USA, NY, NY, Queens].
Edit: names_and_places is a dictionary with names as the index, so
display(names_and_places['Alfred'])
would be
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
I do have the raw dataframe that has columns:
Date, Field name, Value, Names,
Where Field Name is either F_1, F_2 or Key and Value is the associated value of that field. I then pivot the data on Name with columns of Field Name to make my extraction easier.
Here's a way to do that in a somewhat more effective way. You start by building a single dataframe out of the dictionary, and then do the actual work on that dataframe.
single_df = pd.concat([df.assign(name = k) for k, df in names_and_places.items()])
single_df["Key"] = single_df.Key.replace("NAN", np.NaN)
single_df.dropna(inplace=True)
# Since the location is a string, we have to parse it.
location_df = pd.DataFrame(single_df.Key.str.replace(r"[\[\]]", "").str.split(",", expand=True))
location_df.columns = ["Country", "State", "County", "City"]
single_df = pd.concat([single_df, location_df], axis=1)
# this is where the actual query goes.
single_df[(single_df.Country == "USA") & (single_df.State == "CT")].name
The output is:
2 Brett
2 Claire
2 Dane
2 Edward
Name: name, dtype: object
first of all, I have no background in computer language and I am learning Python.
I'm trying to group some data in a data frame.
[dataframe "cafe_df_merged"]
Actually, I want to create a new data frame shows the 'city_number', 'city' (which is a name), and also the number of cafes in the same city. So, it should have 3 columns; 'city_number', 'city' and 'number_of_cafe'
However, I have tried to use the group by but the result did not come out as I expected.
city_directory = cafe_df_merged[['city_number', 'city']]
city_directory = city_directory.groupby('city').count()
city_directory
[the result]
How should I do this? Please help, thanks.
There are likely other ways of doing this as well, but something like this should work:
import pandas as pd
import numpy as np
# Create a reproducible example
places = [[['starbucks', 'new_york', '1234']]*5, [['bean_dream', 'boston', '3456']]*4, \
[['coffee_today', 'jersey', '7643']]*3, [['coffee_today', 'DC', '8902']]*3, \
[['starbucks', 'nowwhere', '2674']]*2]
places = [p for sub in places for p in sub]
# a dataframe containing all information
city_directory = pd.DataFrame(places, columns=['shop','city', 'id'])
# make a new dataframe with just cities and ids
# drop duplicate rows
city_info = city_directory.loc[:, ['city','id']].drop_duplicates()
# get the cafe counts (number of cafes)
cafe_count = city_directory.groupby('city').count().iloc[:,0]
# add the cafe counts to the dataframe
city_info['cafe_count'] = cafe_count[city_info['city']].to_numpy()
# reset the index
city_info = city_info.reset_index(drop=True)
city_info now yields the following:
city id cafe_count
0 new_york 1234 5
1 boston 3456 4
2 jersey 7643 3
3 DC 8902 3
4 nowwhere 2674 2
And part of the example dataframe, city_directory.tail(), looks like this:
shop city id
12 coffee_today DC 8902
13 coffee_today DC 8902
14 coffee_today DC 8902
15 starbucks nowwhere 2674
16 starbucks nowwhere 2674
Opinion: As a side note, it might be easier to get comfortable with regular Python first before diving deep into the world of pandas and numpy. Otherwise, it might be a bit overwhelming.
I am facing a problem in applying fuzzy logic for data cleansing in python. My data looks something like this
data=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "Count":['140','120','50','45','30','20','10','5']})
data
I am using fuzzy logic to compare the values in the data frame. The final output should have a third column with result like this:
data_out=pd.DataFrame({'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'], "New_Column":["Deloitte",'Accenture','Accenture','Accenture','Ernst & young','Ernst & young','Tata Consultancy Services','Deloitte']})
data_out
So if you see, I want less occurring values to have a new entry as a new column with the most occurred value of its type. That is where fuzzy logic is helpful.
Most of your duplicate companies can be detected using fuzzy string matching quite easily, however the replacement Ernst & young <-> EY is not really similar at all, which is why I am going to ignore this replacement here. This solution is using my library RapidFuzz, but you could implement something similar using FuzzyWuzzy aswell (with a little more code, since it does not has the extractIndices processor).
import pandas as pd
from rapidfuzz import process, utils
def add_deduped_employer_colum(data):
values = data.values.tolist()
employers = [employer for employer, _ in values]
# preprocess strings beforehand (lowercase + remove punctuation),
# so this is not done multiple times
processed_employers = [utils.default_process(employer)
for employer in employers]
deduped_employers = employers.copy()
replaced = []
for (i, (employer, processed_employer)) in enumerate(
zip(employers, processed_employers)):
# skip elements that already got replaced
if i in replaced:
continue
duplicates = process.extractIndices(
processed_employer, processed_employers[i+1:],
processor=None, score_cutoff=90, limit=None)
for (c, _) in duplicates:
deduped_employers[i+c+1] = employer
"""
by replacing the element with an empty string the index from
extractIndices stays correct but it can be skipped a lot
faster, since the compared strings will have very different
lengths
"""
processed_employers[i+c+1] = ""
replaced.append(i+c+1)
data['New_Column'] = deduped_employers
data=pd.DataFrame({
'Employer':['Deloitte','Accenture','Accenture Solutions Ltd','Accenture USA', 'Ernst & young',' EY', 'Tata Consultancy Services','Deloitte Uk'],
"Count":['140','120','50','45','30','20','10','5']})
add_deduped_employer_colum(data)
print(data)
which results in the following dataframe:
Employer Count New_Column
0 Deloitte 140 Deloitte
1 Accenture 120 Accenture
2 Accenture Solutions Ltd 50 Accenture
3 Accenture USA 45 Accenture
4 Ernst & young 30 Ernst & young
5 EY 20 EY
6 Tata Consultancy Services 10 Tata Consultancy Services
7 Deloitte Uk 5 Deloitte
I have not used fuzzy but can assist as follows
Data
df=pd.DataFrame({'Employer':['Accenture','Accenture Solutions Ltd','Accenture USA', 'hjk USA', 'Tata Consultancy Services']})
df
You did not give an explanation why Tata remains with the full name. Hence I assume it is special and mask it.
m=df.Employer.str.contains('Tata')
I then use np.where to replace anything after the first name for the rest
df['New_Column']=np.where(m, df['Employer'], df['Employer'].str.replace(r'(\s+\D+)',''))
df
Output