I have two dataframe df1 and df2.
df1 has 4 columns.
>df1
Neighborhood Street Begin Street End Street
8th Ave 6th St Church St Mlk blvd
.....
>df2
Intersection Roadway
Mlk blvd Hue St.
I want to add a new column Count in df2 in such a way that for every row in df2 if any string from Intersection or Roadway column exists in overall df1 data frame even once or more, the count column will have a value of 1. For example for this sample dataframe df2 as Mlk blvd is found in df1 under End Street column : the df2 will look like :
>df2
Intersection Roadway Count
Mlk blvd Hue St. 1
I also wanted to strip the string and make it case neutral to match it. However, I am not sure how would I set this matching logic using .iloc . How could I solve this?
Flatten the values in df1 and map to lower case, then convert the values in df2 to lower case and use isin + any to test for the match
vals = map(str.lower, df1.values.ravel())
df2['count'] = df2.applymap(str.lower).isin(vals).any(1).astype(int)
Intersection Roadway count
0 Mlk blvd Hue St. 1
Related
I have a dataframe such contains companies with their sectors
Symbol Sector
0 MCM Industrials
1 AFT Health Care
2 ABV Health Care
3 AMN Health Care
4 ACN Information Technology
I have another dataframe that contains companies with their positions
Symbol Position
0 ABC 1864817
1 AAP -3298989
2 ABV -1556626
3 AXC 2436387
4 ABT 878535
What I want is to get a dataframe that contains the aggregate positions for sectors. So sum the positions of all the companies in a given sector. I can do this individually by
df2[df2.Symbol.isin(df1.groupby('Sector').get_group('Industrials')['Symbol'].to_list())]
I am looking for a more efficient pandas approach to do this rather than looping over each sector under the group_by. The final dataframe should look like the following:
Sector Sum Position
0 Industrials 14567232
1 Health Care -329173249
2 Information Technology -65742234
3 Energy 6574352342
4 Pharma 6342387658
Any help is appreciated.
If I understood the question correctly, one way to do it is joining both data frames and then group by sector and sum the position column, like so:
df_agg = df1.join(df2['Position']).drop('Symbol', axis=1)
df_agg.groupby('Sector').sum()
Where, df1 is the df with Sectors and df2 is the df with Positions.
You can map the Symbol column to sector and use that Series to group.
df2.groupby(df2.Symbol.map(df1.set_index('Symbol').Sector)).Position.sum()
let us just do merge
df2.merge(df1,how='left').groupby('Sector').Position.sum()
I have these two data frames:
1st df
#df1 -----
location Ethnic Origins Percent(1)
0 Beaches-East York English 18.9
1 Davenport Portuguese 22.7
2 Eglinton-Lawrence Polish 12.0
2nd df
#df2 -----
location lat lng
0 Beaches—East York, Old Toronto, Toronto, Golde... 43.681470 -79.306021
1 Davenport, Old Toronto, Toronto, Golden Horses... 43.671561 -79.448293
2 Eglinton—Lawrence, North York, Toronto, Golden... 43.719265 -79.429765
Expected Output:
I want to use the location column of #df1 as it is cleaner and retain all other columns. I don't need the city, country info on the location column.
location Ethnic Origins Percent(1) lat lng
0 Beaches-East York English 18.9 43.681470 -79.306021
1 Davenport Portuguese 22.7 43.671561 -79.448293
2 Eglinton-Lawrence Polish 12.0 43.719265 -79.429765
I have tried several ways to merge them but to no avail.
This returns a NaN for all lat and long rows
df3 = pd.merge(df1, df2, on="location", how="left")
This returns a NaN for all Ethnic and Percent rows
df3 = pd.merge(df1, df2, on="location", how="right")
As others have noted, the problem is that the 'location' columns do not share any values. One solution to this is to use a regular expression to get rid of everything starting with the first comma and extending to the end of the string:
df2.location = df2.location.replace(r',.*', '', regex=True)
Using the exact data you provide this still won't work because you have different kinds of dashes in the two data frame. You could solve this in a similar way (no regex needed this time):
df2.location = df2.location.replace('—', '-')
And then merge as you suggested
df3 = pd.merge(df1, df2, on="location", how="left")
We should using findall create the key
df2['location']=df2.location.str.findall('|'.join(df1.location)).str[0]
df3 = pd.merge(df1, df2, on="location", how="left")
I'm guessing the problem you're having is that the column you're trying to merge on is not the same, i.e. it doesn't find the corresponding values in df2.location to merge to df1. Try changing those first and it should work:
df2["location"] = df2["location"].apply(lambda x: x.split(",")[0])
df3 = pd.merge(df1, df2, on="location", how="left")
I have produced some data which lists parks in proximity to different areas of East London with use of the FourSquare API. It here in the dataframe, df.
Location,Parks,Borough
Aldborough Hatch,Fairlop Waters Country Park,Redbridge
Ardleigh Green,Haynes Park,Havering
Bethnal Green,"Haggerston Park, Weavers Fields",Tower Hamlets
Bromley-by-Bow,"Rounton Park, Grove Hall Park",Tower Hamlets
Cambridge Heath,"Haggerston Park, London Fields",Tower Hamlets
Dalston,"Haggerston Park, London Fields",Hackney
Import data with df = pd.read_clipboard(sep=',')
What I would like to do is group by the borough column and count the distinct parks in that borough so that for example 'Tower Hamlets' = 5 and 'Hackney' = 2. I will create a new dataframe for this purpose which simply lists total number of parks for each borough present in the dataframe.
I know I can do:
df.groupby(['Borough', 'Parks']).size()
But I need to split parks by the delimiter ',' such that they are treated as unique, distinct entities for a borough.
What do you suggest?
Thanks!
The first rule of data science is to clean your data into a useful format.
Reformat the DataFrame to be usable:
df.Parks = df.Parks.str.split(',\s*') # per user piRSquared
df = df.explode('Parks') # pandas v 0.25
Now the DataFrame is in a proper format that can be more easily analyzed
df.groupby('Borough').Parks.nunique()
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
That's three lines of code, but now the DataFrame is in a useful format, upon which more insights can easily be extracted.
Plot
df.groupby(['Borough']).Parks.nunique().plot(kind='bar', title='Unique Parks Counts by Borough')
If you are using Pandas 0.25 or greater, consider the answer from Trenton_M
His answer provides a good suggestion for creating a more useful data set.
IIUC:
df.groupby('Borough').Parks.apply(
lambda s: len(set(', '.join(s).split(', ')))
)
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
Similar
df.Parks.str.split(', ').groupby(df.Borough).apply(lambda s: len(set().union(*s)))
Borough
Hackney 2
Havering 1
Redbridge 1
Tower Hamlets 5
Name: Parks, dtype: int64
I'm writing a simple code to have a two-way table of distances between various cities.
Basically, I have a list of cities (say just 3: Paris, Berlin, London), and I created a combination between them with itertools (so I have Paris-Berlin, Paris-London, Berlin-London). I parsed the distances from a website and saved them in a dictionary (so I have: {Paris: {Berlin : 878.36, London : 343.67}, Berlin : {London : 932.14}}).
Now I want to create a two way table, so that I can look up for a pair of cities in Excel (I need it in Excel unfortunately, otherwise with Python all of this would be unnecessary!), and have the distance back. The table has to be complete (ie not triangular, so that I can look for London-Paris, or Paris-London, and the value must be there on both row/column pair). Is something like this possible easily? I was thinking probably I need to fill in my dictionary (ie create something like { Paris : {Berlin : 878.36, London 343.67}, Berlin : {Paris : 878.36, London : 932.14}, London : {Paris : 343.67, Berlin : 932.14}), and then feed it to Pandas, but not sure it's the fastest way. Thank you!
I think this does something like what you need:
import pandas as pd
data = {'Paris': {'Berlin': 878.36, 'London': 343.67}, 'Berlin': {'London': 932.14}}
# Create data frame from dict
df = pd.DataFrame(data)
# Rename index
df.index.name = 'From'
# Make index into a column
df = df.reset_index()
# Turn destination columns into rows
df = df.melt(id_vars='From', var_name='To', value_name='Distance')
# Drop missing values (distance to oneself)
df = df.dropna()
# Concatenate with itself but swapping the order of cities
df = pd.concat([df, df.rename(columns={'From' : 'To', 'To': 'From'})], sort=False)
# Reset index
df = df.reset_index(drop=True)
print(df)
Output:
From To Distance
0 Berlin Paris 878.36
1 London Paris 343.67
2 London Berlin 932.14
3 Paris Berlin 878.36
4 Paris London 343.67
5 Berlin London 932.14
Basically I need to remove certain rows from a csv file where the value of 'County' column does not contain the word county. I'm trying to push that back into my dataframe but, I'm getting an index error.
chd = pd.read_csv('some_file.csv')
for index, row in chd.iterrows():
if 'County' not in row['County']:
chd = chd.drop(chd.index[[index,3]])
I get the following error:
IndexError: index 2959 is out of bounds for axis 1 with size 2909
Given the following two rows. I would like the get rid of the first row.
STATECODE COUNTYCODE State County Some_Column
1 0 AL Alabama 9,508
1 0 AL Alabama County 9,508
I have since tried the following which doesn't seem to remove any rows. If i print the data frame it remains the same.
chd = pd.read_csv('some_file.csv')
chd[chd['County'].str.contains('county', case=False)]
IIUC then you can do chd[chd['Count'].str.contains('county', case=False)] to remove the rows that don't contain your value
The reason you get the error is because you're iterating over the df and removing rows so your indexing values become mutated and invalid
Example:
In [123]:
df = pd.DataFrame({'County':['Alaska', 'Big county', 'Country', 'No county', 'County']})
df[df['County'].str.contains('county', case=False)]
Out[123]:
County
1 Big county
3 No county
4 County