Having issues merging data from multiple sheets from within same excel file.
2008: Data
UNI DEP ADDRESBR
6 24065037 225 Franklin Street
17 416952 100 North Gay Street
361391 3756 1717 South College
blank 81651 215 South 6th Street
2009 : Data
UNI DEP-2009 ADDRESBR
6 20624948 225 Franklin Street
17 471803 100 North Gay Street
361391 3891 1717 South College
180886 100277 215 South 6th Street
493224 1683 2315 Bentcreek Road
The goal is to combine all the sheet values, into the first sheet, just appending the year_dep as a new column. The issue I am having is the blank information from sheet1, and trying to match address, uniq from each colum.
Final result should look like this.
UNI DEPSUMBR ADDRESBR DEP-2009 DEP-n
6 20624948 225 Franklin Street 20624948
17 471803 100 North Gay Street 471803
361391 3891 1717 South College 3891
180886 100277 215 South 6th Street ...
493224 1683 2315 Bentcreek Road ...
Can anyone help, as to how I would do this in python? The goal is to have a final dataset that accounts for dep per year appended as a column.The trouble that I am having is matching dep value per year with respective UNI#
Related
I am calculating a new column in dataframe using a regular expression with named capturing groups as follows:
(df["Address Column"]
.str.extract("(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)")
.apply(lambda x: x.str.title()))
However, I am getting a KeyError when calling new column "Suburb"
KeyError: "['Suburb'] not in index"
Sample data:
**Address column**
4a Mcarthurs Road, Altona north
1 Neal court, Altona North
4 Vermilion Drive, Greenvale
Lot 307 Bonds Lane, Greenvale
430 Blackshaws rd, Altona North
159 Bonds lane, Greenvale
Desired output:
Address Suburb
4a Mcarthurs Road Altona North
1 Neal court Altona North
4 Vermilion Drive Greenvale
Lot 307 Bonds Lane Greenvale
430 Blackshaws rd Altona North
159 Bonds lane Greenvale
Not sure why I am getting this!
Any help on this will be highly appreciated.
Thank you in advance for the support!
I think your problem is that you don't assign the result of your regexp query to the original df.
The following works for me:
r = r"(?P<Address>.*\d+[\w+?|\s]\s?\w+\s+\w+),?\s(?P<Suburb>.*$)"
ret = df["Address Column"].str.extract(r).apply(lambda x: x.str.title())
df = pd.concat([df, ret], axis=1)
df["Suburb"]
For completeness, this is how I initialized df.
import pandas as pd
s = pd.Series(["4a Mcarthurs Road, Altona north",
"1 Neal court, Altona North",
"4 Vermilion Drive, Greenvale",
"Lot 307 Bonds Lane, Greenvale",
"430 Blackshaws rd, Altona North",
"159 Bonds lane, Greenvale"])
df = pd.DataFrame({"Address Column": s})
The above code adds the new columns Address and Suburb to df:
Address Column Address Suburb
4a Mcarthurs Road, Altona north 4A Mcarthurs Road Altona North
1 Neal court, Altona North 1 Neal Court Altona North
4 Vermilion Drive, Greenvale 4 Vermilion Drive Greenvale
Lot 307 Bonds Lane, Greenvale Lot 307 Bonds Lane Greenvale
430 Blackshaws rd, Altona North 430 Blackshaws Rd Altona North
159 Bonds lane, Greenvale 159 Bonds Lane Greenvale
I have a file that I am trying to read in a pandas dataframe. However, some of the cells, are coming up as NaN even though there are values in there. The cells that are showing up as float value. The cells that are not showing up were copied pasted in the cells. Not sure why that would make a difference. Can anyone help? I have included the file as a link at this location: https://www.dropbox.com/s/30rxw07eaza29df/manhattan_hs_gps.csv?dl=0
Tried this and it worked fine, both encoding='unicode-escape' and encoding='latin-1' work:
df = pd.read_csv('manhattan_hs_gps.csv', encoding='unicode-escape', header=None)
print(df)
0 1 2 3
0 0 A. Philip Randolph Campus High School 40.818500 -73.950000
1 1 Aaron School 40.744800 -73.983700
2 2 Abraham Joshua Heschel School 40.772300 -73.989700
3 3 Academy of Environmental Science Secondary Hig... 40.785200 -73.942200
4 4 Academy for Social Action: A College Board School 40.815400 -73.955300
.. ... ... ... ...
162 164 Xavier High School 40.737900 -73.994600
163 165 Yeshiva University High School for Boys 40.851749 -73.928695
164 166 York Preparatory School 40.774100 -73.979400
165 167 Young Women's Leadership School 40.792900 -73.947200
166 168 Washington Heights Expeditionary Learning School 40.774100 -73.979400
I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.
I have below dataframe nbr2:
Postal_Code Borough Neighborhood
0 M1B Scarborough Rouge, Malvern
1 M4C East York Woodbine Heights
2 M4E East Toronto The Beaches
3 M4L East Toronto The Beaches West, India Bazaar
4 M4M East Toronto Studio District
5 M4N Central Toronto Lawrence Park
On applying below code to filter out rows:
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
the dataframe gets distributes like below:
Postal_Code Borough \
37 M4E East Toronto
41 M4K East Toronto
42 M4L East Toronto
43 M4M East Toronto
Neighborhood
37 The Beaches
41 The Danforth West\n, Riverdale
42 The Beaches West\n, India Bazaar
43 Studio District\n
below code also results in similar structure:
# define the dataframe columns
column_names = ['Postal_Code','Borough', 'Neighborhood']
# instantiate the dataframe
neighbor = pd.DataFrame(columns=column_names)
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
use
pd.set_option('display.expand_frame_repr', False)
I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column.
After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.