Pandas exits early before I can read entire Excel file

Pandas exits early before I can read entire Excel file - python

I'm trying to read in a Python excel file into Pandas, access particular columns of each row, and geocode an address to coordinates. Then write them to a csv
The geocoding part works good, and as far as I know my loop starts out good where it can read the address. However, it just stops as 22 rows. I have no clue why, I've been using Pandas with this same excel file for something else and it does fine. Just doing this, not so much. It has 27k rows in it. Printing out data.__len__() gives me 27395. Any help?
##### READ IN DATA
file = r'rollingsales_manhattan.xls'
# Read in the data from the Excel
data = pd.read_excel(file)
# g = geocoder.osm(str(data['ADDRESS'].iloc[0]) + " New York City, NY " + str(data['ZIP CODE'].iloc[0]))
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for index,d in enumerate(data):
print(str(data['ADDRESS'].iloc[index]) + " New York City, NY " + str(data['ZIP CODE'].iloc[index]))
Then my output...
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
345 WEST 14TH STREET New York City, NY 10014
229 EAST 2ND STREET New York City, NY 10009
243 EAST 7TH STREET New York City, NY 10009
238 EAST 4TH STREET New York City, NY 10009
303 EAST 4TH STREET New York City, NY 10009
Process finished with exit code 0

You need to use the iteritems() method to iterate over the Pandas series. To iterate them both, use map() like such...
with open("geotagged_manhattan.csv", 'wb') as result_file:
wr = csv.writer(result_file)
for a, z in map(None, data['ADDRESS'].iteritems(), data['ZIP CODE'].iteritems()):
print(str(a[1]) + " New York City, NY " + str(z[1]))

Related

Pythons 'DataFrame' object is not callable - FOR ERROR

Good Morning, My df(df_part3) is above:
Postal Code Borough Neighbourhood Latitude Longitude
0 M5A Downtown Toronto Regent Park, Harbourfront 43.654260 -79.360636
1 M7A Downtown Toronto Queen's Park, Ontario Provincial Government 43.662301 -79.389494
2 M5B Downtown Toronto Garden District, Ryerson 43.657162 -79.378937
3 M5C Downtown Toronto St. James Town 43.651494 -79.375418
4 M4E East Toronto The Beaches 43.676357 -79.293031
... ... ... ... ... ...
34 M5W Downtown Toronto Stn A PO Boxes 43.646435 -79.374846
35 M4X Downtown Toronto St. James Town, Cabbagetown 43.667967 -79.367675
36 M5X Downtown Toronto First Canadian Place, Underground city 43.648429 -79.382280
37 M4Y Downtown Toronto Church and Wellesley 43.665860 -79.383160
38 M7Y East Toronto Business reply mail Processing Centre, South C... 43.662744 -79.321558
And My Code is Here:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, label in zip(df_part3['Latitude'], df_part3['Longitude'], df_part3['Neighbourhood']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_toronto)
map_toronto
But When i Run it i get:
TypeError: 'DataFrame' object is not callable
----> 5 for lat, lng, label in zip(df_part3['Latitude'], df_part3['Longitude'], df_part3['Neighbourhood']):
Does anyone knows how to help-me?

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df

Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

Having issues merging time series data

Having issues merging data from multiple sheets from within same excel file.
2008: Data
UNI DEP ADDRESBR
6 24065037 225 Franklin Street
17 416952 100 North Gay Street
361391 3756 1717 South College
blank 81651 215 South 6th Street
2009 : Data
UNI DEP-2009 ADDRESBR
6 20624948 225 Franklin Street
17 471803 100 North Gay Street
361391 3891 1717 South College
180886 100277 215 South 6th Street
493224 1683 2315 Bentcreek Road
The goal is to combine all the sheet values, into the first sheet, just appending the year_dep as a new column. The issue I am having is the blank information from sheet1, and trying to match address, uniq from each colum.
Final result should look like this.
UNI DEPSUMBR ADDRESBR DEP-2009 DEP-n
6 20624948 225 Franklin Street 20624948
17 471803 100 North Gay Street 471803
361391 3891 1717 South College 3891
180886 100277 215 South 6th Street ...
493224 1683 2315 Bentcreek Road ...
Can anyone help, as to how I would do this in python? The goal is to have a final dataset that accounts for dep per year appended as a column.The trouble that I am having is matching dep value per year with respective UNI#

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?

Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

AWK reformat portion of results (names) within larger string

My goal is to reformat names from Last First Middle (LFM) to First Middle Last (FML), which are part of a larger string. Here's some sample data:
Name, Address1, Address2
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
What I would like is:
Name, Address1, Address2 Joe M Smith, 123 Apple Rd, Paris
TX Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY Karen E F
Walker, 98 West Ave, Denver CO
I know how to reorder the first column, but I end up dropping the rest of the row data:
# Return the first colum via comma seperation (name), then seperate by spaces
# If there are two strings but not three (only a last and first name),
# then change the order to first last.
awk -F, '{print $1}'| awk -F" " '$2!="" && $3=="" {print $2,$1}' >> names.txt
awk -F, '{print $1}'| awk -F" " '$3!="" && $4=="" {print $3,$1,$2}' >> names.txt
...# Continue to iterate column numbers
If there's an easier way to put the last string found and move it to the front I'd like to hear about it, but here's my real interest...
My problem is that I want to reorder the space separated fields of the 1st comma separated field (what I did above), but then also print the rest of the comma separated data.
Is there a way I can store the address info in a variable and append it after the space seperated names?
Alternatively, could I do some kind of nested split?
I'm currently doing this with awk in bash, but am willing to use python/pandas or any other efficient methods.
Thanks for the help!

Using sed, looks terrible but works:
sed -E '2,$s/^([^ ,]*) ([^ ,]*)( [^,]*)?/\2\3 \1/' in
and POSIX version:
sed '2,$s/^\([^ ,]*\) \([^ ,]*\)\( [^,]*\)*/\2\3 \1/' in
output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

The following AWK script, as ugly as it is, works for your inputs (run with awk -F, -f script.awk):
{
split($1, names, " ");
for (i=2; i<=length(names); i++)
printf("%s ", names[i]);
printf("%s, ", names[1]);
for(i=2; i<NF; i++)
printf("%s,", $i);
print($NF)
}
Input:
Smith Joe M, 123 Apple Rd, Paris TX
Adams Keith Randall, 543 1st Street, Salinas CA
Price Tiffany, 11232 32nd Street, New York NY
Walker Karen E F, 98 West Ave, Denver CO
Output:
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
The same solution in Python:
import sys
import re
for line in sys.stdin:
parts = re.split('\s*,\s*', line)
names = parts[0].split()
print(", ".join([" ".join(names[1:] + names[:1])] + parts[1:]))

Another awk. This one works with the header line and Madonna (ie. single word fields):
$ awk ' # using awk
BEGIN{FS=OFS=","} # csv
{
n=split($1,a," ") # split the first field to a
for(i=n;i>1;i--) # iterate back from the last element of a
a[1]=a[i] " " a[1] # prepending to the first element of a
$1=a[1] # replace the first field with the first element of a
}1' file # output
Output:
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO
Madonna, ...

$ awk '
BEGIN { FS=OFS=", " }
$1 ~ / / {
last = rest = $1
sub(/ .*/,"",last)
sub(/[^ ]+ /,"",rest)
$1 = rest " " last
}
{ print }
' file
Name, Address1, Address2
Joe M Smith, 123 Apple Rd, Paris TX
Keith Randall Adams, 543 1st Street, Salinas CA
Tiffany Price, 11232 32nd Street, New York NY
Karen E F Walker, 98 West Ave, Denver CO

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas exits early before I can read entire Excel file - python

Related

Pythons 'DataFrame' object is not callable - FOR ERROR

Split column in DataFrame based on item in list

Having issues merging time series data

Group a dataframe by a column and concactenate strings in another

AWK reformat portion of results (names) within larger string

Categories

Resources