Pandas DataFrame from txt - python

I have a .txt that goes like this:
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
...
...
I want to create a DataFrame and a CSV in pandas, where the first column contains the States, and the second column the Highway.
States HW_Number
Arizona - New Mexico Interstate 40
Arizona - New Mexico Interstate 10
South Dakota - Minneapolis Interstate 90
Arizona - California Interstate 40
Arizona - California Interstate 10
Arizona - California Interstate 9
State A - State B Highway 1
State A - State B Highway 2
State A - State B Highway 3
How can I manage to do that? Not all the states have the same amount of Highways, and can have 0 Highways, and those that have 0, I do not want to be integrated in the DataFrame.
A column with the Country could be integrated as well.
Thank you

As I said, a pretty easy file to parse:
import pandas as pd
rows = []
state = None
for line in open('x.txt'):
if line[0] == ' ':
continue
line = line.strip()
if not line:
continue
if '-' in line:
state = line
else:
rows.append( (state,line) )
df = pd.DataFrame(rows, columns=['state','road'])
print(df)
Output:
----------
state road
0 Arizona - New Mexico Interstate 40
1 Arizona - New Mexico Interstate 10
2 South Dakota - Minneapolis Interstate 90
3 Arizona - California Interstate 40
4 Arizona - California Interstate 10
5 Arizona - California Interstate 8
6 State A - State B Highway 1
7 State A - State B Highway 2
8 State A - State B Highway 3

You can iterate through the rows and use characteristics of your structured data to create lists. These lists can be used to make a dataframe or series.
read the lines from the file into a list (f.readlines())
remove empty rows
keep track of current state (doesn't end with a number)
append the states and highways to lists
use lists to make a dataframe or series
import pandas as pd
import io
f = io.StringIO(
"""
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
"""
)
lines = f.readlines()
states = []
hw_numbers = []
current_state = None
for line in lines:
line = line.strip() #removes \n
if line == '': #remove empty rows
continue
elif line[-1].isdigit() == False: #if not a digit, then it's a state
current_state = line
else: #if it is a digit, then it's a highway
states.append(current_state)
hw_numbers.append(line)
pd.DataFrame({
'States':states,
'HW_number':hw_numbers
})

Related

Assign values from a dictionary to a new column based on condition

This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom

Split column in DataFrame based on item in list

I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df
Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""

Group a dataframe by a column and concactenate strings in another

I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.

deleting row from Dataframe results in distribute dataframe in Python

I have below dataframe nbr2:
Postal_Code Borough Neighborhood
0 M1B Scarborough Rouge, Malvern
1 M4C East York Woodbine Heights
2 M4E East Toronto The Beaches
3 M4L East Toronto The Beaches West, India Bazaar
4 M4M East Toronto Studio District
5 M4N Central Toronto Lawrence Park
On applying below code to filter out rows:
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
the dataframe gets distributes like below:
Postal_Code Borough \
37 M4E East Toronto
41 M4K East Toronto
42 M4L East Toronto
43 M4M East Toronto
Neighborhood
37 The Beaches
41 The Danforth West\n, Riverdale
42 The Beaches West\n, India Bazaar
43 Studio District\n
below code also results in similar structure:
# define the dataframe columns
column_names = ['Postal_Code','Borough', 'Neighborhood']
# instantiate the dataframe
neighbor = pd.DataFrame(columns=column_names)
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
use
pd.set_option('display.expand_frame_repr', False)

how to preserve original indexes in the new dataframe

def answer_eight():
templist = list()
for county, region, p15, p14, ste, cty in zip(census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME):
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame.from_records(templist, columns=labels)
return df
STNAME CTYNAME
0 Iowa Washington County
1 Minnesota Washington County
2 Pennsylvania Washington County
3 Rhode Island Washington County
4 Wisconsin Washington County
All these CTYNAME has different indexes in the original census_df. How could I transfer them over to the new DF so the answer looks like:
STNAME CTYNAME
12 Iowa Washington County
222 Minnesota Washington County
400 Pennsylvania Washington County
2900 Rhode Island Washington County
2999 Wisconsin Washington County
I'd include the index with the other things your are zipping
def answer_eight():
templist = list()
index = list()
zipped = zip(
census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME,
census_df.index
)
for county, region, p15, p14, ste, cty, idx in zipped:
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
index.append(idx)
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame(templist, index, labels)
return df.rename_axis(census_df.index.name)
Before you start filtering, you can assign the original index to a column with:
census_df['original index'] = census_df.index
Then just treat it like one of the other columns you're selecting from.

Categories