I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()
You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States
I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work
I'm a beginner at Python.
I'm having trouble printing the values from a .txt file.
NFL_teams = csv.DictReader(txt_file, delimiter = "\t")
print(type(NFL_teams))
columns = NFL_teams.fieldnames
print('Columns: {0}'.format(columns))
print(columns[2])
This gets me to the name of the 3rd column ("team_id") but I can't seem to print any of the data that would be in the corresponding column. I have no idea what I'm doing.
Thanks in advance.
EDIT: The .txt file looks like this:
team_name team_name_short team_id team_id_pfr team_conference team_division team_conference_pre2002 team_division_pre2002 first_year last_year
Arizona Cardinals Cardinals ARI CRD NFC NFC West NFC NFC West 1994 2021
Phoenix Cardinals Cardinals ARI CRD NFC - NFC NFC East 1988 1993
St. Louis Cardinals Cardinals ARI ARI NFC - NFC NFC East 1966 1987
Atlanta Falcons Falcons ATL ATL NFC NFC South NFC NFC West 1966 2021
Baltimore Ravens Ravens BAL RAV AFC AFC North AFC AFC Central 1996 2021
Buffalo Bills Bills BUF BUF AFC AFC East AFC AFC East 1966 2021
Carolina Panthers Panthers CAR CAR NFC NFC South NFC NFC West 1995 2021
There are tabs between each item.
I have a pandas dataframe with one column like this:
Merged_Cities
New York, Wisconsin, Atlanta
Tokyo, Kyoto, Suzuki
Paris, Bordeaux, Lyon
Mumbai, Delhi, Bangalore
London, Manchester, Bermingham
And I want a new dataframe with the output like this:
Merged_Cities
Cities
New York, Wisconsin, Atlanta
New York
New York, Wisconsin, Atlanta
Wisconsin
New York, Wisconsin, Atlanta
Atlanta
Tokyo, Kyoto, Suzuki
Tokyo
Tokyo, Kyoto, Suzuki
Kyoto
Tokyo, Kyoto, Suzuki
Suzuki
Paris, Bordeaux, Lyon
Paris
Paris, Bordeaux, Lyon
Bordeaux
Paris, Bordeaux, Lyon
Lyon
Mumbai, Delhi, Bangalore
Mumbai
Mumbai, Delhi, Bangalore
Delhi
Mumbai, Delhi, Bangalore
Bangalore
London, Manchester, Bermingham
London
London, Manchester, Bermingham
Manchester
London, Manchester, Bermingham
Bermingham
In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column.
Here's a replicable version of df:
df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta',
'Tokyo, Kyoto, Suzuki',
'Paris, Bordeaux, Lyon',
'Mumbai, Delhi, Bangalore',
'London, Manchester, Bermingham']})
Use .str.split() and .explode():
df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities")
print(df)
Prints:
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
2 Paris, Bordeaux, Lyon Paris
2 Paris, Bordeaux, Lyon Bordeaux
2 Paris, Bordeaux, Lyon Lyon
3 Mumbai, Delhi, Bangalore Mumbai
3 Mumbai, Delhi, Bangalore Delhi
3 Mumbai, Delhi, Bangalore Bangalore
4 London, Manchester, Bermingham London
4 London, Manchester, Bermingham Manchester
4 London, Manchester, Bermingham Bermingham
This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index.
# Create pandas.Series from splitting the column on ', '
s = df['Merged_Cities'].str.split(', ').explode().rename('Cities')
# Merge df with s on their index
df = df.merge(s, left_index=True, right_index=True)
# Result
print(df)
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
I have the following formula:
import pandas as pd
houses = houses.reset_index()
houses['difference'] = houses[start]-houses[bottom]
towns = pd.merge(houses, unitowns, how='inner', on=['State','RegionName'])
print(towns)
However, the output is a dataframe 'towns' with 0 rows.
I can't understand why this is given that the dataframe 'houses' looks like this:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
And the dataframe 'unitowns' looks like this:
State RegionName 2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
0 New York New York NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0
1 California Los Angeles 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0
2 Illinois Chicago 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0
3 Pennsylvania Philadelphia 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0
4 Arizona Phoenix 1.118333e+05 1.143667e+05 1.160000e+05 1.174000e+05 1.196000e+05 1.215667e+05 1.227000e+05 1.243000e+05 ... 1.642667e+05 1.653667e+05 1.685000e+05 1.715333e+05 1.741667e+05 1.790667e+05 1.838333e+05 1.879000e+05 1.914333e+05 195200.0
I need to create a Pandas DataFrame based on a text file based on the following structure:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
Pandas DataFrame
I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.
You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
You could parse the file into tuples first:
import pandas as pd
from collections import namedtuple
Item = namedtuple('Item', 'state area')
items = []
with open('unis.txt') as f:
for line in f:
l = line.rstrip('\n')
if l.endswith('[edit]'):
state = l.rstrip('[edit]')
else:
i = l.index(' (')
area = l[:i]
items.append(Item(state, area))
df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
print df
output:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
Assuming you have the following DF:
In [73]: df
Out[73]:
text
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
you can use Series.str.extract() method:
In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
In [120]: df.State = df.State.ffill()
In [121]: df
Out[121]:
text State Region Name
0 Alabama[edit] Alabama NaN
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
9 Alaska[edit] Alaska NaN
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
11 Arizona[edit] Arizona NaN
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
15 Arkansas[edit] Arkansas NaN
In [122]: df = df.dropna()
In [123]: df
Out[123]:
text State Region Name
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]
regex = '(?P<State>.*?)\[edit\]' # pattern to match
print(s.groupby(
# will get nulls where we don't have "[edit]"
# forward fill fills in the most recent line
# where we did have an "[edit]"
s.str.extract(regex, expand=False).ffill()
).apply(
# I still have all the original values
# If I group by the forward filled rows
# I'll want to drop the first one within each group
pd.Series.tail, n=-1
).reset_index(
# munge the dataframe to get columns sorted
name='Region_Name'
)[['State', 'Region_Name']])
State Region_Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
setup
txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""
s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)
You will probably need to perform some additional manipulation on the file before getting it into a dataframe.
A starting point would be to split the file into lines, search for the string [edit] in each line, put the string name as the key of a dictionary when it is there...
I do not think that Pandas has any built in methods that would handle a file in this format.
You seem to be from Coursera's Introduction to Data Science course. Passed my test with this solution. I would advice not copying the whole solution but using it just for refrence purpose :)
lines = open('university_towns.txt').readlines()
l=[]
lofl=[]
flag=False
for line in lines:
l = []
if('[edit]' in line):
index = line[:-7]
elif('(' in line):
pos = line.find('(')
line = line[:pos-1]
l.append(index)
l.append(line)
flag=True
else:
line = line[:-1]
l.append(index)
l.append(line)
flag=True
if(flag and np.array(l).size!=0):
lofl.append(l)
df = pd.DataFrame(lofl,columns=["State","RegionName"])