Python regex named groups for multiline match

Python regex named groups for multiline match - python

I have a text like this
Alabama[STATE]
Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]
Alaska[STATE]
Fairbanks (University of Alaska Fairbanks)[15]
Arizona[STATE]
Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
I am trying to read the state and university list into two named groups using python regex.
My code is
UNIV_LIST = r"(?P<state>(\w)+)\[.*\n(?P<region>(.*?).*)"
RE_COMMIT = re.compile(UNIV_LIST)
text = open(UFILE).read()
each_group = RE_COMMIT.finditer(text)
for rc in each_group:
state = rc.groups()[0]
regions = rc.groups()[1]
print ('State is %s' %(state))
print ('regions are %s' %(regions))
Expected output is
State is : Alabama
Regions are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]
State is : Alaska
Regions are : Fairbanks (University of Alaska Fairbanks)[15]
State is : Arizona
Regions are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
But the current output is
UNIV_LIST = r"(?P<state>(\w+))\[edit\]\n(?P<region>(.*))\n+")
State is Alabama
regions are Auburn (Auburn University)[1]
State is Alaska
regions are Fairbanks (University of Alaska Fairbanks)[2]
State is Arizona
regions are Flagstaff (Northern Arizona University)[6]
Any suggestions on how to get the region named group correctly ?
[EDIT]
The actual text is
Alabama[STATE]
Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University)
Troy (Troy University)[15]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]
Alaska[STATE]
Fairbanks (University of Alaska Fairbanks)[15]
Arizona[STATE]
Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas
Arkadelphia (Henderson State University, Ouachita Baptist University)[15]
Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[15]
Fayetteville (University of Arkansas)[20]
Jonesboro (Arkansas State University)[21]
Magnolia (Southern Arkansas University)[15]
Monticello (University of Arkansas at Monticello)[15]
Russellville (Arkansas Tech University)[15]
Searcy (Harding University)[18]
California[STATE]
the below regex:
UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\])?(?!\[STATE\])$\r?\n?)+)"
Is provided the most of the expected result but is missing some regions
State is : Alabama
Regions are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University)
Troy (Troy University)[15]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]
State is : Alaska
Regions are : Fairbanks (University of Alaska Fairbanks)[15]
State is : Arizona
Regions are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
I get the result but
Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University)
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17]
Tuskegee (Tuskegee University)[18]
are missing.
Any suggestion on what is wrong ?
[EDIT]
UNIV_LIST = r"(?P<state>^(\w+\s*\w*\[edit\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[edit\])$\r?\n?)+)"
This handles states with two words like New Mexico.
But there is one case which still fails
Pomona (Cal Poly Pomona, WesternU)[9][10][11] and formerly Pomona College

The following regex work.
UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[STATE\])$\r?\n?)+)"
RE_COMMIT = re.compile(UNIV_LIST,re.IGNORECASE | re.MULTILINE)
each_group = RE_COMMIT.finditer(text)
for rc in each_group:
print('State is : %s' %(rc.group('state')))
print('Region are : %s' %rc.group('region'))
print('-'*40)
Output
State is : Alabama[STATE]
Region are : Auburn (Auburn University)[14]
Florence (University of North Alabama)
Huntsville (University of Alabama, Huntsville)
Jacksonville (Jacksonville State University)[15]
Livingston (University of West Alabama)[15]
Montevallo (University of Montevallo)[15]
Troy (Troy University)[15]
Tuskegee (Tuskegee University)[18]
----------------------------------------
State is : Alaska[STATE]
Region are : Fairbanks (University of Alaska Fairbanks)[15]
----------------------------------------
State is : Arizona[STATE]
Region are : Flagstaff (Northern Arizona University)[19]
Prescott (Embry–Riddle Aeronautical University)
Tempe (Arizona State University)
----------------------------------------

Related

Search pandas dataframe and edit values

I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()

You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States

I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work

Printing a column from a .txt file

I'm a beginner at Python.
I'm having trouble printing the values from a .txt file.
NFL_teams = csv.DictReader(txt_file, delimiter = "\t")
print(type(NFL_teams))
columns = NFL_teams.fieldnames
print('Columns: {0}'.format(columns))
print(columns[2])
This gets me to the name of the 3rd column ("team_id") but I can't seem to print any of the data that would be in the corresponding column. I have no idea what I'm doing.
Thanks in advance.
EDIT: The .txt file looks like this:
team_name team_name_short team_id team_id_pfr team_conference team_division team_conference_pre2002 team_division_pre2002 first_year last_year
Arizona Cardinals Cardinals ARI CRD NFC NFC West NFC NFC West 1994 2021
Phoenix Cardinals Cardinals ARI CRD NFC - NFC NFC East 1988 1993
St. Louis Cardinals Cardinals ARI ARI NFC - NFC NFC East 1966 1987
Atlanta Falcons Falcons ATL ATL NFC NFC South NFC NFC West 1966 2021
Baltimore Ravens Ravens BAL RAV AFC AFC North AFC AFC Central 1996 2021
Buffalo Bills Bills BUF BUF AFC AFC East AFC AFC East 1966 2021
Carolina Panthers Panthers CAR CAR NFC NFC South NFC NFC West 1995 2021
There are tabs between each item.

Performing Split on Pandas dataframe and create a new frame

I have a pandas dataframe with one column like this:
Merged_Cities
New York, Wisconsin, Atlanta
Tokyo, Kyoto, Suzuki
Paris, Bordeaux, Lyon
Mumbai, Delhi, Bangalore
London, Manchester, Bermingham
And I want a new dataframe with the output like this:
Merged_Cities
Cities
New York, Wisconsin, Atlanta
New York
New York, Wisconsin, Atlanta
Wisconsin
New York, Wisconsin, Atlanta
Atlanta
Tokyo, Kyoto, Suzuki
Tokyo
Tokyo, Kyoto, Suzuki
Kyoto
Tokyo, Kyoto, Suzuki
Suzuki
Paris, Bordeaux, Lyon
Paris
Paris, Bordeaux, Lyon
Bordeaux
Paris, Bordeaux, Lyon
Lyon
Mumbai, Delhi, Bangalore
Mumbai
Mumbai, Delhi, Bangalore
Delhi
Mumbai, Delhi, Bangalore
Bangalore
London, Manchester, Bermingham
London
London, Manchester, Bermingham
Manchester
London, Manchester, Bermingham
Bermingham
In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column.
Here's a replicable version of df:
df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta',
'Tokyo, Kyoto, Suzuki',
'Paris, Bordeaux, Lyon',
'Mumbai, Delhi, Bangalore',
'London, Manchester, Bermingham']})

Use .str.split() and .explode():
df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities")
print(df)
Prints:
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
2 Paris, Bordeaux, Lyon Paris
2 Paris, Bordeaux, Lyon Bordeaux
2 Paris, Bordeaux, Lyon Lyon
3 Mumbai, Delhi, Bangalore Mumbai
3 Mumbai, Delhi, Bangalore Delhi
3 Mumbai, Delhi, Bangalore Bangalore
4 London, Manchester, Bermingham London
4 London, Manchester, Bermingham Manchester
4 London, Manchester, Bermingham Bermingham

This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index.
# Create pandas.Series from splitting the column on ', '
s = df['Merged_Cities'].str.split(', ').explode().rename('Cities')
# Merge df with s on their index
df = df.merge(s, left_index=True, right_index=True)
# Result
print(df)
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki

Problem with joining dataframes using pd.merge

I have the following formula:
import pandas as pd
houses = houses.reset_index()
houses['difference'] = houses[start]-houses[bottom]
towns = pd.merge(houses, unitowns, how='inner', on=['State','RegionName'])
print(towns)
However, the output is a dataframe 'towns' with 0 rows.
I can't understand why this is given that the dataframe 'houses' looks like this:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
And the dataframe 'unitowns' looks like this:
State RegionName 2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
0 New York New York NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0
1 California Los Angeles 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0
2 Illinois Chicago 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0
3 Pennsylvania Philadelphia 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0
4 Arizona Phoenix 1.118333e+05 1.143667e+05 1.160000e+05 1.174000e+05 1.196000e+05 1.215667e+05 1.227000e+05 1.243000e+05 ... 1.642667e+05 1.653667e+05 1.685000e+05 1.715333e+05 1.741667e+05 1.790667e+05 1.838333e+05 1.879000e+05 1.914333e+05 195200.0

Given conditions for list.append [duplicate]

I need to create a Pandas DataFrame based on a text file based on the following structure:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
Pandas DataFrame
I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.

You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)

You could parse the file into tuples first:
import pandas as pd
from collections import namedtuple
Item = namedtuple('Item', 'state area')
items = []
with open('unis.txt') as f:
for line in f:
l = line.rstrip('\n')
if l.endswith('[edit]'):
state = l.rstrip('[edit]')
else:
i = l.index(' (')
area = l[:i]
items.append(Item(state, area))
df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
print df
output:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson

Assuming you have the following DF:
In [73]: df
Out[73]:
text
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
you can use Series.str.extract() method:
In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
In [120]: df.State = df.State.ffill()
In [121]: df
Out[121]:
text State Region Name
0 Alabama[edit] Alabama NaN
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
9 Alaska[edit] Alaska NaN
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
11 Arizona[edit] Arizona NaN
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
15 Arkansas[edit] Arkansas NaN
In [122]: df = df.dropna()
In [123]: df
Out[123]:
text State Region Name
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson

TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]
regex = '(?P<State>.*?)\[edit\]' # pattern to match
print(s.groupby(
# will get nulls where we don't have "[edit]"
# forward fill fills in the most recent line
# where we did have an "[edit]"
s.str.extract(regex, expand=False).ffill()
).apply(
# I still have all the original values
# If I group by the forward filled rows
# I'll want to drop the first one within each group
pd.Series.tail, n=-1
).reset_index(
# munge the dataframe to get columns sorted
name='Region_Name'
)[['State', 'Region_Name']])
State Region_Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
setup
txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""
s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)

You will probably need to perform some additional manipulation on the file before getting it into a dataframe.
A starting point would be to split the file into lines, search for the string [edit] in each line, put the string name as the key of a dictionary when it is there...
I do not think that Pandas has any built in methods that would handle a file in this format.

You seem to be from Coursera's Introduction to Data Science course. Passed my test with this solution. I would advice not copying the whole solution but using it just for refrence purpose :)
lines = open('university_towns.txt').readlines()
l=[]
lofl=[]
flag=False
for line in lines:
l = []
if('[edit]' in line):
index = line[:-7]
elif('(' in line):
pos = line.find('(')
line = line[:pos-1]
l.append(index)
l.append(line)
flag=True
else:
line = line[:-1]
l.append(index)
l.append(line)
flag=True
if(flag and np.array(l).size!=0):
lofl.append(l)
df = pd.DataFrame(lofl,columns=["State","RegionName"])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python regex named groups for multiline match - python

Related

Search pandas dataframe and edit values

Printing a column from a .txt file

Performing Split on Pandas dataframe and create a new frame

Problem with joining dataframes using pd.merge

Given conditions for list.append [duplicate]

Categories

Resources