Given conditions for list.append [duplicate] - python
I need to create a Pandas DataFrame based on a text file based on the following structure:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
The rows with "[edit]" are States and the rows [number] are Regions. I need to split the following and repeat the State name for each Region Name thereafter.
Index State Region Name
0 Alabama Aurburn...
1 Alabama Florence...
2 Alabama Jacksonville...
...
9 Alaska Fairbanks...
10 Alaska Arizona...
11 Alaska Flagstaff...
Pandas DataFrame
I not sure how to split the text file based on "[edit]" and "[number]" or "(characters)" into the respective columns and repeat the State Name for each Region Name. Please can anyone give me a starting point to begin with to accomplish the following.
You can first read_csv with parameter name for create DataFrame with column Region Name, separator is value which is NOT in values (like ;):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert new column State with extract rows where text [edit] and replace all values from ( to the end to column Region Name.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
Last remove rows where text [edit] by boolean indexing, mask is created by str.contains:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
If need all values solution is easier:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
You could parse the file into tuples first:
import pandas as pd
from collections import namedtuple
Item = namedtuple('Item', 'state area')
items = []
with open('unis.txt') as f:
for line in f:
l = line.rstrip('\n')
if l.endswith('[edit]'):
state = l.rstrip('[edit]')
else:
i = l.index(' (')
area = l[:i]
items.append(Item(state, area))
df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
print df
output:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
Assuming you have the following DF:
In [73]: df
Out[73]:
text
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
you can use Series.str.extract() method:
In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
In [120]: df.State = df.State.ffill()
In [121]: df
Out[121]:
text State Region Name
0 Alabama[edit] Alabama NaN
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
9 Alaska[edit] Alaska NaN
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
11 Arizona[edit] Arizona NaN
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
15 Arkansas[edit] Arkansas NaN
In [122]: df = df.dropna()
In [123]: df
Out[123]:
text State Region Name
1 Auburn (Auburn University)[1] Alabama Auburn
2 Florence (University of North Alabama) Alabama Florence
3 Jacksonville (Jacksonville State University)[2] Alabama Jacksonville
4 Livingston (University of West Alabama)[2] Alabama Livingston
5 Montevallo (University of Montevallo)[2] Alabama Montevallo
6 Troy (Troy University)[2] Alabama Troy
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa
8 Tuskegee (Tuskegee University)[5] Alabama Tuskegee
10 Fairbanks (University of Alaska Fairbanks)[2] Alaska Fairbanks
12 Flagstaff (Northern Arizona University)[6] Arizona Flagstaff
13 Tempe (Arizona State University) Arizona Tempe
14 Tucson (University of Arizona) Arizona Tucson
TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]
regex = '(?P<State>.*?)\[edit\]' # pattern to match
print(s.groupby(
# will get nulls where we don't have "[edit]"
# forward fill fills in the most recent line
# where we did have an "[edit]"
s.str.extract(regex, expand=False).ffill()
).apply(
# I still have all the original values
# If I group by the forward filled rows
# I'll want to drop the first one within each group
pd.Series.tail, n=-1
).reset_index(
# munge the dataframe to get columns sorted
name='Region_Name'
)[['State', 'Region_Name']])
State Region_Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)
setup
txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""
s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)
You will probably need to perform some additional manipulation on the file before getting it into a dataframe.
A starting point would be to split the file into lines, search for the string [edit] in each line, put the string name as the key of a dictionary when it is there...
I do not think that Pandas has any built in methods that would handle a file in this format.
You seem to be from Coursera's Introduction to Data Science course. Passed my test with this solution. I would advice not copying the whole solution but using it just for refrence purpose :)
lines = open('university_towns.txt').readlines()
l=[]
lofl=[]
flag=False
for line in lines:
l = []
if('[edit]' in line):
index = line[:-7]
elif('(' in line):
pos = line.find('(')
line = line[:pos-1]
l.append(index)
l.append(line)
flag=True
else:
line = line[:-1]
l.append(index)
l.append(line)
flag=True
if(flag and np.array(l).size!=0):
lofl.append(l)
df = pd.DataFrame(lofl,columns=["State","RegionName"])
Related
Printing a column from a .txt file
I'm a beginner at Python. I'm having trouble printing the values from a .txt file. NFL_teams = csv.DictReader(txt_file, delimiter = "\t") print(type(NFL_teams)) columns = NFL_teams.fieldnames print('Columns: {0}'.format(columns)) print(columns[2]) This gets me to the name of the 3rd column ("team_id") but I can't seem to print any of the data that would be in the corresponding column. I have no idea what I'm doing. Thanks in advance. EDIT: The .txt file looks like this: team_name team_name_short team_id team_id_pfr team_conference team_division team_conference_pre2002 team_division_pre2002 first_year last_year Arizona Cardinals Cardinals ARI CRD NFC NFC West NFC NFC West 1994 2021 Phoenix Cardinals Cardinals ARI CRD NFC - NFC NFC East 1988 1993 St. Louis Cardinals Cardinals ARI ARI NFC - NFC NFC East 1966 1987 Atlanta Falcons Falcons ATL ATL NFC NFC South NFC NFC West 1966 2021 Baltimore Ravens Ravens BAL RAV AFC AFC North AFC AFC Central 1996 2021 Buffalo Bills Bills BUF BUF AFC AFC East AFC AFC East 1966 2021 Carolina Panthers Panthers CAR CAR NFC NFC South NFC NFC West 1995 2021 There are tabs between each item.
Performing Split on Pandas dataframe and create a new frame
I have a pandas dataframe with one column like this: Merged_Cities New York, Wisconsin, Atlanta Tokyo, Kyoto, Suzuki Paris, Bordeaux, Lyon Mumbai, Delhi, Bangalore London, Manchester, Bermingham And I want a new dataframe with the output like this: Merged_Cities Cities New York, Wisconsin, Atlanta New York New York, Wisconsin, Atlanta Wisconsin New York, Wisconsin, Atlanta Atlanta Tokyo, Kyoto, Suzuki Tokyo Tokyo, Kyoto, Suzuki Kyoto Tokyo, Kyoto, Suzuki Suzuki Paris, Bordeaux, Lyon Paris Paris, Bordeaux, Lyon Bordeaux Paris, Bordeaux, Lyon Lyon Mumbai, Delhi, Bangalore Mumbai Mumbai, Delhi, Bangalore Delhi Mumbai, Delhi, Bangalore Bangalore London, Manchester, Bermingham London London, Manchester, Bermingham Manchester London, Manchester, Bermingham Bermingham In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column. Here's a replicable version of df: df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta', 'Tokyo, Kyoto, Suzuki', 'Paris, Bordeaux, Lyon', 'Mumbai, Delhi, Bangalore', 'London, Manchester, Bermingham']})
Use .str.split() and .explode(): df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities") print(df) Prints: Merged_Cities Cities 0 New York, Wisconsin, Atlanta New York 0 New York, Wisconsin, Atlanta Wisconsin 0 New York, Wisconsin, Atlanta Atlanta 1 Tokyo, Kyoto, Suzuki Tokyo 1 Tokyo, Kyoto, Suzuki Kyoto 1 Tokyo, Kyoto, Suzuki Suzuki 2 Paris, Bordeaux, Lyon Paris 2 Paris, Bordeaux, Lyon Bordeaux 2 Paris, Bordeaux, Lyon Lyon 3 Mumbai, Delhi, Bangalore Mumbai 3 Mumbai, Delhi, Bangalore Delhi 3 Mumbai, Delhi, Bangalore Bangalore 4 London, Manchester, Bermingham London 4 London, Manchester, Bermingham Manchester 4 London, Manchester, Bermingham Bermingham
This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index. # Create pandas.Series from splitting the column on ', ' s = df['Merged_Cities'].str.split(', ').explode().rename('Cities') # Merge df with s on their index df = df.merge(s, left_index=True, right_index=True) # Result print(df) Merged_Cities Cities 0 New York, Wisconsin, Atlanta New York 0 New York, Wisconsin, Atlanta Wisconsin 0 New York, Wisconsin, Atlanta Atlanta 1 Tokyo, Kyoto, Suzuki Tokyo 1 Tokyo, Kyoto, Suzuki Kyoto Merged_Cities Cities 0 New York, Wisconsin, Atlanta New York 0 New York, Wisconsin, Atlanta Wisconsin 0 New York, Wisconsin, Atlanta Atlanta 1 Tokyo, Kyoto, Suzuki Tokyo 1 Tokyo, Kyoto, Suzuki Kyoto 1 Tokyo, Kyoto, Suzuki Suzuki
Problem with joining dataframes using pd.merge
I have the following formula: import pandas as pd houses = houses.reset_index() houses['difference'] = houses[start]-houses[bottom] towns = pd.merge(houses, unitowns, how='inner', on=['State','RegionName']) print(towns) However, the output is a dataframe 'towns' with 0 rows. I can't understand why this is given that the dataframe 'houses' looks like this: State RegionName 0 Alabama Auburn 1 Alabama Florence 2 Alabama Jacksonville 3 Alabama Livingston 4 Alabama Montevallo 5 Alabama Troy 6 Alabama Tuscaloosa 7 Alabama Tuskegee 8 Alaska Fairbanks And the dataframe 'unitowns' looks like this: State RegionName 2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3 0 New York New York NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0 1 California Los Angeles 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0 2 Illinois Chicago 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0 3 Pennsylvania Philadelphia 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0 4 Arizona Phoenix 1.118333e+05 1.143667e+05 1.160000e+05 1.174000e+05 1.196000e+05 1.215667e+05 1.227000e+05 1.243000e+05 ... 1.642667e+05 1.653667e+05 1.685000e+05 1.715333e+05 1.741667e+05 1.790667e+05 1.838333e+05 1.879000e+05 1.914333e+05 195200.0
creating new column by merging on column name and other column value
Trying to create a new column in DF1 that lists the home teams number of allstars for that year. DF1 Date Visitor V_PTS Home H_PTS \ 0 2012-10-30 19:00:00 Washington Wizards 84 Cleveland Cavaliers 94 1 2012-10-30 19:30:00 Dallas Mavericks 99 Los Angeles Lakers 91 2 2012-10-30 20:00:00 Boston Celtics 107 Miami Heat 120 3 2012-10-31 19:00:00 Dallas Mavericks 94 Utah Jazz 113 4 2012-10-31 19:00:00 San Antonio Spurs 99 New Orleans Pelicans 95 Attendance Arena Location Capacity \ 0 20562 Quicken Loans Arena Cleveland, Ohio 20562 1 18997 Staples Center Los Angeles, California 18997 2 20296 American Airlines Arena Miami, Florida 19600 3 17634 Vivint Smart Home Arena Salt Lake City, Utah 18303 4 15358 Smoothie King Center New Orleans, Louisiana 16867 Yr Arena Opened Season 0 1994 2012-13 1 1992 2012-13 2 1999 2012-13 3 1991 2012-13 4 1999 2012-13 DF2 2012-13 2013-14 2014-15 2015-16 2016-17 Cleveland Cavaliers 1 1 2 1 3 Los Angeles Lakers 2 1 1 1 0 Miami Heat 3 3 2 2 1 Chicago Bulls 2 1 2 2 1 Detroit Pistons 0 0 0 1 1 Los Angeles Clippers 2 2 2 1 1 New Orleans Pelicans 0 1 1 1 1 Philadelphia 76ers 1 0 0 0 0 Phoenix Suns 0 0 0 0 0 Portland Trail Blazers 1 2 2 0 0 Toronto Raptors 0 1 1 2 2 DF1['H_Allstars']=DF2[DF1['Season'],DF1['Home']]) results in TypeError: 'Series' objects are mutable, thus they cannot be hashed I understand the error just am not sure how else to do it.
I've removed the extra columns and just focused on the necessary ones for demonstration. Input: df1 Home 2012-13 2013-14 2014-15 2015-16 2016-17 0 Cleveland Cavaliers 1 1 2 1 3 1 Los Angeles Lakers 2 1 1 1 0 2 Miami Heat 3 3 2 2 1 3 Chicago Bulls 2 1 2 2 1 4 Detroit Pistons 0 0 0 1 1 5 Los Angeles Clippers 2 2 2 1 1 6 New Orleans Pelicans 0 1 1 1 1 7 Philadelphia 76ers 1 0 0 0 0 8 Phoenix Suns 0 0 0 0 0 9 Portland Trail Blazers 1 2 2 0 0 10 Toronto Raptors 0 1 1 2 2 df2 Visitor Home Season 0 Washington Wizards Cleveland Cavaliers 2012-13 1 Dallas Mavericks Los Angeles Lakers 2012-13 2 Boston Celtics Miami Heat 2012-13 3 Dallas Mavericks Utah Jazz 2012-13 4 San Antonio Spurs New Orleans Pelicans 2012-13 Step 1: Melt df1 to get the allstars column df3 = pd.melt(df1, id_vars='Home', value_vars = df1.columns[df.columns.str.contains('20')], var_name = 'Season', value_name='H_Allstars') Ouput: Home Season H_Allstars 0 Cleveland Cavaliers 2012-13 1 1 Los Angeles Lakers 2012-13 2 2 Miami Heat 2012-13 3 3 Chicago Bulls 2012-13 2 4 Detroit Pistons 2012-13 0 5 Los Angeles Clippers 2012-13 2 6 New Orleans Pelicans 2012-13 0 7 Philadelphia 76ers 2012-13 1 8 Phoenix Suns 2012-13 0 ... Step 2: Merge this new dataframe with df2 to get the H_Allstars and V_Allstars columns df4 = pd.merge(df2, df3, how='left', on=['Home', 'Season']) Output: Visitor Home Season H_Allstars 0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 2 Boston Celtics Miami Heat 2012-13 3.0 3 Dallas Mavericks Utah Jazz 2012-13 NaN 4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 Step 3: Add the V_Allstars column # renaming column as required df3.rename(columns={'Home': 'Visitor', 'H_Allstars': 'V_Allstars'}, inplace=True) df5 = pd.merge(df4, df3, how='left', on=['Visitor', 'Season']) Output: Visitor Home Season H_Allstars V_Allstars 0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 NaN 1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 NaN 2 Boston Celtics Miami Heat 2012-13 3.0 NaN 3 Dallas Mavericks Utah Jazz 2012-13 NaN NaN 4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 NaN
You can use pandas.melt . Bring your data df2 to long format, i.e. Home and season as columns and Allstars as values and then merge to df1 on 'Home' and 'Season'. import pandas as pd df2['Home'] = df2.index df2 = pd.melt(df2, id_vars = 'Home', value_vars = ['2012-13', '2013-14', '2014-15', '2015-16', '2016-17'], var_name = 'Season', value_name='H_Allstars') df = df1.merge(df2, on=['Home','Season'], how='left')
Python regex named groups for multiline match
I have a text like this Alabama[STATE] Auburn (Auburn University)[14] Florence (University of North Alabama) Huntsville (University of Alabama, Huntsville) Jacksonville (Jacksonville State University)[15] Livingston (University of West Alabama)[15] Montevallo (University of Montevallo)[15] Troy (Troy University)[15] Tuskegee (Tuskegee University)[18] Alaska[STATE] Fairbanks (University of Alaska Fairbanks)[15] Arizona[STATE] Flagstaff (Northern Arizona University)[19] Prescott (Embry–Riddle Aeronautical University) Tempe (Arizona State University) I am trying to read the state and university list into two named groups using python regex. My code is UNIV_LIST = r"(?P<state>(\w)+)\[.*\n(?P<region>(.*?).*)" RE_COMMIT = re.compile(UNIV_LIST) text = open(UFILE).read() each_group = RE_COMMIT.finditer(text) for rc in each_group: state = rc.groups()[0] regions = rc.groups()[1] print ('State is %s' %(state)) print ('regions are %s' %(regions)) Expected output is State is : Alabama Regions are : Auburn (Auburn University)[14] Florence (University of North Alabama) Huntsville (University of Alabama, Huntsville) Jacksonville (Jacksonville State University)[15] Troy (Troy University)[15] Tuskegee (Tuskegee University)[18] State is : Alaska Regions are : Fairbanks (University of Alaska Fairbanks)[15] State is : Arizona Regions are : Flagstaff (Northern Arizona University)[19] Prescott (Embry–Riddle Aeronautical University) Tempe (Arizona State University) But the current output is UNIV_LIST = r"(?P<state>(\w+))\[edit\]\n(?P<region>(.*))\n+") State is Alabama regions are Auburn (Auburn University)[1] State is Alaska regions are Fairbanks (University of Alaska Fairbanks)[2] State is Arizona regions are Flagstaff (Northern Arizona University)[6] Any suggestions on how to get the region named group correctly ? [EDIT] The actual text is Alabama[STATE] Auburn (Auburn University)[14] Florence (University of North Alabama) Huntsville (University of Alabama, Huntsville) Jacksonville (Jacksonville State University)[15] Livingston (University of West Alabama)[15] Montevallo (University of Montevallo)[15] Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University) Troy (Troy University)[15] Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17] Tuskegee (Tuskegee University)[18] Alaska[STATE] Fairbanks (University of Alaska Fairbanks)[15] Arizona[STATE] Flagstaff (Northern Arizona University)[19] Prescott (Embry–Riddle Aeronautical University) Tempe (Arizona State University) Tucson (University of Arizona) Arkansas Arkadelphia (Henderson State University, Ouachita Baptist University)[15] Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[15] Fayetteville (University of Arkansas)[20] Jonesboro (Arkansas State University)[21] Magnolia (Southern Arkansas University)[15] Monticello (University of Arkansas at Monticello)[15] Russellville (Arkansas Tech University)[15] Searcy (Harding University)[18] California[STATE] the below regex: UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\])?(?!\[STATE\])$\r?\n?)+)" Is provided the most of the expected result but is missing some regions State is : Alabama Regions are : Auburn (Auburn University)[14] Florence (University of North Alabama) Huntsville (University of Alabama, Huntsville) Jacksonville (Jacksonville State University)[15] Livingston (University of West Alabama)[15] Montevallo (University of Montevallo)[15] Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University) Troy (Troy University)[15] Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17] Tuskegee (Tuskegee University)[18] State is : Alaska Regions are : Fairbanks (University of Alaska Fairbanks)[15] State is : Arizona Regions are : Flagstaff (Northern Arizona University)[19] Prescott (Embry–Riddle Aeronautical University) Tempe (Arizona State University) I get the result but Montgomery (Alabama State University, Huntingdon College, Auburn University at Montgomery, H. Councill Trenholm State Technical College, Faulkner University) Tuscaloosa (University of Alabama, Stillman College, Shelton State)[6][17] Tuskegee (Tuskegee University)[18] are missing. Any suggestion on what is wrong ? [EDIT] UNIV_LIST = r"(?P<state>^(\w+\s*\w*\[edit\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[edit\])$\r?\n?)+)" This handles states with two words like New Mexico. But there is one case which still fails Pomona (Cal Poly Pomona, WesternU)[9][10][11] and formerly Pomona College
The following regex work. UNIV_LIST = r"(?P<state>^(\w+\[STATE\]))\r?\n?(?P<region>((^[^[]+)(\[\d+\]){0,}?(?!\[STATE\])$\r?\n?)+)" RE_COMMIT = re.compile(UNIV_LIST,re.IGNORECASE | re.MULTILINE) each_group = RE_COMMIT.finditer(text) for rc in each_group: print('State is : %s' %(rc.group('state'))) print('Region are : %s' %rc.group('region')) print('-'*40) Output State is : Alabama[STATE] Region are : Auburn (Auburn University)[14] Florence (University of North Alabama) Huntsville (University of Alabama, Huntsville) Jacksonville (Jacksonville State University)[15] Livingston (University of West Alabama)[15] Montevallo (University of Montevallo)[15] Troy (Troy University)[15] Tuskegee (Tuskegee University)[18] ---------------------------------------- State is : Alaska[STATE] Region are : Fairbanks (University of Alaska Fairbanks)[15] ---------------------------------------- State is : Arizona[STATE] Region are : Flagstaff (Northern Arizona University)[19] Prescott (Embry–Riddle Aeronautical University) Tempe (Arizona State University) ----------------------------------------