I have a list of college towns with corresponding states in the U.S. I want to create a dataframe with two columns one for 'State' and the other 'RegionName'. The dataframe should look like this:
DataFrame( [ ["Alabama", "Auburn"], ["Alabama", "Troy"],
["Alabama", "Tuscaloosa"], ["Alabama", "Tuskegee"], ["Alaska",
"Fairbanks"], ["Arizona", "Flagstaff"], ["Arizona", "Tempe"], ["Arizona",
"Tucson"] ],
columns=["State", "RegionName"] )
The problem is I have a list with the States and RegionNames together, with the corresponding RegionNames following after the State name in the list like this:
['Alabama',
'Auburn','Troy','Tuscaloosa','Tuskegee',
'Alaska','Fairbanks',
'Arizona','Flagstaff','Tempe','Tucson']
I have been looking at examples and I am currently stuck on this. Any help would be greatly appreciated!
You may need create the list of states here, then using ffill with mask to split the original single columns dataframe
df['RegionName']=df.State
df.State=df.State.where(df.State.isin(States)).ffill()
df=df.loc[df.State!=df.RegionName]
df
Out[80]:
State RegionName
1 Alabama Auburn
2 Alabama Troy
3 Alabama Tuscaloosa
4 Alabama Tuskegee
6 Alaska Fairbanks
8 Arizona Flagstaff
9 Arizona Tempe
10 Arizona Tucson
Data Input
States=['Alabama','Alaska','Arizona']
l=['Alabama',
'Auburn','Troy','Tuscaloosa','Tuskegee',
'Alaska','Fairbanks',
'Arizona','Flagstaff','Tempe','Tucson']
df=pd.DataFrame(l,columns=['State'])
Related
I have a dataframe like this:
STNAME CTYNAME POPESTIMATE
Alabama Autauga County 54660
Alabama Baldwin County 183193
Alabama Barbour County 27341
Alabama Bibb County 22861
Alabama Blount County 57373
....... ............... .....
Wyoming Sweetwater County 43593
Wyoming Teton County 21297
Wyoming Uinta County 21102
....... ............. ......
....... ............. .....
and so on............
Here i have to find out three most populous cities(CTYNAME) for each state and sum up them(using POPESTIMATE) for each state and we can call that as Population of each state,and from that data of population(only three most populous cities for each state) I have to find out three most populous states and print them in a list.
I have tried out this using multiple method in pandas library but nothing has worked for me.
Can some one please help me with this.
Spliting df:
df = df.groupby('STNAME',as_index=True)
print(df.apply(lambda s: pd.Series(s.nlargest(3).index)))
i have two datasets:
-population: shows the population of USA states, organized alphabetically.
-data: has more than 200,000 rows
population.head()
state population
0 Alabama 4887871
1 Alaska 737438
2 Arizona 7171646
3 Arkansas 3013825
4 California 39557045
i'm trying to add a new column called "Incidents" from the other data set.
I tried: population['incidents'] = data.state.value_counts().sort_index()
but i'm getting the following result:
state population incidents
0 Alabama 4887871 NaN
1 Alaska 737438 NaN
2 Arizona 7171646 NaN
3 Arkansas 3013825 NaN
4 California 39557045 NaN
what can i do to fix this??
EDIT:
data.state.value_counts().sort_index()
Alabama 5373
Alaska 1292
Arizona 2268
Arkansas 2753
California 15975
Colorado 3069
Connecticut 2984
Delaware 1643
District of Columbia 3091
Florida 14610
Georgia 8717
````````````````````````
If you wanna add a specific column from one dataset to the other dataset you do it like this
population['incidents'] = data[['columntoappend']]
Your RHS (right hand side ) must be one column which in your case is not.
https://www.google.com/amp/s/www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/amp/
The way to do this is as follows, provided that your length of your indices are consistent:
population['incidents'] = [x for x in data.state.value_counts().sort_index()]
I can't really explain why your approach results in NaN objects though. In any case, it would be incorrect as well as you're assigning entire series to each row in the population dataset. With the list comprehension, you're assigning one value to each row.
def get_list_of_university_towns():
with open('university_towns.txt', 'r') as f:
data = (line.rstrip() for line in f)
lines = list(line for line in data if line)
thing = [lines]
indexx = [lines.index(line) for line in lines if '[edit]' in line]
numlist = [indexx]
wow = pd.DataFrame(thing)
tr = wow.T
tr.columns=['Region']
When I return the code it returns:
""" Region
0 Alabama[edit]
1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
9 Alaska[edit]
10 Fairbanks (University of Alaska Fairbanks)[2]
11 Arizona[edit]
12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
15 Arkansas[edit]
16 Arkadelphia (Henderson State University, Ouach...
How do I get it so that I can make a new column named the state that appends the state name with the corresponding index of the state? The index portion works in returning the corresponding index to all of the states.
I essentially want it to return:
Region State
1 Auburn Alabama
2 Florence Alabama etc..
You should be able to iterate over the lines, and use an if-else to determine whether the line is a state or a region. The states appear all to have the [edit] tag in them, so any line with that must be a state, otherwise it's a region.
To create the dataframe itself, we can create a list of tuples, with the first element being the state, and the second being the region (after appropriately cleaning the text). Then pass the list to pandas, which will elegantly convert it into a dataframe.
A potential solution (though I'm not sure exactly what your text file looks like):
data = []
for line in lines:
if '[edit]' in line:
state = line.replace('[edit]', '')
else:
region = line.split(' (')[0]
data.append((state, region))
df = pd.DataFrame(data, columns=['state', 'region'])
I'm trying to use the apply function on my dataframe ('homes') that has multi index ('states' and 'RegionName'). The function i use tries to check if the combination of state and Region Name is matched by my other data frame ('UT').
when applying this function:
homes['UT']=homes.apply(lambda row: 1 if
ut[(ut['State']==states[homes.iloc[row].name[0]]) &
(ut['RegionName']==homes.iloc[row].name[1])] else 0, axis=1)
i get an error saying basicaly that my index is out of bounds.
I tried a few things like converting the other dataframe to two lists and check if the rows of my dataframe are in those lists but still getting the same error.
my ut dataframe head:
State RegionName
1 Alabama Auburn
2 Alabama Florence
3 Alabama Jacksonville
4 Alabama Livingston
5 Alabama Montevallo
my home data frame head:
2000q1 2000q2 2000q3 2000q4
State RegionName
New York New York NaN NaN NaN NaN
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000
Any suggestions?
found the answer thank to #user8505495.
the code should be like this:
homes['UT']=homes.apply(lambda row: 1 if (row.name[0]+', '+row.name[1] in ut['full'].values) else 0, axis=1)
i have no idea why it works but it does. thanks for all the help!
This question already has answers here:
Create Pandas DataFrame from txt file with specific pattern
(6 answers)
Closed 5 years ago.
*Desicribe: in this txt file, the state names is ending by [edit]. And items between two [edit]s are names of university towns. I need to read the whole txt file into a pd.Dataframe with two columns named 'State' and 'RegionName'. However, the difficult thing is that the txt file have each item in a row, if I use df=pd.read_table('university_towns.txt') directly, the Dataframe would be with only one column and put both state-names and region names in this columns. How can I deal with this?
Thanks in advance!*
Part of my txt
"Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"
with open('Universities.txt', 'r') as f:
#Read non-empty lines:
data = (line.rstrip() for line in f)
lines = list(line for line in data if line)
#Get the index of states:
r_idx = [lines.index(line) for line in lines if '[edit]' in line]
#Separating states and university names using wrapping indexes:
university = []
region = [lines[i].replace('[edit]', '') for i in r_idx]
for i in range(len(r_idx)):
if i != len(r_idx)-1:
sub = lines[r_idx[i]+1:r_idx[i+1]]
university.append(sub)
else:
sub = lines[r_idx[i]+1:]
university.append(sub)
#Create dict:
uni = dict(zip(region, university))
Here I tried to extract info into separate lists then map them together. Some data formatting is on the fly also.
Does this help?