def answer_eight():
templist = list()
for county, region, p15, p14, ste, cty in zip(census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME):
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame.from_records(templist, columns=labels)
return df
STNAME CTYNAME
0 Iowa Washington County
1 Minnesota Washington County
2 Pennsylvania Washington County
3 Rhode Island Washington County
4 Wisconsin Washington County
All these CTYNAME has different indexes in the original census_df. How could I transfer them over to the new DF so the answer looks like:
STNAME CTYNAME
12 Iowa Washington County
222 Minnesota Washington County
400 Pennsylvania Washington County
2900 Rhode Island Washington County
2999 Wisconsin Washington County
I'd include the index with the other things your are zipping
def answer_eight():
templist = list()
index = list()
zipped = zip(
census_df.CTYNAME,
census_df.REGION,
census_df.POPESTIMATE2015,
census_df.POPESTIMATE2014,
census_df.STNAME,
census_df.CTYNAME,
census_df.index
)
for county, region, p15, p14, ste, cty, idx in zipped:
# print(county)
if region == 1 or region == 2:
if county.startswith('Washington'):
if p15 > p14:
templist.append((ste, cty))
index.append(idx)
labels = ['STNAME', 'CTYNAME']
df = pd.DataFrame(templist, index, labels)
return df.rename_axis(census_df.index.name)
Before you start filtering, you can assign the original index to a column with:
census_df['original index'] = census_df.index
Then just treat it like one of the other columns you're selecting from.
Related
I have a .txt that goes like this:
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
...
...
I want to create a DataFrame and a CSV in pandas, where the first column contains the States, and the second column the Highway.
States HW_Number
Arizona - New Mexico Interstate 40
Arizona - New Mexico Interstate 10
South Dakota - Minneapolis Interstate 90
Arizona - California Interstate 40
Arizona - California Interstate 10
Arizona - California Interstate 9
State A - State B Highway 1
State A - State B Highway 2
State A - State B Highway 3
How can I manage to do that? Not all the states have the same amount of Highways, and can have 0 Highways, and those that have 0, I do not want to be integrated in the DataFrame.
A column with the Country could be integrated as well.
Thank you
As I said, a pretty easy file to parse:
import pandas as pd
rows = []
state = None
for line in open('x.txt'):
if line[0] == ' ':
continue
line = line.strip()
if not line:
continue
if '-' in line:
state = line
else:
rows.append( (state,line) )
df = pd.DataFrame(rows, columns=['state','road'])
print(df)
Output:
----------
state road
0 Arizona - New Mexico Interstate 40
1 Arizona - New Mexico Interstate 10
2 South Dakota - Minneapolis Interstate 90
3 Arizona - California Interstate 40
4 Arizona - California Interstate 10
5 Arizona - California Interstate 8
6 State A - State B Highway 1
7 State A - State B Highway 2
8 State A - State B Highway 3
You can iterate through the rows and use characteristics of your structured data to create lists. These lists can be used to make a dataframe or series.
read the lines from the file into a list (f.readlines())
remove empty rows
keep track of current state (doesn't end with a number)
append the states and highways to lists
use lists to make a dataframe or series
import pandas as pd
import io
f = io.StringIO(
"""
USA
Arizona - New Mexico
Interstate 40
Interstate 10
South Dakota - Minneapolis
Interstate 90
South Carolina - Washington
Arizona - California
Interstate 40
Interstate 10
Interstate 8
ANOTHER COUNTRY
State A - State B
Highway 1
Highway 2
Highway 3
"""
)
lines = f.readlines()
states = []
hw_numbers = []
current_state = None
for line in lines:
line = line.strip() #removes \n
if line == '': #remove empty rows
continue
elif line[-1].isdigit() == False: #if not a digit, then it's a state
current_state = line
else: #if it is a digit, then it's a highway
states.append(current_state)
hw_numbers.append(line)
pd.DataFrame({
'States':states,
'HW_number':hw_numbers
})
Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0
This my data frame
City
sales
San Diego
500
Texas
400
Nebraska
300
Macau
200
Rome
100
London
50
Manchester
70
I want to add the country at the end which will look like this
City
sales
Country
San Diego
500
US
Texas
400
US
Nebraska
300
US
Macau
200
Hong Kong
Rome
100
Italy
London
50
England
Manchester
200
England
The countries are stored in below dictionary
country={'US':['San Diego','Texas','Nebraska'], 'Hong Kong':'Macau', 'England':['London','Manchester'],'Italy':'Rome'}
It's a little complicated because you have lists and strings as the values and strings are technically iterable, so distinguishing is more annoying. But here's a function that can flatten your dict:
def flatten_dict(d):
nd = {}
for k,v in d.items():
# Check if it's a list, if so then iterate through
if ((hasattr(v, '__iter__') and not isinstance(v, str))):
for item in v:
nd[item] = k
else:
nd[v] = k
return nd
d = flatten_dict(country)
#{'San Diego': 'US',
# 'Texas': 'US',
# 'Nebraska': 'US',
# 'Macau': 'Hong Kong',
# 'London': 'England',
# 'Manchester': 'England',
# 'Rome': 'Italy'}
df['Country'] = df['City'].map(d)
You can implement this using geopy
You can install geopy by pip install geopy
Here is the documentation : https://pypi.org/project/geopy/
# import libraries
from geopy.geocoders import Nominatim
# you need to mention a name for the app
geolocator = Nominatim(user_agent="some_random_app_name")
# get country name
df['Country'] = df['City'].apply(lambda x : geolocator.geocode(x).address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 中国
4 Rome 100 Italia
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
# to get country name in english
df['Country'] = df['City'].apply(lambda x : geolocator.reverse(geolocator.geocode(x).point, language='en').address.split(', ')[-1])
print(df)
City sales Country
0 San Diego 500 United States
1 Texas 400 United States
2 Nebraska 300 United States
3 Macau 200 China
4 Rome 100 Italy
5 London 50 United Kingdom
6 Manchester 70 United Kingdom
Input df(example)
Country SubregionA SubregionB
BRA State of Acre Brasiléia
BRA State of Acre Cruzeiro do Sul
USA AL Bibb County
USA AL Blount County
USA AL Bullock County
Output df
Country SubregionA SubregionB
BRA State of Acre State of Acre - Brasiléia
BRA State of Acre State of Acre - Cruzeiro do Sul
USA AL AL Bibb County
USA AL AL Blount County
USA AL AL Bullock County
The code snippet is quite self explanatory, but when executed seems to run forever. What could be going wrong(Also the dataframe 'data' is quite large around 250K+ rows)
for row in data.itertuples():
region = data['Country']
if region == 'ARG' :
data['SubregionB'] = data[['SubregionA' 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
elif region == 'BRA' :
data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: '-'.join(row.values.astype(str)), axis=1)
elif region == 'USA':
data['SubregionB'] = data[['SubregionA', 'SubregionB']].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
else:
pass
Explanation : Trying to join columns SubregionA and SubregionB based on values in the column name 'Country'. The separators are different and thus have written multiple if-else statements. Takes too long to execute, how can I make this faster?
You can use numpy.select with Series.isin and join columns with +:
print (df)
Country SubregionA SubregionB
0 BRA State of Acre Brasilia
1 BRA State of Acre Cruzeiro do Sul
2 USA AL Bibb County
3 USA AL Blount County
4 USA AL Bullock County
5 JAP AAA BBBB
reg1 = ['ARG','BRA']
reg2 = ['USA']
a = np.select([df['Country'].isin(reg1), df['Country'].isin(reg2)],
[df['SubregionA'] + ' - ' + df['SubregionB'],
df['SubregionA'] + ' ' + df['SubregionB']],
default=df['SubregionB'])
df['SubregionB'] = a
print (df)
Country SubregionA SubregionB
0 BRA State of Acre State of Acre - Brasilia
1 BRA State of Acre State of Acre - Cruzeiro do Sul
2 USA AL AL Bibb County
3 USA AL AL Blount County
4 USA AL AL Bullock County
5 JAP AAA BBBB
I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.