I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I have 2 dataframes I would like to analyze.
df1:
City Time Day
5866128 Los Angeles 3.5 01/09/2019
5172728 New York 14 09/09/2019
4787428 Boston 9 10/09/2019
df2:
City Time Day
5866128 Los Angeles 3.5 01/09/2019
2478987 Denver 10 07/09/2019
5172728 New York 24 09/09/2019
4787428 Boston 4 10/09/2019
1478712 Austin 7 10/09/2019
I would like to create a third dataframe that would contain only the rows where df2['Time']-df1['Time'] != 0 based on the index and the rows that are only available in df2.
Desired output :
City Time Day
2478987 Denver 10 07/09/2019
5172728 New York 10 09/09/2019
4787428 Boston -5 10/09/2019
1478712 Austin 7 10/09/2019
I tried to play with numpy.where(), but I can't make it work to compare only the same index.
Thanks
df2['Time'] = df2['Time'].sub(df1['Time'], fill_value=0)
df2[df2.Time.ne(0)]
or
df2.assign(Time = df2['Time'].sub(df1['Time'], fill_value= 0)).loc[lambda x:x.Time.ne(0)]
Try this one:
df3 = (df1.join(df2, rsuffix="_2")
df3 = df3.loc[df["Time"]!=df["Time_2"]]
df3["Time"]=df3["Time_2"]-df3["Time"]
df3=df3.drop(["Time_2", "Day_2", "City_2"], axis=1)
I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.