I have the following dataframe:
import pandas as pd
fertilityRates = pd.read_csv('fertility_rate.csv')
fertilityRatesRowCount = len(fertilityRates.axes[0])
fertilityRates.head(fertilityRatesRowCount)
I have found a way to find the mean for each row over columns 1960-1969, but would like to do so without removing the column called "Country".
The following is what is outputted after I execute the following commands:
Mean1960To1970 = fertilityRates.iloc[:, 1:11].mean(axis=1)
Mean1960To1970
You can use pandas.DataFrame.loc to select a range of years (e.g "1960":"1968" means from 1960 to 1968).
Try this :
Mean1960To1968 = (
fertilityRates[["Country"]]
.assign(Mean= fertilityRates.loc[:, "1960":"1968"].mean(axis=1))
)
# Output :
print(Mean1960To1968)
Country Mean
0 _World 5.004444
1 Afghanistan 7.450000
2 Albania 5.913333
3 Algeria 7.635556
4 Angola 7.030000
5 Antigua and Barbuda 4.223333
6 Arab World 7.023333
7 Argentina 3.073333
8 Armenia 4.133333
9 Aruba 4.044444
10 Australia 3.167778
11 Austria 2.715556
Related
I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})
I have a big dataset. It's about news reading. I'm trying to clean it. I created a checklist of cities that I want to keep (the set has all the cities). How can I drop the rows based on that checklist? For example, I have a checklist (as a list) that contains all the french cities. How can I drop other cities?
To picture the data frame (I have 1.5m rows btw):
City Age
0 Paris 25-34
1 Lyon 45-54
2 Kiev 35-44
3 Berlin 25-34
4 New York 25-34
5 Paris 65+
6 Toulouse 35-44
7 Nice 55-64
8 Hannover 45-54
9 Lille 35-44
10 Edinburgh 65+
11 Moscow 25-34
You can do this using pandas.Dataframe.isin. This will return boolean values checking whether each element is inside the list x. You can then use the boolean values and take out the subset of the df with rows that return True by doing df[df['City'].isin(x)]. Following is my solution:
import pandas as pd
x = ['Paris' , 'Marseille']
df = pd.DataFrame(data={'City':['Paris', 'London', 'New York', 'Marseille'],
'Age':[1, 2, 3, 4]})
print(df)
df = df[df['City'].isin(x)]
print(df)
Output:
>>> City Age
0 Paris 1
1 London 2
2 New York 3
3 Marseille 4
City Age
0 Paris 1
3 Marseille 4
How I can count if a country that is in more rows , has failed or passed,
enter image description here
Like is
ID unique Countries Test
1 Spain, Netherlands Fail
2 Italy Pass
3 France, Netherlands Pass
4 Belgium, France, Bulgaria Fail
5 Belgium, United Kingdom Pass
6 Netherlands, France Pass
7 France, Netherlands, Belgiu Pass
and the result should be like this
enter image description here
Pass Fail
Spain 0 1
Italy 1 0
France 3 1
Netherlands 3 1
Belgium 2 1
United Kingdom 1 0
Because Netherlands is in 4 rows , and has 3 passed and one failed.
Use Series.str.split with DataFrame.explode and last call crosstab:
df1 = df.assign(Countries = df.Countries.str.split(', ')).explode('Countries')
df2 = pd.crosstab(df1['Countries'],df1['Test'])
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I am trying to create a new column which values are the sum of another column but only if two column contain a specific value.
origin_data_frame (df_o)
month state count
2015-12 Alabama 31359
2015-12 Alaska 245
2015-12 Arizona 2940
2015-12 Arkansas 4076
2015-12 California 119166
2015-12 Colorado 3265
2015-12 Connecticut 12190
2015-12 Delaware 297
2015-12 DC 16
....... ... ..
target_data_frame (df_t) ('counts' is not there):
level_0 level_1 Veterans, 2011-2015 counts
0 h_pct_vet California 1777410 <?>
1 h_pct_vet Texas 1539655 <?>
2 h_pct_vet Florida 1507738 <?>
3 h_pct_vet Pennsylvania 870770 <?>
4 h_pct_vet New York 828586 <?>
5 l_pct_vet Vermont 44708 <?>
6 l_pct_vet Wyoming 48505 <?>
the problem:
counts should include a value that is the sum of count if month is between '2011-01' and '2015-12' and state equals "level_1".
I can get a sum for all count in the time frame:
counts_2011_2015 = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31')].sum()
What I tried so far but without success:
df_t['counts'] = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31') & (df_o['state'] == df_t['level_1'])].sum()
It raises a ValueError: "ValueError: Can only compare identically-labeled Series objects".
What I found so far (dropping indexes) is not helpful so I would be thankful if someone has an idea
Try grouping them by state first and then merging them with df_t:
# untested code
counts = (
df_o[df_o.month.between("2011-01", "2015-12")]
.groupby("state")["count"].sum()
.reset_index(name="counts")
)
df_t.merge(counts, left_on="level_1", right_index=True, how="left")
An alternative to #pomber's solution, if you wish to avoid an explicit merge, is to align indices, assign a series from your groupby, then reset index.
df_t = df_t.set_index('level_1')
df_t['counts'] = df_o.loc[df_o.month.between('2011-01', '2015-12')]\
.groupby('state')['count'].sum()
df_t = df_t.reset_index()