Conditionals with NaN in python [duplicate] - python

I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?

You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003

A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])

df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.

df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.

for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])

Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]

You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)

We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

Related

drop rows in Groupby python

I want to get result
df3=df.groupby(['Region']).apply(lambda x: x[x['Region'].isin(["North", "East"])]['Sales'].sum()).reset_index(name='sum')
Region sum
0 East 455.0
1 North 665.0
2 South 0.0
3 West 0.0
I want to do drop rows with value = 0 or another conditions
Region sum
0 East 455.0
1 North 665.0
You can use df.loc
df[1]!=0 -> True/False filter
df.loc[df[1]!=0] # Apply the filter
df=pd.DataFrame([['East', 455.0],
['North', 665.0],
['South', 0.0],
['West', 0.0]])
df
Out[11]:
0 1
0 East 455.0
1 North 665.0
2 South 0.0
3 West 0.0
df.loc[df[1]!=0]
Out[12]:
0 1
0 East 455.0
1 North 665.0
Answer to the comment:
df.rename(columns={0:'region', 1:'sum'}).assign(**{'sum':lambda p:[q if q !=0 else pd.NA for q in p['sum']] }).dropna() (I am not sure if I understood it. Do you mean that?)
Using df.loc is the easiest method it comes to mind
filtered_df = df3.loc[df3["sum"] != 0]

Calling mean() Function Without Removing Non-Numeric Columns In Dataframe

I have the following dataframe:
import pandas as pd
fertilityRates = pd.read_csv('fertility_rate.csv')
fertilityRatesRowCount = len(fertilityRates.axes[0])
fertilityRates.head(fertilityRatesRowCount)
I have found a way to find the mean for each row over columns 1960-1969, but would like to do so without removing the column called "Country".
The following is what is outputted after I execute the following commands:
Mean1960To1970 = fertilityRates.iloc[:, 1:11].mean(axis=1)
Mean1960To1970
You can use pandas.DataFrame.loc to select a range of years (e.g "1960":"1968" means from 1960 to 1968).
Try this :
Mean1960To1968 = (
fertilityRates[["Country"]]
.assign(Mean= fertilityRates.loc[:, "1960":"1968"].mean(axis=1))
)
# Output :
print(Mean1960To1968)
Country Mean
0 _World 5.004444
1 Afghanistan 7.450000
2 Albania 5.913333
3 Algeria 7.635556
4 Angola 7.030000
5 Antigua and Barbuda 4.223333
6 Arab World 7.023333
7 Argentina 3.073333
8 Armenia 4.133333
9 Aruba 4.044444
10 Australia 3.167778
11 Austria 2.715556

Replace Matrix elements with 1

I have an empty matrix and I want to replace the matrix elements with 1 if country (index) belongs to Region (column).
I try to create a double loop, but I get stacked when I need to do the conditional. Thanks. ([152 rows x 6 columns]). Thanks so much.
west europe east europe latin america
Norway 0 0 0
Denmark 0 0 0
Iceland 0 0 0
Switzerland 0 0 0
Finland 0 0 0
Netherlands 0 0 0
Sweden 0 0 0
Austria 0 0 0
Ireland 0 0 0
Germany 0 0 0
Belgium 0 0 0
I was thinking smth like:
matrix = pd.DataFrame(np.random.randint(1, size=(152, 6)), index=['# enumarate all the countries], columns=['west europe', 'east europe', 'latin america','north america','africa', 'asia'])
print (matrix)
for i in range (len(matrix)):
for j in range(len(matrix)):
if data[i] =='Africa' and data['Country'] = [ '#here enumarate all Africa countries':
matrix[i][j]==1
elif:
....
matrix[i][j]==1
else:
matrix[i][j]==0
print (matrix)
Sample data frame with countries and region:
Country Happiness Rank Happiness Score Economy Family Health Freedom Generosity Corruption Dystopia Job Satisfaction Region
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 Western Europe
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 Western Europe
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 Western Europe
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 Western Europe
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 Western Europe
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 Western Europe
If your input variable data is a DataFrame, then as #Alollz mentioned, you can use the pandas pd.get_dummies function.
Something like this: pd.get_dummies(data, columns=['Region'])
And the output would look like:
Country HappinessRank HappinessScore Economy Family Health Freedom Generosity Corruption Dystopia JobSatisfaction Region_WesternEurope
0 Norway 1 7.537 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027 94.6 1
1 Denmark 2 7.522 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707 93.5 1
2 Iceland 3 7.504 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715 94.5 1
3 Switzerland 4 7.494 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716 93.7 1
4 Finland 5 7.469 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182 91.2 1
5 Netherlands 6 7.377 1.503945 1.428939 0.810696 0.585384 0.470490 0.282662 2.294804 93.8 1
It will take the Region category column and make it into indicator columns. In this case it uses the column name as the prefix but you can play around with that.

Pandas column is the sum if three criteria are met (similar to sumproduct)

I am trying to create a new column which values are the sum of another column but only if two column contain a specific value.
origin_data_frame (df_o)
month state count
2015-12 Alabama 31359
2015-12 Alaska 245
2015-12 Arizona 2940
2015-12 Arkansas 4076
2015-12 California 119166
2015-12 Colorado 3265
2015-12 Connecticut 12190
2015-12 Delaware 297
2015-12 DC 16
....... ... ..
target_data_frame (df_t) ('counts' is not there):
level_0 level_1 Veterans, 2011-2015 counts
0 h_pct_vet California 1777410 <?>
1 h_pct_vet Texas 1539655 <?>
2 h_pct_vet Florida 1507738 <?>
3 h_pct_vet Pennsylvania 870770 <?>
4 h_pct_vet New York 828586 <?>
5 l_pct_vet Vermont 44708 <?>
6 l_pct_vet Wyoming 48505 <?>
the problem:
counts should include a value that is the sum of count if month is between '2011-01' and '2015-12' and state equals "level_1".
I can get a sum for all count in the time frame:
counts_2011_2015 = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31')].sum()
What I tried so far but without success:
df_t['counts'] = df_o['count'][(df_o['month'] >= '2011-01-01') & (df_o['month'] <= '2015-12-31') & (df_o['state'] == df_t['level_1'])].sum()
It raises a ValueError: "ValueError: Can only compare identically-labeled Series objects".
What I found so far (dropping indexes) is not helpful so I would be thankful if someone has an idea
Try grouping them by state first and then merging them with df_t:
# untested code
counts = (
df_o[df_o.month.between("2011-01", "2015-12")]
.groupby("state")["count"].sum()
.reset_index(name="counts")
)
df_t.merge(counts, left_on="level_1", right_index=True, how="left")
An alternative to #pomber's solution, if you wish to avoid an explicit merge, is to align indices, assign a series from your groupby, then reset index.
df_t = df_t.set_index('level_1')
df_t['counts'] = df_o.loc[df_o.month.between('2011-01', '2015-12')]\
.groupby('state')['count'].sum()
df_t = df_t.reset_index()

Selecting rows by last 3 characters in a column with strings

I have this dataframe
name year ...
0 Carlos - xyz 2019
1 Marcos - yws 2031
3 Fran - xxz 2431
4 Matt - yre 1985
...
I want to create a new column, called type.
If the name of the person ends with "xyz" or "xxz" I want type to be "big"
So, it should look like this:
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
...
Any suggestions?
Option 1
Use str.contains to generate a mask:
m = df.name.str.contains(r'x[yx]z$')
Or,
sub_str = ['xyz', 'xxz']
m = df.name.str.contains(r'{}$'.format('|'.join(sub_str)))
Now, you may either create your column with np.where,
df['type'] = np.where(m, 'big', '')
Or, loc in place of np.where;
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Option 2
As an alternative, consider str.endswith + np.logical_or.reduce
sub_str = ['xyz', 'xxz']
m = np.logical_or.reduce([df.name.str.endswith(s) for s in sub_str])
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Here is one way using pandas.Series.str.
df = pd.DataFrame([['Carlos - xyz', 2019], ['Marcos - yws', 2031],
['Fran - xxz', 2431], ['Matt - yre', 1985]],
columns=['name', 'year'])
df['type'] = np.where(df['name'].str[-3:].isin({'xyz', 'xxz'}), 'big', '')
Alternatively, you can use .loc accessor instead of numpy.where:
df['type'] = ''
df.loc[df['name'].str[-3:].isin({'xyz', 'xxz'}), 'type'] = 'big'
Result
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
2 Fran - xxz 2431 big
3 Matt - yre 1985
Explanation
Extract last 3 letters using pd.Series.str.
Compare to a specified set of values for O(1) complexity lookup.
Use numpy.where to perform conditional assignment for new series.

Categories