Pandas select rows by multiple conditions on columns - python

I would like to reduce my code. So instead of 2 lines I would like to select rows by 3 conditions on 2 columns.
My DataFrame contains Country's population between 2000 and 2018 by granularity (Total, Female, Male, Urban, Rural)
Zone Granularity Year Value
0 Afghanistan Total 2000 20779.953
1 Afghanistan Male 2000 10689.508
2 Afghanistan Female 2000 10090.449
3 Afghanistan Rural 2000 15657.474
4 Afghanistan Urban 2000 4436.282
20909 Zimbabwe Total 2018 14438.802
20910 Zimbabwe Male 2018 6879.119
20911 Zimbabwe Female 2018 7559.693
20912 Zimbabwe Rural 2018 11465.748
20913 Zimbabwe Urban 2018 5447.513
I would like all rows of the Year 2017 with granularity Total AND Urban.
I tried something like this below but not working but each condition working well in separate code.
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & (df['Year'] == '2017')]
Thanks for tips to help

Very likely, you're using the wrong type for the year. I imagine these are integers.
You should try:
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & df['Year'].eq(2017)]
output (for the Year 2018 as 2017 is missing from the provided data):
Zone Granularity Year Value
20909 Zimbabwe Total 2018 14438.802
20913 Zimbabwe Urban 2018 5447.513

Related

How to sum the values of one column with respect to a value of another column in python

Suppose I have data like this
Year Population
2016 1000
2016 1200
2017 1400
2017 1500
2018 1600
2018 1600
Now I need the data to be unifying the data like this depends upon the year values
Year Population
2016 2200
2017 2900
Here I don't need the values of 2018. Only I need the sum for 2016 and 2017. How to achieve this?
There are just so many ways to achieve this.
You could do:
df.groupby('Year').sum().drop(2018).reset_index()
or:
df.query('Year != 2018').groupby('Year', as_index=False).sum()
output:
Year Population
0 2016 2200
1 2017 2900

Python duplicates date values for week number conversion in leap year

I have a dataframe where the date column is in format format='%Y-W%W-%w'. I am converting from the 2018-W01 etc. to an actual date using pd.to_datetime(urldict[key]['date']+'-1', format='%Y-W%W-%w'), but the data appears to be shifted incorrectly for 2020/2021, I'm guessing because of the leap-year.
Subsequently, it creates two entries for 01-04-2021, with the first entry being what would be 2020-W53. The data going back is also misaligned.
I'm not sure how to fix this as I assumed that the datetime library would account for it.
Pre-conversion:
date region total
2020-W51 africa 1
2020-W52 africa 2
2020-W53 africa 3
2021-W01 africa 4
Post-conversion:
date region total
12/21/2020 africa 1
12/28/2020 africa 2
1/4/2021 africa 3
1/4/2021 africa 4
It seems you need ISO 8601 year/week/weekday, so the correct formatting directive would be '%G-W%V-%u' (see the docs, end of that section). For
df
date region total
0 2020-W51 africa 1
1 2020-W52 africa 2
2 2020-W53 africa 3
3 2021-W01 africa 4
that would look like
pd.to_datetime(df['date']+'-1', format='%G-W%V-%w')
0 2020-12-14
1 2020-12-21
2 2020-12-28
3 2021-01-04
Name: date, dtype: datetime64[ns]
Related: Python - Get date from day of week, year, and week number

Create a column that divides the other 2 columns using Pandas Apply()

Given a dataset -
country year cases population
Afghanistan 1999 745 19987071
Brazil 1999 37737 172006362
China 1999 212258 1272915272
Afghanistan 2000 2666 20595360
Brazil 2000 80488 174504898
China 2000 213766 1280428583
The task is to get the ratio of cases to population using the pandas apply function, in a new column called "prevalence"
This is what I have written
def calc_prevalence(G):
assert 'cases' in G.columns and 'population' in G.columns
G_copy = G.copy()
G_copy['prevalence'] = G_copy['cases','population'].apply(lambda x: (x['cases']/x['population']))
display(G_copy)
but I am getting a
KeyError: ('cases', 'population')
Here is a solution that applies a named function to the dataframe without using lambda:
def calculate_ratio(row):
return row['cases']/row['population']
df['prevalence'] = df.apply(calculate_ratio, axis = 1)
print(df)
#output:
country year cases population prevalence
0 Afghanistan 1999 745 19987071 0.000037
1 Brazil 1999 37737 172006362 0.000219
2 China 1999 212258 1272915272 0.000167
3 Afghanistan 2000 2666 20595360 0.000129
4 Brazil 2000 80488 174504898 0.000461
5 China 2000 213766 1280428583 0.000167
First, unless you've been explicitly told to use an apply function here for some reason, you can call the operation on the columns themselves resulting in a much faster vectorized operation. ie;
G_copy['prevalence']=G_copy['cases']/G_copy['population']
Finally, if you must use an apply for some reason, apply on the df instead of the two series;
G_copy['prevalence']=G_copy.apply(lambda row: row['cases']/row['population'],axis=1)

pandas python get table with the first date of event in every year, each country, alternative groupby

Who can help, I'm trying to group this table here ( original table ) with tables : (country, year, date of the earthquake) in this form: the first earthquake in every year, each country. I was able to group through groupby, ( table with groupby ), but this view does not suit me, I need the same result but in this view :
China 2002 06-28
China 2005 07-25
China 2009 05-10
China 2010 03-10
China 2011 05-10
... ... ... ...
the Kuril Islands 2017 04-07
the Kuril Islands 2018 01-06
the Volcano Islands 2010 10-24
the Volcano Islands 2013 08-24
the Volcano Islands 2015 04-02
06-28 = month-day
How can I do it?
Thanks
Once you get your groupby use df = df.reset_index().
This will bring the columns you used in the groupby to columns and will get you the result you want

Discrepancy in data values while opening .csv file manually and by using python query

Data Source: https://www.kaggle.com/worldbank/world-development-indicators
Folder: 'world-development-indicators'
When I manually check the database by opening the csv file in MS-Excel, I find the number of years to be from 1960 to 1980 (min year 1960 and max year 1980).
However when I run the below command in python, I get the total number of years to be 1960 to 2015. And the max year to be 2015 (min year continues to be 1960)
data = pd.read_csv('./world-development-indicators/Indicators.csv')
years = data['Year'].unique().tolist()
len(years)
o/p: 56
years.min
o/p: 1960
years.max
o/p: 2015
If the maximum year in .csv file when opened manually is 1980, then why am I getting the maximum value of Year column as 2015 while executing python query.
Has anyone faced such an issue? Can anyone please help?
The file you have mentioned contains 5.65 million records. I have tested this in MS-Excel as well as Libre Office on Linux, it gives me an error message that not all rows can been loaded. Hence, you see records only until 1980.
I did a:
data.describe()
And found the min and max to be 1960 and 2015. Also, the year is increasing in the file. If you do a data.head(5) and data.tail(5), you will notice the following:
data.tail(5)
Out[109]:
CountryName CountryCode ... Year Value
5656453 Zimbabwe ZWE ... 2015 36.0
5656454 Zimbabwe ZWE ... 2015 90.0
5656455 Zimbabwe ZWE ... 2015 242.0
5656456 Zimbabwe ZWE ... 2015 3.3
5656457 Zimbabwe ZWE ... 2015 32.8
[5 rows x 6 columns]
data.head(5)
Out[110]:
CountryName CountryCode ... Year Value
0 Arab World ARB ... 1960 1.335609e+02
1 Arab World ARB ... 1960 8.779760e+01
2 Arab World ARB ... 1960 6.634579e+00
3 Arab World ARB ... 1960 8.102333e+01
4 Arab World ARB ... 1960 3.000000e+06
PS: If you use Spyder, you can open the Variable Explorer section, double click on data, and you should see all the records. I prefer this over opening in Excel because Excel usually truncates the records at the bottom if the file is large.

Categories