Getting unique rows conditioned on year pandas python dataframe - python

I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation

If all you want to do is keep the first entry for each name, you can use drop_duplicates Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008

Related

How to convert "event" data into country-year data by summating information in columns? Using python/pandas

I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
10000102
2000
Nigeria
3
10
0
10000103
2000
Mali
1
3
15
10000103
2000
Nigeria
15
0
0
10000103
2001
Benin
1
0
0
10000103
2001
Nigeria
1
3
15
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
Nigeria
2000
200
300
300
Nigeria
2001
250
450
15
So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country".
For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.
This simplified example shows how you could use groupby -
import pandas as pd
df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'],
'year': [2000, 2000, 2001, 2000],
'events1': [ 3, 4, 5, 2],
'events2': [1, 6, 3, 4]
})
df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum()
print(df2)
which gives the total of each type of event by country and by year
events1 events2
country year
Mali 2000 2 4
Nigeria 2000 7 7
2001 5 3

In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?

**I am analyzing data found on github reflecting COVID-19 (coronavirus) by Our World in Data on GitHub. The data found will be presented in an organized table format.
**
Link to data
https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
Project Goals:
List every country in alphabetical order
List the total number of tests conducted by each country
List the total number of vaccinated peoples in each country
List the total number of deaths of each country
Show the average age of death to Covid-19 in each country
Here is an example of the data
iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.122,0.122,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,,
Overall Question:
How do I sum the countries together so the information does not repeat? I just want the country to display once, and then next to it the sum of the information given from the data for the number of vaccinations displays. Example of the way
I want the data to display:
Country Vaccinations
Afghanistan 30235
Albania 15032
Andorra 2352
I have tried to import the data with pandasand sum the data but not quite sure how to get specifically for that one country. I want to write it so I can just display the single country by itself, but I run into the issue of either creating a new table and confuse myself. I am a beginner here and this data set is very large.
df = pd.read_csv(r'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
df[['location', 'total_vaccinations']].groupby('location').sum().reset_index()
location total_vaccinations
0 Afghanistan 4.013568e+08
1 Africa 1.853572e+11
2 Albania 3.587615e+08
3 Algeria 2.983973e+08
4 Andorra 4.593697e+06
.. ... ...
243 Western Sahara 0.000000e+00
244 World 4.857404e+12
245 Yemen 2.126927e+07
246 Zambia 5.435039e+08
247 Zimbabwe 3.211422e+09
[248 rows x 2 columns]

Using groupby calculations in Pandas data frames

I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID.
I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code.
LAC LAN JAN FEB MAR APR MAY JUN ID
K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5
E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5
I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example.
What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID?
df.groupby('LAC').mean()
I come frome a matlab background so just getting the hang of the best way to do things.
Secondly, once these operatons are complete, I would like to do the following:
(mean of id - 1:5 - mean id:6) using LAC as the key.
Sorry if I haven't explained this very well!
Edit: Expected output.
To be able to average a group of rows by specific ID for a given value of LAC.
For example:
Average monthly values for E92000001 rows with ID 3
LAC JAN FEB MAR APR MAY JUN ID
K04000001, 56706 43653 49723 88153 52374 42624 5
K04000001 56597 43555 49641 88049 52315 42577 5
E92000001 49186 36947 42649 79645 45554 36026 5
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 68715 56476 62178 99174 65083 55555 4
E92000001 41075 28836 34538 71534 37443 27915 3
E92000001 54595 42356 48058 85054 50963 41435 1
Rows to be averaged:
E92000001 53045 40806 46508 83504 49413 39885 3
E92000001 41075 28836 34538 71534 37443 27915 3
Result
E92000001 47060 34821 40523 77519 43428 33900 3
edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index.
df.groupby(['ID', 'LAC']).mean()
I would start by setting the year and LAC as the index
df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True)
Now you can groupby Index and get the mean for every month, or even each row's average since the first year.
expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)

Merging two dataframes with different structure using pandas

I need to merge data from one dataframe onto another.
The main dataframe consists of survey answers with a year, month, and region variable.
The data I need to merge onto this is the weather data for that specific month. This data is stored in my second data frame for weather stations with a year variable, a temperature average variable for each month (eg. value1, value2, ... value12), and a region variable.
I've tried to merge the two dataframes on region and year, and my plan was then afterwards to select the average temperature variable which coincides with the survey.
df1
---------------------------
year month regions
2002 january Pais Vasco
2002 february Pais Vasco
2003 march Pais Vasco
2002 november Florida
2003 december Florida
... ... ...
---------------------------
df2
-----------------------------------------------
year value1 value2 ... value12 regions
2002 10 11 ... 9 Pais Vasco
2003 11 11 ... 10 Pais Vasco
2004 12 11 ... 10 Pais Vasco
2002 11 11 ... 9 Florida
2003 10 11 ... 9 Florida
-----------------------------------------------
So in this example I need for my first survey observation to get the corresponding temperature (value1) data from the region Pais Vasco and year 2002.
When I tried to merge with
df_merged = pd.merge(df1, df2, how = "left", on =["regions", "year"])
I just get a dataframe with way more observations than my original survey dataframe.
I convert this data to tidy format. Assuming value1, value2 etc. correspond to value and month, then use pd.wide_to_long to turn it into long tidy format then merge.
tidy = pd.wide_to_long(df, stubnames=['value'], i=['year', 'region'], j='month', sep='') \
.reset_index()
You need to normalise your months so that they are all either numbers or integers. How you do this is outside the scope of this answer.
Then,
df1.merge(tidy, on=['year', 'month', 'region'], how='left', validate='1:1')
If this raises an error, then you have multiple observations for the same ['year', 'month', 'region'] key. Fix that by dropping duplicates. How you do so is almost certainly based heavily on your data.
sobek noticed that you have a typo, saying 'regions' rather than 'region' in your merge command. Make sure you're referring to columns that actually exist.

Dropping a column in a dataframe based on another column

I have a dataframe called jobs
position software salary degree location industry
architect autoCAD 400 masters london AEC
data analyst python 500 bachelors New York Telecommunications
personal assistant excel 200 bachelors London Media
.....
I have another dataframe called 'preference'
name value
position 2
software 4
salary 3
degree 1
location 3
industry 1
I'd like to drop columns from the 'jobs' dataframe whose preference value is less than 2 so that I have
position software salary location
architect autoCAD 400 london
data analyst python 500 New York
personal assistant excel 200 London
.....
This is what I have
jobs.drop(list(jobs.filter(preference['value'] < 2), axis = 1, inplace = True)
but it doesn't seem to drop the (degree and industry) columns. Any help would be appreciated
Your attempt is almost there I think. Here's what I have:
>>>jobs.drop(preference.loc[preference['value'] < 2,'name'], axis=1, inplace=True)
position software salary location
0 architect autoCAD 400 london
1 data analyst python 500 New York
2 personal assistant excel 200 London
This should work for you:
jobs.drop(preferences.loc[preferences.value < 2, 'name'], axis=1, inplace=True)
This is why your line of code did not work:
first of all, there is a closing parenthesis missing (but I guess that's just a typo)
the filter method should be applied to preferences instead of jobs
filter is not really what you want to use here to get a list of names: preferences.loc[preferences.value < 2, 'name'] returns a list of all names with value < 2

Categories