Getting unique rows conditioned on year pandas python dataframe - python
I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation
If all you want to do is keep the first entry for each name, you can use drop_duplicates Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008
Related
How to convert "event" data into country-year data by summating information in columns? Using python/pandas
I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to summate the columns nkill, nhostage, and nwounded per year. This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment). Right now my data looks like this (there are a ton of other columns but they aren't important for this): eventID iyear country_txt nkill nwounded nhostages 10000102 2000 Nigeria 3 10 0 10000103 2000 Mali 1 3 15 10000103 2000 Nigeria 15 0 0 10000103 2001 Benin 1 0 0 10000103 2001 Nigeria 1 3 15 . . . And I would like it to look like this: country_txt iyear total_nkill total_nwounded total_nhostages Nigeria 2000 200 300 300 Nigeria 2001 250 450 15 So basically, I want to add up the number of nkill, nwounded, and nhostages for each country-year group. So then I can have a list of all the countries and years with information about the number of deaths, injuries, and hostages taken per year in total. The countries also have an associated number if it is easier to write the code with a number instead of country_txt, the column with the country's number is just "country". For a solution, I've been looking at the pandas "groupby" function, but I'm really new to coding so I'm having trouble understanding the documentation. It also seems like melt or pivot functions could be helpful.
This simplified example shows how you could use groupby - import pandas as pd df = pd.DataFrame({'country': ['Nigeria', 'Nigeria', 'Nigeria', 'Mali'], 'year': [2000, 2000, 2001, 2000], 'events1': [ 3, 4, 5, 2], 'events2': [1, 6, 3, 4] }) df2 = df.groupby(['country', 'year'])[['events1', 'events2']].sum() print(df2) which gives the total of each type of event by country and by year events1 events2 country year Mali 2000 2 4 Nigeria 2000 7 7 2001 5 3
In Python, how did I prevent a name from repeating in dataset and sum or average the information of the data given?
**I am analyzing data found on github reflecting COVID-19 (coronavirus) by Our World in Data on GitHub. The data found will be presented in an organized table format. ** Link to data https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv Project Goals: List every country in alphabetical order List the total number of tests conducted by each country List the total number of vaccinated peoples in each country List the total number of deaths of each country Show the average age of death to Covid-19 in each country Here is an example of the data iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,new_deaths_per_million,new_deaths_smoothed_per_million,reproduction_rate,icu_patients,icu_patients_per_million,hosp_patients,hosp_patients_per_million,weekly_icu_admissions,weekly_icu_admissions_per_million,weekly_hosp_admissions,weekly_hosp_admissions_per_million,total_tests,new_tests,total_tests_per_thousand,new_tests_per_thousand,new_tests_smoothed,new_tests_smoothed_per_thousand,positive_rate,tests_per_case,tests_units,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,new_vaccinations,new_vaccinations_smoothed,total_vaccinations_per_hundred,people_vaccinated_per_hundred,people_fully_vaccinated_per_hundred,total_boosters_per_hundred,new_vaccinations_smoothed_per_million,new_people_vaccinated_smoothed,new_people_vaccinated_smoothed_per_hundred,stringency_index,population_density,median_age,aged_65_older,aged_70_older,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,0.122,0.122,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,0.122,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,8.33,54.422,18.6,2.581,1.337,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.511,41128772.0,,,, Overall Question: How do I sum the countries together so the information does not repeat? I just want the country to display once, and then next to it the sum of the information given from the data for the number of vaccinations displays. Example of the way I want the data to display: Country Vaccinations Afghanistan 30235 Albania 15032 Andorra 2352 I have tried to import the data with pandasand sum the data but not quite sure how to get specifically for that one country. I want to write it so I can just display the single country by itself, but I run into the issue of either creating a new table and confuse myself. I am a beginner here and this data set is very large.
df = pd.read_csv(r'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv') df[['location', 'total_vaccinations']].groupby('location').sum().reset_index() location total_vaccinations 0 Afghanistan 4.013568e+08 1 Africa 1.853572e+11 2 Albania 3.587615e+08 3 Algeria 2.983973e+08 4 Andorra 4.593697e+06 .. ... ... 243 Western Sahara 0.000000e+00 244 World 4.857404e+12 245 Yemen 2.126927e+07 246 Zambia 5.435039e+08 247 Zimbabwe 3.211422e+09 [248 rows x 2 columns]
Using groupby calculations in Pandas data frames
I am working on a geospatial project where I need to do some calculations between groups of data within a data frame. The data I am using spans over several different years and specific to the Local Authority District code, each year has a numerical ID. I need to be able to calculate the mean average of a group of years within that data set relative to the LAD code. LAC LAN JAN FEB MAR APR MAY JUN ID K04000001 ENGLAND AND WALES 56597 43555 49641 88049 52315 42577 5 E92000001 ENGLAND 53045 40806 46508 83504 49413 39885 5 I can use groupby to calculate the mean based on a LAC, but what I can't do is calculate the mean grouped by LAC for ID 1:3 for example. What is more efficient, seperate in to seperate dataframes stored in an dict for example, or keep in one dataframe and use an ID? df.groupby('LAC').mean() I come frome a matlab background so just getting the hang of the best way to do things. Secondly, once these operatons are complete, I would like to do the following: (mean of id - 1:5 - mean id:6) using LAC as the key. Sorry if I haven't explained this very well! Edit: Expected output. To be able to average a group of rows by specific ID for a given value of LAC. For example: Average monthly values for E92000001 rows with ID 3 LAC JAN FEB MAR APR MAY JUN ID K04000001, 56706 43653 49723 88153 52374 42624 5 K04000001 56597 43555 49641 88049 52315 42577 5 E92000001 49186 36947 42649 79645 45554 36026 5 E92000001 53045 40806 46508 83504 49413 39885 3 E92000001 68715 56476 62178 99174 65083 55555 4 E92000001 41075 28836 34538 71534 37443 27915 3 E92000001 54595 42356 48058 85054 50963 41435 1 Rows to be averaged: E92000001 53045 40806 46508 83504 49413 39885 3 E92000001 41075 28836 34538 71534 37443 27915 3 Result E92000001 47060 34821 40523 77519 43428 33900 3 edit: corrected error.
To match the update in your question. This will give you a dataframe with only one row for each ID-LAC combination, with the average of all the rows that had that index. df.groupby(['ID', 'LAC']).mean() I would start by setting the year and LAC as the index df.set_index(['ID', 'LAC'], inplace=True).sort_index(inplace=True) Now you can groupby Index and get the mean for every month, or even each row's average since the first year. expanding_mean = df.groupby('index').cumsum() / (df.groupby('index').cumcount() + 1)
Merging two dataframes with different structure using pandas
I need to merge data from one dataframe onto another. The main dataframe consists of survey answers with a year, month, and region variable. The data I need to merge onto this is the weather data for that specific month. This data is stored in my second data frame for weather stations with a year variable, a temperature average variable for each month (eg. value1, value2, ... value12), and a region variable. I've tried to merge the two dataframes on region and year, and my plan was then afterwards to select the average temperature variable which coincides with the survey. df1 --------------------------- year month regions 2002 january Pais Vasco 2002 february Pais Vasco 2003 march Pais Vasco 2002 november Florida 2003 december Florida ... ... ... --------------------------- df2 ----------------------------------------------- year value1 value2 ... value12 regions 2002 10 11 ... 9 Pais Vasco 2003 11 11 ... 10 Pais Vasco 2004 12 11 ... 10 Pais Vasco 2002 11 11 ... 9 Florida 2003 10 11 ... 9 Florida ----------------------------------------------- So in this example I need for my first survey observation to get the corresponding temperature (value1) data from the region Pais Vasco and year 2002. When I tried to merge with df_merged = pd.merge(df1, df2, how = "left", on =["regions", "year"]) I just get a dataframe with way more observations than my original survey dataframe.
I convert this data to tidy format. Assuming value1, value2 etc. correspond to value and month, then use pd.wide_to_long to turn it into long tidy format then merge. tidy = pd.wide_to_long(df, stubnames=['value'], i=['year', 'region'], j='month', sep='') \ .reset_index() You need to normalise your months so that they are all either numbers or integers. How you do this is outside the scope of this answer. Then, df1.merge(tidy, on=['year', 'month', 'region'], how='left', validate='1:1') If this raises an error, then you have multiple observations for the same ['year', 'month', 'region'] key. Fix that by dropping duplicates. How you do so is almost certainly based heavily on your data. sobek noticed that you have a typo, saying 'regions' rather than 'region' in your merge command. Make sure you're referring to columns that actually exist.
Dropping a column in a dataframe based on another column
I have a dataframe called jobs position software salary degree location industry architect autoCAD 400 masters london AEC data analyst python 500 bachelors New York Telecommunications personal assistant excel 200 bachelors London Media ..... I have another dataframe called 'preference' name value position 2 software 4 salary 3 degree 1 location 3 industry 1 I'd like to drop columns from the 'jobs' dataframe whose preference value is less than 2 so that I have position software salary location architect autoCAD 400 london data analyst python 500 New York personal assistant excel 200 London ..... This is what I have jobs.drop(list(jobs.filter(preference['value'] < 2), axis = 1, inplace = True) but it doesn't seem to drop the (degree and industry) columns. Any help would be appreciated
Your attempt is almost there I think. Here's what I have: >>>jobs.drop(preference.loc[preference['value'] < 2,'name'], axis=1, inplace=True) position software salary location 0 architect autoCAD 400 london 1 data analyst python 500 New York 2 personal assistant excel 200 London
This should work for you: jobs.drop(preferences.loc[preferences.value < 2, 'name'], axis=1, inplace=True) This is why your line of code did not work: first of all, there is a closing parenthesis missing (but I guess that's just a typo) the filter method should be applied to preferences instead of jobs filter is not really what you want to use here to get a list of names: preferences.loc[preferences.value < 2, 'name'] returns a list of all names with value < 2