Consolidating data based on two conditions

Consolidating data based on two conditions - python

I have four columns of data that I am trying to consolidate based on two conditions. The data are formatted as follows:
CountyName Year Oil Gas
ANDERSON 2010 1358 0
ANDERSON 2010 621746 4996766
ANDERSON 2011 1587 0
ANDERSON 2011 633120 5020877
ANDERSON 2012 55992 387685
ANDERSON 2012 1342 0
ANDERSON 2013 635572 3036578
ANDERSON 2013 4873 0
ANDERSON 2014 656440 2690333
ANDERSON 2014 12332 0
ANDERSON 2015 608454 2836272
ANDERSON 2015 23339 0
ANDERSON 2016 551728 2682261
ANDERSON 2016 12716 0
ANDERSON 2017 132466 567874
ANDERSON 2017 1709 0
ANDREWS 2010 25701725 1860063
ANDREWS 2010 106351 0
ANDREWS 2011 97772 0
ANDREWS 2011 28818329 1377865
ANDREWS 2012 105062 0
...
I'm interested in combining the respective oil and then gas values for entries that are repeated. For example, I'd like to add all the oil entries for Anderson County for the year 2010 and have that value replace the existing entries in just one row. The code I am using now is summing all the values in the respective county regardless of year, giving me a condensed output like this:
CountyName Year Oil Gas
ANDERSON 3954774
ANDREWS 206472698
...
Here's the code I am using:
import csv
with open('Texas.csv', 'r') as Texas: #opening Texas csv file
TexasReader = csv.reader(Texas)
counties = {}
years = {}
index = 0 and 1
for row in TexasReader:
if index == 0 and 1:
header = row
else:
county = row[0]
year = row[1]
oil = row[2]
gas = row[3]
if county in counties:
counties[county] += int(oil)
else:
counties[county] = int(oil)
index += 1
with open('TexasConsolidated.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header, delimiter=',', lineterminator='\n')
writer.writeheader()
for k, v in counties.items():
writer.writerow({header[0]: k, header[2]: v})

This is the line that is doing what you complain of:
if county in counties:
counties[county] += int(oil)
If you want a dict that stores sums over two keys then both values need to be in the dict key.
Add the line
counties_years = {}
then sum like this, using the tuple (county,year) as the key:
if (county,year) in counties_years:
counties_years[(county,year)] += int(oil)
else:
counties_years[(county,year)] = int(oil)

Related

Add a column of repeating numbers to existing dataframe

I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?

You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')

One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()

I have country, start and end year for all baseball players. I need to know how many players per country played each year

I have a dataset with 20,000 players. Columns are birthCountry, debut_year and final_year.
birthCountry debut_year final_year
0 USA 2004 2015
1 USA 1954 1976
2 USA 1962 1971
3 USA 1977 1990
4 USA 2001 2006
I need to get a table as follows:
1980 1981 1982
USA 50 49 48
CANADA XX XX XX
MEXICO XX XX XX
...
Where each cell represents the number of players that were born in a particular country, that played during that year.
I created a nested list, containing all years that each player played. The length of this list is the same as the length of the df. In the df, I created one additional column per year and I tried to add 1 for each player/year combination.
The idea was to use this to create a groupby or pivot_table
# create a list of years
years = list(range(min(df['debut_year'].values),max(df['final_year'].values)+1))
# create a list of countries
countries = df.birthCountry.unique()
# add columns for years
for n in range(1841,2019): #years are from 1841 to 2018
df[n] = ''
# now I have one additional column for every year. A lot of new empty columns
# temporary lists
templist = list(range(0,len(df)))
# every element of the following list contains all the years each player played
templist2 = []
for i in templist:
templist2.append(list(range(int(df.iloc[i,1]),int(df.iloc[i,2]))))
# add 1 if the player played that year
for i in range(len(df)):
for j in templist2[i]:
df.iloc[i][j] = 1
I run for some time and then nothing changed in the original dataframe.
Probably you can find a better more elegant solution.

To limit the size of the example, I created the following source DataFrame:
df = pd.DataFrame(data=[[ 1, 'USA', 1974, 1978 ], [ 2, 'USA', 1976, 1981 ],
[ 3, 'USA', 1975, 1979 ], [ 4, 'USA', 1977, 1980 ],
[ 5, 'Mex', 1976, 1979 ], [ 6, 'Mex', 1978, 1980 ]],
columns=['Id', 'birthCountry', 'debut_year', 'final_year'])
The fists step of actual computation is to create a Series containing
years in which each player was active:
years = df.apply(lambda row: pd.Series(range(row.debut_year,
row.final_year + 1)), axis=1).stack().astype(int).rename('year')
The second step is to create an auxiliary DataFrame - a join of
df.birthCountry and years:
df2 = df[['birthCountry']].join(years.reset_index(level=1, drop=True))
And the last step is to produce the actual result:
df2.groupby(['birthCountry', 'year']).size().rename('Count')\
.unstack().fillna(0, downcast='infer')
For the above test data, the result is:
year 1974 1975 1976 1977 1978 1979 1980 1981
birthCountry
Mex 0 0 1 1 2 2 1 0
USA 1 2 3 4 4 3 2 1
I think, my solution is more "pandasonic" than the other, proposed earlier
by Remy.

I was able to come up with the following solution if I understand the structure of your df variable correctly. I made a dictionary list (using a smaller range of years) with the same structure for my example:
df = [{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2016},
{'birthCountry': 'CANADA', 'debut_year': 2010, 'final_year': 2016},
{'birthCountry': 'USA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'CANADA', 'debut_year': 2012, 'final_year': 2017},
{'birthCountry': 'MEXICO', 'debut_year': 2012, 'final_year': 2016}]
countries = {}
for field in df:
if field['birthCountry'] not in countries.keys():
countries[field['birthCountry']] = {year: 0 for year in range(2010, 2019)}
for year in range(field['debut_year'], field['final_year']):
countries[field['birthCountry']][year] += 1

pandas groupby include a column in final result

cast year revenue title
id
135397 Chris Pratt 2015 1.392446e+09 Jurassic World
135397 Bryce Dallas Howard 2015 1.392446e+09 Jurassic World
135397 Irrfan Khan 2015 1.392446e+09 Jurassic World
135397 Nick Robinson 2015 1.392446e+09 Jurassic World
Given the above DataFrame, I would like to find the highest earning actors per year (based on the combined revenue of movies they acted in that year). This is what I have so far :
#get the total revenue associated with each cast for each year
f ={'revenue':sum}
#revenue by year for each cast
df_actor_yr = df_actor_yr.groupby(['year', 'cast']).agg(f)
df_actor_yr
year cast
1960 Anthony Perkins 2.359350e+08
Charles Laughton 4.423780e+08
Fred MacMurray 1.843242e+08
Jack Kruschen 1.843242e+08
Jean Simmons 4.423780e+08
John Gavin 2.359350e+08
Kirk Douglas 4.423780e+08
Vera Miles 2.359350e+08
1961 Anthony Quayle 2.108215e+08
Anthony Quinn 2.108215e+08
Ben Wright 1.574815e+09
Betty Lou Gerson 1.574815e+09
...
Next to get the highest earning cast member for each year I did the following
df_actor_yr.reset_index(inplace=True)
g ={"revenue" : max }
df_actor_yr = df_actor_yr.groupby('year').agg(g)
df_actor_yr
revenue
year
1960 4.423780e+08
1961 1.574815e+09
1962 5.045914e+08
1963 5.617734e+08
1964 8.780804e+08
1965 1.129535e+09
1967 1.345551e+09
1968 4.187094e+08
1969 6.081511e+08
...
This only gives me the year and maximum revenue for that year.I would also like to get the corresponding name of the cast member associated with the revenue. How do I go about doing this?

You can split your logic into 2 steps. First sum by cast and year using GroupBy + sum. Then find the maximum revenue by year using GroupBy + idxmax:
# sum by cast and year
df_summed = df.groupby(['cast', 'year'])['revenue'].sum().reset_index()
# maximums by year
res = df_summed.loc[df_summed.groupby('year')['revenue'].idxmax()]
print(res)
cast year revenue
3 NickRobinson 2012 3.401340e+09
0 BryceDallasHoward 2015 1.568978e+09
For the above output, I've used more interesting data:
id cast year revenue title
135397 ChrisPratt 2015 1.392446e+09 JurassicWorld
135397 BryceDallasHoward 2015 1.568978e+09 SomeMovie
135397 IrrfanKhan 2012 1.392446e+09 JurassicWorld
135397 NickRobinson 2012 1.046987e+09 JurassicWorld
135398 NickRobinson 2012 2.354353e+09 SomeOtherMovie

Transpose subset of pandas dataframe into multi-indexed data frame

I have the following dataframe:
df.head(14)
I'd like to transpose just the yr and the ['WA_','BA_','IA_','AA_','NA_','TOM_']
variables by Label. The resulting dataframe should then be a Multi-indexed frame with Label and the WA_, BA_, etc. and the columns names will be 2010, 2011, etc. I've tried,
transpose(), groubby(), pivot_table(), long_to_wide(),
and before I roll my own nested loop going line by line through this df I thought I'd ping the community. Something like this by every Label group:
I feel like the answer is in one of those functions but I'm just missing it. Thanks for your help!

From what I can tell by your illustrated screenshots, you want WA_, BA_ etc as rows and yr as columns, with Label remaining as a row index. If so, consider stack() and unstack():
# sample data
labels = ["Albany County","Big Horn County"]
n_per_label = 7
n_rows = n_per_label * len(labels)
years = np.arange(2010, 2017)
min_val = 10000
max_val = 40000
data = {"Label": sorted(np.array(labels * n_per_label)),
"WA_": np.random.randint(min_val, max_val, n_rows),
"BA_": np.random.randint(min_val, max_val, n_rows),
"IA_": np.random.randint(min_val, max_val, n_rows),
"AA_": np.random.randint(min_val, max_val, n_rows),
"NA_": np.random.randint(min_val, max_val, n_rows),
"TOM_": np.random.randint(min_val, max_val, n_rows),
"yr":np.append(years,years)
}
df = pd.DataFrame(data)
AA_ BA_ IA_ NA_ TOM_ WA_ Label yr
0 27757 23138 10476 20047 34015 12457 Albany County 2010
1 37135 30525 12296 22809 27235 29045 Albany County 2011
2 11017 16448 17955 33310 11956 19070 Albany County 2012
3 24406 21758 15538 32746 38139 39553 Albany County 2013
4 29874 33105 23106 30216 30176 13380 Albany County 2014
5 24409 27454 14510 34497 10326 29278 Albany County 2015
6 31787 11301 39259 12081 31513 13820 Albany County 2016
7 17119 20961 21526 37450 14937 11516 Big Horn County 2010
8 13663 33901 12420 27700 30409 26235 Big Horn County 2011
9 37861 39864 29512 24270 15853 29813 Big Horn County 2012
10 29095 27760 12304 29987 31481 39632 Big Horn County 2013
11 26966 39095 39031 26582 22851 18194 Big Horn County 2014
12 28216 33354 35498 23514 23879 17983 Big Horn County 2015
13 25440 28405 23847 26475 20780 29692 Big Horn County 2016
Now set Label and yr as indices.
df.set_index(["Label","yr"], inplace=True)
From here, unstack() will pivot the inner-most index to columns. Then, stack() can swing our value columns down into rows.
df.unstack().stack(level=0)
yr 2010 2011 2012 2013 2014 2015 2016
Label
Albany County AA_ 27757 37135 11017 24406 29874 24409 31787
BA_ 23138 30525 16448 21758 33105 27454 11301
IA_ 10476 12296 17955 15538 23106 14510 39259
NA_ 20047 22809 33310 32746 30216 34497 12081
TOM_ 34015 27235 11956 38139 30176 10326 31513
WA_ 12457 29045 19070 39553 13380 29278 13820
Big Horn County AA_ 17119 13663 37861 29095 26966 28216 25440
BA_ 20961 33901 39864 27760 39095 33354 28405
IA_ 21526 12420 29512 12304 39031 35498 23847
NA_ 37450 27700 24270 29987 26582 23514 26475
TOM_ 14937 30409 15853 31481 22851 23879 20780
WA_ 11516 26235 29813 39632 18194 17983 29692

Find key from value for Pandas Series

I have a dictionary whose values are in a pandas series. I want to make a new series that will look up a value in a series and return a new series with associated key. Example:
import pandas as pd
df = pd.DataFrame({'season' : ['Nor 2014', 'Nor 2013', 'Nor 2013', 'Norv 2013',
'Swe 2014', 'Swe 2014', 'Swe 2013',
'Swe 2013', 'Sven 2013', 'Sven 2013', 'Norv 2014']})
nmdict = {'Norway' : [s for s in list(set(df.season)) if 'No' in s],
'Sweden' : [s for s in list(set(df.season)) if 'S' in s]}
Desired result with df['country'] as the new column name:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
Due to nature of my data I must manually make the nmdict as shown. I've tried this but couldn't reverse my nmdict as arrays are not same length.
More importantly, I think my approach may be wrong. I'm coming from Excel and thinking of a vlookup solution, but according to this answer, I shouldn't be using the dictionary in this way.
Any answers appreciated.

I've done it in a verbose manner to allow you to follow through.
First, let's define a function that determines the value 'country'
In [4]: def get_country(s):
...: if 'Nor' in s:
...: return 'Norway'
...: if 'S' in s:
...: return 'Sweden'
...: # return 'Default Country' # if you get unmatched values
In [5]: get_country('Sven')
Out[5]: 'Sweden'
In [6]: get_country('Norv')
Out[6]: 'Norway'
We can use map to run get_country on every row. Pandas DataFrames also have a apply() which works similarly*.
In [7]: map(get_country, df['season'])
Out[7]:
['Norway',
'Norway',
'Norway',
'Norway',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Sweden',
'Norway']
Now we assign that result to the column called 'country'
In [8]: df['country'] = map(get_country, df['season'])
Let's view the final result:
In [9]: df
Out[9]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
*With apply() here's how it would look:
In [16]: df['country'] = df['season'].apply(get_country)
In [17]: df
Out[17]:
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway
A more scalable country matcher
pseudo-code only :)
# Modify this as needed
country_matchers = {
'Norway': ['Nor', 'Norv'],
'Sweden': ['S', 'Swed'],
}
def get_country(s):
"""
Run the passed string s against "matchers" for each country
Return the first matched country
"""
for country, matchers in country_matchers.items():
for matcher in matchers:
if matcher in s:
return country

IIUC, I would do the following:
df['country'] = df['season'].apply(lambda x: 'Norway' if 'No' in x else 'Sweden' if 'S' in x else x)

You could create the country dictionary using a dictionary comprehension:
country_id = df.season.str.split().str.get(0).drop_duplicates()
country_dict = {c: ('Norway' if c.startswith('N') else 'Sweden') for c in country_id.values}
to get:
{'Nor': 'Norway', 'Swe': 'Sweden', 'Sven': 'Sweden', 'Norv': 'Norway'}
This works fine for two countries, otherwise you can apply a self-defined function in similar way:
def country_dict(country_id):
if country_id.startswith('S'):
return 'Sweden'
elif country_id.startswith('N'):
return 'Norway'
elif country_id.startswith('XX'):
return ...
else:
return 'default'
Either way, map the dictionary to the country_id part of the season column, extracted using pandas string methods:
df['country'] = df.season.str.split().str.get(0).map(country_dict)
season country
0 Nor 2014 Norway
1 Nor 2013 Norway
2 Nor 2013 Norway
3 Norv 2013 Norway
4 Swe 2014 Sweden
5 Swe 2014 Sweden
6 Swe 2013 Sweden
7 Swe 2013 Sweden
8 Sven 2013 Sweden
9 Sven 2013 Sweden
10 Norv 2014 Norway

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Consolidating data based on two conditions - python

Related

Add a column of repeating numbers to existing dataframe

I have country, start and end year for all baseball players. I need to know how many players per country played each year

pandas groupby include a column in final result

Transpose subset of pandas dataframe into multi-indexed data frame

Find key from value for Pandas Series

Categories

Resources