I know the question name is a little ambiguous.
My goal is to assign global key column based on 2 columns + unique value in my data frame.
For example
CountryCode | Accident
AFG Car
AFG Bike
AFG Car
AFG Plane
USA Car
USA Bike
UK Car
Let Car = 01, Bike = 02, Plane = 03
My desire global key format is [Accident][CountryCode][UniqueValue]
Unique value is a count of similar [Accident][CountryCode]
so if Accident = Car and CountryCode = AFG and it is the first occurrence, the global key would be 01AFG01
The desired dataframe would look like this:
CountryCode | Accident | GlobalKey
AFG Car 01AFG01
AFG Bike 02AFG01
AFG Car 01AFG02
AFG Plane 01AFG03
USA Car 01USA01
USA Bike 01USA02
UK Car 01UK01
I have tried running a for loop to append Accident Number and CountryCode together
for example:
globalKey = []
for x in range(0,6):
string = df.iloc[x, 1]
string2 = df.iloc[x, 2]
if string2 == 'Car':
number = '01'
elif string2 == 'Bike':
number = '02'
elif string2 == 'Plane':
number = '03'
#Concat the number of accident and Country Code
subKey = number + string
#Append to the list
globalKey.append(subKey)
This code will provide me with something like 01AFG, 02AFG based on the value I assign. but I want to assign a unique value by counting the occurrence of when CountryCode and Accident is similar.
I am stuck with the code above. I think there should be a better way to do it using map function in Pandas.
Thanks for helping guys!
much appreciate!
You can try with cumcount to achieve this in a number of steps, like this:
In [1]: df = pd.DataFrame({'Country':['AFG','AFG','AFG','AFG','USA','USA','UK'], 'Accident':['Car','Bike','Car','Plane','Car','Bike','Car']})
In [2]: df
Out[2]:
Accident Country
0 Car AFG
1 Bike AFG
2 Car AFG
3 Plane AFG
4 Car USA
5 Bike USA
6 Car UK
## Create a column to keep incremental values for `Country`
In [3]: df['cumcount'] = df.groupby('Country').cumcount()
In [4]: df
Out[4]:
Accident Country cumcount
0 Car AFG 0
1 Bike AFG 1
2 Car AFG 2
3 Plane AFG 3
4 Car USA 0
5 Bike USA 1
6 Car UK 0
## Create a column to keep incremental values for combination of `Country`,`Accident`
In [5]: df['cumcount_type'] = df.groupby(['Country','Accident']).cumcount()
In [6]: df
Out[6]:
Accident Country cumcount cumcount_type
0 Car AFG 0 0
1 Bike AFG 1 0
2 Car AFG 2 1
3 Plane AFG 3 0
4 Car USA 0 0
5 Bike USA 1 0
6 Car UK 0 0
And from that point on you can concatenate the values of cumcount, cumcount_type and Country to achieve what you're after.
Maybe you want to add 1 to each of the values you have under the different counts, depending on whether you want to start counting from 0 or 1.
I hope this helps.
First of all, don't use for loops if you can help it. For example, you can do your Accident to code mapping with:
df['AccidentCode'] = df['Accident'].map({'Car': '01', 'Bike': '02', 'Plane': '03'})
To get the unique code, Thanos has shown how to do that using GroupBy.cumcount:
df['CA_ID'] = df.groupby(['CountryCode', 'Accident']).cumcount() + 1
And then to put them all together into a unique key:
df['NewKey'] = df['AccidentCode'] + df['CountryCode'] + df['CA_ID'].map('{:0>2}'.format)
which gives:
CountryCode Accident GlobalKey AccidentCode CA_ID NewKey
0 AFG Car 01AFG01 01 1 01AFG01
1 AFG Bike 02AFG01 02 1 02AFG01
2 AFG Car 01AFG02 01 2 01AFG02
3 AFG Plane 01AFG03 03 1 03AFG01
4 USA Car 01USA01 01 1 01USA01
5 USA Bike 01USA02 02 1 02USA01
6 UK Car 01UK01 01 1 01UK01
After you create your subKey we can sort the dataframe and count the occurences of the couples. First let's reset the index (to store the original order)
df = df.reset_index()
then sort by the subKey and count
df = df.sort_values(by='subKey')
df['newnumber'] = 1
for ind in range(1, len(df)): #start by 1 because first row is always 1
if df.loc[ind, 'subKey'] == df.loc[ind - 1, 'subKey']:
df.loc[ind, 'newnumber'] = df.loc[ind - 1, 'newnumber'] + 1
Finally create the GlobalKey with the help of the zfill function, the reorder by index:
df['GlobalKey'] = df.apply(lambda x: x['subKey'] + str(x['new_number']).zfill(2), 1)
df = df.sort_values(by='index').drop('index', 1).reset_index(drop=True)
I don't have any experience of Pandas so this answer may not be what you are looking for. That being said, if the data you have is really that simple (few countries, few accident types) have you considered storing each country|accident combination in their own value?
So as you traverse your input, just increment the counter for that country|accident combination, and then read through those counters at the end to produce the GlobalKeys.
If you have other data to store besides the Global Key, then store the country|accident combinations as lists, and read through them at the end one-at-a-time to produce GlobalKeys.
Related
In a Panda's dataframe: I want to count how many of value 1 there is, in the stroke coulmn, for each value in the Residence_type column. In order to count how much 1 there is, I convert the stroke column to a list, easier I think.
So for example, the value Rural in Residence_type has 300 times 1 in the stroke column.. and so on.
The data is something like this:
Residence_type Stroke
0 Rural 1
1 Urban 1
2 Urban 0
3 Rural 1
4 Rural 0
5 Urban 0
6 Urban 0
7 Urban 1
8 Rural 0
9 Rural 1
The code:
grpby_variable = data.groupby('stroke')
grpby_variable['Residence_type'].tolist().count(1)
the final goal is to find the difference between the number of times the value 1 appears, for each value in the Residence_type column (rural or urban).
Am I doing it right? what is this error ?
Not sure I got what you need done. Please try filter stroke==1, groupby and count;
df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
Stroke_Count
Residence_type
Rural 3
Urban 2
You could try the following if you need the differences between categories
df1 =df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
df1.loc['Diff'] = abs(df1.loc['Rural']-df1.loc['Urban'])
print(df1)
Stroke_Count
Residence_type
Rural 3
Urban 2
Diff 1
Assuming that Stroke only contains 1 or 0, you can do:
result_df = df.groupby('Residence_type').sum()
>>> result_df
Stroke
Residence_type
Rural 3
Urban 2
>>> result_df.Stroke['Rural'] - result_df.Stroke['Urban']
1
I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')
I have a pandas DataFrame which looks like this:
home_team away_team home_score away_score
Spain Albania 0 5
Albania Spain 4 1
Albania Portugal 1 2
Albania US 0 2
From the first two lines we see that Spain and Albania played 2 times in total, Spain scored 1 goal, Albania scored 9 goals.
Then Albania has 1 game with US and Portugal and it's scores. I am trying to answer 'How many goals Albania has scored against each country and how many goals that country has scored against Albania'
So that I would get a DataFrame like this:
Albania Spain 9 1
Albania Portugal 1 2
Albania US 0 2
When I use print(df.groupby(['away_team']).sum() + df.groupby(['home_team']).sum()) I do not get what I want and for some reason some lines are filled with NaNs. And it appears that the sums are not summing correctly.
You can sorting both columns teams and assign back, then swap values of scope if no match original home_team with sorted and last aggregate sum:
orig = df['home_team'].copy()
df[['home_team','away_team']] = np.sort(df[['home_team','away_team']], axis=1)
m = orig.ne(df['home_team'])
df.loc[m, ['home_score','away_score']] = df.loc[m, ['away_score','home_score']].values
print (df)
home_team away_team home_score away_score
0 Albania Spain 5 0
1 Albania Spain 4 1
2 Albania Portugal 1 2
3 Albania US 0 2
df1 = df.groupby(['home_team', 'away_team'], as_index=False).sum()
print (df1)
home_team away_team home_score away_score
0 Albania Portugal 1 2
1 Albania Spain 9 1
2 Albania US 0 2
sort the home & away team columns alphabetically to generate two new team columns
add two more columns for score_for & score against.
group by the two new team columns and sum the two new score columns
df[['team1', 'team2']] = df[['home_team', 'away_team']].apply(np.sort, axis=1, result_type='expand')
df[['score_for', 'score_against']] = df.apply(
lambda x: [x.home_score, x.away_score] if x.team1 == x.home_team else [x.away_score, x.home_score],
axis=1,
result_type='expand')
df.groupby(['team1', 'team2'])[['score_for', 'score_against']].sum()
score_for score_against
team1 team2
Albania Portugal 1 2
Spain 9 1
US 0 2
Swap home team and away team, as well as score. Concatenate two dataframes together and groupby.
df1 = df.T.reset_index(drop=True)
df2 = df1.rename({0:1, 1:0, 2:3, 3:2}).sort_index()
pd.concat([df1.T, df2.T]).groupby([0,1]).sum().loc[['Albania']]
home = df[df.home_team == "Albania"]
home.columns = ["country","opponent","win","lost"]
away = df[df.away_team == "Albania"]
away.columns = ["opponent","country","lost","win"]
pd.concat([home,away],ignore_index=True).groupby(["country","opponent"]).sum()
I have a dataset structures as below:
index country city Data
0 AU Sydney 23
1 AU Sydney 45
2 AU Unknown 2
3 CA Toronto 56
4 CA Toronto 2
5 CA Ottawa 1
6 CA Unknown 2
I want to replace 'Unknown' in the city column with the mode of the occurences of cities per country. The result would be:
...
2 AU Sydney 2
...
6 CA Toronto 2
I can get the city modes with:
city_modes = df.groupby('country')['city'].apply(lambda x: x.mode().iloc[0])
And I can replace values with:
df['column']=df.column.replace('Unknown', 'something')
But i cant work out how to combine these to only replace unknowns for each country based on mode of occurrence of cities.
Any ideas?
Use transform for Series with same size as original DataFrame and set new values by numpy.where:
city_modes = df.groupby('country')['city'].transform(lambda x: x.mode().iloc[0])
df['column'] = np.where(df['column'] == 'Unknown',city_modes, df['column'])
Or:
df.loc[df['column'] == 'Unknown', 'column'] = city_modes
I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India