In a Panda's dataframe: I want to count how many of value 1 there is, in the stroke coulmn, for each value in the Residence_type column. In order to count how much 1 there is, I convert the stroke column to a list, easier I think.
So for example, the value Rural in Residence_type has 300 times 1 in the stroke column.. and so on.
The data is something like this:
Residence_type Stroke
0 Rural 1
1 Urban 1
2 Urban 0
3 Rural 1
4 Rural 0
5 Urban 0
6 Urban 0
7 Urban 1
8 Rural 0
9 Rural 1
The code:
grpby_variable = data.groupby('stroke')
grpby_variable['Residence_type'].tolist().count(1)
the final goal is to find the difference between the number of times the value 1 appears, for each value in the Residence_type column (rural or urban).
Am I doing it right? what is this error ?
Not sure I got what you need done. Please try filter stroke==1, groupby and count;
df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
Stroke_Count
Residence_type
Rural 3
Urban 2
You could try the following if you need the differences between categories
df1 =df.query("Stroke==1").groupby('Residence_type')['Stroke'].agg('count').to_frame('Stroke_Count')
df1.loc['Diff'] = abs(df1.loc['Rural']-df1.loc['Urban'])
print(df1)
Stroke_Count
Residence_type
Rural 3
Urban 2
Diff 1
Assuming that Stroke only contains 1 or 0, you can do:
result_df = df.groupby('Residence_type').sum()
>>> result_df
Stroke
Residence_type
Rural 3
Urban 2
>>> result_df.Stroke['Rural'] - result_df.Stroke['Urban']
1
Related
I have created a dataframe from a dictionary as follows:
my_dict = {'VehicleType':['Truck','Car','Truck','Car','Car'],'Colour':['Green','Green','Black','Yellow','Green'],'Year':[2002,2014,1975,1987,1987],'Frequency': [0,0,0,0,0]}
df = pd.DataFrame(my_dict)
So my dataframe df currently looks like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 0
1 Car Green 2014 0
2 Truck Black 1975 0
3 Car Yellow 1987 0
4 Car Green 1987 0
I'd like it to look like this:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
i.e., the Frequency column should represent the totals of VehicleType AND Colour combinations (but leaving out the Year column). So in row 4 for example, the 2 in the Frequency column tells you that there are a total of 2 rows with the combination of 'Car' and 'Green'.
This is essentially a 'Count' with 'Group By' calculation, and Pandas provides a way to do the calculation as follows:
grp_by_series = df.groupby(['VehicleType', 'Colour']).size()
grp_by_series
VehicleType Colour
Car Green 2
Yellow 1
Truck Black 1
Green 1
dtype: int64
What I'd like to do next is to extract the calculated group_by values from the Panda series and put them into the Frequency column of the Pandas dataframe. I've tried various approaches but without success.
The example I've given is hugely simplified - the dataframes I'm using are derived from genomic data and have hundreds of millions of rows, and will have several frequency columns based on various combinations of other columns, so ideally I need a solution which is fast and scales well.
Thanks for any help!
You are on a good path. You can continue like this:
grp_by_series=grp_by_series.reset_index()
res=df[['VehicleType', 'Colour']].merge(grp_by_series, how='left')
df['Frequency'] = res[0]
print(df)
Output:
VehicleType Colour Year Frequency
0 Truck Green 2002 1
1 Car Green 2014 2
2 Truck Black 1975 1
3 Car Yellow 1987 1
4 Car Green 1987 2
I think a .transform() does what you want:
df['Frequency'] = df.groupby(['VehicleType', 'Colour'])['Year'].transform('count')
I have the following dataframe (dummy data):
score GDP
country
Bangladesh 6 12
Bolivia 4 10
Nigeria 3 9
Pakistan 2 3
Ghana 1 3
India 1 3
Algeria 1 3
And I want to split it into two groups based on GDP and sum the score of each group. On the condition of GDP being less than 9:
sum_score
country
rich 13
poor 5
You can use np.where to make your rich and poor categories, then groupby that category and get the sum:
df['country_cat'] = np.where(df.GDP < 9, 'poor', 'rich')
df.groupby('country_cat')['score'].sum()
country_cat
poor 5
rich 13
You can also do the same in one step, by not creating the extra column for the category (but IMO the code becomes less readable):
df.groupby(np.where(df.GDP < 9, 'poor', 'rich'))['score'].sum()
You can aggregate by boolean mask and last only rename index:
a = df.groupby(df.GDP < 9)['score'].sum().rename({True:'rich', False:'poor'})
print (a)
GDP
poor 13
rich 5
Name: score, dtype: int64
Last for one column DataFrame add Series.to_frame:
df = a.to_frame('sum_score')
print (df)
sum_score
GDP
poor 13
rich 5
I have a dataframe with 2 columns as below:
Index Year Country
0 2015 US
1 2015 US
2 2015 UK
3 2015 Indonesia
4 2015 US
5 2016 India
6 2016 India
7 2016 UK
I want to create a new dataframe containing the maximum count of country in every year.
The new dataframe will contain 3 columns as below:
Index Year Country Count
0 2015 US 3
1 2016 India 2
Is there any function in pandas where this can be done quickly?
One way can be to use groupby and along with size for finding in each category adn sort values and slice by possible number of year. You can try the following:
num_year = df['Year'].nunique()
new_df = df.groupby(['Year', 'Country']).size().rename('Count').sort_values(ascending=False).reset_index()[:num_year]
Result:
Year Country Count
0 2015 US 3
1 2016 India 2
Use:
1.
First get count of each pairs Year and Country by groupby and size.
Then get index of max value by idxmax and select row by loc:
df = df.groupby(['Year','Country']).size()
df = df.loc[df.groupby(level=0).idxmax()].reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
2.
Use custom function with value_counts and head:
df = df.groupby('Year')['Country']
.apply(lambda x: x.value_counts().head(1))
.rename_axis(('Year','Country'))
.reset_index(name='Count')
print (df)
Year Country Count
0 2015 US 3
1 2016 India 2
Just provide a method without groupby
Count=pd.Series(list(zip(df2.Year,df2.Country))).value_counts()
.head(2).reset_index(name='Count')
Count[['Year','Country']]=Count['index'].apply(pd.Series)
Count.drop('index',1)
Out[266]:
Count Year Country
0 3 2015 US
1 2 2016 India
I am trying to basically look through a column and if that column has a unique value then enter 1 but if it doesn't it just becomes a NaN, my dataframe looks like this:
Street Number
0 1312 Oak Avenue 1
1 14212 central Ave 2
2 981 franklin way 1
the code I am using to put the number 1 next to unique values is as follows:
df.loc[(df['Street'].unique()), 'Unique'] = '1'
however when I run this I get this error KeyError: "not in index" I don't know why. I tried running this on the Number column and I get my desired result which is:
Street Number Unique
0 1312 Oak Avenue 1 NaN
1 14212 central Ave 2 1
2 981 franklin way 1 1
so my column that specifies which ones are unique is called Unique and it puts a one by the rows that are unique and NaNs by ones that are duplicates. So in this case I have 2 ones and it notices that and makes the first NaN and the second it provides a 1 and since their is only 1 two then it provides us for a 1 their as well since it is unique. I just don't know why I am getting that error for the street column.
That's not really producing your desired result. The output of df['Number'].unique(), array([1, 2], dtype=int64), just happened to be in the index. You'd encounter the same issue on that column if Number instead was [3, 4, 3], say.
For what you're looking for, selecting where not duplicated, or where you have left after dropping duplicates, might be better than unique:
df.loc[~(df['Number'].duplicated()), 'Unique'] = 1
df
Out[51]:
Street Number Unique
0 1312 Oak Avenue 1 1.0
1 14212 central Ave 2 1.0
2 981 franklin way 1 NaN
df.loc[df['Number'].drop_duplicates(), 'Unique'] = 1
df
Out[63]:
Street Number Unique
0 1312 Oak Avenue 1 NaN
1 14212 central Ave 2 1.0
2 981 franklin way 1 1.0
I know the question name is a little ambiguous.
My goal is to assign global key column based on 2 columns + unique value in my data frame.
For example
CountryCode | Accident
AFG Car
AFG Bike
AFG Car
AFG Plane
USA Car
USA Bike
UK Car
Let Car = 01, Bike = 02, Plane = 03
My desire global key format is [Accident][CountryCode][UniqueValue]
Unique value is a count of similar [Accident][CountryCode]
so if Accident = Car and CountryCode = AFG and it is the first occurrence, the global key would be 01AFG01
The desired dataframe would look like this:
CountryCode | Accident | GlobalKey
AFG Car 01AFG01
AFG Bike 02AFG01
AFG Car 01AFG02
AFG Plane 01AFG03
USA Car 01USA01
USA Bike 01USA02
UK Car 01UK01
I have tried running a for loop to append Accident Number and CountryCode together
for example:
globalKey = []
for x in range(0,6):
string = df.iloc[x, 1]
string2 = df.iloc[x, 2]
if string2 == 'Car':
number = '01'
elif string2 == 'Bike':
number = '02'
elif string2 == 'Plane':
number = '03'
#Concat the number of accident and Country Code
subKey = number + string
#Append to the list
globalKey.append(subKey)
This code will provide me with something like 01AFG, 02AFG based on the value I assign. but I want to assign a unique value by counting the occurrence of when CountryCode and Accident is similar.
I am stuck with the code above. I think there should be a better way to do it using map function in Pandas.
Thanks for helping guys!
much appreciate!
You can try with cumcount to achieve this in a number of steps, like this:
In [1]: df = pd.DataFrame({'Country':['AFG','AFG','AFG','AFG','USA','USA','UK'], 'Accident':['Car','Bike','Car','Plane','Car','Bike','Car']})
In [2]: df
Out[2]:
Accident Country
0 Car AFG
1 Bike AFG
2 Car AFG
3 Plane AFG
4 Car USA
5 Bike USA
6 Car UK
## Create a column to keep incremental values for `Country`
In [3]: df['cumcount'] = df.groupby('Country').cumcount()
In [4]: df
Out[4]:
Accident Country cumcount
0 Car AFG 0
1 Bike AFG 1
2 Car AFG 2
3 Plane AFG 3
4 Car USA 0
5 Bike USA 1
6 Car UK 0
## Create a column to keep incremental values for combination of `Country`,`Accident`
In [5]: df['cumcount_type'] = df.groupby(['Country','Accident']).cumcount()
In [6]: df
Out[6]:
Accident Country cumcount cumcount_type
0 Car AFG 0 0
1 Bike AFG 1 0
2 Car AFG 2 1
3 Plane AFG 3 0
4 Car USA 0 0
5 Bike USA 1 0
6 Car UK 0 0
And from that point on you can concatenate the values of cumcount, cumcount_type and Country to achieve what you're after.
Maybe you want to add 1 to each of the values you have under the different counts, depending on whether you want to start counting from 0 or 1.
I hope this helps.
First of all, don't use for loops if you can help it. For example, you can do your Accident to code mapping with:
df['AccidentCode'] = df['Accident'].map({'Car': '01', 'Bike': '02', 'Plane': '03'})
To get the unique code, Thanos has shown how to do that using GroupBy.cumcount:
df['CA_ID'] = df.groupby(['CountryCode', 'Accident']).cumcount() + 1
And then to put them all together into a unique key:
df['NewKey'] = df['AccidentCode'] + df['CountryCode'] + df['CA_ID'].map('{:0>2}'.format)
which gives:
CountryCode Accident GlobalKey AccidentCode CA_ID NewKey
0 AFG Car 01AFG01 01 1 01AFG01
1 AFG Bike 02AFG01 02 1 02AFG01
2 AFG Car 01AFG02 01 2 01AFG02
3 AFG Plane 01AFG03 03 1 03AFG01
4 USA Car 01USA01 01 1 01USA01
5 USA Bike 01USA02 02 1 02USA01
6 UK Car 01UK01 01 1 01UK01
After you create your subKey we can sort the dataframe and count the occurences of the couples. First let's reset the index (to store the original order)
df = df.reset_index()
then sort by the subKey and count
df = df.sort_values(by='subKey')
df['newnumber'] = 1
for ind in range(1, len(df)): #start by 1 because first row is always 1
if df.loc[ind, 'subKey'] == df.loc[ind - 1, 'subKey']:
df.loc[ind, 'newnumber'] = df.loc[ind - 1, 'newnumber'] + 1
Finally create the GlobalKey with the help of the zfill function, the reorder by index:
df['GlobalKey'] = df.apply(lambda x: x['subKey'] + str(x['new_number']).zfill(2), 1)
df = df.sort_values(by='index').drop('index', 1).reset_index(drop=True)
I don't have any experience of Pandas so this answer may not be what you are looking for. That being said, if the data you have is really that simple (few countries, few accident types) have you considered storing each country|accident combination in their own value?
So as you traverse your input, just increment the counter for that country|accident combination, and then read through those counters at the end to produce the GlobalKeys.
If you have other data to store besides the Global Key, then store the country|accident combinations as lists, and read through them at the end one-at-a-time to produce GlobalKeys.