I currently have a massive dataset with a large amount of rows and I wanted to create a smaller dataframe that only pulls 2 columns from the larger one and how many times each name occurred in that chapter in this instance 'Occurrence'
The below code is what I am using
df1 = (Dec16.groupby(["BNF Chapter", "Name"]).size().reset_index(name="Occurrence"))
df1
It plots this
BNF Chapter Name Occurrence
1 Aluminium hydroxide 2
1 Aluminium hydroxide + Magnesium trisilicate 2
1 Alverine 702
.......
21 Polihexanide 2
21 Potassium hydroxide 32
21 Sesame oil 22
21 Sodium chloride 222
What I would like to get is the top 10 most occurred names for a certain chapter as the dataset is so large.
For example a dataframe that only pulls
The top 10 most common names in chapter 1
How would I go about doing this?
Many thanks!!!
You can use this pandas.DataFrame.count
This Count Values In Pandas Dataframe here can help you out I hope
Related
I am trying to convert a dataframe where each row is a specific event, and each column has information about the event. I want to turn this into data in which each row is a country and year with information about the number and characteristics about the events in the given year.In this data set, each event is an occurrence of terrorism, and I want to count the number of events where the "target" is a government building. One of the columns is called "targettype" or "targettype_txt" and there are 5 different entries in this column I want to count (government building, military, police, diplomatic building etc). The targettype is also coded as a number if that is easier (i.e. there is another column where gov't building is 2, military installation is 4 etc..)
FYI This data set has 16 countries in West Africa and is looking at years 2000-2020 with a total of roughly 8000 events recorded. The data comes from the Global Terrorism Database, and this is for a thesis/independent research project (i.e. not a graded class assignment).
Right now my data looks like this (there are a ton of other columns but they aren't important for this):
eventID
iyear
country_txt
nkill
nwounded
nhostages
targettype_txt
10000102
2000
Nigeria
3
10
0
government building
10000103
2000
Mali
1
3
15
military installation
10000103
2000
Nigeria
15
0
0
government building
10000103
2001
Benin
1
0
0
police
10000103
2001
Nigeria
1
3
15
private business
.
.
.
And I would like it to look like this:
country_txt
iyear
total_nkill
total_nwounded
total_nhostages
total public_target
Nigeria
2000
200
300
300
15
Nigeria
2001
250
450
15
17
I was able to get the total number for nkill,nwounded, and nhostages using this super simple line:
df2 = cdf.groupby(['country','country_txt', 'iyear'])['nkill', 'nwound','nhostkid'].sum()
But this is a little different because I want to only count certain entries and sum up the total number of times they occur. Any thoughts or suggestions are really appreciated!
Try:
cdf['CountCondition'] = (cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')
df2 = cdf[cdf['CountCondition']].groupby(['country','country_txt', 'iyear', 'CountCondition']).count()
You create a new column 'CountCondition' which just marks as true or false if the condition in the statement holds. Then you just count the number of times the CountCondition is True. Hope this makes sense.
It is possible to combine all this into one statement and NOT create an additional column but the statement gets quite convaluted and more difficult to understand how it works:
df2 = cdf[(cdf['targettype_txt']=='government building') |
(cdf['targettype_txt']=='military installation') |
(cdf['targettype_txt']=='police')].groupby(['country','country_txt', 'iyear']).count()
I have a data like below in a pandas dataframe but there are over 500 columns and for columns 2-500+ I need to divide only the rows where in column 0 the value is 'dog' by 100.
0 1 2 3
cat 2019 19.80 96.28
cat 2022 19.50 66.80
dog 2022 21.10 57.70
dog 2021 21.50 42.85
The expected output is below:
0 1 2 3
cat 2019 19.80 96.28
cat 2022 19.50 66.80
dog 2022 0.211 0.577
dog 2021 0.215 0.4285
I have the following code which works to divide it by 100 for those specific rows and columns but it removes columns 0 and 1 and any rows that aren't 'dog'.
df=df[df[0].str.contains("dog")].loc[:,2:len(r2.columns)-1].div(100)
How do I keep the full dataframe and apply this division on those specific rows and columns?
Hate it when people post the correct answer in the comments, Imma just post the same thing as I thought as a real answer. Small credit for #mozway cuz' otherwise I would've needed to rummage in my brain how to do so.
df.loc[df[0].str.contains("dog"), 2:] /= 100
Have a good day.
I have a Pandas DataFrame with some categorical data in one of the columns. On doing value_counts on that particular column, I get something similar to:
HR 176
Coding 81
Reject 74
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10
Medical Science 9
Core Mechanical 8
Web Development 4
Puzzles 3
behavioural 3
not a question 2
civil engineering 1
Mathematics 1
Finance, Medical Science 1
Sales, HR 1
What I'd like to do is to only keep the categories with a count >= some threshold (e.g. 10). All the smaller categories should get clubbed in a separate "Other" category i.e. the result should look like:
HR 176
Coding 81
Reject 74
*Other* 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking 10
I've done this in the past by hacking together a defaultdict(int) and only taking the instances where count >= threshold. I want to know if there is a Pandas canonical way of achieving the same.
I would use a mask to perform boolean indexing and concat:
m = s>=10
out = (pd.concat([s[m], pd.Series(s[~m].sum(), index=['Others'])])
.sort_values(ascending=False)
)
output:
HR 176
Coding 81
Reject 74
Others 33
Database Administration 21
Finance 17
Project Management 16
Sales 15
DevOps 13
Core Electronics 10
Networking
Is this the answer you're looking for :
Pandas: Selecting rows based on value counts of a particular column
Else maybe this is what you want :
data = pd.DataFrame([["researcher",150],["politician",15],["builder",1],["teacher",5],])
data.columns = ["category", "count"]
filter_value = 10
d1 = data[data['count'] >= filter_value]
d2 = data[data['count'] < filter_value]
d1["tag"] = "filter_passed"
d2["tag"] = "Others"
data = pd.concat([d1,d2])
>>> data
category count tag
0 researcher 150 filter_passed
1 politician 15 filter_passed
2 builder 1 Others
3 teacher 5 Others
I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):
df1:
PRODUCT_ID PRODUCT_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce"
1 185965653252 "Chicken Salad with Dressing"
2 165958565556 "Pork and Honey Rissoles"
3 655262522233 "Cheese, Ham and Tomato Sandwich"
4 857485966653 "Coleslaw with Yoghurt Dressing"
5 524156285551 "Lemon and Raspberry Cheesecake"
I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:
df2 (also saved as dict_2)
PROD_ID PROD_DESCRIPTION
0 548576 "Fish Burger"
1 156956 "Chckn Salad w/Ranch Dressing"
2 257848 "Rissoles - Lamb & Rosemary"
3 298770 "Lemn C-cake"
4 651452 "Potato Salad with Bacon"
5 100256 "Cheese Cake - Lemon Raspberry Coulis"
What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:
PRODUCT_ID PRODUCT_DESCRIPTION PROD_ID PROD_DESCRIPTION
0 165985858958 "Fish Burger with Lettuce" 548576 "Fish Burger"
1 185965653252 "Chicken Salad with Dressing" 156956 "Chckn Salad w/Ranch Dressing"
2 165958565556 "Pork and Honey Rissoles" 257848 "Rissoles - Lamb & Rosemary"
3 655262522233 "Cheese, Ham and Tomato Sandwich" NaN NaN
4 857485966653 "Coleslaw with Yoghurt Dressing" NaN NaN
5 524156285551 "Lemon and Raspberry Cheesecake" 298770 "Lemn C-cake"
6 524156285551 "Lemon and Raspberry Cheesecake" 100256 "Cheese Cake - Lemon Raspberry Coulis"
I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.
I've used the following questions and answers (amongst others):
Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array
and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.
I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.
I'm using Python 2.7, pandas and have fuzzywuzzy installed.
using fuzz.ratio as my distance metric, calculate my distance matrix like this
df3 = pd.DataFrame(index=df.index, columns=df2.index)
for i in df3.index:
for j in df3.columns:
vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
vj = df2.get_value(j, 'PROD_DESCRIPTION')
df3.set_value(
i, j, fuzz.ratio(vi, vj))
print(df3)
0 1 2 3 4 5
0 63 15 24 23 34 27
1 26 84 19 21 52 32
2 18 31 33 12 35 34
3 10 31 35 10 41 42
4 29 52 32 10 42 12
5 15 28 21 49 8 55
Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.
threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)
Make assignments
df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df
You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:
d = {
'df1_id': [],
'df1_prod_desc': [],
'df2_id': [],
'df2_prod_desc': [],
'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
for _, df2_row in df2.iterrows():
d['df1_id'] = df1_row['PRODUCT_ID']
...
df3 = pd.DataFrame.from_dict(d)
I don't have enough reputation to be able to comment on answer from #piRSquared. Hence this answer.
The definition of 'vi' and 'vj' didn't go through with an error (AttributeError: 'DataFrame' object has no attribute 'get_value'). It worked when I inserted an "underscore". E.g. vi = df._get_value(i, 'PRODUCT_DESCRIPTION')
Similar issue persisted for 'set_value' and the same solution worked there too. E.g. df3._set_value(i, j, fuzz.ratio(vi, vj))
Generating idxmax posed another error (TypeError: reduction operation 'argmax' not allowed for this dtype) which was because contents of df3 (the fuzzy ratios) were of type 'object'. I converted all of them to numeric just before defining threshold and it worked. E.g. df3 = df3.apply(pd.to_numeric)
A million thanks to #piRSquared for the solution. For a Python novice like me, it worked like a charm. I am posting this answer to make it easy for other newbies like me.
DataFrame1:
Device MedDescription Quantity
RWCLD Acetaminophen (TYLENOL) 325 mg Tab 54
RWCLD Ampicillin Inj (AMPICILLIN) 2 g Each 13
RWCLD Betamethasone Inj *5mL* (CELESTONE SOLUSPAN) 30 mg (5 mL) Each 2
RWCLD Calcium Carbonate Chew (500mg) (TUMS) 200 mg Tab 17
RWCLD Carboprost Inj *1mL* (HEMABATE) 250 mcg (1 mL) Each 5
RWCLD Chlorhexidine Gluc Liq *UD* (PERIDEX/PERIOGARD) 0.12 % (15 mL) Each 5
Data Frame2:
Device DrwSubDrwPkt MedDescription BrandName MedID PISAlternateID CurrentQuantity Min Max StandardStock ActiveOrders DaysUnused
RWC-LD RWC-LD_MAIN Drw 1-Pkt 12 Mag/AlOH/Smc 200-200-20/5 *UD* (MYLANTA/MAALOX) (30 mL) Each MYLANTA/MAALOX A03518 27593 7 4 10 N Y 3
RWC-LD RWC-LD_MAIN Drw 1-Pkt 20 ceFAZolin in Dextrose(ISO-OS) (ANCEF/KEFZOL) 1 g (50 mL) Each ANCEF/KEFZOL A00984 17124 6 5 8 N N 2
RWC-LD RWC-LD_MAIN Drw 1-Pkt 22 Clindamycin Phosphate/D5W (CLEOCIN) 900 mg (50 mL) IV Premix CLEOCIN A02419 19050 7 6 8 N N 2
What I want to do is append DataFrame2 values to Data Frame 1 ONLY if the 'MedDescription' matches. When it find the match, I would like to add only certain columns from dataFrame2[Min,Max,Days Unused] which are all integers
I had an iterative solution where I access the dataframe 1 object 1 row at a time and then check for a match with dataframe 2, once found I append the column numbers from there to the original dataFrame.
Is there a better way? It is making my computer slow to a crawl as I have thousands upon thousands of rows.
It sounds like you want to merge the target columns ('MedDescription', 'Min', 'Max', 'Days Unused') to df1 based on a matching 'MedDescription'.
I believe the best way to do this is as follows:
target_cols = ['MedDescription', 'Min', 'Max', 'Days Unused']
df1.merge(df2[target_cols], on='MedDescription', how='left')
how='left' ensures that all the data in df1 is returned, and only the target columns in df2 are appended if MedDescription matches.
Note: It is easier for others if you copy the results of df1/df2.to_dict(). The data above is difficult to parse.
This sounds like an opportunity to use Pandas' built-in functions for joining datasets - you should be able to join on MedDescription with a the desired columns from DataFrame2. The join function in Pandas is very efficient, and should far outperform your method of looping through.
Pandas has documentation on merging datasets that includes some good examples, and you can find ample literature on the concepts of joins in SQL tutorials.
pd.merge(ld,ldAc,on='MedDescription',how='outer')
This is the way I used to join the 2 DataFrames, it seems to work, although it deleted one of the Indexes that contained the devices.