Fill empty Pandas column based on condition on substring - python

I have this dataset with the following data. I have a Job_Title column and I added a Categories column that I want to use to categorize my job titles. For example, all the job titles that contains the word 'Analytics' will be categorize as Data. This label Data will appear on the Categories table.
I have created a dictionary with the words I want to identify on the Job_Title column as key and the values I want to add on the Categories column as values.
#Creating a new dictionary with the new categories
cat_type_dic = {}
cat_type_file = open("categories.txt")
for line in cat_type_file:
key, value = line.split(";")
cat_type_dic[key] = value
print(cat_type_dic)
Then, I tried to create a loop based on a condition. Basically, if the key on the dictionary is a substring of the column Job_Title, fill the column Categories with the value. This is what I tried:
for i in range(len(df)):
if df.loc["Job_Title"].str.contains(cat_type_dic[i]):
df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
Of course, it's not working. I think I am not accessing correctly to the key and value. Any clue?
This is the message error that I am getting:
TypeError Traceback (most recent call
last) in
1 for i in range(len(df)):
----> 2 if df.iloc["Job_Title"].str.contains(cat_type_dic[i]):
3 df["Categories"] = df["Categories"].str.replace(cat_type_dic[i], cat_type_dic.get(i))
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in
getitem(self, key)
929
930 maybe_callable = com.apply_if_callable(key, self.obj)
--> 931 return self._getitem_axis(maybe_callable, axis=axis)
932
933 def _is_scalar_access(self, key: tuple):
C:\Anaconda3\lib\site-packages\pandas\core\indexing.py in
_getitem_axis(self, key, axis) 1561 key = item_from_zerodim(key) 1562 if not is_integer(key):
-> 1563 raise TypeError("Cannot index by location index with a non-integer key") 1564 1565 # validate
the location
TypeError: Cannot index by location index with a non-integer key
Thanks a lot!

Does the following code give you what you need?
import pandas as pd
df = pd.DataFrame()
df['Job_Title'] = ['Business Analyst', 'Data Scientist', 'Server Analyst']
cat_type_dic = {'Business': ['CatB1', 'CatB2'], 'Scientist': ['CatS1', 'CatS2', 'CatS3']}
list_keys = list(cat_type_dic.keys())
def label_extracter(x):
list_matched_keys = list(filter(lambda y: y in x['Job_Title'], list_keys))
category_label = ' '.join([' '.join(cat_type_dic[key]) for key in list_matched_keys])
return category_label
df['Categories'] = df.apply(lambda x: label_extracter(x), axis=1)
print(df)
Job_Title Categories
0 Business Analyst CatB1 CatB2
1 Data Scientist CatS1 CatS2 CatS3
2 Server Analyst
EDIT: Explaination added. #SofyPond
apply helps when loop necessary.
I defined a function which checks if Job_Title contains a key in the dictionary which is assigned earlier. I preferred convert keys to a list to make checking process easier.
(list_label renamed to category_label since it is not list anymore) category_label in function label_extracter gets values assigned to key in list format. It is converted to str by putting ' ' (white space) between values. In the case, length of list_matched_keys is greater than 0, it will contains list of string which are created by inner ' '.join. So outer ' '.join convert it to string format.

Related

Pandas Sort File and group up values

I'm learning pandas,but having some trouble.
I import data as DataFrame and want to bin the 2017 population values into four equal-size groups.
And count the number of group4
However the system print out:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-52-05d9f2e7ffc8> in <module>
2
3 df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
----> 4 df=df.sort_values('2017',ascending=True)
5 df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
6 splitData = [group for _, group in df.groupby('Group')]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index, key)
5453
5454 by = by[0]
-> 5455 k = self._get_label_or_level_values(by, axis=axis)
5456
5457 # need to rewrap column in Series to apply key function
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1682 values = self.axes[axis].get_level_values(key)._values
1683 else:
-> 1684 raise KeyError(key)
1685
1686 # Check for duplicates
KeyError: '2017'
What's wrong with it?
Thanks~
Here's the dataframe:
And I tried:
df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
df=df.sort_values('2017',ascending=True)
df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
splitData = [group for _, group in df.groupby('Group')]
print('The number of group4 is :',splitData[3].shape[0])
You are inserting the key for df.sort_values() as a str. You can either give it as an element in a list or not.
df = df.sort_values(by=['2017'], ascending=True)
or
df = df.sort_values(by='2017', ascending=True)
This only works if the column value is exactly matching the string you pass. If it is not a string or if that string contains some white spaces it won't work. You can remove any trailing white spaces before sorting by,
df.columns = df.columns.str.strip()
and if it is not a string you should use,
df = df.sort_values(by=[2017], ascending=True)
Firstly, you have problem in 4 line with the sort, you tell sort function to look for string 2017, but it's integer. Try this then move on on your code:
df=df.sort_values([2017],ascending=True)

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1
Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

how to return max value of each column from a dataframe using for loop in python

I have a dataframe that i perform a crosstab function and return a dataframe object where i want to return the max() value of each column but when i try to iterate over the dataframe using for loop the system crash and display the below error:
for m in range(len(maxvalue)):
TypeError: object of type 'numpy.int64' has no len()
the type of maxvalue is numpy.int64
code:
import pandas as pd
df = pd.Dataframe({'event_type': ['watch movie ', 'stay at home', 'swimming','camping','meeting'],
'year_month': ['2020-08', '2020-05', '2020-02','2020-06','2020-01'],
'event_mohafaza':['loc1','loc3','loc2','loc5','loc4'],
' number_person ':[24,39,20,10,33],})
grouped_df=pd.crosstab(df['year_month'], df[event_type])
print(type(grouped_df))
for x in grouped_df.columns:
mx = []
maxvalue =grouped_df[x].max()
str(maxvalue)
print(type(maxvalue))
print(maxvalue)
for m in range(len(maxvalue)):
mx.append(m)
print(mx)
the expected result is :
1
509
92
332
14
where each value is the max value in its column

pandas replace under condition issue

I want to replace values under column Severity if the value has the word (dtd), (dnp), (out indefintely), (out for season) in the string to levels(1-4)
I tried to replace using a dictionary, but it doesn't change the elements under the column
df['Severity'] = ""
df['Severity'] = df['Notes']
replace_dict = {'DTD':1,'DNP':2,'out indefinitely':3,'out for season':4}
df['Severity'] = df['Severity'].replace(replace_dict)
I am cleaning NBA injury data from season 2018-19
the frame look like this:
enter image description here
I would build a custom function then apply to the string column:
def replace_severity(s):
"""matches string `s` for keys in `replace_dict` returning values from `replace_dict`"""
# define the keys + matches
replace_dict = {'DTD':1,'DNP':2,'out indefinitely':3,'out for season':4}
for key, value in replace_dict.items():
if re.search(key, s, re.I):
return value
# if no results
return None
# Apply this function to your column
df['Severity'] = df['Notes'].apply(replace_severity)

Exception during groupby pandas

I am just beginning to learn analytics with python for network analysis using the Python For Data Analysis book and I'm getting confused by an exception I get while doing some groupby's... here's my situation.
I have a CSV of NetFlow data that I've imported to pandas. The data looks something like:
dt, srcIP, srcPort, dstIP, dstPort, bytes
2013-06-06 00:00:01.123, 123.123.1.1, 12345, 234.234.1.1, 80, 75
I've imported and indexed the data as follows:
df = pd.read_csv('mycsv.csv')
df.index = pd.to_datetime(full_set.pop('dt'))
What I want is a count of unique srcIPs which visit my servers per time period (I have data over several days and I'd like time period by date,hour). I can obtain an overall traffic graph by grouping and plotting as follows:
df.groupby([lambda t: t.date(), lambda t: t.hour]).srcIP.nunique().plot()
However, I want to know how that overall traffic is split amongst my servers. My intuition was to additionally group by the 'dstIP' column (which only has 5 unique values), but I get errors when I try to aggregate on srcIP.
grouped = df.groupby([lambda t: t.date(), lambda t: t.hour, 'dstIP'])
grouped.sip.nunique()
...
Exception: Reindexing only valid with uniquely valued Index objects
So, my specific question is: How can I avoid this exception in order to create a plot where traffic is aggregated over 1 hour blocks and there is a different series for each server.
More generally, please let me know what newb errors I'm making.
Also, the data does not have regular frequency timestamps and I don't want sampled data in case that makes any difference in your answer.
EDIT 1
This is my ipython session exactly as input. output ommitted except for the deepest few calls in the error.
EDIT 2
Upgrading pandas from 0.8.0 to 0.12.0 as yielded a more descriptive exception shown below
import numpy as np
import pandas as pd
import time
import datetime
full_set = pd.read_csv('june.csv', parse_dates=True, index_col=0)
full_set.sort_index(inplace=True)
gp = full_set.groupby(lambda t: (t.date(), t.hour, full_set['dip'][t]))
gp['sip'].nunique()
...
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _make_labels(self)
1239 raise Exception('Should not call this method grouping by level')
1240 else:
-> 1241 labs, uniques = algos.factorize(self.grouper, sort=self.sort)
1242 uniques = Index(uniques, name=self.name)
1243 self._labels = labs
/usr/local/lib/python2.7/dist-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel)
123 table = hash_klass(len(vals))
124 uniques = vec_klass()
--> 125 labels = table.get_labels(vals, uniques, 0, na_sentinel)
126
127 labels = com._ensure_platform_int(labels)
/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_labels (pandas/hashtable.c:12229)()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
52 def __hash__(self):
53 raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54 ' hashed'.format(self.__class__.__name__))
55
56 def __unicode__(self):
TypeError: 'TimeSeries' objects are mutable, thus they cannot be hashed
So I'm not 100 percent sure why that exception was raised.. but a few suggestions:
You can read in your data and parse the datetime and index by the datetime all at once with read_csv:
df = pd.read_csv('mycsv.csv', parse_dates=True, index_col=0)
Then you can form your groups by using a lambda function that returns a tuple of values:
gp = df.groupby( lambda t: ( t.date(), t.hour, df['dstIP'][t] ) )
The input to this lambda function is the index, we can use this index to go into the dataframe in the outer scope and retrieve the srcIP value at that index and thus factor it into the grouping.
Now that we have the grouping, we can apply the aggregator:
gp['srcIP'].nunique()
I ended up solving my problem by adding a new column of hour-truncated datetimes to the original dataframe as follows:
f = lambda i: i.strftime('%Y-%m-%d %H:00:00')
full_set['hours'] = full_set.index.map(f)
Then I can groupby('dip') and loop through each destIP creating an hourly grouped plot as I go...
for d, g in dipgroup:
g.groupby('hours').sip.nunique().plot()

Categories