I created the following function to retrieve data from an internal incident management system:
def get_issues(session, query):
block_size = 50
block_num = 0
start = 0
all_issues = []
while True:
issues = sesssion.search_issues(query, start, block_size, expand='changelog')
if len(issues) == 0 # no more issues
break
start += len(issues)
for issue in issues:
all_issues.append(issue)
issues = pd.DataFrame(issues)
for issue in all_issues:
changelog = issue.changelog
for history in changelog.histories:
for item in history.items:
if item.field == 'status' and item.toString == 'Pending':
groups = issue.fields.customfield_02219
d = {
'key' : issue.key,
'issue_type' : issue.fields.issuetype,
'creator' : issue.fields.creator,
'business' : issue.fields.customfield_082011,
'groups' : groups
}
fields = issue.fields
issues = issues.append(d, ignore_index=True)
return issues
I use this function to create a dataframe df using:
df = get_issues(the_session, the_query)
The resulting dataset looks similar to the following:
key issue_type creator business groups
0 MED-184 incident Smith, J Mercedes [Finance, Accounting, Billing]
1 MED-186 incident Jones, M Mercedes [Finance, Accounting]
2 MED-187 incident Williams, P Mercedes [Accounting, Sales, Executive, Tax]
3 MED-188 incident Smith, J BMW [Sales, Executive, Tax, Finance]
When I call dtypes on df, I get:
key object
issue_type object
creator object
business object
groups object
I would like to get only the last element of the groups column, such that the dataframe looks like:
key issue_type creator business groups
0 MED-184 incident Smith, J Mercedes Billing
1 MED-186 incident Jones, M Mercedes Accounting
2 MED-187 incident Williams, P Mercedes Tax
3 MED-188 incident Smith, J BMW Finance
I tried to amend the function above, as follows:
groups = issue.fields.customfield_02219[-1]
But, I get an error that it's not possible to index into that field:
TypeError: 'NoneType' object is not subscriptable
I also tried to create another column using:
df['groups_new'] = df['groups']:[-1]
But, this returns the original groups column with all elements.
Does anyone have any ideas as to how to accomplish this?
Thanks!
########################################################
UPDATE
print(df.info()) results in the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex 13 entries, 0 to 12
Data columns (total 14 columns)
# Column Non-Null Count Dtype
--- ------ ------------- -----
0 activity 7 non-null object
1 approvals 8 non-null object
2 business 13 non-null object
3 created 13 non-null object
4 creator 13 non-null object
5 region_a 5 non-null object
6 issue_type 13 non-null object
7 key 13 non-null object
8 materiality 13 non-null object
9 region_b 5 non-null object
10 resolution 2 non-null object
11 resolution_time 1 non-null object
12 target 13 non-null object
13 region_b 5 non-null object
types: object(14)
memory usage: 1.5+ KB
None
Here it is:
df['new_group'] = df.apply(lambda x: x['groups'][-1], axis = 1)
UPDATE: If you get an IndexError with this, it means that at least one one your lists in empty. You can try this:
df['new_group'] = df.apply(lambda x: x['groups'][-1] if x['groups'] else None, axis = 1)
EXAMPLE:
df = pd.DataFrame({'key':[121,234,147], 'groups':[[111,222,333],[34,32],[]]})
print(f'ORIGINAL DATAFRAME:\n{df}\n')
df['new_group'] = df.apply(lambda x: x['groups'][-1] if x['groups'] else None, axis = 1)
print(f'FINAL DATAFRAME:\n{df}')
#
ORIGINAL DATAFRAME:
key groups
0 121 [111, 222, 333]
1 234 [34, 32]
2 147 []
FINAL DATAFRAME:
key groups new_group
0 121 [111, 222, 333] 333.0
1 234 [34, 32] 32.0
2 147 [] NaN
UPDATE: demonstration handling empty values
To get only the last element of each value (a Python list) in the 'groups' column, you can apply the following lambda to modify the 'groups' column inplace:
df['groups'] = df['groups'].apply(lambda x: x.pop() if x else None)
Working demonstration:
import pandas as pd
# Code for mocking the dataframe
data = {
'key': ["MED-184", "MED-186", "MED-187"],
'issue_type': ['incident', 'incident', 'incident'],
'creator': ['Smith, J', 'Jones, M', 'Williams, P'],
'business': ['Mercedes', 'Mercedes', 'Mercedes'],
'groups': [['Finance', 'Accounting', 'Billing'], ['Finance', 'Accounting'], None]
}
df = pd.DataFrame.from_dict(data)
# print old dataframe:
print(df)
# Execute the line below to transform the dataframe
# into one with only the last values in the group column.
df['groups'] = df['groups'].apply(lambda x: x.pop() if x else None)
# print new transformed dataframe:
print(df)
I hope this answer helps you.
Related
I am learning how to create heatmaps from CSV datasets using Pandas, Seaborn and Numpy.
# Canada Cases Year overview - Heatmap
# Read file and separate needed data subset
canada_df = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
canada_df.info()
canada_df.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 110370 entries, 2020-01-22 to 2021-08-09
Data columns (total 2 columns):
#
Column
Non-Null
count
Dtype
0
Country
110370
non-null
object
1
Confirmed
110370
non-null
int64
dtypes: int64(1), object(1)
Country
Confirmed
Date
Afghanistan
0
2020-01-22
Afghanistan
0
2020-01-23
Afghanistan
0
2020-01-24
Afghanistan
0
2020-01-25
Afghanistan
0
2020-01-26
Afghanistan
0
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
#Isolating needed subset
canada_cases = canada_df['Confirmed']
canada_cases.head()
# create a copy of the dataframe, and add columns for month and year
canada_heatmap = canada_cases.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
# group by month and year, get the average
canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
At this point I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-787f01af1859> in <module>
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
<ipython-input-54-787f01af1859> in <listcomp>(.0)
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
AttributeError: 'str' object has no attribute 'year'
I'm stuck on how to solve this, as the line above is pretty much the same but doesn't raise the same issue. Does anyone know what's going on here?
Some of your indexes are not in a date format (2 elements are string, which are the two lasts elements)
# check the type of the elements in index
count = pd.Series(canada_heatmap.index).apply(type).value_counts()
print(count)
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 110370
<class 'str'> 2
Name: Date, dtype: int64
# remove them
canada_heatmap = canada_heatmap.iloc[:-2]
I reproduced your error.
Here
canada_cases = canada_df['Confirmed']
you're extracting one column of the dataset and it becomes a Series object, not Dataframe. Which then carries over to canada_heatmap.
type(canada_heatmap)
>>> pandas.core.series.Series
As such, using an assignment with
canada_heatmap['month'] = ANYTHING
creates a new record in the series with the index value "month", not a new column.
Thus, on the first pass canada_heatmap.index is still a DatetimeIndex and has .year or .month attribute, but it breaks in the next line, as now the index is just strings. And strings don't have .year attributes.
Instead do:
import pandas as pd
covid_all_countries = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
covid_canada_confirmed = covid_all_countries.loc[covid_all_countries['Country']=='Canada']
canada_heatmap = covid_canada_confirmed.copy()
canada_heatmap.drop(columns='Country', inplace=True)
canada_heatmap['month'] = canada_heatmap.index.month
canada_heatmap['year'] = canada_heatmap.index.year
Note, that the last two statements are equivalent to what you were trying to achieve but without looping through all the values (even if using list comprehension). This is clearer, more concise and considerably faster.
A couple comments:
This line does nothing:
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
You need to assign the filtering to a value like this:
#Filtering data for Canadian values only
canada_df_filt = canada_df.loc[canada_df['Country']=='Canada'].copy()
Next, try to set the month/year of canada_df before filtering it as a series like this:
canada_heatmap = canada_df_filt.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
This works on my machine.
I'm trying to clean up a Pandas dataframe called names2. It consists of 599,864 rows, 549,317 of them are non-null. In each row under the column in question, 'primary_profession' there is either 1 string, an array of strings or NaN.
Here is a look at how I loaded the dataframe:
name_basics_imdb = pd.read_csv('imdb.name.basics.csv.gz')
names = name_basics_imdb
names2 = names.copy(deep=True)
(Note: I dropped some columns and rows and renamed a column, if you need more details they'll gladly be supplied)
Here is a view of names2.info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599865 entries, 0 to 599864
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nconst 599865 non-null object
1 primary_name 599865 non-null object
2 primary_profession 549317 non-null object
3 known_for_titles 569766 non-null object
dtypes: object(4)
memory usage: 18.3+ MB
names2.head( )
nconst primary_name primary_profession known_for_titles
0 nm0061671 Mary Ellen Bauder miscellaneous,production_manager,producer tt0837562,tt2398241,tt0844471,tt0118553
1 nm0061865 Joseph Bauer composer,music_department,sound_department tt0896534,tt6791238,tt0287072,tt1682940
2 nm0062070 Bruce Baum miscellaneous,actor,writer tt0363631
3 nm0062195 Axel Baumann camera_department,cinematographer,art_department tt0114371,tt2004304,tt1618448,tt1224387
4 nm0062798 Pete Baxter production_designer,art_department,set_decorator tt0452644,tt0452692,tt34580
The goal is to iterate through each row's string, strings, or Nan and just keep the rows with writer, writer director, or just director. Any other profession can be thrown out. For instance, in row 2: miscellaneous, actor, writer are in the primary_profession column, miscellaneous and actor can be eliminated leaving only writer in that row.
Any rows where there is no writer or director or contain NaN can be dropped.
Here are a few of the attempts I made
#inverse filtering
value_list = ['miscellaneous', 'production_manager', 'composer', 'music_department', 'sound_department',
'miscellaneous', 'actor', '...', 'costume_department', 'costume_designer', 'actress', 'art_director', 'music_department' ]
#have to split the arrays first
inverse_bool_series = ~names2.primary_profession.isin(value_list)
names2_filtered = names2[inverse_bool_series]
names2_filtered
I also tried
names2['primary_profession'] = names2['primary_profession'].str.split(",").str[:3]
names2['primary_profession']
(names2['primary_profession'][0])
type(names2['primary_profession'][0][0])
And then there was this
for index, row in names2.iterrows():
idx = list(len(range(names2.primary_profession)))
for i in idx:
print(row['primary_profession'][i])
To summarize, the goal is for the dataframe names2 to have only rows with the profession writer, writer director, or just writer
In [96]: names2
Out[96]:
nconst primary_name primary_profession known_for_titles
0 nm0061671 Mary Ellen Bauder miscellaneous,production_manager,producer,director,writer tt0837562,tt2398241,tt0844471,tt0118553
1 nm0061865 Joseph Bauer composer,music_department,sound_department tt0896534,tt6791238,tt0287072,tt1682940
2 nm0062070 Bruce Baum miscellaneous,actor,writer tt0363631
3 nm0062195 Axel Baumann camera_department,cinematographer,art_department tt0114371,tt2004304,tt1618448,tt1224387
4 nm0062798 Pete Baxter production_designer,art_department,set_decorator tt0452644,tt0452692,tt34580
In [97]: profs = names2['primary_profession'].str.split(',').explode()
In [98]: profs
Out[98]:
0 miscellaneous
0 production_manager
0 producer
0 director
0 writer
1 composer
1 music_department
1 sound_department
2 miscellaneous
2 actor
2 writer
3 camera_department
3 cinematographer
3 art_department
4 production_designer
4 art_department
4 set_decorator
Name: primary_profession, dtype: object
In [99]: filtered_profs = profs[profs.isin(['writer', 'writer director', 'director'])]
In [100]: filtered_profs.groupby(filtered_profs.index).agg(','.join)
Out[100]:
0 director,writer
2 writer
Name: primary_profession, dtype: object
In [101]: names2.drop('primary_profession', axis=1).join(filtered_profs.groupby(filtered_profs.index).agg(','.join), how='inner')
Out[101]:
nconst primary_name known_for_titles primary_profession
0 nm0061671 Mary Ellen Bauder tt0837562,tt2398241,tt0844471,tt0118553 director,writer
2 nm0062070 Bruce Baum tt0363631 writer
# set the words you want to match.
matched_words = ['writer', 'writer_director', 'director']
#drop rows which has nan in column 'primary_profession'
names2 = names.dropna(axis='index', subset=['primary_profession'])
#extract all matched words
names2_extractall = names2['primary_profession'].str.extractall(rf'({"|".join(matched_words)})')
#groupby index and join those matches result by ','
mod_prof = names2_extractall.groupby(level=0).apply(lambda x: ",".join(x.iloc[:, 0]))
#assign to column 'primary_profession'
names2 = names2.assign(primary_profession=mod_prof)
#drop no matched rows
names2 = names2.dropna(axis='index', subset=['primary_profession'])
unwanted_row_indices=[]
for index,row in names2.iterrows():
if('writer' in row['primary_profession'].lower()):
pass
else:
unwanted_row_indices.append(index)
names2=names2.drop(unwanted_row_indices,axis=0)
I have list of column names of csv file like:[email, null, password, ip_address, user_name, phone_no] .Consider I have csv with data:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
Now I want to identify the column names of this csv file on the basis of data, like col_1 is date and col_2 is mail.
I tried to use pandas. like getting all values from col_1 and then identify either it is mail or something else but couldn't get much.
i tried something like this:
df = pd.read_csv('demo.csv', header=None)
df[df[1].str.contains("#")]
but its not helping me.
thank you.
Have you tried using Pandas dataframe.infer_objects()?
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":["alpha", 15, 81, 1, 100],
"B":[2, 78, 7, 4, 12],
"C":["beta", 21, 14, 61, 5]})
# data frame info and data
df.info()
print(df)
# slice all rows except first into a new frame
df_temp = df[1:]
# print it
print(df_temp)
df_temp.info()
# infer the object types
df_inferred = df_temp.infer_objects()
# print inferred
print(df_inferred)
df_inferred.info()
Here's the output from the above py script.
Initially df is inferred as object, int64 and object for A, B and C respectively.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 5 non-null object
1 B 5 non-null int64
2 C 5 non-null object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
A B C
0 alpha 2 beta
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After removing the first exception row which has the strings, the data frame is still showing the same type.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After infer_objects(), the types have been correctly inferred as int64.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null int64
2 C 4 non-null int64
dtypes: int64(3)
memory usage: 228.0 bytes
Is this what you need?
The OP has clarified that s/he needs to determine if the column contains one of the following:
email
password
ip_address
user_name
phone_no
null
There are a couple of approaches we could use:
Approach #1: Take a random sample of rows and analyze their column contents using heuristics
We could use the following heuristic rules to identify column content type.
email: Use a regex to check for presence of a valid email.
[Stackoverflow - How to validate an email address]
https://www.regular-expressions.info/email.html
https://emailregex.com/
ip_address: Use a regex to match an ip_address.
Stackoverflow - Validating IPv4 addresses with Regex
Stackoverflow - Regular expression that matches valid IPv6 addresses
username: Use a table of common first names or lastnames and search for them within the username
phone_no: Strip +, SPACE, -, (, ) -- alternatively, all special characters. If you are left with all digits, we have a potential phone number
null: All column contents in sample are null
password: If it doesn't satisfy rules 1 through 5, we identify it as password
We should do the analysis independently on each column and keep track of how many sample items in the column matched each heuristic. Then we could pick the classification with the maximum number of matches.
Approach #2: Train a classifier using training data (obtained from real system) and use it to determine column content type
This is a machine learning classification task. A naive approach would be to take each column's data mapped to the content type as the training input.
Using the OP's sample set:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
We would have:
data_content, content_type
03-Sep-14, date
20-Jan-13, date
20-Feb-15, date
12-May-16, date
25-May-16, date
foo2#yahoo.co.jp, email
foo3#gmail.com, email
foo4#yahoo.co.jp, email
foo5#hotmail.co.jp, email
foo6#hotmail.co.jp, email
We can then use machine learning to build a text-to-class multi-class classifier. Some references are given below:
Multi-Class Text Classification from Start to Finish
Multi-Class Text Classification with Scikit-Learn
I'm trying to analyze a data set in colab, and it looks a bit like this:
import pandas as pd
df = pd.read_csv('gdrive/My Drive/python_for_data_analysts/Agora Data.csv')
df.info()
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Vendor 109689 non-null object
1 Category 109689 non-null object
2 Item 109687 non-null object
3 Item Description 109660 non-null object
4 Price 109684 non-null object
5 Origin 99807 non-null object
6 Destination 60528 non-null object
7 Rating 109674 non-null object
8 Remarks 12616 non-null object
There's a column of category and origin and what I'm trying to do is get a value count specifically of the categories with an origin of say China or USA only. Something that looks like:
df[' Origin'].value_counts().head(30)
USA 33729
UK 10336
Australia 8767
Germany 7876
Netherlands 7707
Canada 5126
EU 4356
China 4185
I've filtered out everything other than rows with an origin of China, but when I try to get a value count of the different categories within China, it doesn't output a proper list like the one above.
china_transactions = (df[' Origin'] == 'China') & (df[' Category']).value_counts()
china_transactions.head(50)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
Create a Boolean Series where 'Origin' == 'China' and subset the DataFrame to only those rows. Then take the value_counts of the Category column. You can use DataFrame.loc to combine row and column selections at once.
df.loc[df[' Origin'].eq('China'), 'Category'].value_counts()
# ------------------------ | |
# | Take this apply this
# Only these rows column method
I'm a newbie to Pandas and I'm trying to apply it to a script that I have already written.
I have a csv file from which I extract the data, and use the columns 'candidate', 'final track' and 'status' for my data frame.
My problem is, I would like to filter the data, using perhaps the method shown in Wes Mckinney's 10min tutorial ('http://nbviewer.ipython.org/urls/gist.github.com/wesm/4757075/raw/a72d3450ad4924d0e74fb57c9f62d1d895ea4574/PandasTour.ipynb'). In the section In [80]: he uses aapl_bars.close_price['2009-10-15'].
I would like to use a similar method to select all the data which have * as a status. Data from the other columns are also deleted if there is no * in that row.
My code at the moment:
def establish_current_tacks(filename):
df=pd.read_csv(filename)
cols=[df.iloc[:,0], df.iloc[:,10], df.iloc[:,11]]
current_tracks=pd.concat(cols, axis=1)
return current_tracks
My DataFrame:
>>> current_tracks
<class 'pandas.core.frame.DataFrame'>
Int64Index: 707 entries, 0 to 706
Data columns (total 3 columns):
candidate 695 non-null values
final track 670 non-null values
status 670 non-null values
dtypes: float64(1), object(2)
I would like to use something such as current_tracks.status['*'], but this does not work
Apologies if this is obvious, struggling a little to get my head around it.
Since the data you want to filter based on is not part of the data frame's index, but instead is a regular column, you need to do something like this:
current_tracks[current_tracks.status == '*']
Full example:
import pandas as pd
current_tracks = pd.DataFrame({'candidate': ['Bob', 'Jim', 'Alice'],
'final_track': [10, 15, 13], 'status': ['*', '.', '*']})
current_tracks
Out[3]:
candidate final_track status
0 Bob 10 *
1 Jim 15 .
2 Alice 13 *
current_tracks[current_tracks.status == '*']
Out[4]:
candidate final_track status
0 Bob 10 *
2 Alice 13 *
If status was part of your dataframe's index, your original syntax would have worked:
current_tracks = current_tracks.set_index('status')
current_tracks.candidate['*']
Out[8]:
status
* Bob
* Alice
Name: candidate, dtype: object