I am learning how to create heatmaps from CSV datasets using Pandas, Seaborn and Numpy.
# Canada Cases Year overview - Heatmap
# Read file and separate needed data subset
canada_df = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
canada_df.info()
canada_df.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 110370 entries, 2020-01-22 to 2021-08-09
Data columns (total 2 columns):
#
Column
Non-Null
count
Dtype
0
Country
110370
non-null
object
1
Confirmed
110370
non-null
int64
dtypes: int64(1), object(1)
Country
Confirmed
Date
Afghanistan
0
2020-01-22
Afghanistan
0
2020-01-23
Afghanistan
0
2020-01-24
Afghanistan
0
2020-01-25
Afghanistan
0
2020-01-26
Afghanistan
0
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
#Isolating needed subset
canada_cases = canada_df['Confirmed']
canada_cases.head()
# create a copy of the dataframe, and add columns for month and year
canada_heatmap = canada_cases.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
# group by month and year, get the average
canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
At this point I get this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-54-787f01af1859> in <module>
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
<ipython-input-54-787f01af1859> in <listcomp>(.0)
2 canada_heatmap = canada_cases.copy()
3 canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
----> 4 canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
5 # group by month and year, get the average
6 canada_heatmap = canada_heatmap.groupby(['month', 'year']).mean()
AttributeError: 'str' object has no attribute 'year'
I'm stuck on how to solve this, as the line above is pretty much the same but doesn't raise the same issue. Does anyone know what's going on here?
Some of your indexes are not in a date format (2 elements are string, which are the two lasts elements)
# check the type of the elements in index
count = pd.Series(canada_heatmap.index).apply(type).value_counts()
print(count)
<class 'pandas._libs.tslibs.timestamps.Timestamp'> 110370
<class 'str'> 2
Name: Date, dtype: int64
# remove them
canada_heatmap = canada_heatmap.iloc[:-2]
I reproduced your error.
Here
canada_cases = canada_df['Confirmed']
you're extracting one column of the dataset and it becomes a Series object, not Dataframe. Which then carries over to canada_heatmap.
type(canada_heatmap)
>>> pandas.core.series.Series
As such, using an assignment with
canada_heatmap['month'] = ANYTHING
creates a new record in the series with the index value "month", not a new column.
Thus, on the first pass canada_heatmap.index is still a DatetimeIndex and has .year or .month attribute, but it breaks in the next line, as now the index is just strings. And strings don't have .year attributes.
Instead do:
import pandas as pd
covid_all_countries = pd.read_csv('https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv', usecols = [0, 1, 2], index_col = 0, parse_dates=[0])
covid_canada_confirmed = covid_all_countries.loc[covid_all_countries['Country']=='Canada']
canada_heatmap = covid_canada_confirmed.copy()
canada_heatmap.drop(columns='Country', inplace=True)
canada_heatmap['month'] = canada_heatmap.index.month
canada_heatmap['year'] = canada_heatmap.index.year
Note, that the last two statements are equivalent to what you were trying to achieve but without looping through all the values (even if using list comprehension). This is clearer, more concise and considerably faster.
A couple comments:
This line does nothing:
#Filtering data for Canadian values only
canada_df.loc[canada_df['Country']=='Canada']
You need to assign the filtering to a value like this:
#Filtering data for Canadian values only
canada_df_filt = canada_df.loc[canada_df['Country']=='Canada'].copy()
Next, try to set the month/year of canada_df before filtering it as a series like this:
canada_heatmap = canada_df_filt.copy()
canada_heatmap['month'] = [i.month for i in canada_heatmap.index]
canada_heatmap['year'] = [i.year for i in canada_heatmap.index]
This works on my machine.
Related
I have a data frame of rental data and would like to annualise the rent based on whether a column containing the frequency states that the rent is monthly, i.e. price * 12
The frequency column contains the following values - 'Yearly', 'Monthly', nan
I have tried - np.where(df['frequency'] == "Monthly", df['price'].apply(lambda x: x*12), 0)
However, where there is monthly data, the figure seems to be being copied 12 times rather than multiplied by 12:
And I need to have the price multiplied by 12 but can't figure out how to do this
The problem is your price column contains string and not numeric values.
If you load your dataframe from a file (csv, xlsx), use thousands=',' as parameter of pd.read_csv or pd.read_excel to interpret string like '4,500 as the number 4500.
Demo:
import pandas as pd
import io
csvdata = """\
frequency;price
Monthly;4,500
Yearly;30,200
"""
df1 = pd.read_csv(io.StringIO(csvdata), sep=';')
df2 = pd.read_csv(io.StringIO(csvdata), sep=';', thousands=',')
For df1:
>>> df1
frequency price
0 Monthly 4,500
1 Yearly 30,200
>>> df1.dtypes
frequency object
price object # not numeric
dtype: object
>>> df1['price'] * 2
0 4,5004,500
1 30,20030,200
Name: price, dtype: object
For df2:
>>> df2
frequency price
0 Monthly 4500
1 Yearly 30200
>>> df2.dtypes
frequency object
price int64 # numeric
dtype: object
>>> df2['price'] * 2
0 9000
1 60400
Name: price, dtype: int64
It seems there are strings instead numbers floats in column price, so first replace , to . and then convert to floats, last multiple by 12:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','.').astype(float)*12, 0)
If values are thousands separated by , replace by empty string:
np.where(df['frequency'] == "Monthly", df['price'].str.replace(',','').astype(float)*12, 0)
I created the following function to retrieve data from an internal incident management system:
def get_issues(session, query):
block_size = 50
block_num = 0
start = 0
all_issues = []
while True:
issues = sesssion.search_issues(query, start, block_size, expand='changelog')
if len(issues) == 0 # no more issues
break
start += len(issues)
for issue in issues:
all_issues.append(issue)
issues = pd.DataFrame(issues)
for issue in all_issues:
changelog = issue.changelog
for history in changelog.histories:
for item in history.items:
if item.field == 'status' and item.toString == 'Pending':
groups = issue.fields.customfield_02219
d = {
'key' : issue.key,
'issue_type' : issue.fields.issuetype,
'creator' : issue.fields.creator,
'business' : issue.fields.customfield_082011,
'groups' : groups
}
fields = issue.fields
issues = issues.append(d, ignore_index=True)
return issues
I use this function to create a dataframe df using:
df = get_issues(the_session, the_query)
The resulting dataset looks similar to the following:
key issue_type creator business groups
0 MED-184 incident Smith, J Mercedes [Finance, Accounting, Billing]
1 MED-186 incident Jones, M Mercedes [Finance, Accounting]
2 MED-187 incident Williams, P Mercedes [Accounting, Sales, Executive, Tax]
3 MED-188 incident Smith, J BMW [Sales, Executive, Tax, Finance]
When I call dtypes on df, I get:
key object
issue_type object
creator object
business object
groups object
I would like to get only the last element of the groups column, such that the dataframe looks like:
key issue_type creator business groups
0 MED-184 incident Smith, J Mercedes Billing
1 MED-186 incident Jones, M Mercedes Accounting
2 MED-187 incident Williams, P Mercedes Tax
3 MED-188 incident Smith, J BMW Finance
I tried to amend the function above, as follows:
groups = issue.fields.customfield_02219[-1]
But, I get an error that it's not possible to index into that field:
TypeError: 'NoneType' object is not subscriptable
I also tried to create another column using:
df['groups_new'] = df['groups']:[-1]
But, this returns the original groups column with all elements.
Does anyone have any ideas as to how to accomplish this?
Thanks!
########################################################
UPDATE
print(df.info()) results in the following:
<class 'pandas.core.frame.DataFrame'>
RangeIndex 13 entries, 0 to 12
Data columns (total 14 columns)
# Column Non-Null Count Dtype
--- ------ ------------- -----
0 activity 7 non-null object
1 approvals 8 non-null object
2 business 13 non-null object
3 created 13 non-null object
4 creator 13 non-null object
5 region_a 5 non-null object
6 issue_type 13 non-null object
7 key 13 non-null object
8 materiality 13 non-null object
9 region_b 5 non-null object
10 resolution 2 non-null object
11 resolution_time 1 non-null object
12 target 13 non-null object
13 region_b 5 non-null object
types: object(14)
memory usage: 1.5+ KB
None
Here it is:
df['new_group'] = df.apply(lambda x: x['groups'][-1], axis = 1)
UPDATE: If you get an IndexError with this, it means that at least one one your lists in empty. You can try this:
df['new_group'] = df.apply(lambda x: x['groups'][-1] if x['groups'] else None, axis = 1)
EXAMPLE:
df = pd.DataFrame({'key':[121,234,147], 'groups':[[111,222,333],[34,32],[]]})
print(f'ORIGINAL DATAFRAME:\n{df}\n')
df['new_group'] = df.apply(lambda x: x['groups'][-1] if x['groups'] else None, axis = 1)
print(f'FINAL DATAFRAME:\n{df}')
#
ORIGINAL DATAFRAME:
key groups
0 121 [111, 222, 333]
1 234 [34, 32]
2 147 []
FINAL DATAFRAME:
key groups new_group
0 121 [111, 222, 333] 333.0
1 234 [34, 32] 32.0
2 147 [] NaN
UPDATE: demonstration handling empty values
To get only the last element of each value (a Python list) in the 'groups' column, you can apply the following lambda to modify the 'groups' column inplace:
df['groups'] = df['groups'].apply(lambda x: x.pop() if x else None)
Working demonstration:
import pandas as pd
# Code for mocking the dataframe
data = {
'key': ["MED-184", "MED-186", "MED-187"],
'issue_type': ['incident', 'incident', 'incident'],
'creator': ['Smith, J', 'Jones, M', 'Williams, P'],
'business': ['Mercedes', 'Mercedes', 'Mercedes'],
'groups': [['Finance', 'Accounting', 'Billing'], ['Finance', 'Accounting'], None]
}
df = pd.DataFrame.from_dict(data)
# print old dataframe:
print(df)
# Execute the line below to transform the dataframe
# into one with only the last values in the group column.
df['groups'] = df['groups'].apply(lambda x: x.pop() if x else None)
# print new transformed dataframe:
print(df)
I hope this answer helps you.
I have the following datasets:
1) Dataset with Month as TimeStamp
df = pd.DataFrame(residuals, columns = ['Passengers'])
Passengers
Month
1949-01-01 -0.082329
1949-02-01 -0.040724
1949-03-01 0.060813
1949-04-01 0.027243
1949-05-01 -0.047359
1949-06-01 0.051545
1949-07-01 0.132902
1949-08-01 0.122322
b) Dataset with Month as Int
dz = pd.DataFrame(estacionalitat, columns = ['Passengers'])
Passengers
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
A set oftransformations have been carried out in both datasets but originally data comes from the following dataset:
data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month', header=0)
I would like to subtract one dataset as from the other as follows:
df-dz
However, when I try to do the above I get the following message:
Cannot compare type 'Timestamp' with type 'int'
I guess this is because 'Month' is of type int in one dataset while in the other is of type Date. Furthermore, I donĀ“t know how to access 'Month' because it is not understood as a column.
If want convert DatetimeIndex to months use:
df.index = df.index.month
Then get integers in both, columns names are same, so possible subtract:
df = df-dz
print (df)
Passengers
Month
1 -0.006485
2 0.048387
3 0.018108
4 0.025096
5 -0.036831
6 -0.057898
7 -0.065432
8 -0.087508
I am reading some .csv files from a folder. I am trying to create a list of data frames from using each file.
In some files the column values, i.e. Quantity is in str and float64 data types. Therefore, I am trying to convert the that column quantity into int.
I am accessing my columns using its position/index (For automation purposes).
Out of all data frames from a list, this is one of them,
CustName ProductID Quantity
0 56MED 110 '1215.0'
1 56MED 112 5003.0
2 56MED 114 '6822.0'
3 WillSup 2285 5645.0
4 WillSup 5622 6523.0
5 HammSup 9522 1254.0
6 HammSup 6954 5642.0
Therefore, I have my looks like this,
df.columns[2] = pd.to_numeric(df.columns[2], errors='coerce').astype(str).astype(np.int64)
I am getting,
TypeError: Index does not support mutable operations
Prior to this, I tried,
df.columns[2] = pd.to_numeric(df.columns[2], errors='coerce').fillna(0).astype(str).astype(np.int64)
However, I got this error,
AttributeError: 'numpy.float64' object has no attribute 'fillna'
There are posts that have using column names directly, but not columns position. How can I convert my column into int using the column position/index in pnadas?
My pandas version
print(pd.__version__)
>> 0.23.3
df.columns[2] returns a scalar, in this case a string.
To access a series use either df['Quantity'] or df.iloc[:, 2], or even df[df.columns[2]]. Instead of the repeated transformations, if you are sure you have data which should be integers, use downcast='integer'.
All these are equivalent:
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce', downcast='integer')
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2], errors='coerce', downcast='integer')
df[df.columns[2]] = pd.to_numeric(df[df.columns[2]], errors='coerce', downcast='integer')
Try this, you need to remove those quotes from your strings first, then use pd.to_numeric:
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2].str.strip('\'')).astype(int)
OR from #jpp:
df['Quantity'] = pd.to_numeric(df['Quantity'].str.strip('\''), errors='coerce', downcast='integer')
Output, df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 3 columns):
CustName 7 non-null object
ProductID 7 non-null int64
Quantity 7 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 196.0+ bytes
Output:
CustName ProductID Quantity
0 56MED 110 1215
1 56MED 112 5003
2 56MED 114 6822
3 WillSup 2285 5645
4 WillSup 5622 6523
5 HammSup 9522 1254
6 HammSup 6954 5642
I have some data with information provide below,
df.info() is below,
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6662 entries, 0 to 6661
Data columns (total 2 columns):
value 6662 non-null float64
country 6478 non-null object
dtypes: float64(1), object(1)
memory usage: 156.1+ KB
None
list of the columns,
[u'value' 'country']
the df is below,
value country
0 550.00 USA
1 118.65 CHINA
2 120.82 CHINA
3 86.82 CHINA
4 112.14 CHINA
5 113.59 CHINA
6 114.31 CHINA
7 111.42 CHINA
8 117.21 CHINA
9 111.42 CHINA
--------------------
--------------------
6655 500.00 USA
6656 500.00 USA
6657 390.00 USA
6658 450.00 USA
6659 420.00 USA
6660 420.00 USA
6661 450.00 USA
I need to add another column namely outlier and put 1
if the data is an outliers for that respective country,
otherwise, I need to put 0. I emphasize that the outlier will need to be computed for the respective countries and NOT for the countries altogether.
I find some formulas for calculating the outliers which may be in help, for example,
# keep only the ones that are within +3 to -3 standard
def exclude_the_outliers(df):
df = df[np.abs(df.col - df.col.mean())<=(3*df.col.std())]
return df
def exclude_the_outliers_extra(df):
LOWER_LIMIT = .35
HIGHER_LIMIT = .70
filt_df = df.loc[:, df.columns == 'value']
# Then, computing percentiles.
quant_df = filt_df.quantile([LOWER_LIMIT, HIGHER_LIMIT])
# Next filtering values based on computed percentiles. To do that I use
# an apply by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[LOWER_LIMIT,x.name]) &
(x < quant_df.loc[HIGHER_LIMIT,x.name])], axis=0)
filt_df = pd.concat([df.loc[:, df.columns != 'value'], filt_df], axis=1)
filt_df.dropna(inplace=True)
return df
I was not able to use those formulas properly for this purpose, but, provided as suggestion.
Finally, I will need to count the percentage of the outliers for the
USA and CHINA presented in the data.
How to achieve that?
Note: putting the outlier column with all zeros is easy in the
pasdas and should be like this,
df['outlier'] = 0
However, it's still the issue to find the outlier and overwrite the
zeros with 1 for that respective country.
You can slice the dataframe by each country, calculate the quantiles for the slice, and set the value of outlier at the index of the country.
There might be a way to do it without iteration, but it is beyond me.
# using True/False for the outlier, it is the same as 1/0
df['outlier'] = False
# set the quantile limits
low_q = 0.35
high_q = 0.7
# iterate over each country
for c in df.country.unique():
# subset the dataframe where the country = c, get the quantiles
q = df.value[df.country==c].quantile([low_q, high_q])
# at the row index where the country column equals `c` and the column is `outlier`
# set the value to be true or false based on if the `value` column is within
# the quantiles
df.loc[df.index[df.country==c], 'outlier'] = (df.value[df.country==c]
.apply(lambda x: x<q[low_q] or x>q[high_q]))
Edit: To get the percentage of outliers per country, you can groupby the country column and aggregate using the mean.
gb = df[['country','outlier']].groupby('country').mean()
for row in gb.itertuples():
print('Percentage of outliers for {: <12}: {:.1f}%'.format(row[0], 100*row[1]))
# output:
# Percentage of outliers for China : 54.0%
# Percentage of outliers for USA : 56.0%