I am trying to write a python function to return mean or mode of a pandas dataframe column, depending on the column data type. If the df column contains strings it should return the mode. If the df column contains numerals it should return the mean.
This is my code:
def calc_mean_mode(df, column_name):
mean = round(df[column_name].mean(), 2)
mode = df[column_name].mode()
if df[column_name].dtypes == 'O':
return mode
else:
return mean
However I keep getting a type error:
TypeError: can only concatenate str (not "int") to str
Try this:
def calc_mean_mode(df, column_name):
if df[column_name].dtypes == 'O':
output = df[column_name].mode()
else:
output = round(df[column_name].mean(), 2)
return output
i have 2 data frames ds and dk and want to merge that with a common column result using the merge command:
result = pd.merge(ds,dk,on='result')
but the result column is actually a dictionary and resulting in the error:
"unhashable type: dict"
what is the possible solution for merging these frames? Can the result column be changed to a string type and then be merged along the column?
I tried using dk['result']= str(dk['result']) and ds['result']= str(ds['result']) to convert then and merge but it did not work.
Thanks
ds = pd.DataFrame({'result':[{'a':1,'b':1},{'a':3,'b':9},{'a':7,'b':5}],
'a':[0,1,0]})
dk = pd.DataFrame({'result':[{'a':1,'b':1},{'a':3,'b':9},{'a':7,'b':5}],
'b':[2,2,2]})
ds['result'] = ds['result'].astype(str) # transorm dict to str
dk['result'] = dk['result'].astype(str) # transorm dict to str
result = pd.merge(ds,dk,on='result')
result['result'] = result['result'].apply(eval) # comeback im dict format
Given a list of values or strings, how can I detect whether these are either dates, date and times, or neither?
I have used the pandas api to infer data types but it doesn't work well with dates. See example:
import pandas as pd
def get_redshift_dtype(values):
dtype = pd.api.types.infer_dtype(values)
return dtype
This is the result that I'm looking for. Any suggestions on better methods?
# Should return "date"
values_1 = ['2018-10-01', '2018-02-14', '2017-08-01']
# Should return "date"
values_2 = ['2018-10-01 00:00:00', '2018-02-14 00:00:00', '2017-08-01 00:00:00']
# Should return "datetime"
values_3 = ['2018-10-01 02:13:00', '2018-02-14 11:45:00', '2017-08-01 00:00:00']
# Should return "None"
values_4 = ['123098', '213408', '801231']
You can write a function to return values dependent on conditions you specify:
def return_date_type(s):
s_dt = pd.to_datetime(s, errors='coerce')
if s_dt.isnull().any():
return 'None'
elif s_dt.normalize().equals(s_dt):
return 'date'
return 'datetime'
return_date_type(values_1) # 'date'
return_date_type(values_2) # 'date'
return_date_type(values_3) # 'datetime'
return_date_type(values_4) # 'None'
You should be aware that Pandas datetime series always include time. Internally, they are stored as integers, and if a time is not specified it will be set to 00:00:00.
Here's something that'll give you exactly what you asked for using re
import re
classify_dict = {
'date': '^\d{4}(-\d{2}){2}$',
'date_again': '^\d{4}(-\d{2}){2} 00:00:00$',
'datetime': '^\d{4}(-\d{2}){2} \d{2}(:\d{2}){2}$',
}
def classify(mylist):
key = 'None'
for k, v in classify_dict.items():
if all([bool(re.match(v, e)) for e in mylist]):
key = k
break
if key == 'date_again':
key = 'date'
return key
classify(values_2)
>>> 'date'
The checking is done iteratively using regex and it tries to match all items of a list. Only if all items are matched will the key be returned. This works for all of your example lists you've given.
For now, the regex string does not check for numbers outside certain range, e.g (25:00:00) but that would be relatively straightforward to implement.
I am new in Python and I would like to extract a string data from my data frame. Here is my data frame:
Which state has the most counties in it?
Unfortunately I could not extract a string! Here is my code:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
return census_df[census_df['COUNTY']==census_df['COUNTY'].max()]['STATE']
answer_five()
How about this:
import pandas as pd
census_df = pd.read_csv('census.csv')
def answer_five():
"""
Returns the 'STATE' corresponding to the max 'COUNTY' value
"""
max_county = census_df['COUNTY'].max()
s = census_df.loc[census_df['COUNTY']==max_county, 'STATE']
return s
answer_five()
This should output a pd.Series object featuring the 'STATE' value(s) where 'COUNTY' is maxed. If you only want the value and not the Series (as your question stated, and since in your image there's only 1 max value for COUNTY) then return s[0] (instead of return s) should do.
def answer_five():
return census_df.groupby('STNAME')['COUNTY'].nunique().idxmax()
You can aggregate data using group by state name, then get count on unique counties and return id of max count.
I had the same issue for some reason I tried using .item() and manage to extract the exact value I needed.
In your case it would look like:
return census_df[census_df['COUNTY'] == census_df['COUNTY'].max()]['STATE'].item()
I'm working with a csv file that has the following format:
"Id","Sequence"
3,"1,3,13,87,1053,28576,2141733,508147108,402135275365,1073376057490373,9700385489355970183,298434346895322960005291,31479360095907908092817694945,11474377948948020660089085281068730"
7,"1,2,1,5,5,1,11,16,7,1,23,44,30,9,1,47,112,104,48,11,1,95,272,320,200,70,13,1,191,640,912,720,340,96,15,1,383,1472,2464,2352,1400,532,126,17,1,767,3328,6400,7168,5152,2464,784,160,19,1,1535,7424"
8,"1,2,4,5,8,10,16,20,32,40,64,80,128,160,256,320,512,640,1024,1280,2048,2560,4096,5120,8192,10240,16384,20480,32768,40960,65536,81920,131072,163840,262144,327680,524288,655360,1048576,1310720,2097152"
11,"1,8,25,83,274,2275,132224,1060067,3312425,10997342,36304451,301432950,17519415551,140456757358,438889687625,1457125820233,4810267148324,39939263006825,2321287521544174,18610239435360217"
I'd like to read this into a data frame with the type of df['Id'] to be integer-like and the type of df['Sequence'] to be list-like.
I currently have the following kludgy code:
def clean(seq_string):
return list(map(int, seq_string.split(',')))
# Read data
training_data_file = "data/train.csv"
train = pd.read_csv(training_data_file)
train['Sequence'] = list(map(clean, train['Sequence'].values))
This appears to work, but I feel like the same could be achieved natively using pandas and numpy.
Does anyone have a recommendation?
You can specify a converter for the Sequence column:
converters: dict, default None
Dict of functions for converting
values in certain columns. Keys can either be integers or column
labels
train = pd.read_csv(training_data_file, converters={'Sequence': clean})
This also works, except that the Sequence is list of string instead of list of int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',')
To convert each element to int:
df = pd.read_csv(training_data_file)
df['Sequence'] = df['Sequence'].str.split(',').apply(lambda s: list(map(int, s)))
An alternative solution is to use literal_eval from the ast module. literal_eval evaluates the string as input to the Python interpreter and should give you back the list as expected.
def clean(x):
return literal_eval(x)
train = pd.read_csv(training_data_file, converters={'Sequence': clean})