I would like to sort allocate the mode value of the given column from a CSV file.
The code I've tried:
def mode_LVL(self):
data = pd.read_csv('highscore.csv', sep=',')
mode_lvl = data["LVL"].mode()
return mode_lvl
Results in:
The mode value of LVL: 0 6
dtype: int64
I would like the mode value only, not wanting the 0 and dtype.
I have attempted to resolve by, but failed:
mode_lvl = data.mode(axis = 'LVL', numeric_only=True )
Sorry I know that this issue may be simple to solve, but I've had issues searching for the right solution.
Here is necessary seelct first value of mode, because possible mode return multiple values if same count of top categories:
mode_lvl = data["LVL"].mode().iat[0]
Related
Some background, I'm taking a machine learning class on customer segmentation. My code env is pandas(python) and sklearn. I have two datasets, a general population dataset and a customer demographics dataset with 85 identical columns.
I'm calling a function I created to run preprocessing steps on the 'customers' data, steps that were previously run outside this function on the general population dataset. Within the function is a loop that replaces missing values with np.nan. Here is the loop:
#replacing missing data with NaNs.
#feat_sum is a dataframe (feature_summary) of coded values
for i in range(len(feat_sum)):
mi_unk = feat_sum.iloc[i]['missing_or_unknown'] #locate column and values
mi_unk = mi_unk.strip('[').strip(']').split(',')# strip the brackets then split
mi_unk = [int(val) if (val!='' and val!='X' and val!='XX') else val for val in mi_unk]
if mi_unk != ['']:
featsum_attrib = feat_sum.iloc[i]['attribute']
df = df.replace({featsum_attrib: mi_unk}, np.nan)
Toward the end of the function I'm engineering new variables:
#Investigate "CAMEO_INTL_2015" and engineer two new variables.
df['WEALTH'] = df['CAMEO_INTL_2015']
df['LIFE_STAGE'] = df['CAMEO_INTL_2015']
mf_wealth_dict = {'11':1, '12':1, '13':1, '14':1, '15':1, '21':2, '22':2, '23':2, '24':2, '25':2, '31':3,'32':3, '33':3, '34':3, '35':3, '41':4, '42':4, '43':4, '44':4, '45':4, '51':5, '52':5, '53':5, '54':5, '55':5}
mf_lifestage_dict = {'11':1, '12':2, '13':3, '14':4, '15':5, '21':1, '22':2, '23':3, '24':4, '25':5, '31':1, '32':2, '33':3, '34':4, '35':5, '41':1, '42':2, '43':3, '44':4, '45':5, '51':1, '52':2, '53':3, '54':4, '55':5}
#replacing the 'WEALTH' and 'LIFE_STAGE' columns with values from the dictionaries
df['WEALTH'].replace(mf_wealth_dict, inplace=True)
df['LIFE_STAGE'].replace(mf_lifestage_dict, inplace=True)
Near the end of the project code, I'm running an imputer to replace the np.nans which ran successfully on the general population dataset(azdias):
az_imp = Imputer(strategy="most_frequent")
azdias_cleaned_imp = pd.DataFrame(az_imp.fit_transform(azdias_cleaned_encoded))
So when I call the clean_data function passing the 'customers' dataframe, clean_data(customers),it is giving me the ValueError: could not convert str to float: 'XX' on this line:
customers_imp = Imputer(strategy="most_frequent")
---> 19 customers_cleaned_imputed = pd.DataFrame(customers_imp.fit_transform(customers_cleaned_encoded))
In the data dictionary for the CAMEO_INTL_2015 column of the dataset, the very last category is 'XX': unknown. When I run a value count on the WEALTH and LIFE_STAGE columns, 124 occurrences of 'XX' under those two columns. No other columns in the dataset have the 'XX' value except these. Again, no issues with the other dataset, I did not run into this problem. I know this is wordy, but any help appreciated and I can provide the project code as well.
A mentor and myself tried troubleshooting looking at all the steps that were performed on both datasets, to no avail. I was expecting the 'XX' to be dealt with from the loop I mentioned earlier.
I'd like to use .ftr files to quickly analyze hundreds of tables. Unfortunately I have some problems with decimal and thousands separator, similar to that post, just that read_feather does not allow for decimal=',', thousands='.' options. I've tried the following approaches:
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.str.replace(".","", regex=True)
.str.replace(",",".", regex=True))
resulting in
AttributeError: 'str' object has no attribute 'str'
when I change it to
df['numberofx'] = (
df['numberofx']
.apply(lambda x: x.replace(".","").replace(",","."))
I receive some strange (rounding) mistakes in the results, like 22359999999999998 instead of 2236 for some numbers that are higher than 1k. All below 1k are 10 times the real result, which is probably because of deleting the "." of the float and creating an int of that number.
Trying
df['numberofx'] = df['numberofx'].str.replace('.', '', regex=True)
also leads to some strange behavior in the results, as some numbers are going in the 10^12 and others remain at 10^3 as they should.
Here is how I create my .ftr files from multiple Excel files. I know I could simply create DataFrames from the Excel files but that would slowdown my daily calculations to much.
How can I solve that issue?
EDIT: The issue seems to come from reading in an excel file as df with non US standard regarding decimal and thousands separator and than saving it as feather. using pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.') options for reading in the excel file solved my issue. That leads to the next question:
why does saving floats in a feather file lead to strange rounding errors like changing 2.236 to 2.2359999999999998?
the problem in your code is that :
when you check your column type in dataframe ( Panda ) you gonna find :
df.dtypes['numberofx']
result : type object
so suggested solution is to try :
df['numberofx'] = df['numberofx'].apply(pd.to_numeric, errors='coerce')
Another way to fix this problem is to convert your values to float :
def coerce_to_float(val):
try:
return float(val)
except ValueError:
return val
df['numberofx']= df['numberofx'].applymap(lambda x: coerce_to_float(x))
to avoid that type of float '4.806105e+12' here is a sample
Sample :
df = pd.DataFrame({'numberofx':['4806105017087','4806105017087','CN414149']})
print (df)
ID
0 4806105017087
1 4806105017087
2 CN414149
print (pd.to_numeric(df['numberofx'], errors='coerce'))
0 4.806105e+12
1 4.806105e+12
2 NaN
Name: ID, dtype: float64
df['numberofx'] = pd.to_numeric(df['numberofx'], errors='coerce').fillna(0).astype(np.int64)
print (df['numberofx'])
ID
0 4806105017087
1 4806105017087
2 0
As mentioned in my edit here is what solved my initial problem:
path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
for col in df.columns:
w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[w]) > 0:
df[col] = df[col].astype(str)
if df[col].dtype == list:
df[col] = df[col].astype(str)
pathname = f[:-4] + "ftr"
df.to_feather(pathname)
df.head()
I had to add the decimal=',', thousands='.' option for reading in an excel file, which I later saved as feather. So the problem did not arise when working with .ftr files but before. The rounding problems seem to come from saving numbers with different decimal and thousand separators as .ftr files.
I have a dataframe in Python using pandas. It has 2 columns called 'dropoff_latitude' and 'pickup_latitude'. I want to make a function that will create a 3rd column based on these 2 variables (runs them through an api).
So I wrote a function:
def dropoff_info(row):
dropoff_latitude = row['dropoff_latitude']
dropoff_longitude = row['dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
dropoffinfo = dropoff_results2["Block"]["FIPS"][2:11]
return dropoffinfo
then I would run it as
df['newcolumn'] = dropoffinfo(df)
However it doesn't work.
Upon troubleshooting I find that when I print dropoff_latitude it looks like this:
0 40.773345947265625
1 40.762149810791016
2 40.770393371582031
...
And so I think that the URL can't get generated. I want dropoff_latitude to look like this when printed:
40.773345947265625
40.762149810791016
40.770393371582031
...
And I don't know how to specify that I want just the actual content part.
When I tried
dropoff_latitude = row['dropoff_latitude'][1]
dropoff_longitude = row['dropoff_longitude'][1]
It just gave me the values from the 1st row so that obviously didn't work.
Ideas please? I am very new to working with dataframes... Thank you!
Alex - with pandas we typically like to avoid loops, but in your particular case, the need to ping a remote server for data pretty much requires it. So I'd do something like the following:
l = []
for i in df.index:
dropoff_latitude = df.loc[i, 'dropoff_latitude']
dropoff_longitude = df.loc[i, 'dropoff_longitude']
dropoff_url2 = "http://data.fcc.gov/api/block/find?format=json&latitude=%s&longitude=%s&showall=true" %(dropoff_latitude,dropoff_longitude)
dropoff_resp2 = requests.get(dropoff_url2)
dropoff_results2 = json.loads(dropoff_resp2.text)
l.append(dropoff_results2["Block"]["FIPS"][2:11])
df['new'] = l
The key here is the .loc[i, ...] bit that gives you the ability to go through each row one by one, and call out the associated column to create the variables to send to your API.
Regarding your question about a drain on your memory - that's a little above my pay-grade, but I really don't think you have any other options in this case (unless your API has some kind of batch request that allows you to pull a larger data set in one call).
I am learning Python's Pandas library using kaggle's titanic tutorial. I am trying to create a function which will calculate the nulls in a column.
My attempt below appears to print the entire dataframe, instead of null values in the specified column:
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
null_percentage_calculator(train,"Age")
My previous (and very first stack overflow question) was a similar problem, and it was explained to me that the .index method in pandas is undesirable and I should try and use other methods like [ ] and .loc to explicitly refer to the column.
So I have tried this:
df_column_null=[df[nullcolumn]].isnull().sum()
I have also tried
df_column_null=df[nullcolumn]df[nullcolumn].isnull().sum()
I am struggling to understand this aspect of Pandas. My non function method works fine:
Train_Age_Nulls = train["Age"].isnull().sum()
Train_Age_Nulls_percentage = (Train_Age_Nulls/traintotal)*100
Train_Age_Nulls_percentage_rounded = np.ceil(Train_Age_Nulls_percentage)
print("{} percent of Train's Age are NaN values".format(Train_Age_Nulls_percentage_rounded))
Could anyone let me know where I am going wrong?
def null_percentage_calculator(df,nullcolumn):
df_column_null = df[nullcolumn].isnull().sum()
df_column_null_percentage = np.ceil((df_column_null /testtotal)*100)
# what is testtotal?
print("{} percent of {} {} are NaN values".format(df_column_null_percentage,df,nullcolumn))
I would do this with:
def null_percentage_calculator(df,nullcolumn):
nulls = df[nullcolumn].isnull().sum()
pct = float(nulls) / len(df[nullcolumn]) # need float because of python division
# if you must you can * 100
print "{} percent of column {} are null".format(pct*100, nullcolumn)
beware of python integer division where 63/180 = 0
if you want a float out, you have to put a float in.
I am using pandas to read in a csv column where every row has the following format:
IP: XXX:XX:XX:XXX
To get rid of the IP: prefix, I am editing the column after the fact:
logs['ip'] = logs['ip'].str[4:]
Is there a way to perform this operation within read_csv, maybe with regex, to avoid the post-computation?
Update |
Consider this scenario where there are multiple columns that have these prefixes – is there a better way?
logs['mac'] = logs['mac'].str[5:]
logs['id'] = logs['id'].str[4:]
logs['lan'] = logs['lan'].str[5:]
logs['ip'] = logs['ip'].str[4:]
The converters option for read_csv might provide a useful way. Let's say the file looks like this:
id address
1 IP:123.1.1.1
2 IP:456.1.1.1
3 IP:789.1.1.1
Then you could specify that 'IP:' should be converted to '' (blank) like this:
dct = { 'address': lambda x: x.replace('IP:','') }
df = pd.read_csv( 'foo.txt', delimiter=' *', converters=dct )
id address
0 1 123.1.1.1
1 2 456.1.1.1
2 3 789.1.1.1
I'm ignoring the slight complication that if there is a space after IP: then you might be reading IP: in as it's own column, but you ought to be able to adapt this fairly easily to handle that.
you could just convert the csv column to a string the use .split("IP: ")[1] on the string which will contain everything except for "IP: ". I'm not sure if this is the best approach but it's what came to mind.
str.split("IP":\s")