Unable to get min value of DataFrame column with Pandas

Unable to get min value of DataFrame column with Pandas - python

I'm trying to get the min value of values in a column of times. If I take a subset of the data I'm able to do it:
print(df7.ix[3,'START_TIME'].min())
type(df7.ix[3,'START_TIME'].min())
output is returned correctly:
09:17:09
str
But if I try on the entire column this error is returned:
print(df7['START_TIME'].min())
output:
TypeError: unorderable types: str() <= float()
So there is some bad data that is tripping up the min method. Is there any way to call the method and skip the bad data?

It seems to me that you have both floats and strings in that one column.
See if this works:
print(df7['START_TIME'].astype(str).min())
If it does, then you also have floats in that column. You want to find them and deal with them.
my_floats_indices = [i for i, v in df7['START_TIME'].iteritems() if isinstance(v, float)]
Then look at them with
df7.loc[my_floats_indices, 'START_TIME']
See if you can fix your problem. Hope that helps.

Related

Cannot plot or use .tolist() on pd dataframe column

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!

So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]

The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

ValueError while trying to check for a "W" in dataset

Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message

The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1

You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!

If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)

You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.

It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

Find if a column in dataframe has neither nan nor none

I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)

A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.

isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.

Multiplying a dataframe by a column... but values are strings?

I have a Dataframe that includes binary variables about respondents' behavior and the weight associated with each respondent. I'd like to multiply the scores by each respondents' weight so I can easily get a weighted average for the total behavior.
The easiest thing would be to multiply the weight column against another column in a loop, as in df.columns[761]*df.columns[i]. However, when I try to do, it throws an error of:
'can't multiply sequence by non-int of type 'str'.'
I shouldn't have any strings, but in the off-chance there are, I tried to convert the df to numeric, like so df.apply(pd.to_numeric, errors='coerce').
But the problem still remains. I'm at my wits' end. Is there a workaround? Should I go row by row (and if so, do I need to loop through every column, or is there a nice clean way?).

You could always break apart your dataframe.
for col in df.columns:
for index, k in enumerate(df[col]):
try:
float(k)
except:
# Print out the row number, col and row value that's failing
print(index, col, k)
It's entirely possible you've got strings/none-types that are causing your multiplication.
There's also df[col].apply(float) but it won't catch those errant rows.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to get min value of DataFrame column with Pandas - python

Related

Cannot plot or use .tolist() on pd dataframe column

ValueError while trying to check for a "W" in dataset

Pandas adding decimal points when using read_csv

Find if a column in dataframe has neither nan nor none

Multiplying a dataframe by a column... but values are strings?

Categories

Resources