Pandas adding decimal points when using read_csv - python

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!

If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)

You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.

It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

Related

How to fix "ValueError: could not convert string to float"

I am going to train an SVM based on x dataframe and y series I have.
X dataframe is shown below:
x:
Timestamp Location of sensors Pressure or Flow values
0.00000 138.22, 1549.64 28.92
0.08333 138.22, 1549.64 28.94
0.16667 138.22, 1549.64 28.96
In X dataframe, location of sensors are represented as the form of node coordinate.
Y series is shown below:
y:
0
0
0
But when I fit svm to the training set, it returned an ValueError: could not convert string to float: '361.51,1100.77' and (361.51, 1100.77) are coordinates for a node.
Could you please give me some ideas to fix this problem?
Appreciate a lot if any advices.
'361.51,1100.77' are actually two numbers right? A latitude (361.51) and a longitude (1100.77). You would first need to split it into two strings. Here is one way to do it:
data = pd.DataFrame(data=[[0, "138.22,1549.64", 28.92]], columns=["Timestamp", "coordinate", "flow"])
data["latitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[0]))
data["longitude"] = data["coordinate"].apply(lambda x: float(x.split(",")[1]))
This will give you two new columns in your dataframe each with a float of the values in the string.
I'm assuming that you are trying to convert the entire string "361.51,1100.77" into a float, and you can see why that's a problem because Python sees two decimal points and a comma inbetween, so it has no idea what to do.
Assuming you want the numbers to be separate, you could do something like this:
myStr = "361.51,1100.77"
x = float(myStr[0:myStr.index(",")])
y = float(myStr[myStr.index(",")+1:])
print(x)
print(y)
Which would get you an output of
361.51
1100.77
Assigning x to be myStr[0:myStr.index(",")] takes a substring of the original string from 1 to the first occurrence of a comma, getting you the first number.
Assigning y to be myStr[myStr.index(",")+1:] takes a substring of the original string starting after the first comma and to the end of the string, getting you the second number.
Both can easily be converted to floats from here using the float(myStr) method, getting you two separate floats.
Here is a helpful link to understand string slicing: https://www.geeksforgeeks.org/string-slicing-in-python/

Remove scientific notation floats in a dataframe

I am receiving different series from a source. Some of those series have the values in big numbers (X billions). I then combine all the series to a dataframe with individual columns for each series.
Now, when I print the dataframe, the big numbers in the series are showed in scientific notation. Even printing the series individually shows the numbers in scientific notation.
Dataframe df (multiindex) output is:
Values
Item Sub
A 1 1.396567e+12
B 1 2.868929e+12
I have tried this:
pd.set_option('display.float_format', lambda x: '%,.2f' % x)
This doesn't work as:
it converts everywhere. I only need the conversion in that specific dataframe.
it tries to convert all kinds of floats, and not just those in scientific. So, even if the float is 89.142, it will try to convert the format and as there's no digit to put ',' it shows an error.
Then I tried these:
df.round(2)
This only converted numeric floats to 2 decimals from existing 3 decimals. Didn't do anything to scientific values.
Then I tried:
df.astypes(floats)
Doesn't do anything visible. Output stayed the same.
How else can we change the scientific notation to normal float digits inside the dataframe. I do not want to create a new list with the converted values. The dataframe itself should show the values in normal terms.
Can you guys please help me find a solution for this?
Thank you.
try df['column'] = df['column'].astype(str) . if does not work you should change type of numbers to string before create pandas dataframe from your data
I would suggest keeping everything in a float type and adjust the display setting.
For example, I have generated a df with some random numbers.
df = pd.DataFrame({"Item": ["A", "B"], "Sub": [1,1],
"Value": [float(31132314122123.1), float(324231235232315.1)]})
# Item Sub Value
#0 A 1 3.113231e+13
#1 B 1 3.242312e+14
If we print(df), we can see that the Sub values are ints and the Value values are floats.
Item object
Sub int64
Value float64
dtype: object
You can then call pd.options.display.float_format = '{:.1f}'.format to suppress the scientific notation of the floats, while retaining the float format.
# Item Sub Value
#0 A 1 31132314122123.1
#1 B 1 324231235232315.1
Item object
Sub int64
Value float64
dtype: object
If you want the scientific notation back, you can call pd.reset_option('display.float_format')
Okay. I found something called option_context for pandas that allows to change the display options just for the particular case / action using a with statement.
with pd.option_context('display.float_format',{:.2f}.format):
print(df)
So, we do not have to reset the options again as well as the options stay default for all other data in the file.
Sadly though, I could find no way to store different columns in different float format (for example one column with currency - comma separated and 2 decimals, while next column in percentage - non-comma and 2 decimals.)

need to convert a object column to int/ float in a Dataframe so that later can do some operations on that

the data set contains Performance_UG which has values of '90.00/100.00' or '4.0/5.0' or '3.50/4.00' which is stored as object.
now i have to extract the 90 and 100 and divide them and then get the output i.e. 0.9 saved to a new column of the data frame as a float
how do i do that?
enter image description here
I'm assuming we are talking about pandas here. Each column then should have a map function. You can call this on each column and assign it to a new one. Perhaps something like:
def parseStuff(x):
x = x.strip()
if x == '0':
return 0
a, b = [float(i) for i in x.split('/')]
return a/b
df['Performance_UG_parsed'] = df['Performance_UG'].map(parseStuff))
If you want to create new columns or just replace the old one, just iterate over columns or cast it to the same column (although, for same column you can just use apply i think).
Note: You may want to cast it to something other then float, maybe a numpy type if you are working with that stuff.

Multiplying a dataframe by a column... but values are strings?

I have a Dataframe that includes binary variables about respondents' behavior and the weight associated with each respondent. I'd like to multiply the scores by each respondents' weight so I can easily get a weighted average for the total behavior.
The easiest thing would be to multiply the weight column against another column in a loop, as in df.columns[761]*df.columns[i]. However, when I try to do, it throws an error of:
'can't multiply sequence by non-int of type 'str'.'
I shouldn't have any strings, but in the off-chance there are, I tried to convert the df to numeric, like so df.apply(pd.to_numeric, errors='coerce').
But the problem still remains. I'm at my wits' end. Is there a workaround? Should I go row by row (and if so, do I need to loop through every column, or is there a nice clean way?).
You could always break apart your dataframe.
for col in df.columns:
for index, k in enumerate(df[col]):
try:
float(k)
except:
# Print out the row number, col and row value that's failing
print(index, col, k)
It's entirely possible you've got strings/none-types that are causing your multiplication.
There's also df[col].apply(float) but it won't catch those errant rows.

Optimize a for loop applied on all elements of a df

EDIT : here are the first lines :
df = pd.read_csv(os.path.join(path, file), dtype = str,delimiter = ';',error_bad_lines=False, nrows=50)
df["CALDAY"] = df["CALDAY"].apply(lambda x:dt.datetime.strptime(x,'%d/%m/%Y'))
df = df.fillna(0)
I have a csv file that has 1500 columns and 35000 rows. It contains values, but under the form 1.700,35 for example, whereas in python I need 1700.35. When I read the csv, all values are under a str type.
To solve this I wrote this function :
def format_nombre(df):
for i in range(length):
for j in range(width):
element = df.iloc[i,j]
if (type(element) != type(df.iloc[1,0])):
a = df.iloc[i,j].replace(".","")
b = float(a.replace(",","."))
df.iloc[i,j] = b
Basically, I select each intersection of all rows and columns, I replace the problematic characters, I turn the element into a float and I replace it in the dataframe. The if ensures that the function doesn't consider dates, which are in the first column of my dataframe.
The problem is that although the function does exactly what I want, it takes approximately 1 minute to cover 10 rows, so transforming my csv would take a little less than 60h.
I realize this is far from being optimized, but I struggled and failed to find a way that suited my needs and (scarce) skills.
How about:
def to_numeric(column):
if np.issubdtype(column.dtype, np.datetime64):
return column
else:
return column.str.replace('.', '').str.replace(',', '.').astype(float)
df = df.apply(to_numeric)
That's assuming all strings are valid. Otherwise use pd.to_numeric instead of astype(float).

Categories