Change specific columns in a csv file into integer with pandas

Change specific columns in a csv file into integer with pandas - python

I have a csv file that contains 6 columns in such format:
2021-04-13 11:03:13+02:00,3.0,3.0,3.0,12.0,12.0
Now I want to remove the decimal point of each column except for the first one.
I already tried using df.style.format aswell as the df.astype(int) and pd.set_option('precision', 0) but in the latter two it always gets stuck on the first column, since that doesn't quiet fit ( ValueError: invalid literal for int() with base 10: '2021-04-13 11:03:13+02:00' )
Any help would be much appreciated!

With your shown samples, please try following. Using iloc functionality of pandas.
df.iloc[:,1:] = df.iloc[:,1:].astype(int)
df will be as follows after running above code:
0,2021-04-13 11:03:13+02:00,3,3,3,12,12

Related

How can I deal with unwanted numbers in my dataframe where I only want date format?

In my dataframe I have a column called Competencia where only dates in the format YYYY-MM-DD should be. But I just found out that I have a couple of zeros and suddenly and a number like for example 11. Which shouldnt be in it. The Column Competencia has currently the Dtype OBJECT. I normally created a new column called dt_COMP and converted all the dates into datetime which is the format I need. Now my question is, what do I do with the unwanted 0's and the 11 and how do I do it?
If I try to run my code ignoring the 0's and 11 I get the error which I dont understand:
ParserError: Unknown string format: COMPETENCIA
This is the Code that I use to convert COMPETENCIA into Datetime in a new column called dt_COMP
#Converting 'COMPETENCIA' into date -> new Column: 'dt_COMP'
df30_new['dt_COMP'] = pd.to_datetime(df30_new['COMPETENCIA'], yearfirst=True)
df30_new['dt_COMP'] = pd.to_datetime(df30_new['dt_COMP']).dt.date
df30_new["dt_COMP"] = pd.to_datetime(df30_new["COMPETENCIA"], format="%Y/%m/%d")
in the screenshot you can see the zeros and the 11

Something along the lines of this if I remember rightly - doing from memory
df30_new['COMPETENCIA_TRIMMED'] = df30_new['COMPETENCIA'].apply(lambda x: x if type(x) is datetime else None, axis=1)
Maybe check the datatype of one of the correct values for this to work as expected.
Check out apply in the pandas docs.

How to round numbers in python a style.background_gradient()

I am trying to round the numbers of a data frame and put it into a table then save it as a jpeg so I can text it out daily as a leaderboard. I am able to accomplish everything but when I create by table in style.background_gradient() it adds a lot of 0's.
I usually have been using the round(table,0) function but it doesn't work on this particular table type. Any suggestions would be appreciated! This is the data frame below pre style.
Once I add the following code it turns it to this
styled = merged.round(2).style.background_gradient()
I would love to get rid of the zero's if possible.

This worked for me:
merged.style.set_precision(2).background_gradient(cmap = 'Blues')

If there are Nan in a column the dtype is float and the notebook will display the data with a comma (also if the first decimal is zero). The solution I suggest is to transform the dtype of these columns in 'Int32' or 'Int32' (int raise an error)
for col in data.columns:
if data[col].dtype == 'float64':
data[col].astype('Int32')

invalid literal for int() with base 10: 'EC180'

All that i tried was:
df['buyer_zip']=df['buyer_zip'].replace('-', 0)
df['buyer_zip']=df['buyer_zip'].replace('', 0)
df['buyer_zip']=df['buyer_zip'].str[:5]
df["buyer_zip"].fillna( method ='ffill', inplace = True)
df["buyer_zip"].apply(int)
I have two columns in a pandas dataframe called Buyer_zip and Item_zip which are the zip codes of Buyer and items respectively. These zip codes have 4 formats. One is 5 digit zip code( ex: 12345), one is 5+4 digit zip code( 12345-1234), one is 9 digit zipcode (123456789) and the last one is 'EC180'. So, the last format is alphanumeric. There are 15 Million records in total. I am struck at a point where i have to convert all those alphanumeric values to numeric. When trying to do the same, i encountered the error: invalid literal for int() with base 10: 'EC180'. Could someone help me how to find all the words in my data column and replace it with 00000. Appreciate any help.But none of it gave an answer to how to find the words in that column and replace it with numbers
Sample data:
buyer_zip
97219
11415-3528
EC180
907031234
Expected output
buyer_zip
0 97219
1 114153528
2 0
3 907031234

Pandas has several different "replace" methods. On a DataFrame or a Series, replace is meant to match and replace entire values. For instance, df['buyer_zip'].replace('-', 0) looks for a column value that is literally the single character "-" and replaces it with the integer 0. That's not what you want. The series also has a .str attribute which holds functions for strings, and its replace is closer to what you want.
But, that is what you want when you have a string that starts with a non-digit letter. You want that one to be completely replaced with "00000".
Finally, astype is a faster way to convert the column to an int.
import pandas as pd
df = pd.DataFrame({"buyer_zip":["12345", "123451234", "123456789", "EC180"]})
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "")
df["buyer_zip"] = df["buyer_zip"].replace(r"[^\d].*$", "00000", regex=True)
df["buyer_zip"] = df["buyer_zip"].astype(int)
The operations can be chained. Apply the second operation to the result of the first, etc, and you can condense the conversion
df["buyer_zip"] = df["buyer_zip"].str.replace("-", "").replace(r"[^\d].*$", "00000", regex=True).astype(int)

Error when converting dataframe object into int64

Good day. I am working on 2 dataframes that i will later be comparing, playersData & allStar. playersData['Year'] is type int64, while allStar is type object. i tried to convert the playersData['Year'] using the following code:
playersData['Year'] = playersData['Year'].astype(str).astype(int)
but it shows error saying:
ValueError: invalid literal for int() with base 10: 'nan'
the code I used is from the link: https://www.kite.com/python/answers/how-to-convert-a-pandas-dataframe-column-from-object-to-int-in-python
here is reference pics regarding types of my dataframes:

Try Dropping all the nan values from the dataset.
playersData.dropna(inplace=True)

You can either drop rows containing NaN values or replace them with a constant (In case there were few other columns containing valuable info, dropping rows might not be a good option).
If you want to drop
playersData.dropna(inplace=True) or playersData = playersData.dropna()
Replacing with a constant (Ex: 0)
playersData['Year'].fillna(0, inplace=True) or playersData['Year'] = playersData['Year'].fillna(0)

Pandas adding decimal points when using read_csv

I'm working with some csv files and using pandas to turn them into a dataframe. After that, I use an input to find values to delete
I'm hung up on one small issue: for some columns it's adding ".o" to the values in the column. It only does this in columns with numbers, so I'm guessing it's reading the column as a float. How do I prevent this from happening?
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!
Here's a sample of my code:
clientid = int(input('What client ID needs to be deleted?'))
df1 = pd.read_csv('Client.csv')
clientclean = df1.loc[df1['PersonalID'] != clientid]
clientclean.to_csv('Client.csv', index=None)
Ideally, I'd like all of the values to be the same as the original csv file, but without the rows with the clientid from the user input.
The part that really confuses me is that it only happens in a few columns, so I can't quite figure out a pattern. I need to chop off the ".0" so I can re-import it, and I feel like it would be easiest to prevent it from happening in the first place.
Thanks!

If PersonalID if the header of the problematic column, try this:
df1 = pd.read_csv('Client.csv', dtype={'PersonalID':np.int32})
Edit:
As there are no NaN value for integer.
You can try this on each problematic colums:
df1[col] = df1[col].fillna(-9999) # or 0 or any value you want here
df1[col] = df1[col].astype(int)

You could go through each value, and if it is a number x, subtract int(x) from it, and if this difference is not 0.0, convert the number x to int(x). Or, if you're not dealing with any non-integers, you could just convert all values that are numbers to ints.
For an example of the latter (when your original data does not contain any non-integer numbers):
for index, row in df1.iterrows():
for c, x in enumerate(row):
if isinstance(x, float):
df1.iloc[index,c] = int(x)
For an example of the former (if you want to keep non-integer numbers as non-integer numbers, but want to guarantee that integer numbers stay as integers):
import numbers
import sys
for c, col in enumerate(df1.columns):
foundNonInt = False
for r, index in enumerate(df1.index):
if isinstance(x, float):
if (x - int(x) > sys.float_info.epsilon):
foundNonInt = True
break
if (foundNonInt==False):
df1.iloc[:,c] = int(df1.iloc[:,c])
else:
Note, the above method is not fool-proof: if by chance, a non-integer number column from the original data set contains non-integers that are all x.0000000, all the way to the last decimal place, this will fail.

It was a datatype issue.
ALollz's comment lead me in the right direction. Pandas was assuming a data type of float, which added the decimal points.
I specified the datatype as object (from Akarius's comment) when using read_csv, which resolved the issue.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Change specific columns in a csv file into integer with pandas - python

With your shown samples, please try following. Using iloc functionality of pandas. df.iloc[:,1:] = df.iloc[:,1:].astype(int) df will be as follows after running above code: 0,2021-04-13 11:03:13+02:00,3,3,3,12,12

Related

How can I deal with unwanted numbers in my dataframe where I only want date format?

How to round numbers in python a style.background_gradient()

invalid literal for int() with base 10: 'EC180'

Error when converting dataframe object into int64

Pandas adding decimal points when using read_csv

Categories

Resources