I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))
Related
I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))
I have a pandas dataframe of 1.3 million rows and a set of columns such as Phone1 (Phone numbers), Sale_date (2015 to 2020), Product_description (185 unique product descriptions) and so on.
Now i want to filter or extract the entire list of phone numbers who have not bought any one of the product (any one product from product_description - let's say table) in year 2020.
>>> data.info()**
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1392125 entries, 0 to 1398844
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sale_dt 1392125 non-null datetime64[ns]
1 Phone1 1392125 non-null object
2 prod_desc 1392125 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 276.1+ MB
One more problem is that some of my phone numbers are in scientific notation (9.96266e+09) and some of the numbers are having special characters like 044-4578930***
How to convert all of them to a phone number format?
when i did try to convert to int it throws an error
data['Phone1'].astype(int)
OverflowError: Python int too large to convert to C long
when i tried,
data['Phone1'].astype('int64')
ValueError: invalid literal for int() with base 10: '22651435,9'
when i tried to remove special characters from the phone numbers,
data.Phone1 = data.Phone1.str.replace('[^\d]+', '')
data['Phone1']
data['Phone1']
Out[52]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
1398840 NaN
1398841 NaN
1398842 NaN
1398843 NaN
1398844 NaN
Name: Phone1, Length: 1392125, dtype: object
So, I want to group or extract or filter the phone numbers who have not bought tables (one of the product in prod_desc column) in 2020 but they could have bought any other products in previous years. That doesn't matter.
Please help me to solve this problem.
I am reading some .csv files from a folder. I am trying to create a list of data frames from using each file.
In some files the column values, i.e. Quantity is in str and float64 data types. Therefore, I am trying to convert the that column quantity into int.
I am accessing my columns using its position/index (For automation purposes).
Out of all data frames from a list, this is one of them,
CustName ProductID Quantity
0 56MED 110 '1215.0'
1 56MED 112 5003.0
2 56MED 114 '6822.0'
3 WillSup 2285 5645.0
4 WillSup 5622 6523.0
5 HammSup 9522 1254.0
6 HammSup 6954 5642.0
Therefore, I have my looks like this,
df.columns[2] = pd.to_numeric(df.columns[2], errors='coerce').astype(str).astype(np.int64)
I am getting,
TypeError: Index does not support mutable operations
Prior to this, I tried,
df.columns[2] = pd.to_numeric(df.columns[2], errors='coerce').fillna(0).astype(str).astype(np.int64)
However, I got this error,
AttributeError: 'numpy.float64' object has no attribute 'fillna'
There are posts that have using column names directly, but not columns position. How can I convert my column into int using the column position/index in pnadas?
My pandas version
print(pd.__version__)
>> 0.23.3
df.columns[2] returns a scalar, in this case a string.
To access a series use either df['Quantity'] or df.iloc[:, 2], or even df[df.columns[2]]. Instead of the repeated transformations, if you are sure you have data which should be integers, use downcast='integer'.
All these are equivalent:
df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce', downcast='integer')
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2], errors='coerce', downcast='integer')
df[df.columns[2]] = pd.to_numeric(df[df.columns[2]], errors='coerce', downcast='integer')
Try this, you need to remove those quotes from your strings first, then use pd.to_numeric:
df.iloc[:, 2] = pd.to_numeric(df.iloc[:, 2].str.strip('\'')).astype(int)
OR from #jpp:
df['Quantity'] = pd.to_numeric(df['Quantity'].str.strip('\''), errors='coerce', downcast='integer')
Output, df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 3 columns):
CustName 7 non-null object
ProductID 7 non-null int64
Quantity 7 non-null int32
dtypes: int32(1), int64(1), object(1)
memory usage: 196.0+ bytes
Output:
CustName ProductID Quantity
0 56MED 110 1215
1 56MED 112 5003
2 56MED 114 6822
3 WillSup 2285 5645
4 WillSup 5622 6523
5 HammSup 9522 1254
6 HammSup 6954 5642
This question already has answers here:
Convert number strings with commas in pandas DataFrame to float
(4 answers)
Closed 9 months ago.
I have imported a csv file using pandas.
My dataframe has multiple columns titled "Farm", "Total Apples" and "Good Apples".
The numerical data imported for "Total Apples" and "Good Apples" contains commas to indicate thousands e.g. 1,200 etc.
I want to remove the comma so the data looks like 1200 etc.
The variable type for the "Total Apples" and "Good Apples" columns comes up as object.
I tried using df.str.replace and df.strip but have not been successful.
Also tried to change the variable type from object to string and object to integer but couldn't make it work.
Any help would be greatly appreciated.
****EDIT****
Excerpt of data from csv file imported using pd.read_csv:
Farm_Name Total Apples Good Apples
EM 18,327 14,176
EE 18,785 14,146
IW 635 486
L 33,929 24,586
NE 12,497 9,609
NW 30,756 23,765
SC 8,515 6,438
SE 22,896 17,914
SW 11,972 9,114
WM 27,251 20,931
Y 21,495 16,662
I think you can add parameter thousands to read_csv, then values in columns Total Apples and Good Apples are converted to integers:
Maybe your separator is different, dont forget change it. If separator is whitespace, change it to sep='\s+'.
import pandas as pd
import io
temp=u"""Farm_Name;Total Apples;Good Apples
EM;18,327;14,176
EE;18,785;14,146
IW;635;486
L;33,929;24,586
NE;12,497;9,609
NW;30,756;23,765
SC;8,515;6,438
SE;22,896;17,914
SW;11,972;9,114
WM;27,251;20,931
Y;21,495;16,662"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";",thousands=',')
print df
Farm_Name Total Apples Good Apples
0 EM 18327 14176
1 EE 18785 14146
2 IW 635 486
3 L 33929 24586
4 NE 12497 9609
5 NW 30756 23765
6 SC 8515 6438
7 SE 22896 17914
8 SW 11972 9114
9 WM 27251 20931
10 Y 21495 16662
print df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
Farm_Name 11 non-null object
Total Apples 11 non-null int64
Good Apples 11 non-null int64
dtypes: int64(2), object(1)
memory usage: 336.0+ bytes
None
try this:
locale.setlocale(locale.LC_NUMERIC, '')
df = df[['Farm Name']].join(df[['Total Apples', 'Good Apples']].applymap(locale.atof))
I have a CSV file with the following data:
Time Pressure
0 2.9852.988
10 2.9882.988
20 2.9902.990
30 2.9882.988
40 2.9852.985
50 2.9842.984
60 2.9852.985.....
for some reason the second column is separated by 2 decimal points. I'm trying to create a dataFrame with pandas but cannot proceed without removing the second decimal point. I cannot do this manually as there are thousands of data points in my file. any ideas?
You can call the vectorised str methods to split the string on decimal point, join the result of split but discard the last element, this produces for example a list [2,9852] which you then join with a decimal point:
In [28]:
df['Pressure'].str.split('.').str[:-1].str.join('.')
Out[28]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: object
If you want to convert the string to a float then call astype:
In [29]:
df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)
Out[29]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: float64
Just remember to assign the conversion back to the original df:
df['Pressure'] = df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)