This question already has answers here:
Convert number strings with commas in pandas DataFrame to float
(4 answers)
Closed 9 months ago.
I have imported a csv file using pandas.
My dataframe has multiple columns titled "Farm", "Total Apples" and "Good Apples".
The numerical data imported for "Total Apples" and "Good Apples" contains commas to indicate thousands e.g. 1,200 etc.
I want to remove the comma so the data looks like 1200 etc.
The variable type for the "Total Apples" and "Good Apples" columns comes up as object.
I tried using df.str.replace and df.strip but have not been successful.
Also tried to change the variable type from object to string and object to integer but couldn't make it work.
Any help would be greatly appreciated.
****EDIT****
Excerpt of data from csv file imported using pd.read_csv:
Farm_Name Total Apples Good Apples
EM 18,327 14,176
EE 18,785 14,146
IW 635 486
L 33,929 24,586
NE 12,497 9,609
NW 30,756 23,765
SC 8,515 6,438
SE 22,896 17,914
SW 11,972 9,114
WM 27,251 20,931
Y 21,495 16,662
I think you can add parameter thousands to read_csv, then values in columns Total Apples and Good Apples are converted to integers:
Maybe your separator is different, dont forget change it. If separator is whitespace, change it to sep='\s+'.
import pandas as pd
import io
temp=u"""Farm_Name;Total Apples;Good Apples
EM;18,327;14,176
EE;18,785;14,146
IW;635;486
L;33,929;24,586
NE;12,497;9,609
NW;30,756;23,765
SC;8,515;6,438
SE;22,896;17,914
SW;11,972;9,114
WM;27,251;20,931
Y;21,495;16,662"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";",thousands=',')
print df
Farm_Name Total Apples Good Apples
0 EM 18327 14176
1 EE 18785 14146
2 IW 635 486
3 L 33929 24586
4 NE 12497 9609
5 NW 30756 23765
6 SC 8515 6438
7 SE 22896 17914
8 SW 11972 9114
9 WM 27251 20931
10 Y 21495 16662
print df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 3 columns):
Farm_Name 11 non-null object
Total Apples 11 non-null int64
Good Apples 11 non-null int64
dtypes: int64(2), object(1)
memory usage: 336.0+ bytes
None
try this:
locale.setlocale(locale.LC_NUMERIC, '')
df = df[['Farm Name']].join(df[['Total Apples', 'Good Apples']].applymap(locale.atof))
Related
I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))
I have list of column names of csv file like:[email, null, password, ip_address, user_name, phone_no] .Consider I have csv with data:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
Now I want to identify the column names of this csv file on the basis of data, like col_1 is date and col_2 is mail.
I tried to use pandas. like getting all values from col_1 and then identify either it is mail or something else but couldn't get much.
i tried something like this:
df = pd.read_csv('demo.csv', header=None)
df[df[1].str.contains("#")]
but its not helping me.
thank you.
Have you tried using Pandas dataframe.infer_objects()?
# importing pandas as pd
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":["alpha", 15, 81, 1, 100],
"B":[2, 78, 7, 4, 12],
"C":["beta", 21, 14, 61, 5]})
# data frame info and data
df.info()
print(df)
# slice all rows except first into a new frame
df_temp = df[1:]
# print it
print(df_temp)
df_temp.info()
# infer the object types
df_inferred = df_temp.infer_objects()
# print inferred
print(df_inferred)
df_inferred.info()
Here's the output from the above py script.
Initially df is inferred as object, int64 and object for A, B and C respectively.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 5 non-null object
1 B 5 non-null int64
2 C 5 non-null object
dtypes: int64(1), object(2)
memory usage: 248.0+ bytes
A B C
0 alpha 2 beta
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After removing the first exception row which has the strings, the data frame is still showing the same type.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null object
1 B 4 non-null int64
2 C 4 non-null object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
A B C
1 15 78 21
2 81 7 14
3 1 4 61
4 100 12 5
After infer_objects(), the types have been correctly inferred as int64.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 1 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null int64
2 C 4 non-null int64
dtypes: int64(3)
memory usage: 228.0 bytes
Is this what you need?
The OP has clarified that s/he needs to determine if the column contains one of the following:
email
password
ip_address
user_name
phone_no
null
There are a couple of approaches we could use:
Approach #1: Take a random sample of rows and analyze their column contents using heuristics
We could use the following heuristic rules to identify column content type.
email: Use a regex to check for presence of a valid email.
[Stackoverflow - How to validate an email address]
https://www.regular-expressions.info/email.html
https://emailregex.com/
ip_address: Use a regex to match an ip_address.
Stackoverflow - Validating IPv4 addresses with Regex
Stackoverflow - Regular expression that matches valid IPv6 addresses
username: Use a table of common first names or lastnames and search for them within the username
phone_no: Strip +, SPACE, -, (, ) -- alternatively, all special characters. If you are left with all digits, we have a potential phone number
null: All column contents in sample are null
password: If it doesn't satisfy rules 1 through 5, we identify it as password
We should do the analysis independently on each column and keep track of how many sample items in the column matched each heuristic. Then we could pick the classification with the maximum number of matches.
Approach #2: Train a classifier using training data (obtained from real system) and use it to determine column content type
This is a machine learning classification task. A naive approach would be to take each column's data mapped to the content type as the training input.
Using the OP's sample set:
03-Sep-14,foo2#yahoo.co.jp,,
20-Jan-13,foo3#gmail.com,,
20-Feb-15,foo4#yahoo.co.jp,,
12-May-16,foo5#hotmail.co.jp,,
25-May-16,foo6#hotmail.co.jp,,
We would have:
data_content, content_type
03-Sep-14, date
20-Jan-13, date
20-Feb-15, date
12-May-16, date
25-May-16, date
foo2#yahoo.co.jp, email
foo3#gmail.com, email
foo4#yahoo.co.jp, email
foo5#hotmail.co.jp, email
foo6#hotmail.co.jp, email
We can then use machine learning to build a text-to-class multi-class classifier. Some references are given below:
Multi-Class Text Classification from Start to Finish
Multi-Class Text Classification with Scikit-Learn
Right now I am trying to read a table which has a variable whitespace delimiter and is also having missing/blank values. I would like to read the table in python and produce a CSV file. I have tried NumPy, Pandas and CSV libraries, but unfortunately both variable space and missing data together are making it near impossible for me to read the table. The file I am trying to read is attached here:
goo.gl/z7S2Mo
Would really appreciate if anyone can help me with a solution in python
You need your delimiter to be two spaces or more (instead of one space or more). Here's a solution:
import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')
Result:
>>> print(df.head())
0 1 2 3 4 5 \
0 ISHARES MORNINGSTAR MID GROWTH ETP 464288307 3892 41700 SH
1 ISHARES S&P MIDCAP 400 GROWTH ETP 464287606 4700 47600 SH
2 BED BATH & BEYOND Common Stock 075896100 870 15000 SH
3 CARBO CERAMICS INC Common Stock 140781105 950 7700 SH
4 CATALYST HEALTH SOLUTIONS IN Common Stock 14888B103 1313 25250 SH
6 7 8 9
0 Sole 41700 0 0
1 Sole 47600 0 0
2 Sole 15000 0 0
3 Sole 7700 0 0
4 Sole 25250 0 0
>>> print(df.dtypes)
0 object
1 object
2 object
3 int64
4 int64
5 object
6 object
7 int64
8 int64
9 int64
dtype: object
The numpy module has a function to do just that (see last line):
import numpy as np
path = "<insert file path here>/infotable.txt"
# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])
# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1
# Get column widths
widths = column_locations[1:] - column_locations[:-1]
data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)
Depending on your exact use case, you may use a different method to get the column widths but you get the idea. dtype=None ensures that numpy determines the data types for you; this is very different from leaving out the dtype argument. Finally, autostrip=True strips leading and trailing whitespace.
The output (data) is a structured array.
I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))
I have a CSV file with the following data:
Time Pressure
0 2.9852.988
10 2.9882.988
20 2.9902.990
30 2.9882.988
40 2.9852.985
50 2.9842.984
60 2.9852.985.....
for some reason the second column is separated by 2 decimal points. I'm trying to create a dataFrame with pandas but cannot proceed without removing the second decimal point. I cannot do this manually as there are thousands of data points in my file. any ideas?
You can call the vectorised str methods to split the string on decimal point, join the result of split but discard the last element, this produces for example a list [2,9852] which you then join with a decimal point:
In [28]:
df['Pressure'].str.split('.').str[:-1].str.join('.')
Out[28]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: object
If you want to convert the string to a float then call astype:
In [29]:
df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)
Out[29]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: float64
Just remember to assign the conversion back to the original df:
df['Pressure'] = df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)