Editing data in CSV files using Pandas

Editing data in CSV files using Pandas - python

I have a CSV file with the following data:
Time Pressure
0 2.9852.988
10 2.9882.988
20 2.9902.990
30 2.9882.988
40 2.9852.985
50 2.9842.984
60 2.9852.985.....
for some reason the second column is separated by 2 decimal points. I'm trying to create a dataFrame with pandas but cannot proceed without removing the second decimal point. I cannot do this manually as there are thousands of data points in my file. any ideas?

You can call the vectorised str methods to split the string on decimal point, join the result of split but discard the last element, this produces for example a list [2,9852] which you then join with a decimal point:
In [28]:
df['Pressure'].str.split('.').str[:-1].str.join('.')
Out[28]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: object
If you want to convert the string to a float then call astype:
In [29]:
df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)
Out[29]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: float64
Just remember to assign the conversion back to the original df:
df['Pressure'] = df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)

Related

How to convert a column's dtype from object to float? [duplicate]

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?

#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.

You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941

You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment

To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)

You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

Using zfill to pad a Colum of numbers based on a string in a column of a dataframe

I'm trying to zfill based on 2 column's in my Dataframe. The first column is called unit and it only contains 2 strings, 'Metric' and 'Imperial'.
The second column is called closing which has loads of numbers in it. However if Colum one is metric I need the numbers to zfill to 5 and if its imperial i need to zfill to 4. Example:
Metric: 23 needs to become 00023.
Imperial: 23 needs to become 0023
Basically,
if column one (unit) = Metric, then I want to look at column two (closing) and zfill to 5
if column one (unit) = Imperia, then I want to look at column two (closing) and zfill to 4
This is the current code, however I'm getting the error: 'function' object has no attribute 'astype'
df['Unit'] == 'Metric', df['closing'].replace.astype(str).zfill(5)
df['Unit'] == 'Imperial', df['closing'].replace.astype(str).zfill(4)

Considering you have only two choices (Metric, Imperial) to choose from, you could use Numpy where.
Input sample.csv
Unit closing
0 Imperial 20
1 Metric 284
2 Imperial 1451
3 Imperial 45
4 Metric 8491
import pandas as pd
import numpy as np
df = pd.read_csv('sample.csv')
df['closing'] = np.where(df['Unit'] == 'Metric',
df['closing'].astype(str).str.zfill(5),
df['closing'].astype(str).str.zfill(4),
)
print(df)
Output from df
Unit closing
0 Imperial 0020
1 Metric 00284
2 Imperial 1451
3 Imperial 0045
4 Metric 08491

Reading a variable white space delimited table in python

Right now I am trying to read a table which has a variable whitespace delimiter and is also having missing/blank values. I would like to read the table in python and produce a CSV file. I have tried NumPy, Pandas and CSV libraries, but unfortunately both variable space and missing data together are making it near impossible for me to read the table. The file I am trying to read is attached here:
goo.gl/z7S2Mo
Would really appreciate if anyone can help me with a solution in python

You need your delimiter to be two spaces or more (instead of one space or more). Here's a solution:
import pandas as pd
df = pd.read_csv('infotable.txt',sep='\s{2,}',header=None,engine='python',thousands=',')
Result:
>>> print(df.head())
0 1 2 3 4 5 \
0 ISHARES MORNINGSTAR MID GROWTH ETP 464288307 3892 41700 SH
1 ISHARES S&P MIDCAP 400 GROWTH ETP 464287606 4700 47600 SH
2 BED BATH & BEYOND Common Stock 075896100 870 15000 SH
3 CARBO CERAMICS INC Common Stock 140781105 950 7700 SH
4 CATALYST HEALTH SOLUTIONS IN Common Stock 14888B103 1313 25250 SH
6 7 8 9
0 Sole 41700 0 0
1 Sole 47600 0 0
2 Sole 15000 0 0
3 Sole 7700 0 0
4 Sole 25250 0 0
>>> print(df.dtypes)
0 object
1 object
2 object
3 int64
4 int64
5 object
6 object
7 int64
8 int64
9 int64
dtype: object

The numpy module has a function to do just that (see last line):
import numpy as np
path = "<insert file path here>/infotable.txt"
# read off column locations from a text editor.
# I used Notepad++ to do that.
column_locations = np.array([1, 38, 52, 61, 70, 78, 98, 111, 120, 127, 132])
# My text editor starts counting at 1, while numpy starts at 0. Fixing that:
column_locations = column_locations - 1
# Get column widths
widths = column_locations[1:] - column_locations[:-1]
data = np.genfromtxt(path, dtype=None, delimiter=widths, autostrip=True)
Depending on your exact use case, you may use a different method to get the column widths but you get the idea. dtype=None ensures that numpy determines the data types for you; this is very different from leaving out the dtype argument. Finally, autostrip=True strips leading and trailing whitespace.
The output (data) is a structured array.

Add values in column of Panda Dataframe

I want to add up the values for a particular column.
I have a dataframe loaded from CSV that contains the following data:
Date Item Count Price per Unit Sales
0 1/21/16 Unit A 40 $1.50 $60.00
1 1/22/16 Unit A 20 $1.50 $30.00
2 1/23/16 Unit A 100 $1.50 $150.00
I want to add up all the sales. I've tried:
print sales_df.groupby(["Sales"]).sum()
But it's not adding up the sales. What can I do to make this work?

IIUC you need to sum values from your Sales column. First you need to remove $ with str.replace and then convert to numeric with pd.to_numeric. Then you could use sum. One liner:
pd.to_numeric(df.Sales.str.replace("$", "")).sum()
And step by step:
In [35]: df.Sales
Out[35]:
0 $60.00
1 $30.00
2 $150.00
Name: Sales, dtype: object
In [36]: df.Sales.str.replace("$", "")
Out[36]:
0 60.00
1 30.00
2 150.00
Name: Sales, dtype: object
In [37]: pd.to_numeric(df.Sales.str.replace("$", ""))
Out[37]:
0 60
1 30
2 150
Name: Sales, dtype: float64
In [38]: pd.to_numeric(df.Sales.str.replace("$", "")).sum()
Out[38]: 240.0
Note: pd.to_numeric works only with pandas version >= 0.17.0. If you are using older version take a look to convert_object(convert_numeric=True)

converting currency with $ to numbers in Python pandas

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?

#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.

You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941

You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment

To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)

You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Editing data in CSV files using Pandas - python

Related

How to convert a column's dtype from object to float? [duplicate]

Using zfill to pad a Colum of numbers based on a string in a column of a dataframe

Reading a variable white space delimited table in python

Add values in column of Panda Dataframe

converting currency with $ to numbers in Python pandas

Categories

Resources