Add values in column of Panda Dataframe - python

I want to add up the values for a particular column.
I have a dataframe loaded from CSV that contains the following data:
Date Item Count Price per Unit Sales
0 1/21/16 Unit A 40 $1.50 $60.00
1 1/22/16 Unit A 20 $1.50 $30.00
2 1/23/16 Unit A 100 $1.50 $150.00
I want to add up all the sales. I've tried:
print sales_df.groupby(["Sales"]).sum()
But it's not adding up the sales. What can I do to make this work?

IIUC you need to sum values from your Sales column. First you need to remove $ with str.replace and then convert to numeric with pd.to_numeric. Then you could use sum. One liner:
pd.to_numeric(df.Sales.str.replace("$", "")).sum()
And step by step:
In [35]: df.Sales
Out[35]:
0 $60.00
1 $30.00
2 $150.00
Name: Sales, dtype: object
In [36]: df.Sales.str.replace("$", "")
Out[36]:
0 60.00
1 30.00
2 150.00
Name: Sales, dtype: object
In [37]: pd.to_numeric(df.Sales.str.replace("$", ""))
Out[37]:
0 60
1 30
2 150
Name: Sales, dtype: float64
In [38]: pd.to_numeric(df.Sales.str.replace("$", "")).sum()
Out[38]: 240.0
Note: pd.to_numeric works only with pandas version >= 0.17.0. If you are using older version take a look to convert_object(convert_numeric=True)

Related

How to convert a column's dtype from object to float? [duplicate]

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

converting currency with $ to numbers in Python pandas

I have the following data in pandas dataframe:
state 1st 2nd 3rd
0 California $11,593,820 $109,264,246 $8,496,273
1 New York $10,861,680 $45,336,041 $6,317,300
2 Florida $7,942,848 $69,369,589 $4,697,244
3 Texas $7,536,817 $61,830,712 $5,736,941
I want to perform some simple analysis (e.g., sum, groupby) with three columns (1st, 2nd, 3rd), but the data type of those three columns is object (or string).
So I used the following code for data conversion:
data = data.convert_objects(convert_numeric=True)
But, conversion does not work, perhaps, due to the dollar sign. Any suggestion?
#EdChum's answer is clever and works well. But since there's more than one way to bake a cake.... why not use regex? For example:
df[df.columns[1:]] = df[df.columns[1:]].replace('[\$,]', '', regex=True).astype(float)
To me, that is a little bit more readable.
You can use the vectorised str methods to replace the unwanted characters and then cast the type to int:
In [81]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str.replace('$','')).apply(lambda x: x.str.replace(',','')).astype(np.int64)
df
Out[81]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
dtype change is now confirmed:
In [82]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 4 columns):
state 4 non-null object
1st 4 non-null int64
2nd 4 non-null int64
3rd 4 non-null int64
dtypes: int64(3), object(1)
memory usage: 160.0+ bytes
Another way:
In [108]:
df[df.columns[1:]] = df[df.columns[1:]].apply(lambda x: x.str[1:].str.split(',').str.join('')).astype(np.int64)
df
Out[108]:
state 1st 2nd 3rd
index
0 California 11593820 109264246 8496273
1 New York 10861680 45336041 6317300
2 Florida 7942848 69369589 4697244
3 Texas 7536817 61830712 5736941
You can also use locale as follows
import locale
import pandas as pd
locale.setlocale(locale.LC_ALL,'')
df['1st']=df.1st.map(lambda x: locale.atof(x.strip('$')))
Note the above code was tested in Python 3 and Windows environment
To convert into integer, use:
carSales["Price"] = carSales["Price"].replace("[$,]", "", regex=True).astype(int)
You can use the methodstr.replace and the regex '\D' to remove all nondigit characters or '[^-.0-9]' to keep minus signs, decimal points and digits:
for col in df.columns[1:]:
df[col] = pd.to_numeric(df[col].str.replace('[^-.0-9]', ''))

How to access column in groupby? - Pandas

I have the following pandas DataFrame dt:
auftragskennung sku artikel_bezeichnung summen_netto system_created
0 14 200182 Product 1 -16.64 2015-05-12 19:55:16
1 14 730293 Product 2 -4.16 2015-05-12 19:55:16
2 3 720933 Product 3 0.00 2014-03-25 12:12:44
3 3 192042 Product 4 19.95 2014-03-25 12:12:45
4 3 423902 Product 5 23.88 2014-03-25 12:12:45
I then execute this command to get the best selling products ordered by sku:
topseller = dt.groupby("sku").agg({"summen_netto": np.sum}).sort("summen_netto", ascending=False)
Which returns something like:
summen_netto
sku
730293 55622.24
720933 35603.99
192042 27698.99
423902 26726.28
734630 25730.21
740353 22798.14
This is what I want, but how can I now access the sku column? topseller["sku"] does not work. It always gives me a KeyError.
I would like to be able to do this:
topseller["sku"]["730293"]
Which would then return 55622.24
The sku is now the column so you need to use loc to perform label selection:
In [7]:
topseller.loc[730293]
Out[7]:
summen_netto -4.16
Name: 730293, dtype: float64
You can confirm this here:
In [8]:
topseller.index
Out[8]:
Int64Index([423902, 192042, 720933, 730293, 200182], dtype='int64', name='sku')

Editing data in CSV files using Pandas

I have a CSV file with the following data:
Time Pressure
0 2.9852.988
10 2.9882.988
20 2.9902.990
30 2.9882.988
40 2.9852.985
50 2.9842.984
60 2.9852.985.....
for some reason the second column is separated by 2 decimal points. I'm trying to create a dataFrame with pandas but cannot proceed without removing the second decimal point. I cannot do this manually as there are thousands of data points in my file. any ideas?
You can call the vectorised str methods to split the string on decimal point, join the result of split but discard the last element, this produces for example a list [2,9852] which you then join with a decimal point:
In [28]:
df['Pressure'].str.split('.').str[:-1].str.join('.')
Out[28]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: object
If you want to convert the string to a float then call astype:
In [29]:
df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)
Out[29]:
0 2.9852
1 2.9882
2 2.9902
3 2.9882
4 2.9852
5 2.9842
6 2.9852
Name: Pressure, dtype: float64
Just remember to assign the conversion back to the original df:
df['Pressure'] = df['Pressure'].str.split('.').str[:-1].str.join('.').astype(np.float64)

Categories