I have a DataFrame, all values are integer
Millage UsedMonth PowerPS
1 261500 269 101
3 320000 211 125
8 230000 253 101
9 227990 255 125
13 256000 240 125
14 153000 242 150
17 142500 215 101
19 220000 268 125
21 202704 260 101
22 350000 246 101
25 331000 230 125
26 250000 226 125
And I would like to calculate log(Millage)
SO I used code
x_trans=copy.deepcopy(x)
x_trans=x_trans.reset_index(drop=True)
x_trans.astype(float)
import math
for n in range(0,len(x_trans.Millage)):
x_trans.Millage[n]=math.log(x_trans.Millage[n])
x_trans.UsedMonth[n]=math.log(x_trans.UsedMonth[n])
I got all interger values
Millage UsedMonth PowerPS
0 12 5 101
1 12 5 125
2 12 5 101
3 12 5 125
4 12 5 125
5 11 5 150
It's python 3, Jupyter notebook
I tried math.log(100)
And get 4.605170185988092
I think the reason could be DataFrame data type.
How could I get the log() result as float
Thanks
One solution would be to simply do
x_trans['Millage'] = np.log(x_trans['Millage'])
Conversion to astype(float) is not an in-place operation. Assign back to your dataframe and you will find your log series will be of type float:
x_trans = x_trans.astype(float)
But, in this case, math.log is inefficient. Instead, you can use vectorised functionality via NumPy:
x_trans['Millage'] = np.log(x_trans['Millage'])
x_trans['UsedMonth'] = np.log(x_trans['UsedMonth'])
With this solution, you do not need to explicitly convert your dataframe to float.
In addition, note that deep copying is native in Pandas, e.g. x_trans = x.copy(deep=True).
First of I strongly recommend using the numpy library for those kind of mathematical operations, it is faster and outputs results in a easier way to use since both numpy and pandas are from the same project.
Now taking into account how you created your dataframe it automatically assumed your data type is integer, try to define it as float when you create the dataframe adding in the parameters dtype = float or better if you are using numpy package (import numpy as np) dtype = np.float64.
Related
I was trying to merge 2 dataframes with float type series in Dask (due to memory issue I can't use pure Pandas). From the post, I found that there will have issue when merging float type columns. So I tried the answer in the post accordingly, to get the XYZ values * 100 and convert into int.
x y z R G B
39020.470001199995750 33884.200004600003012 36.445701600000000 25 39 26
39132.740005500003463 33896.049995399996988 30.405698800000000 19 24 18
39221.059997600001225 33787.050003099997411 26.605699500000000 115 145 145
39237.370010400001775 33773.019996599992737 30.205699900000003 28 33 37
39211.370010400001775 33848.270004300000437 32.535697900000002 19 28 25
What I did
N = 100
df2.x = np.round(df2.x*N).astype(int)
df2.head()
But since this dataframe has no index, it results in a error message
local variable 'index' referenced before assignment
Expected answer
x y z R G B
3902047 3388420 3644 25 39 26
I was having the same problem and got it to work this way:
df2.x = (df2.x*N).round().astype(int)
If you need to round to a specific decimal:
(df2.x*N).round(2)
Is there an easy way to sum the value of all the rows above the current row in an adjacent column? Click on the image below to see what I'm trying to make. It's easier to see it than explain it.
Text explanation: I'm trying to create a chart where column B is either the sum or percent of total of all the rows in A that are above it. That way I can quickly visualize where the quartile, third, etc are in the dataframe. I'm familiar with the percentile function
How to calculate 1st and 3rd quartiles?
but I'm not sure I can get it to do exactly what I want it to do. Image below as well as text version:
Text Version
1--1%
1--2%
4--6%
4--10%
2--12%
...
and so on to 100 percent.
Do i need to write a for loop to do this?
Excel Chart:
you can use cumsum for this:
import numpy as np
import pandas as pd
df = pd.DataFrame(data=dict(x=[13,22,34,21,33,41,87,24,41,22,18,12,13]))
df["percent"] = (100*df.x.cumsum()/df.x.sum()).round(1)
output:
x percent
0 13 3.4
1 22 9.2
2 34 18.1
3 21 23.6
4 33 32.3
5 41 43.0
6 87 65.9
7 24 72.2
8 41 82.9
9 22 88.7
10 18 93.4
11 12 96.6
12 13 100.0
This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)
Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])
If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).
whats the name of below operation in Pandas?
import numpy as np
import pandas as pd
x=np.linspace(10,15,64)
y=np.random.permutation(64)
z=x[y]
ndarray "x" is (I assume) shuffled using ndarray "y" and then result ndarray is assigned to "z".
What is the name of this operation? I can't find it in Pandas documentation.
Thanks,
Pawel
This is called indexing, both in Pandas and NumPy
This code is basically shuffling an array using an array of indices. Using pandas you could shuffle a Series containing x using Series.sample, and specifying frac=1 so the whole sample is shuffled:
s = pd.Series(x)
s.sample(frac=1)
52 14.126984
1 10.079365
41 13.253968
16 11.269841
29 12.301587
9 10.714286
37 12.936508
19 11.507937
15 11.190476
56 14.444444
0 10.000000
45 13.571429
34 12.698413
12 10.952381
....
If you want to use the existing y, you could index the Series using the iloc indexer:
s.iloc[y]
8 10.634921
53 14.206349
48 13.809524
43 13.412698
51 14.047619
21 11.666667
9 10.714286
29 12.301587
5 10.396825
61 14.841270
56 14.444444
39 13.095238
30 12.380952
...
Here are the docs on indexing with pandas.
import pandas as pd
path1 = "/home/supertramp/Desktop/100&life_180_data.csv"
mydf = pd.read_csv(path1)
numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}
print mydf['Cigarettes']
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
print mydf['CigarNum']
mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.
No int64
Age int64
BMI float64
Alcohol object
Cigarettes object
dtype: object
Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day".
I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)
The expected output is new column CigarNum appended which consists only numbers 0,1,2
CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column
0 Never
1 Never
2 1-5 Cigarettes/day
3 Never
4 Never
5 Never
6 Never
7 Never
8 Never
9 Never
10 Never
11 Never
12 10-20 Cigarettes/day
13 1-5 Cigarettes/day
14 Never
...
167 Never
168 Never
169 10-20 Cigarettes/day
170 Never
171 Never
172 Never
173 Never
174 Never
175 Never
176 Never
177 Never
178 Never
179 Never
180 Never
181 Never
Name: Cigarettes, Length: 182, dtype: object
The output I get shoudln't give NaN after few first rows.
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 NaN
11 NaN
12 NaN
13 NaN
14 0
...
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
180 NaN
181 NaN
Name: CigarNum, Length: 182, dtype: float64
OK, first problem is you have embedded spaces causing the function to incorrectly apply:
fix this using vectorised str:
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
now create your new column should just work:
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
UPDATE
Thanks to #Jeff as always for pointing out superior ways to do things:
So you can call replace instead of calling apply:
mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
you can also use factorize method also.
Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?
So:
numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Version 0.17.0 or newer
convert_objects is deprecated since 0.17.0, this has been replaced with to_numeric
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Here errors='coerce' will return NaN where the values cannot be converted to a numeric value, without this it will raise an exception
Try using this function for all problems of this kind:
def get_series_ids(x):
'''Function returns a pandas series consisting of ids,
corresponding to objects in input pandas series x
Example:
get_series_ids(pd.Series(['a','a','b','b','c']))
returns Series([0,0,1,1,2], dtype=int)'''
values = np.unique(x)
values2nums = dict(zip(values,range(len(values))))
return x.replace(values2nums)