I have a Pandas Dataframe with a float column. The values in that column have many decimal points but I only need 2 decimal points. I don't want to round, but truncate the value after the second digit.
this is what I have so far, however with this operation i always get NaN's:
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
sub = "."
t['latitude'].astype(str).str.slice(start=t['latitude'].astype(str).str.find(sub), stop=t['latitude'].astype(str).str.find(sub)+2)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
Name: latitude, dtype: float64
The simpliest way to truncate:
t = pd.DataFrame()
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
t['latitude'] = (t['latitude'] * 100).astype(int) / 100
print(t)
>>
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Use np.round -
s = pd.Series([18.3988, 18.4439, 18.3467, 37.5079, 38.1102, 38.2927])
s_rounded = np.round(s, 2)
Output
0 18.40
1 18.44
2 18.35
3 37.51
4 38.11
5 38.29
dtype: float64
If you don't want to round, but just truncate -
s.astype(str).str.split('.').apply(lambda x: str(x[0]) + '.' + str(x[1])[:2])
Output
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
dtype: object
Use numpy.trunc for a vectorial operation:
n = 2 # number of decimals to keep
np.trunc(df['latitude'].mul(10**n)).div(10**n)
# to assign
# df['latitude'] = np.trunc(df['latitude'].mul(10**n)).div(10**n)
output:
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Name: latitude, dtype: float64
x = 12.3614
y = round(x,2)
print(y) // 12.36
Easiest is Serious.round, but you can also try .str.extract
t['latitude'] = (t['latitude'].astype(str)
.str.extract('(.*\.\d{0,2})')
.astype(float))
print(t)
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
import re
t = [18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
truncated_lat=[]
for lat in t:
truncated_lat.append(float(re.findall('[0-9]+\.[0-9]{2}', str(lat))[0]))
print(truncated_lat)
Output:
[18.39, 18.44, 18.34, 37.5, 38.11, 38.29]
Try
import math
for i in t['latitude']:
math.trunc(i)
Related
I have two DataFrames that look like this:
dfH
TICKER Qty PPC Date PxQ PPerc
0 C 6 4185.0 2021-11-13 25110.0 0.097416
1 AAPL 20 3058.0 2021-11-13 61160.0 0.237274
2 JPM 3 5915.0 2021-11-13 17745.0 0.068843
3 KO 15 2481.0 2021-11-13 37215.0 0.144378
4 MSFT 10 5825.6 2021-11-13 58256.0 0.226008
5 PG 5 6280.0 2021-11-13 31400.0 0.121818
6 WMT 5 5375.0 2021-11-13 26875.0 0.104263
dfMerged
Date,C,AAPL,JPM,KO,MSFT,PG,WMT
2020-11-10,2380.000,1759.000,3480.000,1601.000,3189.500,4269.000,3665.000
2020-11-11,2475.000,1798.000,3500.000,1626.000,3286.000,4352.000,3780.000
2020-11-12,2409.000,1765.000,3392.000,1590.000,3208.000,4305.000,3687.000
2020-11-13,2425.000,1770.000,3400.000,1590.000,3245.000,4322.500,3780.000
2020-11-16,2472.000,1792.000,3460.000,1600.000,3215.000,4240.000,3805.000
2020-11-17,2535.000,1810.000,3489.000,1610.000,3220.000,4300.000,3793.000
Like Vlookup in excel I'm trying to pick PPerc value from dfH and multiply it with the correspondent column in dfMerged, acummulate row values and append it in dfMerged as a new column. With the expression below I manage to do the math but I'm having troubles to accumulate this iteration on dfMerged["Ind"], I'm just getting last iteration values.
for i in list(dfMerged.columns):
if i != 'Date':
index = (dfH[dfH["TICKER"]==i]["PPerc"].values[0] * dfMerged[i])
dfMerged["Ind"] = index
Date C AAPL JPM ... MSFT PG WMT Ind
0 2020-11-10 2380.0 1759.0 3480.0 ... 3189.5 4269.0 3665.0 382.124817
1 2020-11-11 2475.0 1798.0 3500.0 ... 3286.0 4352.0 3780.0 394.115091
2 2020-11-12 2409.0 1765.0 3392.0 ... 3208.0 4305.0 3687.0 384.418609
3 2020-11-13 2425.0 1770.0 3400.0 ... 3245.0 4322.5 3780.0 394.115091
4 2020-11-16 2472.0 1792.0 3460.0 ... 3215.0 4240.0 3805.0 396.721672
If I you understand correctly, this works:
new_col = sum([df2[symbol] * df1['PPerc'][i] for i, symbol in enumerate(df1['TICKER'])])
Output:
>>> new_col
0 76190.0
1 77730.0
2 75660.0
3 75950.0
4 77240.0
5 78340.0
dtype: float64
Solved using Join and then sum
for i in list(dfMerged.columns):
if i != 'Date':
index = (dfH[dfH["TICKER"]==i]["PPerc"].values[0] * dfMerged[i])
tmp = tmp.join(index,how="right")
tmp["index"]=tmp.sum(axis=1)
dfMerged["Ind"] = tmp['index']
— #Billy101
Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).
I have created two series and I want to create a third series by doing element-wise multiplication of first two. My code is given below:
new_samples = 10 # Number of samples in series
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
c = pd.Series([x*y for x,y in zip(a.tolist(),b.tolist())],index=['Power'])
My output is:
TypeError: can't multiply sequence by non-int of type 'list'
To keep things clear, I am pasting my actual for loop code below. My data frame already has three columns Current,Voltage,Power. For my requirement, I have to add new list of values to existing columns Voltage,Current. But, Power values are created by multiplying already created values. My code is given below:
for i,j in zip(IV_start_index,IV_start_index[1:]):
isc_act = module_allData_df['Current'].iloc[i:j-1].max()
isc_indx = module_allData_df['Current'].iloc[i:j-1].idxmax()
sample_count = int((j-i)/(module_allData_df['Voltage'].iloc[i]-module_allData_df['Voltage'].iloc[j-1]))
new_samples = int(sample_count * (module_allData_df['Voltage'].iloc[isc_indx]))
missing_current = pd.Series([list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples)))],index=['Current'])
missing_voltage = pd.Series([list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples)))],index=['Voltage'])
print(missing_current.tolist()*missing_voltage.tolist())
Sample data: module_allData_df.head()
Voltage Current Power
0 33.009998 -0.004 -0.13204
1 33.009998 0.005 0.16505
2 32.970001 0.046 1.51662
3 32.950001 0.087 2.86665
4 32.919998 0.128 4.21376
sample data: module_allData_df.iloc[120:126] and you require this also
Voltage Current Power
120 0.980000 5.449 5.34002
121 0.920000 5.449 5.01308
122 0.860000 5.449 4.68614
123 0.790000 5.449 4.30471
124 33.110001 -0.004 -0.13244
125 33.110001 0.005 0.16555
sample data: IV_start_index[:5]
[0, 124, 251, 381, 512]
Based on #jezrael answer, I have successfully created three separate series. How to append them to main dataframe. My requirement is explained in following plot.
Problem is each Series is one element with lists, so not possible use vectorized operations.
a = pd.Series([list(map(lambda x:x,np.linspace(2,2,new_samples)))],index=['Current'])
b = pd.Series([list(map(lambda x:x,np.linspace(10,0,new_samples)))],index=['Voltage'])
print (a)
Current [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, ...
dtype: object
print (b)
Voltage [10.0, 8.88888888888889, 7.777777777777778, 6....
dtype: object
So I believe need remove [] and if necessary add parameter name:
a = pd.Series(list(map(lambda x:x,np.linspace(2,2,new_samples))), name='Current')
b = pd.Series(list(map(lambda x:x,np.linspace(10,0,new_samples))),name='Voltage')
print (a)
0 2.0
1 2.0
2 2.0
3 2.0
4 2.0
5 2.0
6 2.0
7 2.0
8 2.0
9 2.0
Name: Current, dtype: float64
print (b)
0 10.000000
1 8.888889
2 7.777778
3 6.666667
4 5.555556
5 4.444444
6 3.333333
7 2.222222
8 1.111111
9 0.000000
Name: Voltage, dtype: float64
c = a * b
print (c)
0 20.000000
1 17.777778
2 15.555556
3 13.333333
4 11.111111
5 8.888889
6 6.666667
7 4.444444
8 2.222222
9 0.000000
dtype: float64
EDIT:
If want outoput multiplied Series need last 2 rows:
missing_current = pd.Series(list(map(lambda x:x,np.linspace(isc_act,isc_act,new_samples))))
missing_voltage = pd.Series(list(map(lambda x:x,np.linspace(module_allData_df['Voltage'].iloc[isc_indx],0,new_samples))))
print(missing_current *missing_voltage)
It's easier using numpy.
import numpy as np
new_samples = 10 # Number of samples in series
a = np.array(np.linspace(2,2,new_samples))
b = np.array(np.linspace(10,0,new_samples))
c = a*b
print(c)
Output:
array([20. , 17.77777778, 15.55555556, 13.33333333,
11.11111111,
8.88888889, 6.66666667, 4.44444444, 2.22222222, 0. ])
As you are doing everything using pandas dataframe, use the below code.
import pandas as pd
new_samples = 10 # Number of samples in series
df = pd.DataFrame({'Current':np.linspace(2,2,new_samples),'Voltage':np.linspace(10,0,new_samples)})
df['Power'] = df['Current'] * df['Voltage']
print(df.to_string(index=False))
Output:
Current Voltage Power
2.0 10.000000 20.000000
2.0 8.888889 17.777778
2.0 7.777778 15.555556
2.0 6.666667 13.333333
2.0 5.555556 11.111111
2.0 4.444444 8.888889
2.0 3.333333 6.666667
2.0 2.222222 4.444444
2.0 1.111111 2.222222
2.0 0.000000 0.000000
Because they are series you should be able to just multiply them c= a * b
You could add a and b to a data frame and the c becomes the third column
I have a dataframe which looks like this
Date |index_numer
26/08/17|200
27/08/17|300
28/08/17|400
29/08/17|100
30/08/17|150
01/09/17|160
02/09/17|170
03/09/17|280
I am trying to do a division where the first row divides by the second row.
Date |index_numer| Divison by next row
26/08/17|200 | 0.666
27/08/17|300 | 0.75
28/08/17|400 | 4
29/08/17|100 |..
I did this in a for loop and then extracted the division number and merge back the DF. however, I am not sure if it can be done in pandas/numpy.
Does anyone have any idea?
Use shift:
df['divison'] = df['index_numer'] / df['index_numer'].shift(-1)
Output:
Date index_numer divison
0 26/08/17 200 0.666667
1 27/08/17 300 0.750000
2 28/08/17 400 4.000000
3 29/08/17 100 0.666667
4 30/08/17 150 0.937500
5 01/09/17 160 0.941176
6 02/09/17 170 0.607143
7 03/09/17 280 NaN
I have dataset, in which I read the data, df.dir.value_counts() returns
169 23042
170 22934
168 22873
316 22872
315 22809
171 22731
317 22586
323 22561
318 22530
...
0.069 1
0.167 1
0557 1
0.093 1
1455 1
0.130 1
0.683 1
2211 1
3.714 1
1.093 1
0819 1
0.183 1
0.110 1
2241 1
0.34 1
0.330 1
0.563 1
60+9 1
0.910 1
0.232 1
1410 1
0.490 1
0.107 1
1.257 1
1704 1
0.491 1
1.180 1
5-230 1
1735 1
1.384 1
The dir column is about direction, and the data should be integer, ranging from (0,361). As you can see, there are a lot of errones data at the end of the value_counts() list.
I want to know, how can I drop the non-integer data?
There are some possible ways
1.read_csv as integer and throw all non-integer data
df = pd.read_csv("/data.dat", names = ['time', 'dir'], dtype={'dir': int}})
However, there some string like error data, such as 60+9, which would cause error. I don't know how to handle it.
2.Select by isdigit(), and then do a downcast
df = df[df['dir'].apply(lambda x: str(x).isdigit())]
df['dir']=pd.to_numeric(df['dir'], downcast='integer', errors='coerce')
This is from Drop rows if value in a specific column is not an integer in pandas dataframe, and works fine for me, but it feels a little bit too much. I'm wondering if there are better approaches?
I like
df.dir[df.dir == df.dir // 1]
How It Works
Consider the dataframe df
df = pd.DataFrame(dict(dir=[1, 1.5, 2, 2.5]))
print(df)
dir
0 1.0
1 1.5
2 2.0
3 2.5
Anything that is an integer should be equal to itself floor divided by one.
df.assign(floor_div=df.dir // 1)
dir floor_div
0 1.0 1.0
1 1.5 1.0
2 2.0 2.0
3 2.5 2.0
So we can test for when they are equal
df.assign(
floor_div=df.dir // 1,
is_int=df.dir // 1 == df.dir
)
dir floor_div is_int
0 1.0 1.0 True
1 1.5 1.0 False
2 2.0 2.0 True
3 2.5 2.0 False
So to filter, we can use the boolean mask in the demo column 'is_int'
df.dir[df.dir == df.dir // 1]
0 1.0
2 2.0
Name: dir, dtype: float64
If there are strings in this column, then you can incorporate pd.to_numeric
df.dir = pd.to_numeric(df.dir, 'coerce')
df.dir[df.dir == df.dir // 1]