Pandas: drop non-integer data - python

I have dataset, in which I read the data, df.dir.value_counts() returns
169 23042
170 22934
168 22873
316 22872
315 22809
171 22731
317 22586
323 22561
318 22530
...
0.069 1
0.167 1
0557 1
0.093 1
1455 1
0.130 1
0.683 1
2211 1
3.714 1
1.093 1
0819 1
0.183 1
0.110 1
2241 1
0.34 1
0.330 1
0.563 1
60+9 1
0.910 1
0.232 1
1410 1
0.490 1
0.107 1
1.257 1
1704 1
0.491 1
1.180 1
5-230 1
1735 1
1.384 1
The dir column is about direction, and the data should be integer, ranging from (0,361). As you can see, there are a lot of errones data at the end of the value_counts() list.
I want to know, how can I drop the non-integer data?
There are some possible ways
1.read_csv as integer and throw all non-integer data
df = pd.read_csv("/data.dat", names = ['time', 'dir'], dtype={'dir': int}})
However, there some string like error data, such as 60+9, which would cause error. I don't know how to handle it.
2.Select by isdigit(), and then do a downcast
df = df[df['dir'].apply(lambda x: str(x).isdigit())]
df['dir']=pd.to_numeric(df['dir'], downcast='integer', errors='coerce')
This is from Drop rows if value in a specific column is not an integer in pandas dataframe, and works fine for me, but it feels a little bit too much. I'm wondering if there are better approaches?

I like
df.dir[df.dir == df.dir // 1]
How It Works
Consider the dataframe df
df = pd.DataFrame(dict(dir=[1, 1.5, 2, 2.5]))
print(df)
dir
0 1.0
1 1.5
2 2.0
3 2.5
Anything that is an integer should be equal to itself floor divided by one.
df.assign(floor_div=df.dir // 1)
dir floor_div
0 1.0 1.0
1 1.5 1.0
2 2.0 2.0
3 2.5 2.0
So we can test for when they are equal
df.assign(
floor_div=df.dir // 1,
is_int=df.dir // 1 == df.dir
)
dir floor_div is_int
0 1.0 1.0 True
1 1.5 1.0 False
2 2.0 2.0 True
3 2.5 2.0 False
So to filter, we can use the boolean mask in the demo column 'is_int'
df.dir[df.dir == df.dir // 1]
0 1.0
2 2.0
Name: dir, dtype: float64
If there are strings in this column, then you can incorporate pd.to_numeric
df.dir = pd.to_numeric(df.dir, 'coerce')
df.dir[df.dir == df.dir // 1]

Related

Take average of range entities and replace it in pandas column

I have dataframe where one column looks like
Average Weight (Kg)
0.647
0.88
0
0.73
1.7 - 2.1
1.2 - 1.5
2.5
NaN
1.5 - 1.9
1.3 - 1.5
0.4
1.7 - 2.9
Reproducible data
df = pd.DataFrame([0.647,0.88,0,0.73,'1.7 - 2.1','1.2 - 1.5',2.5 ,np.NaN,'1.5 - 1.9','1.3 - 1.5',0.4,'1.7 - 2.9'],columns=['Average Weight (Kg)'])
where I would like to take average of range entries and replace it in the dataframe e.g. 1.7 - 2.1 will be replaced by 1.9 , following code doesn't work TypeError: 'float' object is not iterable
np.where(df['Average Weight (Kg)'].str.contains('-'), df['Average Weight (Kg)'].str.split('-')
.apply(lambda x: statistics.mean((list(map(float, x)) ))), df['Average Weight (Kg)'])
Another possible solution, which is based on the following ideas:
Convert column to string.
Split each cell by \s-\s.
Explode column.
Convert back to float.
Group by and mean.
df['Average Weight (Kg)'] = df['Average Weight (Kg)'].astype(
str).str.split(r'\s-\s').explode().astype(float).groupby(level=0).mean()
Output:
Average Weight (Kg)
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
edit: slight change to avoid creating a new column
You could go for something like this (renamed your column name to avg, cause it was long to type :-) ):
new_average =(df.avg.str.split('-').str[1].astype(float) + df.avg.str.split('-').str[0].astype(float) ) / 2
df["avg"] = new_average.fillna(df.avg)
yields for avg:
0 0.647
1 0.880
2 0.000
3 0.730
4 1.900
5 1.350
6 2.500
7 NaN
8 1.700
9 1.400
10 0.400
11 2.300
Name: avg2, dtype: float64

How to cut float value to 2 decimal points

I have a Pandas Dataframe with a float column. The values in that column have many decimal points but I only need 2 decimal points. I don't want to round, but truncate the value after the second digit.
this is what I have so far, however with this operation i always get NaN's:
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
sub = "."
t['latitude'].astype(str).str.slice(start=t['latitude'].astype(str).str.find(sub), stop=t['latitude'].astype(str).str.find(sub)+2)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
Name: latitude, dtype: float64
The simpliest way to truncate:
t = pd.DataFrame()
t['latitude']=[18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
t['latitude'] = (t['latitude'] * 100).astype(int) / 100
print(t)
>>
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Use np.round -
s = pd.Series([18.3988, 18.4439, 18.3467, 37.5079, 38.1102, 38.2927])
s_rounded = np.round(s, 2)
Output
0 18.40
1 18.44
2 18.35
3 37.51
4 38.11
5 38.29
dtype: float64
If you don't want to round, but just truncate -
s.astype(str).str.split('.').apply(lambda x: str(x[0]) + '.' + str(x[1])[:2])
Output
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
dtype: object
Use numpy.trunc for a vectorial operation:
n = 2 # number of decimals to keep
np.trunc(df['latitude'].mul(10**n)).div(10**n)
# to assign
# df['latitude'] = np.trunc(df['latitude'].mul(10**n)).div(10**n)
output:
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
Name: latitude, dtype: float64
x = 12.3614
y = round(x,2)
print(y) // 12.36
Easiest is Serious.round, but you can also try .str.extract
t['latitude'] = (t['latitude'].astype(str)
.str.extract('(.*\.\d{0,2})')
.astype(float))
print(t)
latitude
0 18.39
1 18.44
2 18.34
3 37.50
4 38.11
5 38.29
import re
t = [18.398, 18.4439, 18.346, 37.5079, 38.11, 38.2927]
truncated_lat=[]
for lat in t:
truncated_lat.append(float(re.findall('[0-9]+\.[0-9]{2}', str(lat))[0]))
print(truncated_lat)
Output:
[18.39, 18.44, 18.34, 37.5, 38.11, 38.29]
Try
import math
for i in t['latitude']:
math.trunc(i)

Getting the row count until the highest date from pandas

I have df like below I want to create dayshigh column. This column will show the row counts until the highest date.
date high
05-06-20 1.85
08-06-20 1.88
09-06-20 2
10-06-20 2.11
11-06-20 2.21
12-06-20 2.17
15-06-20 1.99
16-06-20 2.15
17-06-20 16
18-06-20 9
19-06-20 14.67
should be like:
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16 8
18-06-20 9 0
19-06-20 14.67 1
using the below code but showing error somehow:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
for j in range(df["DaysHigh"][i].index, len(df)):
if df["high"][i] > df["high"][i-1]:
df["DaysHigh"][i] = df["DaysHigh"][i-1] + 1
else:
df["DaysHigh"][i] = 0
At which point am I doing wrong? Thank you
Is the dayshigh number for 17-06-20 supposed to be 2 instead of 8? If so, you can basically use the code you had already written here. There are three changes I'm making below:
starting i from 1 instead of 0 to avoid trying to access the -1th element
removing the loop over j (doesn't seem to be necessary)
using loc to set the values instead of df["high"][i] -- you'll see this should resolve the warnings about copies and slices.
Keeping first line same as before,
for i in range(1, len(df)):
if df["high"][i] > df["high"][i-1]:
df.loc[i,"DaysHigh"] = df["DaysHigh"][i-1] + 1
else:
df.loc[i,"DaysHigh"] = 0
procedure
Use pandas.shift() to create a column for the next row of comparison results.
calculate the cumulative sum of its created columns
delete the columns if they are not needed
df['tmp'] = np.where(df['high'] >= df['high'].shift(), 1, np.NaN)
df['dayshigh'] = df['tmp'].groupby(df['tmp'].isna().cumsum()).cumsum()
df.drop('tmp', axis=1, inplace=True)
df
date high dayshigh
0 05-06-20 1.85 NaN
1 08-06-20 1.88 1.0
2 09-06-20 2.00 2.0
3 10-06-20 2.11 3.0
4 11-06-20 2.21 4.0
5 12-06-20 2.17 NaN
6 15-06-20 1.99 NaN
7 16-06-20 2.15 1.0
8 17-06-20 16.00 2.0
9 18-06-20 9.00 NaN
10 19-06-20 14.67 1.0
Well, I think I did, here is my solution:
df["DaysHigh"] = np.repeat(0, len(df))
for i in range(0, len(df)):
#for i in range(len(df)-1000, len(df)):
for j in reversed(range(i)):
if df["high"][i] > df["high"][j]:
df["DaysHigh"][i] = df["DaysHigh"][i] + 1
else:
break
print(df)
date high dayshigh
05-06-20 1.85 nan
08-06-20 1.88 1
09-06-20 2.00 2
10-06-20 2.11 3
11-06-20 2.21 4
12-06-20 2.17 0
15-06-20 1.99 0
16-06-20 2.15 1
17-06-20 16.00 8
18-06-20 9.00 0
19-06-20 14.67 1

Division by following row

I have a dataframe which looks like this
Date |index_numer
26/08/17|200
27/08/17|300
28/08/17|400
29/08/17|100
30/08/17|150
01/09/17|160
02/09/17|170
03/09/17|280
I am trying to do a division where the first row divides by the second row.
Date |index_numer| Divison by next row
26/08/17|200 | 0.666
27/08/17|300 | 0.75
28/08/17|400 | 4
29/08/17|100 |..
I did this in a for loop and then extracted the division number and merge back the DF. however, I am not sure if it can be done in pandas/numpy.
Does anyone have any idea?
Use shift:
df['divison'] = df['index_numer'] / df['index_numer'].shift(-1)
Output:
Date index_numer divison
0 26/08/17 200 0.666667
1 27/08/17 300 0.750000
2 28/08/17 400 4.000000
3 29/08/17 100 0.666667
4 30/08/17 150 0.937500
5 01/09/17 160 0.941176
6 02/09/17 170 0.607143
7 03/09/17 280 NaN

How to merge the two columns from two dataframe into one column of a new dataframe (pandas)?

I want to merge the values of two different columns of pandas dataframe into one column of new dataframe.
pandas df1 =
hapX
pos 0.0
1 721 0.2
2 735 0.5
3 739 1.0
pandas df2 =
hapY
pos 0.1
1 721 0.0
2 735 0.6
3 739 1.5
I want to generate a new dataframe like:
df_joined['hapX|Y'] = df1.astype(str).add('|').add(df2.astype(str))
with expected output:
hapX|Y
pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
But, this is outputting bunch of NaN
hapX hapY
pos NaN NaN
1 721 NaN NaN
2 735 NaN NaN
3 739 NaN NaN
Is the problem with value being float (i don't think so). What is the problem with my approach?
Also, is there a way to automate the process if columns values are like hapX1 hapX1 hapX3 in one dataframe with hapY1 hapY2 hapY3 in another dataframe?
Thanks,
You can merge the two dataframes and then concat the hapX and hapY.
Say your first column name is no.
df_joined = df1.merge(df2, on = 'no')
df_joined['hapX|Y'] = (df_joined['hapX'].astype(str))+'|'+(df_joined['hapY'].astype(str))
df_joined.drop(['hapX', 'hapY'], axis = 1)
This gives you
no hapX|Y
0 pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
Just to add onto the previous answer, for the general case of N DataFrames,
Suppose you have a number of DataFrames as follows:
dfs = [pd.DataFrame({'hapY'+str(j): [random.random() for i in range(10)]}) for j in range(5)]
such that
>>> dfs[0]
hapY0
0 0.175683
1 0.353729
2 0.949848
3 0.346088
4 0.435292
5 0.837879
6 0.277274
7 0.623121
8 0.325119
9 0.709252
Then,
>>> map( lambda m: '|'.join(m) , zip(*[ dfs[j]['hapY'+str(j)].astype(str) for j in range(5)]))
['0.0845464936138|0.193336164837|0.551717121013|0.113566029656|0.479590342798',
'0.275851474238|0.694161791339|0.151607726092|0.615367668451|0.498997567849',
'0.116891472119|0.258406028668|0.315137581816|0.819992354178|0.864412473301',
'0.729581942312|0.614902776003|0.443986436146|0.227782256619|0.0149481683863',
'0.745583477173|0.441456815889|0.428691631831|0.307480112319|0.136790112739',
'0.981337451224|0.0117895017035|0.415140979617|0.650957722911|0.968082350568',
'0.725618728314|0.0546057041356|0.715910454674|0.0828229441557|0.220878025678',
'0.704047455894|0.303403129266|0.0499082759635|0.49727194707|0.251623048104',
'0.453595354131|0.146042134766|0.346665276655|0.911092176243|0.291405609407',
'0.140523603089|0.117930249858|0.902071673051|0.0804933425857|0.876006332635']
which you can later put into a DataFrame.
I think the simpliest is rename columns by dict which can be created by dict comprehension, last add_suffix:
print (df1)
hapX1 hapX2 hapX3 hapX4
pos
23 1.0 0.0 1.0 1.0
24 1.0 1.0 1.5 1.0
28 1.0 0.0 0.5 0.0
print (df2)
hapY1 hapY2 hapY3 hapY4
pos
23 0.0 1.0 0.5 0.0
24 1.0 1.0 1.5 1.0
28 0.0 1.0 1.0 1.0
d = {'hapY' + str(x):'hapX' + str(x) for x in range(1,5)}
print (d)
{'hapY1': 'hapX1', 'hapY3': 'hapX3', 'hapY2': 'hapX2', 'hapY4': 'hapX4'}
df_joined = df1.astype(str).add('|').add(df2.rename(columns=d).astype(str)).add_suffix('|Y')
print (df_joined)
hapX1|Y hapX2|Y hapX3|Y hapX4|Y
pos
23 1.0|0.0 0.0|1.0 1.0|0.5 1.0|0.0
24 1.0|1.0 1.0|1.0 1.5|1.5 1.0|1.0
28 1.0|0.0 0.0|1.0 0.5|1.0 0.0|1.0

Categories