Converting string objects to int/float using pandas - python

import pandas as pd
path1 = "/home/supertramp/Desktop/100&life_180_data.csv"
mydf = pd.read_csv(path1)
numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}
print mydf['Cigarettes']
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
print mydf['CigarNum']
mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.
No int64
Age int64
BMI float64
Alcohol object
Cigarettes object
dtype: object
Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day".
I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)
The expected output is new column CigarNum appended which consists only numbers 0,1,2
CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column
0 Never
1 Never
2 1-5 Cigarettes/day
3 Never
4 Never
5 Never
6 Never
7 Never
8 Never
9 Never
10 Never
11 Never
12 10-20 Cigarettes/day
13 1-5 Cigarettes/day
14 Never
...
167 Never
168 Never
169 10-20 Cigarettes/day
170 Never
171 Never
172 Never
173 Never
174 Never
175 Never
176 Never
177 Never
178 Never
179 Never
180 Never
181 Never
Name: Cigarettes, Length: 182, dtype: object
The output I get shoudln't give NaN after few first rows.
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 NaN
11 NaN
12 NaN
13 NaN
14 0
...
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
180 NaN
181 NaN
Name: CigarNum, Length: 182, dtype: float64

OK, first problem is you have embedded spaces causing the function to incorrectly apply:
fix this using vectorised str:
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
now create your new column should just work:
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
UPDATE
Thanks to #Jeff as always for pointing out superior ways to do things:
So you can call replace instead of calling apply:
mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
you can also use factorize method also.
Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?
So:
numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Version 0.17.0 or newer
convert_objects is deprecated since 0.17.0, this has been replaced with to_numeric
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Here errors='coerce' will return NaN where the values cannot be converted to a numeric value, without this it will raise an exception

Try using this function for all problems of this kind:
def get_series_ids(x):
'''Function returns a pandas series consisting of ids,
corresponding to objects in input pandas series x
Example:
get_series_ids(pd.Series(['a','a','b','b','c']))
returns Series([0,0,1,1,2], dtype=int)'''
values = np.unique(x)
values2nums = dict(zip(values,range(len(values))))
return x.replace(values2nums)

Related

Drop interval of rows in a dataframe based on the value of an object

I am trying to drop intervals of rows in my Dataframes from the maximal value (exclusive) to the rest (end) of the column. Here is an example of one of the column of my df (dflist['time']):
0 0.000000
1 0.021528
2 0.042135
3 0.062925
4 0.083498
...
88 1.796302
89 1.816918
90 1.837118
91 1.857405
92 1.878976
Name: time, Length: 93, dtype: float64
I have tried to use the .iloc and the .drop function in conjunction to the .index to achieve this result but without any success so far:
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['time'] = dflist['time'].iloc[0:[v_max_idx]]
I have also tried several variations, like converting 'v_max_idx' to a list with .list or a .int to change the type inside the .iloc function as it seems to be the problem:
TypeError: cannot do positional indexing on RangeIndex with these indexers [[Int64Index([15], dtype='int64')]] of type list
I don't know why I am not able to do this and it is quiet frustrating, as it seems to be a pretty basic operation..
Any help would therefore be greatly appreciated !
##EDIT REGARDING THE dropna() PROBLEM
I tried with .notna() :
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['velocity'] = dflist['velocity'].iloc[0:list(v_max_idx)[0]]
dflist['velocity'] = dflist['velocity'][dflist['velocity'].notna()]
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]]
dflist['time'] = dflist['time'][dflist['time'].notna()]
and try with dropna():
for nested_dict in dict_all_raw.values():
for dflist in nested_dict.values():
v_max = dflist['velocity'].max()
v_max_idx = dflist['velocity'].index[dflist['velocity'] == v_max]
dflist['velocity'] = dflist['velocity'].iloc[0:list(v_max_idx)[0]].dropna()
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]].dropna()
No error messages, it just doesn't do anything:
19 0.385243 1.272031
20 0.405416 1.329072
21 0.425477 1.352059
22 0.445642 1.349657
23 0.465755 1.378407
24 NaN NaN
25 NaN NaN
26 NaN NaN
27 NaN NaN
28 NaN NaN
29 NaN NaN
30 NaN NaN
31 NaN NaN
32 NaN NaN
33 NaN NaN
34 NaN NaN
35 NaN NaN
36 NaN NaN
Return value of pandas.Index() in your example is pandas.Int64Index().
pandas.DataFrame.iloc() allows inputs like a slice object with ints, e.g. 1:7.
In your code, no matter v_max_idx which a pandas.Index() object or [pandas.Index()] which is a list object doesn't meet the requirements of iloc() argument type.
You can use list(v_max_idx) to convert pandas.Index() object to list then use [0] etc. to access the data, like
dflist['time'] = dflist['time'].iloc[0:list(v_max_idx)[0]]

iteritems() in dataframe column

I have a dataset of U.S. Education Datasets: Unification Project. I want to find out
Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.
I am having problem in updating the count whenever the value in the if statement is correct.
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df looks like this
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
In the output I got
00
With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.
The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.
To go off your method:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
Why did I change the code in my answer instead of using yours?
Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.

Replacing a pandas DataFrame value with np.nan when the values ends with '_h'

Below is the code that I have been working with to replace some values with np.NaN. My issue is how to replace'47614750_h' at index 111 with np.NaN. I can do this directly with drop_list, however, I need to iterate this with different values ending in '_h' over many files and would like to do this automatically.
I have tried some searches on regex as it seems the way to go, but could not find what i needed.
drop_list = ['dash_code', 'SONIC WELD']
df_clean.replace(drop_list, np.NaN).tail(10)
DASH_CODE Name Quantity
107 1011567 .156 MALE BULLET TERM INSUL 1.0
108 102066901 .032 X .187 FEMALE Q.D. TERM. 1.0
109 105137901 TERM,RING,10-12AWG,INSULATED 1.0
110 101919701 1/4 RING TERM INSUL 2.0
111 47614750001_h HARNESS, MAIN, AC, LIO 1.0
112 NaN NaN 19.0
113 7685 5/16 RING TERM INSUL. 1.0
114 102521601 CLIP,HARNESS 2.0
115 47614808001 CAP, RESISTOR, TERMINATION 1.0
116 103749801 RECPT, DEUTSCH, DTM04-4P 1.0
You can use pd.Series.apply for this with a lambda:
df['DASH_CODE'] = df['DASH_CODE'].apply(lambda x: np.NaN if x.endswith('_h') else x)
From the documentation:
Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only
works on single values
It may be faster to try to convert all the rows to float using pd.to_numeric:
In [11]: pd.to_numeric(df.DASH_CODE, errors='coerce')
Out[11]:
0 1.011567e+06
1 1.020669e+08
2 1.051379e+08
3 1.019197e+08
4 NaN
5 NaN
6 7.685000e+03
7 1.025216e+08
8 4.761481e+10
9 1.037498e+08
Name: DASH_CODE, dtype: float64
In [12]: df["DASH_CODE"] = pd.to_numeric(df["DASH_CODE"], errors='coerce')

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

Conditional update on two columns on Pandas Dataframe

I have a pandas dataframe where I'm trying to append two column values if the value of the second column is not NaN. Importantly, after appending the two values I need the value from the second column set to NaN. I have managed to concatenate the values but cannot update the second column to NaN.
This is what I start with for ldc_df[['ad_StreetNo', 'ad_StreetNo2']].head(5):
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196 198
4 227 NaN
This is what I currently have after appending:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 198
4 227 NaN
But here is what I am trying to obtain:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN
Where the value for ldc_df['ad_StreetNo2'].loc[3] should be changed to NaN.
This is the code I am using currently:
def street_check(street_number_one, street_number_two):
if pd.notnull(street_number_one) and pd.notnull(street_number_two):
return str(street_number_one) + '-' + str(street_number_two)
else:
return street_number_one
ldc_df['ad_StreetNo'] = ldc_df[['ad_StreetNo', 'ad_StreetNo2']].apply(lambda x: street_check(*x),axis=1)
Does anyone have any advice as to how I can obtain my expected output?
Sam
# Convert the Street numbers to a string so that you can append the '-' character.
ldc_df['ad_StreetNo'] = ldc_df['ad_StreetNo'].astype(str)
# Create a mask of those addresses having an additional street number.
mask = ldc_df.loc[ldc_df['ad_StreetNo2'].notnull()
# Use the mask to append the additional street number.
ldc_df.loc[mask, 'ad_StreetNo'] += '-' + ldc_df.loc[mask, 'ad_StreetNo2'].astype(str)
# Set the additional street number to NaN.
ldc_df.loc[mask, 'ad_StreetNo2'] = np.nan
Alternative Solution
ldc_df['ad_StreetNo'] = (
ldc_df['ad_StreetNo'].astype(str)
+ ['' if np.isnan(n) else '-{}'.format(str(int(n)))
for n in ldc_df['ad_StreetNo2']]
)
ldc_df['ad_StreetNo2'] = np.nan
pd.DataFrame.stack folds a dataframe with a single level column index into a series object. Along the way, it drops any null values by default. We can then group by the previous index levels and join with '-'.
df.stack().astype(str).groupby(level=0).apply('-'.join)
0 284
1 51
2 136
3 196-198
4 227
dtype: object
I then use assign to create a copy of df while overwriting the two columns.
df.assign(
ad_StreetNo=df.stack().astype(str).groupby(level=0).apply('-'.join),
ad_StreetNo2=np.NaN
)
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN

Categories