I have a DataFrame with data in string. These data need to be evaluated and change to numeric.
Let my df be:
var_pct var_num
-76*2 14*1000000
-76*2 12*1000000
111*2 29*1000000
47*2 33*1000000
nan 60*1000000
for column in df:
df[column] =df.eval(df[column],inplace=True)
I faced problem for column with 'nan' where the result of eval has length less than the original. How do I make sure the 'nan' will be a '' after eval?
You should avoid eval. In this case, I recommend you split into numeric series first:
df = df.fillna('NaN*NaN')
for col in df.columns:
df = df.join(df.pop(col).str.split('*', expand=True)\
.apply(pd.to_numeric, errors='coerce')\
.add_prefix(f'{col}_'))
print(df)
var_pct_0 var_pct_1 var_num_0 var_num_1
0 -76.0 2.0 14 1000000
1 -76.0 2.0 12 1000000
2 111.0 2.0 29 1000000
3 47.0 2.0 33 1000000
4 NaN NaN 60 1000000
Then perform your calculations using vectorised operations:
for col in ['var_pct', 'var_num']:
df[col] = df[f'{col}_0'] * df[f'{col}_1']
For academic purposes, the approach you are attempting is possible via the top level function pd.eval together with applymap. But beware, this is just an inefficient Python-level loop.
nan = np.nan
df = df.fillna('nan*nan')
df = df.applymap(lambda x: pd.eval(x))
print(df)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Assuming that you can live with a copied dataframe:
def ff(val):
if 'nan' not in val:
return eval(val)
df4 = df3.applymap(ff)
print(df4)
var_pct var_num
0 -152.0 14000000
1 -152.0 12000000
2 222.0 29000000
3 94.0 33000000
4 NaN 60000000
Of course ff can be expressed as lambda too:
lambda val: eval(val) if 'nan' not in val else val
Related
I have a data frame of three columns. I want to check them if they follow a logical sequence.
code:
df = pd.DataFrame({'low':[10,15,np.nan]','medium:[12,18,29],'high':[16,19,np.nan]})
df =
low medium high
0 10.0 12 16.0
1 15.0 18 19.0
2 NaN 29 NaN
# check if low<medium<high
df['check'] = (df['low']<df['medium'])&(df['medium']<df['high'])
print("Condition failed: %s"%(df['check'].all()))
Present output:
df['check']=
True #correct
True # correct
False # wrong output here, it should not consider this
Basically I want to avoid comparison with the NaN values and producing false output. I want to avoid them. How can I do it?
You can mask it. Also, instead of chained condition, you can use between:
df['check'] = df['medium'].between(df['low'], df['high'], inclusive='neither').mask(df[['low','high']].isna().any(axis=1))
Output:
low medium high check
0 10.0 12 16.0 True
1 15.0 18 19.0 True
2 NaN 29 NaN NaN
I have a dataset of U.S. Education Datasets: Unification Project. I want to find out
Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.
I am having problem in updating the count whenever the value in the if statement is correct.
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df looks like this
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
In the output I got
00
With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.
The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.
To go off your method:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
Why did I change the code in my answer instead of using yours?
Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.
I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)
Below is the code that I have been working with to replace some values with np.NaN. My issue is how to replace'47614750_h' at index 111 with np.NaN. I can do this directly with drop_list, however, I need to iterate this with different values ending in '_h' over many files and would like to do this automatically.
I have tried some searches on regex as it seems the way to go, but could not find what i needed.
drop_list = ['dash_code', 'SONIC WELD']
df_clean.replace(drop_list, np.NaN).tail(10)
DASH_CODE Name Quantity
107 1011567 .156 MALE BULLET TERM INSUL 1.0
108 102066901 .032 X .187 FEMALE Q.D. TERM. 1.0
109 105137901 TERM,RING,10-12AWG,INSULATED 1.0
110 101919701 1/4 RING TERM INSUL 2.0
111 47614750001_h HARNESS, MAIN, AC, LIO 1.0
112 NaN NaN 19.0
113 7685 5/16 RING TERM INSUL. 1.0
114 102521601 CLIP,HARNESS 2.0
115 47614808001 CAP, RESISTOR, TERMINATION 1.0
116 103749801 RECPT, DEUTSCH, DTM04-4P 1.0
You can use pd.Series.apply for this with a lambda:
df['DASH_CODE'] = df['DASH_CODE'].apply(lambda x: np.NaN if x.endswith('_h') else x)
From the documentation:
Invoke function on values of Series. Can be ufunc (a NumPy function
that applies to the entire Series) or a Python function that only
works on single values
It may be faster to try to convert all the rows to float using pd.to_numeric:
In [11]: pd.to_numeric(df.DASH_CODE, errors='coerce')
Out[11]:
0 1.011567e+06
1 1.020669e+08
2 1.051379e+08
3 1.019197e+08
4 NaN
5 NaN
6 7.685000e+03
7 1.025216e+08
8 4.761481e+10
9 1.037498e+08
Name: DASH_CODE, dtype: float64
In [12]: df["DASH_CODE"] = pd.to_numeric(df["DASH_CODE"], errors='coerce')
I want to merge the values of two different columns of pandas dataframe into one column of new dataframe.
pandas df1 =
hapX
pos 0.0
1 721 0.2
2 735 0.5
3 739 1.0
pandas df2 =
hapY
pos 0.1
1 721 0.0
2 735 0.6
3 739 1.5
I want to generate a new dataframe like:
df_joined['hapX|Y'] = df1.astype(str).add('|').add(df2.astype(str))
with expected output:
hapX|Y
pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
But, this is outputting bunch of NaN
hapX hapY
pos NaN NaN
1 721 NaN NaN
2 735 NaN NaN
3 739 NaN NaN
Is the problem with value being float (i don't think so). What is the problem with my approach?
Also, is there a way to automate the process if columns values are like hapX1 hapX1 hapX3 in one dataframe with hapY1 hapY2 hapY3 in another dataframe?
Thanks,
You can merge the two dataframes and then concat the hapX and hapY.
Say your first column name is no.
df_joined = df1.merge(df2, on = 'no')
df_joined['hapX|Y'] = (df_joined['hapX'].astype(str))+'|'+(df_joined['hapY'].astype(str))
df_joined.drop(['hapX', 'hapY'], axis = 1)
This gives you
no hapX|Y
0 pos 0.0|0.1
1 721 0.2|0.0
2 735 0.5|0.6
3 739 1.0|1.5
Just to add onto the previous answer, for the general case of N DataFrames,
Suppose you have a number of DataFrames as follows:
dfs = [pd.DataFrame({'hapY'+str(j): [random.random() for i in range(10)]}) for j in range(5)]
such that
>>> dfs[0]
hapY0
0 0.175683
1 0.353729
2 0.949848
3 0.346088
4 0.435292
5 0.837879
6 0.277274
7 0.623121
8 0.325119
9 0.709252
Then,
>>> map( lambda m: '|'.join(m) , zip(*[ dfs[j]['hapY'+str(j)].astype(str) for j in range(5)]))
['0.0845464936138|0.193336164837|0.551717121013|0.113566029656|0.479590342798',
'0.275851474238|0.694161791339|0.151607726092|0.615367668451|0.498997567849',
'0.116891472119|0.258406028668|0.315137581816|0.819992354178|0.864412473301',
'0.729581942312|0.614902776003|0.443986436146|0.227782256619|0.0149481683863',
'0.745583477173|0.441456815889|0.428691631831|0.307480112319|0.136790112739',
'0.981337451224|0.0117895017035|0.415140979617|0.650957722911|0.968082350568',
'0.725618728314|0.0546057041356|0.715910454674|0.0828229441557|0.220878025678',
'0.704047455894|0.303403129266|0.0499082759635|0.49727194707|0.251623048104',
'0.453595354131|0.146042134766|0.346665276655|0.911092176243|0.291405609407',
'0.140523603089|0.117930249858|0.902071673051|0.0804933425857|0.876006332635']
which you can later put into a DataFrame.
I think the simpliest is rename columns by dict which can be created by dict comprehension, last add_suffix:
print (df1)
hapX1 hapX2 hapX3 hapX4
pos
23 1.0 0.0 1.0 1.0
24 1.0 1.0 1.5 1.0
28 1.0 0.0 0.5 0.0
print (df2)
hapY1 hapY2 hapY3 hapY4
pos
23 0.0 1.0 0.5 0.0
24 1.0 1.0 1.5 1.0
28 0.0 1.0 1.0 1.0
d = {'hapY' + str(x):'hapX' + str(x) for x in range(1,5)}
print (d)
{'hapY1': 'hapX1', 'hapY3': 'hapX3', 'hapY2': 'hapX2', 'hapY4': 'hapX4'}
df_joined = df1.astype(str).add('|').add(df2.rename(columns=d).astype(str)).add_suffix('|Y')
print (df_joined)
hapX1|Y hapX2|Y hapX3|Y hapX4|Y
pos
23 1.0|0.0 0.0|1.0 1.0|0.5 1.0|0.0
24 1.0|1.0 1.0|1.0 1.5|1.5 1.0|1.0
28 1.0|0.0 0.0|1.0 0.5|1.0 0.0|1.0