Conditional update on two columns on Pandas Dataframe - python

I have a pandas dataframe where I'm trying to append two column values if the value of the second column is not NaN. Importantly, after appending the two values I need the value from the second column set to NaN. I have managed to concatenate the values but cannot update the second column to NaN.
This is what I start with for ldc_df[['ad_StreetNo', 'ad_StreetNo2']].head(5):
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196 198
4 227 NaN
This is what I currently have after appending:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 198
4 227 NaN
But here is what I am trying to obtain:
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN
Where the value for ldc_df['ad_StreetNo2'].loc[3] should be changed to NaN.
This is the code I am using currently:
def street_check(street_number_one, street_number_two):
if pd.notnull(street_number_one) and pd.notnull(street_number_two):
return str(street_number_one) + '-' + str(street_number_two)
else:
return street_number_one
ldc_df['ad_StreetNo'] = ldc_df[['ad_StreetNo', 'ad_StreetNo2']].apply(lambda x: street_check(*x),axis=1)
Does anyone have any advice as to how I can obtain my expected output?
Sam

# Convert the Street numbers to a string so that you can append the '-' character.
ldc_df['ad_StreetNo'] = ldc_df['ad_StreetNo'].astype(str)
# Create a mask of those addresses having an additional street number.
mask = ldc_df.loc[ldc_df['ad_StreetNo2'].notnull()
# Use the mask to append the additional street number.
ldc_df.loc[mask, 'ad_StreetNo'] += '-' + ldc_df.loc[mask, 'ad_StreetNo2'].astype(str)
# Set the additional street number to NaN.
ldc_df.loc[mask, 'ad_StreetNo2'] = np.nan
Alternative Solution
ldc_df['ad_StreetNo'] = (
ldc_df['ad_StreetNo'].astype(str)
+ ['' if np.isnan(n) else '-{}'.format(str(int(n)))
for n in ldc_df['ad_StreetNo2']]
)
ldc_df['ad_StreetNo2'] = np.nan

pd.DataFrame.stack folds a dataframe with a single level column index into a series object. Along the way, it drops any null values by default. We can then group by the previous index levels and join with '-'.
df.stack().astype(str).groupby(level=0).apply('-'.join)
0 284
1 51
2 136
3 196-198
4 227
dtype: object
I then use assign to create a copy of df while overwriting the two columns.
df.assign(
ad_StreetNo=df.stack().astype(str).groupby(level=0).apply('-'.join),
ad_StreetNo2=np.NaN
)
ad_StreetNo ad_StreetNo2
0 284 NaN
1 51 NaN
2 136 NaN
3 196-198 NaN
4 227 NaN

Related

Pandas Dataframe - How to transpose one value for the row n to the row n-5 [duplicate]

I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it from the documentation without rewriting the whole DF. Does anyone know how to do it?
DataFrame:
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 272
##5 nan 291
In [18]: a
Out[18]:
x1 x2
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
In [19]: a['x2'] = a.x2.shift(1)
In [20]: a
Out[20]:
x1 x2
0 0 NaN
1 1 5
2 2 6
3 3 7
4 4 8
You need to use df.shift here.
df.shift(i) shifts the entire dataframe by i units down.
So, for i = 1:
Input:
x1 x2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Output:
x1 x2
0 Nan Nan
1 206 214
2 226 234
3 245 253
4 265 272
So, run this script to get the expected output:
import pandas as pd
df = pd.DataFrame({'x1': ['206', '226', '245',' 265', '283'],
'x2': ['214', '234', '253', '272', '291']})
print(df)
df['x2'] = df['x2'].shift(1)
print(df)
Lets define the dataframe from your example by
>>> df = pd.DataFrame([[206, 214], [226, 234], [245, 253], [265, 272], [283, 291]],
columns=[1, 2])
>>> df
1 2
0 206 214
1 226 234
2 245 253
3 265 272
4 283 291
Then you could manipulate the index of the second column by
>>> df[2].index = df[2].index+1
and finally re-combine the single columns
>>> pd.concat([df[1], df[2]], axis=1)
1 2
0 206.0 NaN
1 226.0 214.0
2 245.0 234.0
3 265.0 253.0
4 283.0 272.0
5 NaN 291.0
Perhaps not fast but simple to read. Consider setting variables for the column names and the actual shift required.
Edit: Generally shifting is possible by df[2].shift(1) as already posted however would that cut-off the carryover.
If you don't want to lose the columns you shift past the end of your dataframe, simply append the required number first:
offset = 5
DF = DF.append([np.nan for x in range(offset)])
DF = DF.shift(periods=offset)
DF = DF.reset_index() #Only works if sequential index
I suppose imports
import pandas as pd
import numpy as np
First append new row with NaN, NaN,... at the end of DataFrame (df).
s1 = df.iloc[0] # copy 1st row to a new Series s1
s1[:] = np.NaN # set all values to NaN
df2 = df.append(s1, ignore_index=True) # add s1 to the end of df
It will create new DF df2. Maybe there is more elegant way but this works.
Now you can shift it:
df2.x2 = df2.x2.shift(1) # shift what you want
Trying to answer a personal problem and similar to yours I found on Pandas Doc what I think would answer this question:
DataFrame.shift(periods=1, freq=None, axis=0)
Shift index by desired number of periods with an optional time freq
Notes
If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data.
Hope to help future questions in this matter.
df3
1 108.210 108.231
2 108.231 108.156
3 108.156 108.196
4 108.196 108.074
... ... ...
2495 108.351 108.279
2496 108.279 108.669
2497 108.669 108.687
2498 108.687 108.915
2499 108.915 108.852
df3['yo'] = df3['yo'].shift(-1)
yo price
0 108.231 108.210
1 108.156 108.231
2 108.196 108.156
3 108.074 108.196
4 108.104 108.074
... ... ...
2495 108.669 108.279
2496 108.687 108.669
2497 108.915 108.687
2498 108.852 108.915
2499 NaN 108.852
This is how I do it:
df_ext = pd.DataFrame(index=pd.date_range(df.index[-1], periods=8, closed='right'))
df2 = pd.concat([df, df_ext], axis=0, sort=True)
df2["forecast"] = df2["some column"].shift(7)
Basically I am generating an empty dataframe with the desired index and then just concatenate them together. But I would really like to see this as a standard feature in pandas so I have proposed an enhancement to pandas.
I'm new to pandas, and I may not be understanding the question, but this solution worked for my problem:
# Shift contents of column 'x2' down 1 row
df['x2'] = df['x2'].shift()
Or, to create a new column with contents of 'x2' shifted down 1 row
# Create new column with contents of 'x2' shifted down 1 row
df['x3'] = df['x2'].shift()
I had a read of the official docs for shift() while trying to figure this out, but it doesn't make much sense to me, and has no examples referencing this specific behavior.
Note that the last row of column 'x2' is effectively pushed off the end of the Dataframe. I expected shift() to have a flag to change this behaviour, but I can't find anything.

I wanna read each cell of pandas df one after another and do some calculation on them

I wanna read each cell of pandas df one after another and do some calculation on them, but I have a problem using dictionaries or lists. for example, I wanna check the Ith cell whether the outdoor door temperature is more than X and also humidity is more/less than Y!then do a special calculation for the row.
here is the body of loaded df:
data=pd.read_csv('/content/drive/My Drive/Thesis/DS1.xlsx - Sheet1.csv')
data=data.drop(columns=["Date","time","real feel","Humidity","indoor temp"])
print(data)
and here is the data:
outdoor temp Unnamed: 6 Humidity Estimation: (poly3)
0 26 NaN 64.1560
1 25 NaN 68.6875
2 25 NaN 68.6875
3 24 NaN 72.4640
4 24 NaN 72.4640
.. ... ... ...
715 35 NaN 22.5625
716 33 NaN 28.1795
717 32 NaN 32.3680
718 31 NaN 37.2085
719 30 NaN 42.5000
[720 rows x 3 columns]
Create a function and then use .apply() to use the function on each row. You can edit temp and humid to your desired values. If you want to reference a specific row then just use data[row index]. I am not sure what calculation you want to do but I just added one to the value.
def calculation(row, temp, humid):
if row["outdoor temp"] > temp:
row["outdoor temp"] += 1
if row["humidity"] > humid:
row["humidity"] += 1
data = data.apply(lambda row : calculation(row, temp, humid), axis = 1)

Find the duplicate values of first column present in second column and return the corresponding row number of the value in second column

I have two columns in a dataframe with overlapping values.How to find the duplicate values of first column present in the second column and return the corresponding row number of the value in second column in a new column.
import pandas as pd
import csv
from pandas.compat import StringIO
print(pd.__version__)
csvdata = StringIO("""a,b
111,122
122,3
111,9
254,395
265,245
111,395
220,111
395,305
395,8""")
df1 = pd.read_csv(csvdata, sep=",")
# find unique duplicate values in first column
col_a_dups = df1['a'][df1['a'].duplicated()].unique()
corresponding_value = df1['b'][df1['b'].isin(col_a_dups)]
print(df1.join(corresponding_value, lsuffix="_l", rsuffix="_r"))
#print(corresponding_value.index)
Produces
0.24.2
a b_l b_r
0 111 122 NaN
1 122 3 NaN
2 111 9 NaN
3 254 395 395.0
4 265 245 NaN
5 111 395 395.0
6 220 111 111.0
7 395 305 NaN
8 395 8 NaN

iteritems() in dataframe column

I have a dataset of U.S. Education Datasets: Unification Project. I want to find out
Number of rows where enrolment in grade 9 to 12 (column: GRADES_9_12_G) is less than 5000
Number of rows where enrolment is grade 9 to 12 (column: GRADES_9_12_G) is between 10,000 and 20,000.
I am having problem in updating the count whenever the value in the if statement is correct.
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/akash/Downloads/states_all.csv")
df.shape
df = df.iloc[:, -6]
for key, value in df.iteritems():
count = 0
count1 = 0
if value < 5000:
count += 1
elif value < 20000 and value > 10000:
count1 += 1
print(str(count) + str(count1))
df looks like this
0 196386.0
1 30847.0
2 175210.0
3 123113.0
4 1372011.0
5 160299.0
6 126917.0
7 28338.0
8 18173.0
9 511557.0
10 315539.0
11 43882.0
12 66541.0
13 495562.0
14 278161.0
15 138907.0
16 120960.0
17 181786.0
18 196891.0
19 59289.0
20 189795.0
21 230299.0
22 419351.0
23 224426.0
24 129554.0
25 235437.0
26 44449.0
27 79975.0
28 57605.0
29 47999.0
...
1462 NaN
1463 NaN
1464 NaN
1465 NaN
1466 NaN
1467 NaN
1468 NaN
1469 NaN
1470 NaN
1471 NaN
1472 NaN
1473 NaN
1474 NaN
1475 NaN
1476 NaN
1477 NaN
1478 NaN
1479 NaN
1480 NaN
1481 NaN
1482 NaN
1483 NaN
1484 NaN
1485 NaN
1486 NaN
1487 NaN
1488 NaN
1489 NaN
1490 NaN
1491 NaN
Name: GRADES_9_12_G, Length: 1492, dtype: float64
In the output I got
00
With Pandas, using loops is almost always the wrong way to go. You probably want something like this instead:
print(len(df.loc[df['GRADES_9_12_G'] < 5000]))
print(len(df.loc[(10000 < df['GRADES_9_12_G']) & (df['GRADES_9_12_G'] < 20000)]))
I downloaded your data set, and there are multiple ways to go about this. First of all, you do not need to subset your data if you do not want to. Your problem can be solved like this:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
print(len(df.loc[df['GRADES_9_12_G'] < 5000])) # 184
print(len(df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)])) # 52
The line df.loc[df['GRADES_9_12_G'] < 5000] is telling pandas to query the dataframe for all rows in column df['GRADES_9_12_G'] that are less than 5000. I am then calling python's builtin len function to return the length of the returned, which outputs 184. This is essentially a boolean masking process which returns all True values for your df that meet the conditions you give it.
The second query df.loc[(df['GRADES_9_12_G'] > 10000) & (df['GRADES_9_12_G'] < 20000)]
uses an & operator which is a bitwise operator that requires both conditions to be met for a row to be returned. We then call the len function on that as well to get an integer value of the number of rows which outputs 52.
To go off your method:
import pandas as pd
df = pd.read_csv('states_all.csv')
df.fillna(0, inplace=True) # fill NA with 0, not required but nice looking
df = df.iloc[:, -6] # select all rows for your column -6
print(len(df[df < 5000])) # query your "df" for all values less than 5k and print len
print(len(df[(df > 10000) & (df < 20000)])) # same as above, just for vals in between range
Why did I change the code in my answer instead of using yours?
Simply enough to say, it is more pandonic. Where we can, it is cleaner to use pandas built-ins than iterating over dataframes with for loops, as this is what pandas was designed for.

Converting string objects to int/float using pandas

import pandas as pd
path1 = "/home/supertramp/Desktop/100&life_180_data.csv"
mydf = pd.read_csv(path1)
numcigar = {"Never":0 ,"1-5 Cigarettes/day" :1,"10-20 Cigarettes/day":4}
print mydf['Cigarettes']
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
print mydf['CigarNum']
mydf.to_csv('/home/supertramp/Desktop/powerRangers.csv')
The csv file "100&life_180_data.csv" contains columns like age, bmi,Cigarettes,Alocohol etc.
No int64
Age int64
BMI float64
Alcohol object
Cigarettes object
dtype: object
Cigarettes column contains "Never" "1-5 Cigarettes/day","10-20 Cigarettes/day".
I want to assign weights to these object (Never,1-5 Cigarettes/day ,....)
The expected output is new column CigarNum appended which consists only numbers 0,1,2
CigarNum is as expected till 8 rows and then shows Nan till last row in CigarNum column
0 Never
1 Never
2 1-5 Cigarettes/day
3 Never
4 Never
5 Never
6 Never
7 Never
8 Never
9 Never
10 Never
11 Never
12 10-20 Cigarettes/day
13 1-5 Cigarettes/day
14 Never
...
167 Never
168 Never
169 10-20 Cigarettes/day
170 Never
171 Never
172 Never
173 Never
174 Never
175 Never
176 Never
177 Never
178 Never
179 Never
180 Never
181 Never
Name: Cigarettes, Length: 182, dtype: object
The output I get shoudln't give NaN after few first rows.
0 0
1 0
2 1
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 NaN
11 NaN
12 NaN
13 NaN
14 0
...
167 NaN
168 NaN
169 NaN
170 NaN
171 NaN
172 NaN
173 NaN
174 NaN
175 NaN
176 NaN
177 NaN
178 NaN
179 NaN
180 NaN
181 NaN
Name: CigarNum, Length: 182, dtype: float64
OK, first problem is you have embedded spaces causing the function to incorrectly apply:
fix this using vectorised str:
mydf['Cigarettes'] = mydf['Cigarettes'].str.replace(' ', '')
now create your new column should just work:
mydf['CigarNum'] = mydf['Cigarettes'].apply(numcigar.get).astype(float)
UPDATE
Thanks to #Jeff as always for pointing out superior ways to do things:
So you can call replace instead of calling apply:
mydf['CigarNum'] = mydf['Cigarettes'].replace(numcigar)
# now convert the types
mydf['CigarNum'] = mydf['CigarNum'].convert_objects(convert_numeric=True)
you can also use factorize method also.
Thinking about it why not just set the dict values to be floats anyway and then you avoid the type conversion?
So:
numcigar = {"Never":0.0 ,"1-5 Cigarettes/day" :1.0,"10-20 Cigarettes/day":4.0}
Version 0.17.0 or newer
convert_objects is deprecated since 0.17.0, this has been replaced with to_numeric
mydf['CigarNum'] = pd.to_numeric(mydf['CigarNum'], errors='coerce')
Here errors='coerce' will return NaN where the values cannot be converted to a numeric value, without this it will raise an exception
Try using this function for all problems of this kind:
def get_series_ids(x):
'''Function returns a pandas series consisting of ids,
corresponding to objects in input pandas series x
Example:
get_series_ids(pd.Series(['a','a','b','b','c']))
returns Series([0,0,1,1,2], dtype=int)'''
values = np.unique(x)
values2nums = dict(zip(values,range(len(values))))
return x.replace(values2nums)

Categories