Python Dataframe - Can't replace text with a number - python

I am working Bicycle dataset. I want to replace text values in 'weather' column with numbers 1 to 4. This field is an object field. I tried all of these following ways but none seems to work.
There is another field called 'season'. If I apply same code on 'season', my code works fine. Please help.
Sample data:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 Clear + Few clouds 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 Clear + Few clouds 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 Light Snow, Light Rain 11.48 12.120 100 27.9993
3 10/13/2011 11:00 Winter NaN 1 Mist + Cloudy 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 Clear + Few clouds 25.42 31.060 43 23.9994
I tried following, none worked on 'weather' but when i use same code on 'season' column it works fine.
test["weather"] = np.where(test["weather"]=="Clear + Few clouds", 1,
(np.where(test["weather"]=="Mist + Cloudy",2,(np.where(test["weather"]=="Light Snow, Light
Rain",3,(np.where(test["weather"]=="Heavy Rain + Thunderstorm",4,0)))))))
PE_weather = [
(train['weather'] == ' Clear + Few clouds '),
(train['weather'] =='Mist + Cloudy') ,
(train['weather'] >= 'Light Snow, Light Rain'),
(train['weather'] >= 'Heavy Rain + Thunderstorm')]
PE_weather_value = ['1', '2', '3','4']
train['Weather'] = np.select(PE_weather, PE_weather_value)
test.loc[test.weather =='Clear + Few clouds', 'weather']='1'

I suggest you make a dictionary to look up the corresponding values and then apply a lookup to the weather column.
weather_lookup = {
'Clear + Few clouds': 1,
'Mist + Cloudy': 2,
'Light Snow, Light Rain': 3,
'Heavy Rain + Thunderstorm': 4
}
def lookup(w):
return weather_lookup.get(w, 0)
test['weather'] = test['weather'].apply(lookup)
Output:
datetime season holiday workingday weather temp atemp humidity windspeed
0 5/10/2012 11:00 Summer NaN 1 1 21.32 25.000 48 35.0008
1 6/9/2012 7:00 Summer NaN 0 1 23.78 27.275 64 7.0015
2 3/6/2011 20:00 Spring NaN 0 3 11.48 12.120 100 27.9993 NaN
3 10/13/2011 11:00 Winter NaN 1 2 25.42 28.790 83 0.0000
4 6/2/2012 12:00 Summer NaN 0 1 25.42 31.060 43 23.9994

Related

New column based on existing string column in Python

My dataframe looks like:
School
Term
Students
A
summer 2020
324
B
spring 21
101
A
summer/spring
201
F
wintersem
44
C
fall trimester
98
E
23
I need to add a new column Termcode that assumes any of the 6 values:
summer, spring, fall, winter, multiple, none based on corresponding value in the Term Column, viz:
School
Term
Students
Termcode
A
summer 2020
324
summer
B
spring 21
101
spring
A
summer/spring
201
multiple
F
wintersem
44
winter
C
fall trimester
98
fall
E
23
none
You can use a regex with str.extractall and filling of the values depending on the number of matches:
terms = ['summer', 'spring', 'fall', 'winter']
regex = r'('+'|'.join(terms)+r')'
# '(summer|spring|fall|winter)'
# extract values and set up grouper for next step
g = df['Term'].str.extractall(regex)[0].groupby(level=0)
# get the first match, replace with "multiple" if more than one
df['Termcode'] = g.first().mask(g.nunique().gt(1), 'multiple')
# fill the missing data (i.e. no match) with "none"
df['Termcode'] = df['Termcode'].fillna('none')
output:
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 none
Series.findall
l = ['summer', 'spring', 'fall', 'winter']
s = df['Term'].str.findall(fr"{'|'.join(l)}")
df['Termcode'] = np.where(s.str.len() > 1, 'multiple', s.str[0])
School Term Students Termcode
0 A summer 2020 324 summer
1 B spring 21 101 spring
2 A summer/spring 201 multiple
3 F wintersem 44 winter
4 C fall trimester 98 fall
5 E NaN 23 NaN

Python : Remodeling the presentation data from a pandas Dataframe / group duplicates

Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN

Assigning games in the NFL week values

I am trying to assign each game in the NFL a value for the week in which they occur.
Example for the 2008 season all the games that occur in the range between the 4th and 10th of September occur in week 1
i = 0
week = 1
start_date = df2008['date'].iloc[0]
end_date = df2008['date'].iloc[-1]
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
for row in df2008['date']:
row = row.date()
if row in week_range[i]:
df2008['week'] = week
else:
week += 1
However, this is updating all of the games to week 1
date week
1601 2008-09-04 1
1602 2008-09-07 1
1603 2008-09-07 1
1604 2008-09-07 1
1605 2008-09-07 1
... ... ...
1863 2009-01-11 1
1864 2009-01-11 1
1865 2009-01-18 1
1866 2009-01-18 1
1867 2009-02-01 1
I have tried using print statements to debug and these are my results. "In Range" are games that occur in week 1 and are returning as expected.
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
Not In Range
df_sample:
display(df2008[['date', 'home', 'away', 'week']])
date home away week
1601 2008-09-04 Giants Redskins 1
1602 2008-09-07 Falcons Lions 1
1603 2008-09-07 Bills Seahawks 1
1604 2008-09-07 Titans Jaguars 1
1605 2008-09-07 Dolphins Jets 1
... ... ... ... ...
1863 2009-01-11 Giants Eagles 1
1864 2009-01-11 Steelers Chargers 1
1865 2009-01-18 Cardinals Eagles 1
1866 2009-01-18 Steelers Ravens 1
1867 2009-02-01 Cardinals Steelers 1
Can anyone point out where I am going wrong?
OP's original question was: "Can anyone point out where I am going wrong?",
so - though as Parfait pointed out using pandas.Series.dt.week is a fine pandas solution - to help him to find answer to it, I followed OP's original code logic, with some fixing:
import pandas as pd
i = 0
week = 1
df2008 = pd.DataFrame({"date": [pd.Timestamp("2008-09-04"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2008-09-07"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-11"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-01-18"), pd.Timestamp("2009-02-01")],
"home": ["Giants", "Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals"],
"away": ["Falcon", "Bills", "Titans", "Dolphins", "Giants", "Steelers", "Cardinals", "Steelers", "Cardinals", "Ravens"]
})
i = 0
week = 1
start_date = df2008['date'].iloc[0]
#end_date = df2008['date'].iloc[-1]
end_date = pd.Timestamp("2009-03-01")
week_range = pd.interval_range(start=start_date, end=end_date, freq='7D', closed='left')
df2008['week'] = None
for i in range(len(df2008['date'])):
rd = df2008.loc[i, 'date'].date()
while True:
if week == len(week_range):
break
if rd in week_range[week - 1]:
df2008.loc[i, 'week'] = week
break
else:
week += 1
print(df2008)
Out:
date home away week
0 2008-09-04 Giants Falcon 1
1 2008-09-07 Falcon Bills 1
2 2008-09-07 Bills Titans 1
3 2008-09-07 Titans Dolphins 1
4 2008-09-07 Dolphins Giants 1
5 2009-01-11 Giants Steelers 19
6 2009-01-11 Steelers Cardinals 19
7 2009-01-18 Cardinals Steelers 20
8 2009-01-18 Steelers Cardinals 20
9 2009-02-01 Cardinals Ravens 22
Consider avoiding any looping and use pandas.Series.dt.week on datetime fields which returns week in year. Then, subtract from the first week. However, a wrinkle occurs when considering the new year so must handle conditionally by adding difference of end of year and then weeks of new year. Fortunately, weeks start on Monday (so Thursday - Sunday maintain same week number).
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
# FIND LAST SUNDAY OF YEAR (NOT NECESSARILY DEC 31)
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
# CONDITIONALLY ASSIGN
df2008['week'] = np.where(df2008['date'] < '2009-01-01',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
To demonstrate with random seeded data (including new year dates). Will replace for OP's reproducible sample.
Data
import numpy as np
import pandas as pd
### DATA BUILD
np.random.seed(120619)
df2008 = pd.DataFrame({'group': np.random.choice(['sas', 'stata', 'spss', 'python', 'r', 'julia'], 500),
'int': np.random.randint(1, 10, 500),
'num': np.random.randn(500),
'char': [''.join(np.random.choice(list('ABC123'), 3)) for _ in range(500)],
'bool': np.random.choice([True, False], 500),
'date': np.random.choice(pd.date_range('2008-09-04', '2009-01-06'), 500)
})
Calculation
first_week = pd.Series(pd.to_datetime(['2008-09-04'])).dt.week.values
end_year_week = pd.Series(pd.to_datetime(['2008-12-28'])).dt.week.values
new_year_week = pd.Series(pd.to_datetime(['2009-01-01'])).dt.week.values
df2008['week'] = np.where(df2008['date'] < '2008-12-28',
(df2008['date'].dt.week - first_week) + 1,
((end_year_week - first_week) + ((df2008['date'].dt.week - new_year_week) + 1))
)
df2008 = df2008.sort_values('date').reset_index(drop=True)
print(df2008.head(10))
# group int num char bool date week
# 0 sas 2 0.099927 A2C False 2008-09-04 1
# 1 python 3 0.241393 2CB False 2008-09-04 1
# 2 python 8 0.516716 ABC False 2008-09-04 1
# 3 spss 2 0.974715 3CB False 2008-09-04 1
# 4 stata 9 -1.582096 CAA True 2008-09-04 1
# 5 sas 3 0.070347 1BB False 2008-09-04 1
# 6 r 5 -0.419936 1CA True 2008-09-05 1
# 7 python 6 0.628749 1AB True 2008-09-05 1
# 8 python 3 0.713695 CA1 False 2008-09-05 1
# 9 python 1 -0.686137 3AA False 2008-09-05 1
print(df2008.tail(10))
# group int num char bool date week
# 490 spss 5 -0.548257 3CC True 2009-01-04 17
# 491 julia 8 -0.176858 AA2 False 2009-01-05 18
# 492 julia 5 -1.422237 A1B True 2009-01-05 18
# 493 stata 2 -1.710138 BB2 True 2009-01-05 18
# 494 python 4 -0.285249 1B1 True 2009-01-05 18
# 495 spss 3 0.918428 C23 True 2009-01-06 18
# 496 r 5 -1.347936 1AC False 2009-01-06 18
# 497 stata 3 0.883093 1C3 False 2009-01-06 18
# 498 python 9 0.448237 12A True 2009-01-06 18
# 499 spss 3 1.459097 2A1 False 2009-01-06 18

Creating a dataframe where one of the arrays has a different length

I am learning to scrape data from website through Python. Extracting weather information about San Francisco from this page. I get stuck while combining data into a Pandas Dataframe. Is it possible to create a dataframe where each rows have different length?
I have already tried 2 ways based on answers here, but they are not excatly what I am looking for. Both answers shift the values of temps column to up. Here is the screen what I try to explain..
1st way: https://stackoverflow.com/a/40442094/10179259
2nd way: https://stackoverflow.com/a/19736406/10179259
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
periods=[pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
short_descs=[sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
temps=[t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d['alt'] for d in seven_day.select('.tombstone-container img')]
#print(len(periods), len(short_descs), len(temps), len(descs))
weather = pd.DataFrame({
"period": periods, #length is 9
"short_desc": short_descs, #length is 9
"temp": temps, #problem here length is 8
#"desc":descs #length is 9
})
print(weather)
I expect that first row of the temp column to be Nan. Thank you.
You can loop each forecast_items value with iter and next for select first value, if not exist is assigned fo dictionary NaN value:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
out = []
for x in forecast_items:
periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})
weather = pd.DataFrame(out)
print (weather)
descs period \
0 NOW until4:00pm Sat
1 Today: Showers, with thunderstorms also possib... Today
2 Tonight: Showers likely and possibly a thunder... Tonight
3 Sunday: A chance of showers before 11am, then ... Sunday
4 Sunday Night: Rain before 11pm, then a chance ... SundayNight
5 Monday: A 40 percent chance of showers. Cloud... Monday
6 Monday Night: A 30 percent chance of showers. ... MondayNight
7 Tuesday: A 50 percent chance of rain. Cloudy,... Tuesday
8 Tuesday Night: Rain. Cloudy, with a low aroun... TuesdayNight
short_desc temp
0 Wind Advisory NaN
1 Showers andBreezy High: 56 °F
2 ShowersLikely Low: 49 °F
3 Heavy Rainand Windy High: 56 °F
4 Heavy Rainand Breezythen ChanceShowers Low: 52 °F
5 ChanceShowers High: 58 °F
6 ChanceShowers Low: 53 °F
7 Chance Rain High: 59 °F
8 Rain Low: 53 °F

Assigning (or tieing in) function results back to original data in pandas

I am struggling with extracting the regression coefficients once I complete the function call np.polyfit (actual code below). I am able to get a display of each coefficient but am unsure how to actually extract them for future use with the original data.
df=pd.read_csv('2_skews.csv')
Here is a head() of the data
date expiry symbol strike vol
0 6/10/2015 1/19/2016 IBM 50 42.0
1 6/10/2015 1/19/2016 IBM 55 41.5
2 6/10/2015 1/19/2016 IBM 60 40.0
3 6/10/2015 1/19/2016 IBM 65 38.0
4 6/10/2015 1/19/2016 IBM 70 36.0
There are many symbols with many strikes across many days and many expiry dates as well
I have grouped the data by date, symbol and expiry and then call the regression function with this:
df_reg=df.groupby(['date','symbol','expiry']).apply(regress)
I have this function that seems to work well (gives proper coefficients), i just don't seem to be able to access them and tie them to the original data.
def regress(df):
y=df['vol']
x=df['strike']
z=P.polyfit(x,y,4)
return (z)
I am calling polyfit like this:
from numpy.polynomial import polynomial as P
The final results:
df_reg
date symbol expiry
5/19/2015 GS 1/19/2016 [-112.064833151, 6.76871521993, -0.11147562136...
3/21/2016 [-131.2914493, 7.16441276062, -0.1145534833, 0...
IBM 1/19/2016 [211.458028147, -5.01236287512, 0.044819313514...
3/21/2016 [-34.1027973807, 3.16990194634, -0.05676206572...
6/10/2015 GS 1/19/2016 [50.3916788503, 0.795484227762, -0.02701849495...
3/21/2016 [31.6090441114, 0.851878910113, -0.01972772270...
IBM 1/19/2016 [-13.6159660078, 3.23002791603, -0.06015739505...
3/21/2016 [-51.6709051223, 4.80288173687, -0.08600312989...
dtype: object
the top results has the functional form of :
y = -0.000002x4 + 0.000735x3 - 0.111476x2 + 6.768715x - 112.064833
I have tried to take the constructive criticism of previous individuals and make my question as clear as possible, please let me know if i still need to work on this :-)
John
Changing the output of regress to a Series rather than a numpy array will give you a data frame when you groupby. The index of the series will be the column names:
In [37]:
df = pd.DataFrame(
[[ '6/10/2015', '1/19/2016', 'IBM', 50, 42.0],
[ '6/10/2015', '1/19/2016', 'IBM', 55, 41.5],
[ '6/10/2015', '1/19/2016', 'IBM', 60, 40.0],
[ '6/10/2015', '1/19/2016', 'IBM', 65, 38.0],
[ '6/10/2015', '1/19/2016', 'IBM', 70, 36.0]],
columns=['date', 'expiry', 'symbol', 'strike', 'vol'])
def regress(df):
y=df['vol']
x=df['strike']
z=np.polyfit(x,y,4)
return pd.Series(z, name='order', index=range(5)[::-1])
group_cols = ['date', 'expiry', 'symbol']
coeffs = df.groupby(group_cols).apply(regress)
coeffs
Out[40]:
order 4 3 2 1 0
date expiry symbol
6/10/2015 1/19/2016 IBM -5.388312e-18 0.000667 -0.13 8.033333 -118
To get the columns containing the coefficients for each combination of date, expiry and symbol you can then merge df and coeffs on these columns:
In [25]: df.merge(coeffs.reset_index(), on=group_cols)
Out[25]:
date expiry symbol strike vol 4 3 2 1 0
0 6/10/2015 1/19/2016 IBM 50 42.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
1 6/10/2015 1/19/2016 IBM 55 41.5 -6.644454e-18 0.000667 -0.13 8.033333 -118
2 6/10/2015 1/19/2016 IBM 60 40.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
3 6/10/2015 1/19/2016 IBM 65 38.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
4 6/10/2015 1/19/2016 IBM 70 36.0 -6.644454e-18 0.000667 -0.13 8.033333 -118
You can then do something like
df = df.merge(coeffs.reset_index(), on=group_cols)
strike_powers = pd.DataFrame(dict((i, df.strike**i) for i in range(5))
df['modelled_vol'] = (strike_powers * df[range(5)]).sum(axis=1)

Categories