I have this Dataframe, which is null values that haven't been populated right.
Unidad Precio Combustible Año_del_vehiculo Caballos \
49 1 1000 Gasolina 1998.0 50.0
63 1 800 Gasolina 1998.0 50.0
88 1 600 Gasolina 1999.0 54.0
107 1 3100 Diésel 2008.0 54.0
244 1 2000 Diésel 1995.0 60.0
... ... ... ... ... ...
46609 1 47795 Gasolina 2016.0 420.0
46770 1 26900 Gasolina 2011.0 450.0
46936 1 19900 Gasolina 2007.0 510.0
46941 1 24500 Gasolina 2006.0 514.0
47128 1 79600 Gasolina 2017.0 612.0
Comunidad_autonoma Marca_y_Modelo Año_Venta Año_Comunidad \
49 Islas Baleares CITROEN AX 2020 2020Islas Baleares
63 Islas Baleares SEAT Arosa 2021 2021Islas Baleares
88 Islas Baleares FIAT Seicento 2020 2020Islas Baleares
107 La Rioja TOYOTA Aygo 2020 2020La Rioja
244 Aragón PEUGEOT 205 2019 2019Aragón
... ... ... ... ...
46609 La Rioja PORSCHE Cayenne 2020 2020La Rioja
46770 Cataluña AUDI RS5 2020 2020Cataluña
46936 Islas Baleares MERCEDES-BENZ Clase M 2020 2020Islas Baleares
46941 La Rioja MERCEDES-BENZ Clase E 2020 2020La Rioja
47128 Islas Baleares MERCEDES-BENZ Clase E 2021 2021Islas Baleares
Fecha Año Super_95 Diesel Comunidad Salario en euros anuales
49 2020-12-01 NaN NaN NaN NaN NaN
63 2021-01-01 NaN NaN NaN NaN NaN
88 2020-12-01 NaN NaN NaN NaN NaN
107 2020-12-01 NaN NaN NaN NaN NaN
244 2019-03-01 NaN NaN NaN NaN NaN
... ... ... ... ... ... ...
46609 2020-12-01 NaN NaN NaN NaN NaN
46770 2020-07-01 NaN NaN NaN NaN NaN
46936 2020-10-01 NaN NaN NaN NaN NaN
46941 2020-11-01 NaN NaN NaN NaN NaN
47128 2021-01-01 NaN NaN NaN NaN NaN
I need to fill the gasoline, diesel and salary tables with the values of the following:
Año Super_95 Diesel Comunidad Año_Comunidad Fecha \
0 2020 1.321750 1.246000 Navarra 2020Navarra 2020-01-01
1 2020 1.301000 1.207250 Navarra 2020Navarra 2020-02-01
2 2020 1.224800 1.126200 Navarra 2020Navarra 2020-03-01
3 2020 1.106667 1.020000 Navarra 2020Navarra 2020-04-01
4 2020 1.078750 0.986250 Navarra 2020Navarra 2020-05-01
.. ... ... ... ... ... ...
386 2021 1.416600 1.265000 La rioja 2021La rioja 2021-08-01
387 2021 1.431000 1.277000 La rioja 2021La rioja 2021-09-01
388 2021 1.474000 1.344000 La rioja 2021La rioja 2021-10-01
389 2021 1.510200 1.382000 La rioja 2021La rioja 2021-11-01
390 2021 1.481333 1.348667 La rioja 2021La rioja 2021-12-01
Salario en euros anuales
0 27.995,96
1 27.995,96
2 27.995,96
3 27.995,96
4 27.995,96
.. ...
386 21.535,29
387 21.535,29
388 21.535,29
389 21.535,29
390 21.535,29
It would fill the columns of the first with the second when the year_community table matches. for example in the nan where 2020Islas Baleares appears in the same row. fill in with the value of the price of gasoline from the other table where 2020Islas Baleares appears in the same row. In the case that it is 2020aragon, it would be with 2020 aragon and so on. I had thought of something like this:
analisis['Super_95'].fillna(analisis2['Super_95'].apply(lambda x: x if x=='2020Islas Baleares' else np.nan), inplace=True)
the second dataframe is the result of doing a merge, and those null values have not worked
df1.merge(df2, on='Año_Comunidad')
As a result you'll have one DataFrame where columns with same names will have a suffix _x for first DataFrame and _y for the second one.
Now to fill in the blanks you can do this for each column:
df1.loc[df1["Año_x"].isnull(),'Año_x'] = df1["Año_y"]
If a row in Año is empty, it will be filled with data from second table that we merged earlier.
You can do it in a cycle for all the columns:
cols = ['Año', 'Super_95', 'Diesel', 'Comunidad', 'Salario en euros anuales']
for col in cols:
df1.loc[df1[col+"_x"].isnull(), col+'_x'] = df1[col+'_y']
And finally you can drop the merged columns:
for col in cols:
df1 = df1.drop(col+'_y', axis=1)
Related
I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203
I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN
I have a dataframe that looks something like:
Component Date MTD YTD QTD FC
ABC Jan 2017 56 nan nan nan
DEF Jan 2017 453 nan nan nan
XYZ Jan 2017 657
PQR Jan 2017 123
ABC Feb 2017 56 nan nan nan
DEF Feb 2017 456 nan nan nan
XYZ Feb 2017 6234 57
PQR Feb 2017 123 346
ABC Dec 2017 56 nan nan nan
DEF Dec 2017 nan nan 345 324
XYZ Dec 2017 6234 57
PQR Dec 2017 nan 346 54654 546
And i would like to transpose this dataframe in such a way that the component becomes the prefix of the existing MTD,QTD, etc columns
so the output expected would be:
Date ABC_MTD DEF_MTD XYZ_MTD PQR_MTD ABC_YTD DEF_YTD XYZ_YTD PQR_YTD etcetc
Jan 2017 56 453 657 123 nan nan nan nan
Feb 2017 56 456 6234 123 nan nan 57 346
Dec 2017 56 nan 6234 nan 57 346
I am not sure whether a pivot or stack/unstack would be efficient out here.
Thanks in advance.
You could try this:
newdf=df.pivot(values=df.columns[2:], index='Date', columns='Component' )
newdf.columns = ['%s%s' % (b, '_%s' % a if b else '') for a, b in newdf.columns] #join the multiindex columns names
print(newdf)
Output:
df
Component Date MTD YTD QTD FC
0 ABC 2017-01-01 56.0 NaN NaN NaN
1 DEF 2017-01-01 453.0 NaN NaN NaN
2 XYZ 2017-01-01 657.0
3 PQR 2017-01-01 123.0
4 ABC 2017-02-01 56.0 NaN NaN NaN
5 DEF 2017-02-01 456.0 NaN NaN NaN
6 XYZ 2017-02-01 6234.0 57
7 PQR 2017-02-01 123.0 346
8 ABC 2017-12-01 56.0 NaN NaN NaN
9 DEF 2017-12-01 NaN NaN 345 324
10 XYZ 2017-12-01 6234.0 57
11 PQR 2017-12-01 NaN 346 54654 546
newdf
ABC_MTD DEF_MTD PQR_MTD XYZ_MTD ABC_YTD DEF_YTD PQR_YTD XYZ_YTD ABC_QTD DEF_QTD PQR_QTD XYZ_QTD ABC_FC DEF_FC PQR_FC XYZ_FC
Date
2017-01-01 56 453 123 657 NaN NaN NaN NaN NaN NaN
2017-02-01 56 456 123 6234 NaN NaN 346 57 NaN NaN NaN NaN
2017-12-01 56 NaN NaN 6234 NaN NaN 346 57 NaN 345 54654 NaN 324 546
I am having difficulties using pd.read_csv() on the web page to use the "Download Data" button since I do not see the typical .zip or .csv at the end. What would be the correct url to use to directly download the data with pd.read_csv()?
Link:
https://climate.weather.gc.ca/climate_data/daily_data_e.html?hlyRange=2008-12-22%7C2020-05-24&dlyRange=1999-05-01%7C2020-05-24&mlyRange=2000-06-01%7C2007-11-01&StationID=27211&Prov=AB&urlExtension=_e.html&searchType=stnProx&optLimit=yearRange&StartYear=2000&EndYear=2020&selRowPerPage=25&Line=5&txtRadius=25&optProxType=city&selCity=51%7C2%7C114%7C4%7CCalgary&selPark=&txtCentralLatDeg=&txtCentralLatMin=0&txtCentralLatSec=0&txtCentralLongDeg=&txtCentralLongMin=0&txtCentralLongSec=0&txtLatDecDeg=&txtLongDecDeg=&timeframe=2&Day=24&Year=2019&Month=5#
When you open Firefox developer tools -> Network tab, you will see the URL when you click the download button. (Chrome has something similar too)
import pandas as pd
url = 'https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=27211&Year=2019&Month=5&Day=1&timeframe=2&submit=Download+Data'
df = pd.read_csv(url)
print(df)
Prints:
Longitude (x) Latitude (y) Station Name Climate ID Date/Time ... Snow on Grnd Flag Dir of Max Gust (10s deg) Dir of Max Gust Flag Spd of Max Gust (km/h) Spd of Max Gust Flag
0 -114.0 51.11 CALGARY INT'L CS 3031094 2019-01-01 ... NaN 29.0 NaN 44.0 NaN
1 -114.0 51.11 CALGARY INT'L CS 3031094 2019-01-02 ... NaN 27.0 NaN 70.0 NaN
2 -114.0 51.11 CALGARY INT'L CS 3031094 2019-01-03 ... NaN 27.0 NaN 62.0 NaN
3 -114.0 51.11 CALGARY INT'L CS 3031094 2019-01-04 ... NaN 23.0 NaN 66.0 NaN
4 -114.0 51.11 CALGARY INT'L CS 3031094 2019-01-05 ... NaN NaN NaN NaN NaN
.. ... ... ... ... ... ... ... ... ... ... ...
360 -114.0 51.11 CALGARY INT'L CS 3031094 2019-12-27 ... NaN 30.0 NaN 46.0 NaN
361 -114.0 51.11 CALGARY INT'L CS 3031094 2019-12-28 ... NaN NaN NaN NaN NaN
362 -114.0 51.11 CALGARY INT'L CS 3031094 2019-12-29 ... NaN NaN NaN NaN NaN
363 -114.0 51.11 CALGARY INT'L CS 3031094 2019-12-30 ... NaN 27.0 NaN 50.0 NaN
364 -114.0 51.11 CALGARY INT'L CS 3031094 2019-12-31 ... NaN 28.0 NaN 55.0 NaN
[365 rows x 31 columns]
I am trying to download data from a website. When I do this, there are some rows that are not part of the data included, which is obvious because their first column is not a number.
So I'm getting something like
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
where the last row is one that I want to drop. In the actual table, there are multiple rows like this randomly throughout the table.
Here is the code that I have tried so far to drop those rows:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
Thank you in advance for any help!
Let's cast your 'Gm#' column and drop records in a couple of steps:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
Output:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA # OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA # OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA # OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA # OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA # CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA # CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]