I have a dataframe that looks something like this
df = pd.DataFrame({'year':[23,23,23,23,23,23], 'month':[1,1,1,2,3,3], 'utility':['A','A','B','A','A','B'], 'state':['NY','NJ','CA','NJ','NY','CA']})
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NJ
4 23 3 A NY
5 23 3 B CA
And I would like to create new rows for utilities-state combinations with missing months. So the new dataframe would look something like this
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NY
4 23 2 A NJ
5 23 2 B CA
6 23 3 A NY
7 23 3 A NJ
8 23 3 B CA
I know that I can use a MultiIndex and then reindex, but using the from_product() method results in utility-state combinations not present in the original df (I do not want a utility A - CA combination, for example).
I thought about concatenating the utility and state columns and then getting the cartesian product from that, but I think there must be a simpler method.
One option is with DataFrame.complete from pyjanitor. For your data, you are basically doing a combination of (year, month) and (utility, state):
# pip install pyjanitor
import janitor
df.complete(('year', 'month'), ('utility', 'state'))
year month utility state
0 23 1 A NY
1 23 1 A NJ
2 23 1 B CA
3 23 2 A NY
4 23 2 A NJ
5 23 2 B CA
6 23 3 A NY
7 23 3 A NJ
8 23 3 B CA
#Timeless, undelete your code and I'll delete mine. You had a good start, and I edited your code to make it simpler.
A possible solution:
cols = ['utility', 'state']
d1 = df.drop_duplicates(cols)
d2 = df.drop_duplicates(['year', 'month'])
d2.assign(**{x: [d1[x].to_list()] * len(d2) for x in cols}).explode(cols)
Output:
year month utility state
0 23 1 A NY
0 23 1 A NJ
0 23 1 B CA
3 23 2 A NY
3 23 2 A NJ
3 23 2 B CA
4 23 3 A NY
4 23 3 A NJ
4 23 3 B CA
I was wondering whether a solution using numpy broadcasting would be possible, and it is:
cols1, cols2 = ['year', 'month'], ['utility', 'state']
(pd.DataFrame(
np.vstack(np.concatenate(
np.broadcast_arrays(
df[cols1].drop_duplicates(cols1).values[:,None],
df[cols2].drop_duplicates(cols2).values), axis=2)),
columns=df.columns))
Related
I have a dataframe with several columns and I need to re-sample from that data with more weight to one category. I think np.random.choice should work but not sure how to implement it. Following is the example data from which I want to sample randomly but want 70% probability of getting expensive home (based on the Expensive_home column, value = 1) and 30% probability for Expensive_home=0. How can I create the re-sampled data file? Thank you!
ID Lot_Area Year_Built Full_Bath Bedroom Sale_Price Expensive_home
1 31770 1960 1 3 215000 0
2 11622 1961 1 2 105000 0
3 5389 1995 2 2 236500 0
4 8402 1998 2 3 180400 0
5 10176 1990 1 2 171500 0
6 6820 1985 1 1 212000 0
7 53504 2003 3 4 538000 1
8 12134 1988 2 4 164000 0
9 11394 2010 1 1 394432 1
10 19138 1951 1 2 141000 0
11 13175 1978 2 3 210000 0
12 11751 1977 2 3 190000 0
13 10625 1974 2 3 170000 0
14 7500 2000 2 3 216000 0
15 11241 1970 1 2 149000 0
16 2280 1978 2 3 146000 0
17 12858 2009 2 3 376162 1
18 12883 2009 2 3 290941 0
19 12182 2005 2 3 220000 0
20 11520 2005 2 3 275000 0
similar data file but with more of randomly picked 1s in the last column
To create a dataframe of the same length but allowing expensive to have a higher chance of being selected and allowing replacements, use:
weights = df['Expensive_home'].replace({0: 30, 1: 70})
df1 = df.sample(len(df), replace=True, weights=weights)
To create a dataframe with all expensive and then 30% of non-expensive, you can do:
expensive = df['Expensive_home'].astype(bool)
df2 = pd.concat([df[expensive], df[~expensive].sample(frac=0.3)])
Hi i've been tryng to replace string values in a dataframe (strings are abbreviation of NFL teams), i have something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 Phi Atl Phi Phi Phi
1 2 Bal Bal Bal Buf Bal
2 3 Ind Ind Cin Cin Ind
3 4 NE NE Hou NE NE
4 5 Jax Jax NYG NYG NYG
and a Dataframe with the mapping, something like this:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
...
31 WAS 32
I want to replace every string with the TeamID to make basic statistics (frequency), i've tried the next:
## Dataframe with strings and Team ID
dfDicTeams = dfTeams[['TEAM_YH','TeamID']].to_dict('dict')
## Dataframe with selections by users
dfW1.replace(dfDicTeams[['TEAM_YH']],dfDicTeams[['TeamID']]) ## Error: unhashable type: 'list'
dfW1.replace(dfDicTeams) ## Error: Replacement not allowed with overlapping keys and values
what am i doing wrong? is it posible to do it?
I'm using Python 3, and i want something like this:
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 1 26 2 26 26 26
1 2 3 3 3 4 3
2 3 14 14 7 7 14
3 4 21 21 13 21 21
4 5 15 15 23 23 23
to aggregate the options:
IDMatch ATeam Count HTeam Count
1 26 4 2 1
2 3 4 4 1
3 14 3 7 2
4 21 4 13 1
5 15 2 23 3
Given a main input dataframe df and a mapping dataframe df_map, you can create a series mapping, then use pd.DataFrame.applymap with a custom function:
s = df_map.set_index('TEAM_YH')['TeamID']
df.iloc[:, 2:] = df.iloc[:, 2:].applymap(lambda x: s.get(x.upper(), -1))
print(df)
Index IDMatch Usr1 Usr2 Usr3 Usr4 Usr5
0 0 1 7 2 7 7 7
1 1 2 3 3 3 4 3
2 2 3 5 5 -1 -1 5
3 3 4 -1 -1 -1 -1 -1
4 4 5 6 6 -1 -1 -1
The example df_map used to calculate the above result:
Index TEAM_YH TeamID
0 ARI 1
1 ATL 2
2 BAL 3
3 BUF 4
4 IND 5
5 JAX 6
6 PHI 7
32 WAS 32
I have a pandas multiindex with two indices, a data and a gender columns. It looks like this:
Division North South West East
Date Gender
2016-05-16 19:00:00 F 0 2 3 3
M 12 15 12 12
2016-05-16 20:00:00 F 12 9 11 11
M 10 13 8 9
2016-05-16 21:00:00 F 9 4 7 1
M 5 1 12 10
Now if I want to find the average values for each hour, I know I can do like:
df.groupby(df.index.hour).mean()
but this does not seem to work when you have a multi index. I found that I could do reach the Date index like:
df.groupby(df.index.get_level_values('Date').hour).mean()
which sort of averages over the 24 hours in a day, but I loose track of the Gender index...
so my question is: how can I find the average hourly values for each Division by Gender?
I think you can add level of MultiIndex, need pandas 0.20.1+:
df1 = df.groupby([df.index.get_level_values('Date').hour,'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Another solution:
df1 = df.groupby([df.index.get_level_values('Date').hour,
df.index.get_level_values('Gender')]).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
Or simply create columns from MultiIndex:
df = df.reset_index()
df1 = df.groupby([df['Date'].dt.hour, 'Gender']).mean()
print (df1)
North South West East
Date Gender
19 F 0 2 3 3
M 12 15 12 12
20 F 12 9 11 11
M 10 13 8 9
21 F 9 4 7 1
M 5 1 12 10
I am working on the Olympics dataset related to this
This is what the dataframe looks like:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan (AFG) 13 0 0 2 2 0
1 Algeria (ALG) 12 5 2 8 15 3
2 Argentina (ARG) 23 18 24 28 70 18
3 Armenia (ARM) 5 1 2 9 12 6
4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0
I want to do the following things:
Split country name and country code and add country name as data
frame index
Remove extra unnecessary characters from country name.
For example the updated column should be:
Unnamed: 0 # Summer 01 ! 02 ! 03 ! Total # Winter \
0 Afghanistan 13 0 0 2 2 0
1 Algeria 12 5 2 8 15 3
2 Argentina 23 18 24 28 70 18
3 Armenia 5 1 2 9 12 6
4 Australasia 2 3 4 5 12 0
Please show me a proper way to achieve this.
You can use regex and replace to that i.e
df = df.replace('\(.+?\)|\[.+?\]\s*','',regex=True).rename(columns={'Unnamed: 0':'Country'}).set_index('Country')
Output:
Summer 01 ! 02 ! 03 ! Total Winter
Country
Afghanistan 13 0 0 2 2 0
Algeria 12 5 2 8 15 3
Argentina 23 18 24 28 70 18
Armenia 5 1 2 9 12 6
Australasia 2 3 4 5 12 0
If you dont want to rename then .set_index('Unnamed: 0')
Or Thanks #Scott a much easier solution is to split by ( and select the first element i.e
df['Unnamed: 0'] = df['Unnamed: 0'].str.split('\(').str[0]
Splitting to get two columns, country and Country Code and setting country as index:
df2 = pd.DataFrame(df.Unnamed.str.split(' ',1).tolist(), columns = ['Country', 'countryCode']).set_index('Country')
You could also add country code as an additional info in your dataframe.
Removing the extra thing, as I suppose like: [ANZ], using regex (as mentioned in other answer)
df2 = df2.replace('\[.*?\]','', regex=True)
I have something like the following DataFrame where I have data points at 2 locations in 4 seasons in 2 years.
>>> df=pd.DataFrame(index=pd.MultiIndex.from_product([[1,2,3,4],[2011,2012],['A','B']], names=['Season','Year','Location']))
>>> df['Value']=np.random.randint(1,100,len(df))
>>> df
Value
Season Year Location
1 2011 A 40
B 7
2012 A 81
B 84
2 2011 A 37
B 59
2012 A 30
B 6
3 2011 A 71
B 43
2012 A 3
B 65
4 2011 A 45
B 13
2012 A 38
B 70
>>>
I would like to create a new series that represents that number of the season sorted by year. For example, the seasons in the first year would just be 1,2,3,4 and then the seasons in the second year would be 5,6,7,8. The series would look like this:
Season Year Location
1 2011 A 1
B 1
2012 A 5
B 5
2 2011 A 2
B 2
2012 A 6
B 6
3 2011 A 3
B 3
2012 A 7
B 7
4 2011 A 4
B 4
2012 A 8
B 8
Name: SeasonNum, dtype: int64
>>>
Any suggestions on the best way to do this?
You could do:
def seasons(row):
return row['Year'] % 2011 * 4 + row['Season']
df.reset_index(inplace=True)
df['seasons'] = df.apply(seasons, axis=1)
df.set_index(['Season', 'Year', 'Location'], inplace=True)