Pandas fillna with string values from 2 other columns - python

I have a df with 3 columns, City, State, and MSA. Some of the MSA values are NaN. I would like to fill the MSA NaN values with a concatenation of City and State. I can fill MSA with City using df.MSA_CBSA.fillna(df.City, inplace=True), but some cities in different states have the same name.
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Nan
Belleville
KS
Nan
City
State
MSA
Chicago
IL
Chicago MSA
Belleville
IL
Belleville IL
Belleville
KS
Belleville KS

Keep using the vectorized operation that you suggested. Notice that the argument can receive a combination from the other instances:
df.MSA.fillna(df.City + "," + df.State, inplace=True)

Related

Pandas Merge Result Output Next Row

Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0

Iterating through a dataframe with info from another dataframe

I have a question that I think is more about logic than about coding. My goal is to calculate how many Kilometers a truck is loaded and charging.
I have two Dataframes
Lets call the first one trips:
Date Licence City State KM
01/05/2019 AAA-1111 Sao Paulo SP 10
02/05/2019 AAA-1111 Santos SP 10
03/05/2019 AAA-1111 Rio de Janeiro RJ 20
04/05/2019 AAA-1111 Sao Paulo SP 15
01/05/2019 AAA-2222 Curitiba PR 20
02/05/2019 AAA-2222 Sao Paulo SP 25
Lets call the second one invoice
Code Date License Origin State Destiny UF Value
A1 01/05/2019 AAA-1111 Sao Paulo SP Rio de Janeiro RJ 10.000,00
A2 01/05/2019 AAA-2222 Curitiba PR Sao Paulo SP 15.000,00
What I need to get is:
Date Licence City State KM Code
01/05/2019 AAA-1111 Sao Paulo SP 10 A1
02/05/2019 AAA-1111 Santos SP 10 A1
03/05/2019 AAA-1111 Rio de Janeiro RJ 20 A1
04/05/2019 AAA-1111 Sao Paulo SP 15 Nan
01/05/2019 AAA-2222 Curitiba PR 20 A2
02/05/2019 AAA-2222 Sao Paulo SP 25 A2
As I said, is more a question of logic. The truck got its cargo in the initial point that is São Paulo. How can I iterate through the rows knowing that it passed through Santos loaded and then went to Rio de Janeiro if I don´t have the date when the cargo was delivered?
tks
Assume the rows in the first dataframe(df1) are sorted, here is what I would do:
Note: below I am using df1 for trips and df2 for invoice
Left join with the df1 (left) and df2 (right) using as much information that is valid for matching two dataframes, so that we can find rows in df1 which are Origin of the trips. In my test, I am using the fields: ['Date', 'License', 'City', 'State'], save the result in a new dataframe df3
df3 = df1.merge(df2[df2.columns[:6]].rename(columns={'Origin':'City'})
, on = ['Date', 'License', 'City', 'State']
, how = 'left'
)
fill the NULL values in df3.Desitiny with ffill()
df3['Destiny'] = df3.Destiny.ffill()
setup the group label by the following flag:
g = (~df3.Code.isnull() | (df3.shift().City == df3.Destiny)).cumsum()
Note: I added df3['g'] in the above picture for reference
update df3.Code using ffill() based on the above group labels.
df3['Code'] = df3.groupby(g).Code.ffill()

dColumns returning Nan with dictionary mapping

I have a data frame that looks like below:
City State Country
Chicago IL United States
Boston
San Diego CA United States
Los Angeles CA United States
San Francisco
Sacramento
Vancouver BC Canada
Toronto
I have 3 lists that contain all the missing values:
city_list = ['Boston', 'San Francisco', 'Sacramento', 'Toronto']
state_list = ['MA', 'CA', 'CA', 'ON']
country_list = ['United States', 'United States', 'United States', 'Canada']
And here's my ideal result:
City State Country
Chicago IL United States
Boston MA United States
San Diego CA United States
Los Angeles CA United States
San Francisco CA United States
Sacramento CA United States
Vancouver BC Canada
Toronto ON Canada
I used a potential method that's suggested by a helpful person, but I've been scratching my head and couldn't figure out what went wrong. And here's the code:
state_dict = dict(zip(city_list, state_list))
country_dict = dict(zip(city_list, country_list))
df = df.set_index('City')
df['State'] = df['State'].map(state_dict)
df['Country'] = df['Country'].map(country_dict)
df.reset_index()
print(df.City, df.State, df.Country)
But every cell of the State and Country columns return NaN.
City State Country
Chicago NaN NaN
Boston NaN NaN
San Diego NaN NaN
Los Angeles NaN NaN
San Francisco NaN NaN
Sacramento NaN NaN
Vancouver NaN NaN
Toronto NaN NaN
What went wrong here? And how would you change the code? Thanks.
I think that map should be called on the 'City' rather than 'State' field, like so:
df['State'] = df['City'].map(state_dict)
However, this has the problem that it overwrites any original 'State' values for cities which are not in your dictionary - e.g. 'Chicago'. One solution that gets around this is the following syntactically clumsier (but I believe correct) code:
df['State'] = df.apply(lambda x: state_dict[x['City']] if x['City'] in state_dict else x['State'], axis=1)
And it'll be the same idea for the country field.
I should add that this only works if you do not first set 'City' as index as you have in your example.

Make row operations faster in pandas

I am doing a course on Coursera and I have a dataset to perform some operations on. I have gotten the answer to the problem but my answer takes time to compute.
Here is the original dataset and a sample screenshot is provided below.
The task is to convert the data from monthly values to quarterly values i.e. I need to sort of aggregate 2000-01, 2000-02, 2000-03 data to 2000-Q1 and so on. The new value for 2000-Q1 should be the mean of these three values.
Likewise 2000-04, 2000-05, 2000-06 would become 2000-Q2 and the new value should be the mean of 2000-04, 2000-05, 2000-06
Here is how I solved the problem.
First I defined a function quarter_rows() which takes a row of data (as a series), loops through every third element using column index, replaces some values (in-place) with a mean computed as explained above and returns the row
import pandas as pd
import numpy as np
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
def quarter_rows(row):
for i in range(0, len(row), 3):
row.replace(row[i], np.mean(row[i:i+3]), inplace=True)
return row
Now I do some subsetting and cleanup of the data to leave only what I need to work with
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
housing3 = housing.set_index(["State","RegionName"]).ix[:, '2000-01' : ]
I then used apply to apply the function to all rows.
housing3 = housing3.apply(quarter_rows, axis=1)
I get the expected result. A sample is shown below
But the whole process takes more than a minute to complete. The original dataframe has about 10370 columns.
I don't know if there is a way to speed things up in the for loop and apply functions. The bulk of the time is taken up in the for loop inside my quarter_rows() function.
I've tried python lambdas but every way I tried threw an exception.
I would really be interested in finding a way to get the mean using three consecutive values without using the for loop.
Thanks
I think you can use instead apply use resample by quarters and aggregate mean, but first convert column names to month periods by to_period:
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
Testing:
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
p = ~housing.columns.str.contains('199') # negation of columns starting with 199
housing = housing[housing.columns[p]]
#for testing slect only 10 first rows and columns from jan 2000 to jun 2000
housing3 = housing.set_index(["State","RegionName"]).ix[:10, '2000-01' : '2000-06']
print (housing3)
2000-01 2000-02 2000-03 2000-04 2000-05 2000-06
State RegionName
NY New York NaN NaN NaN NaN NaN NaN
CA Los Angeles 204400.0 207000.0 209800.0 212300.0 214500.0 216600.0
IL Chicago 136800.0 138300.0 140100.0 141900.0 143700.0 145300.0
PA Philadelphia 52700.0 53100.0 53200.0 53400.0 53700.0 53800.0
AZ Phoenix 111000.0 111700.0 112800.0 113700.0 114300.0 115100.0
NV Las Vegas 131700.0 132600.0 133500.0 134100.0 134400.0 134600.0
CA San Diego 219200.0 222900.0 226600.0 230200.0 234400.0 238500.0
TX Dallas 85100.0 84500.0 83800.0 83600.0 83800.0 84200.0
CA San Jose 364100.0 374000.0 384700.0 395700.0 407100.0 416900.0
FL Jacksonville 88000.0 88800.0 89000.0 88900.0 89600.0 90600.0
housing3.columns = pd.to_datetime(housing3.columns).to_period('M')
housing3 = housing3.resample('Q', axis=1).mean()
print (housing3)
2000Q1 2000Q2
State RegionName
NY New York NaN NaN
CA Los Angeles 207066.666667 214466.666667
IL Chicago 138400.000000 143633.333333
PA Philadelphia 53000.000000 53633.333333
AZ Phoenix 111833.333333 114366.666667
NV Las Vegas 132600.000000 134366.666667
CA San Diego 222900.000000 234366.666667
TX Dallas 84466.666667 83866.666667
CA San Jose 374266.666667 406566.666667
FL Jacksonville 88600.000000 89700.000000

Pandas dataframe, split data by last column in last position but keep other columns

Very new to pandas so any explanation with a solution is appreciated.
I have a dataframe such as
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
I want to split the Zip State city column in my data into 3 different columns. Using the answer from this post Pandas DataFrame, how do i split a column into two I could accomplish this task if I didn't have my first column. Writing a regex to captures all companies just leads to me capturing everything in my data.
I also tried
foo = lambda x: pandas.Series([i for i in reversed(x.split())])
data_pretty = data['Zip State City'].apply(foo)
but this causes me to loose the company column and splits the names of the cities that are more than one word into separate columns.
How can I split my last column while keeping the company column data?
you can use extract() method:
In [110]: df
Out[110]:
Company Zip State City
1 *CBRE San Diego, CA 92101
4 1908 Brands Boulder, CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS
10 21st Century Healthcare, Inc. Tempe 85282
15 AAA Jefferson City, MO 65101-9564
In [112]: df[['City','State','ZIP']] = df['Zip State City'].str.extract(r'([^,\d]+)?[,]*\s*([A-Z]{2})?\s*([\d\-]{4,11})?', expand=True)
In [113]: df
Out[113]:
Company Zip State City City State ZIP
1 *CBRE San Diego, CA 92101 San Diego CA 92101
4 1908 Brands Boulder, CO 80301 Boulder CO 80301
7 1st Infantry Division Headquarters Fort Riley, KS Fort Riley KS NaN
10 21st Century Healthcare, Inc. Tempe 85282 Tempe NaN 85282
15 AAA Jefferson City, MO 65101-9564 Jefferson City MO 65101-9564
From docs:
Series.str.extract(pat, flags=0, expand=None)
For each subject string in the Series, extract groups from the first
match of regular expression pat.
New in version 0.13.0.
Parameters:
pat : string
Regular expression pattern with capturing groups
flags : int, default 0 (no flags)
re module flags, e.g.
re.IGNORECASE .. versionadded:: 0.18.0
expand : bool, default False
If True, return DataFrame.
If False, return Series/Index/DataFrame.
Returns: DataFrame with one row for each subject string, and one
column for each group. Any capture group names in regular expression
pat will be used for column names; otherwise capture group numbers
will be used. The dtype of each result column is always object, even
when no match is found. If expand=True and pat has only one capture
group, then return a Series (if subject is a Series) or Index (if
subject is an Index).

Categories