Create a new column based off values in two others - python

I'm trying to merge two columns. Merging is not working out and I've been reading and asking questions for two days trying to find a solution so I'm going to go a different route.
Since I have to change the column name after I merge anyway why not just create a new column and fill it based on the other two.
So I have column A, B and C now.
C is a blank column.
Column A has values for most rows but not all, In the case that column A doesn't have a value I want to use Column B's value. I want to put one of the two Values in column C.
Please keep in mind that when column A doesn't have a value a "-" was put in its place (hence why I'm having a horrendous time trying to merge these columns).
I have converted the "-" to NaN but then the .fillna function doesn't work and I'm not sure why.
I'm thinking I have to write a for loop and an if statement to accomplish this although I feel like there is a function that would accomplish compiling a new column based on the other two columns values.
| A | B |
| 34 | 35 |
| 37 | - |
| - | 32 |
| 94 | 92 |
| - | 91 |
| 47 | - |
Desired Result
|C |
|34|
|37|
|32|
|94|
|91|
|47|

Does this answer your question:
df['A']=df.apply(lambda x: x['B'] if x['A']=='-' else x['A'],axis=1)

df['A']=df.apply(lambda x: x['B'] if x['A']==np.NaN else x['A'],axis=1)

Related

How to list values from column A where column B is NaN?

I have a dataframe (let's call it df) that looks a bit like this:
Offer | Cancelled | Restriction
------|-----------|------------
1 | N | A
2 | Y | B
3 | N | NaN
4 | Y | NaN
I have the following bit of code, which creates a list of all offers that have been cancelled:
cancelled = list('000'+df.loc[(df['Cancelled']=='Y'),'Offer'].astype(str))
What I now want to do is to adapt this to create a list of all offers where the 'Restriction' column is not NaN. So my desired result would look like this:
['0001','0002']
Does anyone know how to do this please?
You were almost there. Just add the extra condition that the Restriction column may not be NaN.
list('000'+df.loc[(df['Restriction'].notna()) & (df['Cancelled'] == 'Y'), 'Offer'].astype(str))
If you just want to filter on not NaN in Restriction column, the answer is commented by #Henry Ecker

Pandas DataFrame: Fill NA values based on group mean

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.
Let's illustrate with an example:
We have the following DataFrame columns:
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
|--------|-------|-----|-------------|
We're simply measuring temperature multiple times a day for many months. Now, let's assume that for some of our records, the temperature reading failed and we have a NA.
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
| 4 | 1 | 2 | NA |
| 5 | 1 | 3 | 14.8 |
| 6 | 1 | 4 | NA |
|--------|-------|-----|-------------|
We could just use panda's .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100's per day), we'd like to take the daily average and use that as our fill value.
we can get the daily averages with a simple groupby:
avg_temp_by_month_day = df.groupby(['month'])['day'].mean()
Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?
We could use an apply(),
df['temperature'] = df.apply(
lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'],
axis=1
)
however this is really slow (1M+ records).
Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.
What's the a more efficient way to perform this operation?
Thank you!
I'm not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.
local_avg_cols = ['temperature'] # can work with multiple columns
# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()
# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()
# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))
# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])
If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I'll unmark this answer and mark yours. Thank you!
I'm guessing what speeds down your process are two things. First, you don't need to convert your groupby to a dataframe. Second, you don't need the for loop.
from pandas import DataFrame
from numpy import nan
# Populating the dataset
df = {"Month": [1] * 6,
"Day": [1, 1, 2, 2, 3, 4],
"Temperature": [14.3, 14.8, 13.1, nan, 14.8, nan]}
# Creating the dataframe
df = pd.DataFrame(df, columns=df.keys())
local_averages = df.groupby(['Month', 'Day'])['Temperature'].mean()
df = df.merge(local_averages, on=['Month', 'Day'], how='left', suffixes=('', '_avg'))
# Filling the missing values of the Temperature column with what is available in Temperature_avg
df.Temperature.fillna(df.Temperature_avg, inplace=True)
df.drop(columns="Temperature_avg", inplace=True)
Groupby is a resource heavy process so make the most out of it when you use it. Furthermore, as you already know loops are not a good idea when it comes to dataframes. Additionally, if you have a large data you may want to avoid creating extra variables from it. I may put the groupby into the merge if my data has 1m rows and many columns.

Pandas groupby function returns NaN values

I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.
Can’t figure out why I keep getting NaN or 0 as the output for each segment.
Here’s the latest approach I've taken...
Data sample:
|---------------------|------------------|------------------|
| unique_id | sex | born_at |
|---------------------|------------------|------------------|
| 1 | M | 1963-08-04 |
|---------------------|------------------|------------------|
| 2 | F | 1972-03-22 |
|---------------------|------------------|------------------|
| 3 | M | 1982-02-10 |
|---------------------|------------------|------------------|
| 4 | M | 1989-05-02 |
|---------------------|------------------|------------------|
| 5 | F | 1974-01-09 |
|---------------------|------------------|------------------|
Code:
df[‘num_people’]=1
breakpoints = [18,25,35,45,55,65]
df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’)
I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.
Desired output would be something like this...
The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.
I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:
datetime.now().year-df.birthday.dt.year
This information then can be used to group the data (which are previously grouped by gender):
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count')
In order to get rid of the nan cells you simply do a fillna(0) like this:
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'})

Python 3.x pandas how to compare duplicates and drop the rows with the higher values in csv?

Hi I'm new to python and currently using python version 3.x. I have a very large set of data needed to be filtered in csv. I searched online and many recommended loading it into pandas DataFrame (done).
My columns can be defined as: "ID", "Name", "Time", "Token", "Text"
I need to check under "Token" for any duplicates - which can be done via
df = df[df.Token.duplicate(keep=False)]
(Please correct me if I am wrong)
But the problem is, I need to keep the original row while dropping the other duplicates. For this, I was told to compare it with "Time". The "Time" with the smallest value will be original (keep) while drop the rest of the duplicates.
For example:
ID Name Time Token Text
1 | John | 333 | Hello | xxxx
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
4 | Kenn | 555 | Hello | xxxx
Desired output:
2 | Mary | 233 | Hiiii | xxxx
3 | Jame | 222 | Hello | xxxx
What I have done:
##compare and keep the smaller value
def dups(df):
return df[df["Time"] < df["Time"]]
df = df[df.Token.duplicate()].apply(dups)
This is roughly where I am stuck! Can anyone help? Its my first time coding in python, any help will be greatly appreciated.
Use sort_values + drop_duplicates:
df = df.sort_values('Time')\
.drop_duplicates('Token', keep='first').sort_index()
df
ID Name Time Token Text
1 2 Mary 233 Hiiii xxxx
2 3 Jame 222 Hello xxxx
The final sort_index call restores order to your original dataframe. If you want to retrieve a monotonically increasing index beyond this point, call reset_index.

Pivot or transpose in SQL or Pandas

I have a table of the form:
item_code | attribute | time_offset | mean | median | description | ...
The attribute column has one of 40 possible values and the time_offset column can be an integer from 0 to 20.
I want to transform this table to a wide one of the form:
item_code | <attribute1>_<time_offset1>_mean | <attribute1>_<time_offset1>_median | <attribute1>_<time_offset1>_description | <attribute1>_<time_offset1>_... | <attribute2>...
I can do this either in SQL or in Pandas but I'm having difficulty with the fact that some of the columns are not numeric, so it is hard to come up with an aggregation function for them.
I can guarantee that each combination of item_code, attribute and time_offset will have only one row, so I do not need an aggregation function. Is there something like a transpose operation that will allow me to do what I am looking for?

Categories