Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row
You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |
Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer
I'm trying to merge two columns. Merging is not working out and I've been reading and asking questions for two days trying to find a solution so I'm going to go a different route.
Since I have to change the column name after I merge anyway why not just create a new column and fill it based on the other two.
So I have column A, B and C now.
C is a blank column.
Column A has values for most rows but not all, In the case that column A doesn't have a value I want to use Column B's value. I want to put one of the two Values in column C.
Please keep in mind that when column A doesn't have a value a "-" was put in its place (hence why I'm having a horrendous time trying to merge these columns).
I have converted the "-" to NaN but then the .fillna function doesn't work and I'm not sure why.
I'm thinking I have to write a for loop and an if statement to accomplish this although I feel like there is a function that would accomplish compiling a new column based on the other two columns values.
| A | B |
| 34 | 35 |
| 37 | - |
| - | 32 |
| 94 | 92 |
| - | 91 |
| 47 | - |
Desired Result
|C |
|34|
|37|
|32|
|94|
|91|
|47|
Does this answer your question:
df['A']=df.apply(lambda x: x['B'] if x['A']=='-' else x['A'],axis=1)
df['A']=df.apply(lambda x: x['B'] if x['A']==np.NaN else x['A'],axis=1)
I am trying to work on some time series data and am quite new to pandas dataframe. I have a dataframe with two columns as below:
+---+-----------------------+-------+--+
| | 0 | 1 | |
+---+-----------------------+-------+--+
| 1 | 2018-08-02 23:00:00 | 456.8 | |
| 2 | 2018-08-02 23:01:00 | 457.9 | |
+---+-----------------------+-------+--+
I am trying to convert it into a series with two columns as it is in the dataframe. How can it be done? as pd.series is converting the dataframe to a series of one column.
There is no such thing as a pandas Series with two columns. My guess is that you want to generate a Series with column 0 as the index and column 1 as the values. You can get that by setting the index and extracting the column of interest (assuming your DataFrame is in df):
df.set_index(0)[1]
As stated in comments using "pd.Series(df.col1, df.col2) produces a Series with NaNs". The reason is that the Series will be reindexed with the object passed as the index argument. Current dev docs clarify:
If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.
To circumvent reindexing this can be done:
pd.Series(df[0].values, index=df[1])
Since df[0].values is a pd.array, rather than a dict-like pd.Series, nothing will be reindexed and df[1] will be set as index as-is.
I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.
Can’t figure out why I keep getting NaN or 0 as the output for each segment.
Here’s the latest approach I've taken...
Data sample:
|---------------------|------------------|------------------|
| unique_id | sex | born_at |
|---------------------|------------------|------------------|
| 1 | M | 1963-08-04 |
|---------------------|------------------|------------------|
| 2 | F | 1972-03-22 |
|---------------------|------------------|------------------|
| 3 | M | 1982-02-10 |
|---------------------|------------------|------------------|
| 4 | M | 1989-05-02 |
|---------------------|------------------|------------------|
| 5 | F | 1974-01-09 |
|---------------------|------------------|------------------|
Code:
df[‘num_people’]=1
breakpoints = [18,25,35,45,55,65]
df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’)
I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.
Desired output would be something like this...
The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.
I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:
datetime.now().year-df.birthday.dt.year
This information then can be used to group the data (which are previously grouped by gender):
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count')
In order to get rid of the nan cells you simply do a fillna(0) like this:
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'})
I have a DataFrame at daily level :
day | type| rev |impressions| yearmonth
2015-10-01| a | 1999| 1000 |201510
2015-10-02| a | 300 | 6777 |201510
2015-11-07| b | 2000| 4999 |201511
Yearmonth is a column I added to the DataFrame. Task is to group by yearmonth, ( and may be type and then sum up all the columns(or select a value) and select them as the new DataFrame.
On grouping the above DataFrame, we should be getting one row for one month .
yearmonth| type| rev |impressions
201510 | a | 2299| 7777
201511 | b | 2000| 4999
Let us say df is the DataFrame , I tried doing
test = df.groupby('yearmonth')
I checked the methods available for test ( test.) but I did not see anything where we can select columns and also aggregate them there ( I guess we can use agg for sum) .
Any inputs?
add the as_index parameter
like this:
test = df.groupby('yearmonth', as_index=False)
here is a reference:
enter link description here