If I have a pandas dataframe it's possible to get values from a row and use it as a label for a new column?
I have something like this:
| Team| DateTime| Score
| Red| 2021/03/19 | 5
| Red| 2021/03/20 | 10
| Blue| 2022/04/10 | 20
I would like to write this data on a new dataframe that has:
Team Column
Year/Month SumScore Column
So I would have a row per team with multiple new columns for a month in a year that contains the sum of the score for a specific month.
It should be like this:
Team
2021/03
2022/04
Red
15
0
Blue
0
20
The date format time is YYYY/MM/DD
I hope I was clear
You can use
df = (df.assign(YM=df['DateTime'].str.rsplit('/', 1).str[0])
.pivot_table(index='Team', columns='YM', values='Score', aggfunc='sum', fill_value=0)
.reset_index())
print(df)
YM Team 2021/03 2022/04
0 Blue 0 20
1 Red 15 0
We can use pd.crosstab which allows us to
Compute a simple cross tabulation of two (or more) factors
Below I've changed df['DateTime'] to contain year/month only.
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime('%Y/%m')
pd.crosstab(
df['Team'],
df['DateTime'],
values=df['Score'],
aggfunc='sum'
).fillna(0)
If you don't want multiple levels in the index, just use the method call reset_index on your crosstab and then drop DateTime.
Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row
You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |
Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer
I did a very inefficient loop to implement my solution but there got to be a better approach I can't think of.
My data:
|ticker|cusip|
|DIS | 123 |
|DIS | None|
|None | abc |
|None | xyz |
if I sort on cusip with ascending= False and drop_duplicates, it's working to remove the DIS row with cusip is None. But at the same time it's removing bottom rows having tickers None, which I want to keep.
I did find duplicate rows and then did a loop to get each duplicate group and applied drop duplicates which is very inefficient as I have to do a huge loop across 1000's of rows.
Is there an option to ignore None rows for dup check.
Try duplicated and with or condition for None
out = df[~df.duplicated('ticker')|df.ticker.eq('None')]
Out[448]:
ticker cusip
0 DIS 123
2 None abc
3 None xyz
You can use Series.where to convert 'None' values in NaN + DataFrame.dropna with subset param.
df['cusip'] = df['cusip'].where(~(df['cusip'] == 'None'))
df = df.dropna(subset=['cusip']).reset_index(drop=True)
print(df)
Output:
ticker
cusip
0
DIS
123
1
None
abc
2
None
xyz
I'm trying to filter a pandas pivot table, but am not sure of the correct syntax to filter the "group by" arguments. I've been trying the standard df['column_name'] but I receive a KeyError.
Here's the code for the table
pivot = pd.pivot_table(q5,values='ENTRIES',index('DATE','STATION','ID'),aggfunc='sum')
Here is what my pivot table looks like:
ENTRIES
DATE STATION ID
1/1/13 1 AVE 1 12
2 60
3 0
4 111
5 123
...
The desired result is to return Dates and Stations where at least one ID had < 10 Entries, but not all IDs had <10 Entries
Thanks
I have the following data
id | value1 | value2
-----------------------
1 A red
1 B red
1 C blue
2 A blue
2 B blue
2 C green
The result I need is:
id | values
---------------------------------
1 [[A,red],[B,red][C,blue]]
2 [[A,blue],[B,blue][C,green]]
My approach so far is to groupby and aggregate value1 and value2 in seperate arrays and then merge them together as described in Combine PySpark DataFrame ArrayType fields into single ArrayType field
df.groupBy(["id"]).agg(*[F.collect_list("value1"), F.collect_list("value2")])
However since order is not guaranteed in collect_list() (see here), how can I make sure value1 and value2 are both matched to the correct values?
This could potentially lead to two lists with different order and subsequent merging would match wrong values?
As commented by #Raphael, you can combine value1 and value2 columns into a single struct type column firstly, and then collect_list:
import pyspark.sql.functions as F
(df.withColumn('values', F.struct(df.value1, df.value2))
.groupBy('id')
.agg(F.collect_list('values').alias('values'))).show()
+---+--------------------+
| id| values|
+---+--------------------+
| 1|[[A,red], [B,red]...|
| 2|[[A,blue], [B,blu...|
+---+--------------------+