How can I add column with the dates (starting from today)? - python

How can I add to column C the dates (starting from today)? Column B is an example of what I want to get.
df = pd.DataFrame({'N': ['1', '2', '3', '4'], 'B': ['16.11.2021', '17.11.2021', '18.11.2021', '19.11.2021'], 'C': ['nan', 'nan', 'nan', 'nan']})

If I understood your question correctly, you want something like this:
import datetime
base = datetime.datetime.today()
date_list = sorted([(base - datetime.timedelta(days=x)).strftime('%d.%m.%Y') for x in range(len(df))])
df['C'] = date_list
This will produce the same result as in column B.

Related

How add each element of a list as an element of a new column of a dataframe?

I have a list of list, like:
X=[[1,1,2],[3,4,5],[8,1,9]]
and I have a dataframe with some columns, I will add a new column to my dataframe that each value of new column in each row is one of list of X, how can I do this?
to be more clear, for example, I have a dataframe in this image :
and I have a list like X. then I want a new dataframe like:
try this:
df = pd.DataFrame({'a': {0: '1', 1: '2', 2: '3'},
'b': {0: 's', 1: 'n', 2: 'k'}})
X=[[1,1,2],[3,4,5],[8,1,9]]
df['new_col'] = X
df
output:

Merging dataframes based on a column 's values of a dataframe

I have a dictionary of dataframes (df_t). Each dataframe has the same exactly same two columns but different number of rows . I merged all the dataframes to one dataframe.
However, I would like to use (df_A [label_1]) as a reference and drop the (df_B [label_2]) (df_A [label_3]) like in the example
df_A = pd.DataFrame({'A': [1, 0.5, 1, 0.5],'label_1': ['-1', '1', '-1', '1']})
df_B = pd.DataFrame({'A': [1, 1.5, 2.3],'label_2': ['-1', '1','-1']})
df_C = pd.DataFrame({'A': [2.1, 5.5],'label_3': ['-1', '1']})
df_t = {'1': df_A, '2': df_B, '3': df_C}
#d = { k: v.set_index('label') for k, v in df_t.items()}
dfx = pd.concat(df_t, axis=1)
dfx.columns = dfx.columns.droplevel(0)

Which Pandas function do I need? group_by or pivot

I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()

Rows to columns based on another column ok this is it

I've merged two dataframes, but now there are duplicate rows. I want to move my rows to columns based on/grouped by a column value.
I have already merged the two dataframes:
df_merge = pd.merge(top_emails_df, keyword_df, on='kmed_idf')
The new dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from: ['foo', 'foo', 'foo', 'bar', 'bar'})
I tried to stack the dataframe:
newtest = df_merge.set_index(['kmed_idf']).stack(level=0)
newtest= newtest.to_frame()
But this only created a series. When converted to a dataframe it's still not very useful.
What I would like is a dataframe where each row is a unique value of 'kmed_idf', and the rows are now columns. Something like this:
import pandas as pd
df = pd.Dataframe({'kmed_idf': ['1', '2', '3'],
'n_docs': [796],
'n_docs2': [796],
'n_docs3,: [796]})
This will make it easier to delete the duplicates. I've also tried using the drop duplicates pandas function, but to no avail.
if all you want is to remove duplicates, I think the .drop_duplicates function should be the way to go...
I don't know why it hadn't worked for you, but please try this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from': ['foo', 'foo', 'foo', 'bar', 'bar']})
df.drop_duplicates(inplace=True)
print(df)
Output:
email_from kmed_idf n_docs
0 foo 1 796
3 bar 2 200

Subtracting values between groups within a dataframe

I am attempting to calculate the differences between two groups that may have mismatched data in an efficient manner.
The following dataframe, df,
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
has two types that have mismatched "codes" -- notably code 3 is not present for the 'W' type and code 4 is not present for the 'A' type. I have wrapped codes as strings as in my particular case they are sometimes strings.
I would like to substract the values for matching codes between the two types so that we obtain,
result = pd.DataFrame({'code': ['1', '2', '3', '4'],
'diff': [0, 15, 25, -40]})
Where the sign would indicate which type had the greater value.
I have spent some time examining variations on groupby diff methods here, but have not seen anything that deals with the particular issue of subtracting between two potentially mismatched columns. Instead, most questions appear to be appropriate for the intended use of the diff() method.
The route I've tried most recently is using a list comprehension on the df.groupby['type'] to split into two dataframes, but then I remain with a similar problem regarding subtracting mismatched cases.
Groupby on code, then substitute the missing value with 0
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
def my_func(x):
# What if there are more than 1 value for a type/code combo?
a_value = x[x.type == 'A']['values'].max()
w_value = x[x.type == 'W']['values'].max()
a_value = 0 if np.isnan(a_value) else a_value
w_value = 0 if np.isnan(w_value) else w_value
return a_value - w_value
df_new = df.groupby('code').apply(my_func)
df_new = df_new.reset_index()
df_new = df_new.rename(columns={0:'diff'})
print(df_new)
code diff
0 1 0
1 2 15
2 3 25
3 4 -40

Categories