I've merged two dataframes, but now there are duplicate rows. I want to move my rows to columns based on/grouped by a column value.
I have already merged the two dataframes:
df_merge = pd.merge(top_emails_df, keyword_df, on='kmed_idf')
The new dataframe looks like this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from: ['foo', 'foo', 'foo', 'bar', 'bar'})
I tried to stack the dataframe:
newtest = df_merge.set_index(['kmed_idf']).stack(level=0)
newtest= newtest.to_frame()
But this only created a series. When converted to a dataframe it's still not very useful.
What I would like is a dataframe where each row is a unique value of 'kmed_idf', and the rows are now columns. Something like this:
import pandas as pd
df = pd.Dataframe({'kmed_idf': ['1', '2', '3'],
'n_docs': [796],
'n_docs2': [796],
'n_docs3,: [796]})
This will make it easier to delete the duplicates. I've also tried using the drop duplicates pandas function, but to no avail.
if all you want is to remove duplicates, I think the .drop_duplicates function should be the way to go...
I don't know why it hadn't worked for you, but please try this:
import pandas as pd
df = pd.DataFrame({'kmed_idf': ['1', '1', '1', '2', '2'],
'n_docs': [796, 796, 796, 200, 200],
'email_from': ['foo', 'foo', 'foo', 'bar', 'bar']})
df.drop_duplicates(inplace=True)
print(df)
Output:
email_from kmed_idf n_docs
0 foo 1 796
3 bar 2 200
Related
How can I add to column C the dates (starting from today)? Column B is an example of what I want to get.
df = pd.DataFrame({'N': ['1', '2', '3', '4'], 'B': ['16.11.2021', '17.11.2021', '18.11.2021', '19.11.2021'], 'C': ['nan', 'nan', 'nan', 'nan']})
If I understood your question correctly, you want something like this:
import datetime
base = datetime.datetime.today()
date_list = sorted([(base - datetime.timedelta(days=x)).strftime('%d.%m.%Y') for x in range(len(df))])
df['C'] = date_list
This will produce the same result as in column B.
I have a list of list, like:
X=[[1,1,2],[3,4,5],[8,1,9]]
and I have a dataframe with some columns, I will add a new column to my dataframe that each value of new column in each row is one of list of X, how can I do this?
to be more clear, for example, I have a dataframe in this image :
and I have a list like X. then I want a new dataframe like:
try this:
df = pd.DataFrame({'a': {0: '1', 1: '2', 2: '3'},
'b': {0: 's', 1: 'n', 2: 'k'}})
X=[[1,1,2],[3,4,5],[8,1,9]]
df['new_col'] = X
df
output:
I have a bunch of dataframes, like the ones below:
import pandas as pd
data1 = [['1', '2', 'mary', 123], ['1', '3', 'john', 234 ], ['2', '4', 'layla', 345 ]]
data2 = [['2', '6', 'josh', 345], ['1', '2', 'dolores', 987], ['1', '4', 'kate', 843]]
df1 = pd.DataFrame(data1, columns = ['state', 'city', 'name', 'number1'])
df2 = pd.DataFrame(data2, columns = ['state', 'city', 'name', 'number1'])
for some silly reason I need to transform it in a list in this manner (for each row):
list(
df1.apply(
lambda x: {
"profile": {"state": x["state"], "city": x["city"], "name": x["name"]},
"number1": x["number1"],
},
axis=1,
)
)
what returns me exactly what I need:
[{'profile': {'state': '1', 'city': '2', 'name': 'mary'}, 'number1': 123},
{'profile': {'state': '1', 'city': '3', 'name': 'john'}, 'number1': 234},
{'profile': {'state': '2', 'city': '4', 'name': 'layla'}, 'number1': 345}]
It works if I do it for each dataframe, but I need to write a function so I can use it latter. Also, I need to be able to store both df1 and df2 separately after the operation.
I tried something like this:
df_list = [df1, df2]
for row in df_list:
row = list(row.apply(lambda x: {'send': {'state':x['state'], 'city':x['city'], 'name':x['name']}, 'number1':x['number1']}, axis=1))
but it saves only the value of the last df in the list (df2) row.
also, I tried something like this (and a lot of other stuff):
new_values = []
for row in df_list:
row = list(row.apply(lambda x: {'send'{'state':x['state'],'city':x['city'],'name':x['name']},'number1':x['number1']}, axis=1))
new_values.append(df_list)
I know it might be about not been saving the row value locally. I've read a lot posts here similar to my problem, but I couldn't manage to fully use then... Any help will be appreciated, I'm really stuck here..
Do you mean this?
def func(df):
return list(df.apply(lambda x:{'profile' : {'state': x['state'],'city': x['city'],'name':x['name']},'number1': x['number1']}, axis=1))
you can use it just like that:
df1 = func(df1)
also if you want to map all of data frames:
df1, df2 = [func(df) for df in [df1, df2]]
I have a dataframe:
values
NaN
NaN
[1,2,5]
[2]
[5]
And a dictionary
{nan: nan,
'1': '10',
'2': '11',
'5': '12',}
The dataframe contains keys from the dictionary.
How can I replace these keys with the corresponding values from the same dictionary?
Output:
values
NaN
NaN
[10,11,12]
[11]
[12]
I have tried
so_df['values'].replace(my_dictionary, inplace=True)
so_df.head()
You can use apply() method of pandas df. Check the implementation below:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.nan,
np.nan,
['1', '2', '5'],
['2'],
['5']], columns=['values'])
my_dict = {np.nan: np.nan,
'1': '10',
'2': '11',
'5': '12'}
def update(row):
if isinstance(row['values'], list):
row['values'] = [my_dict.get(val) for val in row['values']]
else:
row['values'] = my_dict.get(row['values'])
return row
df = df.apply(lambda row: update(row), axis=1)
Simple implementation. Just make sure if your dataframe contains string, your dictionary keys also contains string.
Try:
df['values']=pd.to_numeric(df['values'].explode().astype(str).map(my_dict), errors='coerce').groupby(level=0).agg(list)
Setup
import numpy as np
df=pd.DataFrame({'values':[np.nan,np.nan,[1,2,5],[2],5]})
my_dict={np.nan: np.nan, '1': '10', '2': '11', '5': '12'}
Use Series.explode with Series.map
df['values']=( df['values'].explode()
.astype(str)
.map(my_dict)
.dropna()
.astype(int)
.groupby(level = 0)
.agg(list) )
If there are others strings in your values column you would need pd.to_numeric with errors = coerce, to keep it you should do:
df['values']=(pd.to_numeric( df['values'].explode()
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)
Output
values
0 NaN
1 NaN
2 [10, 11, 12]
3 [11]
4 [12]
UPDATE
solution without explode
df['values']=(pd.to_numeric( df['values'].apply(pd.Series)
.stack()
.reset_index(level=1,drop=1)
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)
I am attempting to calculate the differences between two groups that may have mismatched data in an efficient manner.
The following dataframe, df,
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
has two types that have mismatched "codes" -- notably code 3 is not present for the 'W' type and code 4 is not present for the 'A' type. I have wrapped codes as strings as in my particular case they are sometimes strings.
I would like to substract the values for matching codes between the two types so that we obtain,
result = pd.DataFrame({'code': ['1', '2', '3', '4'],
'diff': [0, 15, 25, -40]})
Where the sign would indicate which type had the greater value.
I have spent some time examining variations on groupby diff methods here, but have not seen anything that deals with the particular issue of subtracting between two potentially mismatched columns. Instead, most questions appear to be appropriate for the intended use of the diff() method.
The route I've tried most recently is using a list comprehension on the df.groupby['type'] to split into two dataframes, but then I remain with a similar problem regarding subtracting mismatched cases.
Groupby on code, then substitute the missing value with 0
df = pd.DataFrame({'type': ['A', 'A', 'A', 'W', 'W', 'W'],
'code': ['1', '2', '3', '1', '2', '4'],
'values': [50, 25, 25, 50, 10, 40]})
def my_func(x):
# What if there are more than 1 value for a type/code combo?
a_value = x[x.type == 'A']['values'].max()
w_value = x[x.type == 'W']['values'].max()
a_value = 0 if np.isnan(a_value) else a_value
w_value = 0 if np.isnan(w_value) else w_value
return a_value - w_value
df_new = df.groupby('code').apply(my_func)
df_new = df_new.reset_index()
df_new = df_new.rename(columns={0:'diff'})
print(df_new)
code diff
0 1 0
1 2 15
2 3 25
3 4 -40