Function to replicate rows of dataframe if column contains list

Function to replicate rows of dataframe if column contains list - python

I am not sure the best way to title this. If I have a dataframe and one of the columns, lets call it 'Tags', may contain a list or may not. If 'Tags' is a list, then I want to replicate that row as many times as there are unique items in the 'Tags' column but then replace the items in that column with the unique item for each row.
Example:
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28'],
'Item': 'My_fake_item',
'Tags': [['A', 'B']],
'Count': 3}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
Would result in:
And I need a function that will change the dataframe to this:

Apply the explode method, for example
df_exploded = (
df.set_index(["Date", "Item", "Count"])
.apply(pd.Series.explode)
.reset_index()
)
will result in
df_exploded
>>>
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
and there's no need to check if an element is a list or not on the column
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28', '2020-11-01'],
'Item': ['My_fake_item', 'My_other_item'],
'Tags': [['A', 'B'], 'C'],
'Count': [3, 5]}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
will result in
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
2 2020-11-01 My_other_item 5 C

Related

Search in pandas Dataframe column value

I have the following data as pandas Dataframe:
df = pd.DataFrame({
'id': [1,2,3,4, 5],
'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'],
'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'],
'movie_ids': ['34,265,268,65',
'34,43,65,61',
'5,876,8',
'14,5,268',
'134,845,2']}).set_index(["id"], drop=False)
and a list of ids:
movie_ids = ['34','845']
I would like to get the indexes of those rows where any of the movie_ids' item is represented in the movie_ids column.
I was trying to convert the column value to list than filter on that, with that I get only the matched values:
result = list(filter(lambda x: set(countries).intersection(set(x.split(","))), df['movie_ids'].values))
than using loc fn to get only those rows:
df = df.loc[df['movie_ids'].isin(result)]
But I guess this is not the most efficient way, for example with the millions of rows.

df[df.movie_ids.str.contains(rf"\b{'|'.join(movie_ids)}\b")]
id first_name last_name movie_ids
id
1 1 Sheldon Copper 34,265,268,65
2 2 Raj Koothrappali 34,43,65,61
5 5 Amy Fowler 134,845,2

How to change the ids referenced in one dataframe to the ids that will be kept after the duplicates are removed from the second dataframe?

I am working with two dataframes:
df - contains multiple rows with data about scientific articles, including a magazine_id which is connected to the ids in the second dataframe
magazines - contains only 2 columns: id and title
In the magazines dataframe there are duplicate titles.
I am unsure about how to change the ids referenced in the first dataframe to the ids that will be kept after the duplicates are removed.
df = pd.Dataframe({'id': [1003, 1009, 1010, 1034],
'title': ['Article1', 'Article2', 'Article3', 'Article4'],
'magazine_id': [1, 2, 3, 4]})
magazines = pd.Dataframe({'id': [1, 2, 3, 4],
'title': ['Mag1','Mag1','Mag3','Mag4']})
So from magazines, entry with id = 2 should be deleted because it has the same title as id = 1.
The output for df should be:
id title magazine_id
1003 'Article1' 1
1009 'Article2' 1
1010 'Article3' 3
1034 'Article4' 4

have created two data frames that align with your problem statement
de-dupe titles, keeping list of IDs that were used
then map (merge) to re-align title_id of second data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": range(10), "title": np.random.choice(list("ABCDEFG"), 10)})
df2 = pd.DataFrame(
{"content_id": range(100), "title_id": np.random.choice(range(10), 100)}
)
# drop duplicates, keep original ids as list
df = (
df.groupby("title").agg(ids=("id", list)).reset_index().assign(id=lambda d: d.index)
)
# map title ids in second data frame to newly de-duped ids
df2 = (
df.loc[:, ["id", "ids"]]
.explode("ids")
.merge(df2, left_on="ids", right_on="title_id")
.drop(columns=["id", "title_id"])
.rename(columns={"ids": "title_id"})
)

Since your question is not clear I made a scenario. If it is your case my suggestion is that first merge the dfs then use drop_duplicates. See the following:
main_df = pd.DataFrame({'title_id':[1,2,3,4,5], 'authors': ['w', 'e', 'w','e','w']})
titles = pd.DataFrame({'id': [1,2,3,4,5], 'title': ['a', 'b', 'a', 'b','a']})
main_df.merge(titles, left_on='title_id', right_on='id').drop_duplicates(['title', 'authors'])
main_df:
titles df with duplicates:
The result:

Use drop_duplicates to get rid of duplicate titles, and use ffill:
magazines = magazines.drop_duplicates(subset=['title'])
df.loc[~df['magazine_id'].isin(magazines['id']), 'magazine_id'] = np.nan
df['magazine_id'] = df['magazine_id'].ffill().astype(int)
Output:
>>> df
id title magazine_id
0 1003 Article1 1
1 1009 Article2 1
2 1010 Article3 3
3 1034 Article4 4
>>> magazines
id title
0 1 Mag1
2 3 Mag3
3 4 Mag4

Summarize Pandas dataframe when column is between two values

In python, I have a Pandas dataframe (df) that can be replicated with the below.
import pandas as pd
data = [['2021-09-12', 'item1', 'IL', 5], ['2021-09-12', 'item2', 'CA', 7], ['2021-08-13', 'item2', 'CA', 8], ['2021-06-12', 'item3', 'NY', 10], ['2021-05-01', 'item1', 'IL', 11]]
df = pd.DataFrame(data, columns = ['date', 'product', 'state', 'sales'])
I also have two strings.
startdate = '2021-08-01'
enddate = '2021-09-12'
I am trying to group by product and state, and add a column df['sum_sales'] that sums up df['sales'] when df['date'] is between startdate and enddate.
I tried to do a df.groupby(['product', state']) but not sure how to add the condition above.

You can use loc and between and groupby.sum().
between will return a Boolean if the condition is satisfied - your conditions are the dates here.
loc will filter down the DataFrame using the Boolean returned
groupby.sum() will give return the sum of sales.
startdate = '2021-08-01'
enddate = '2021-09-12'
>>> df.loc[df.date.between(startdate,enddate)].groupby(['product', 'state'])['sales'].sum()
product state
item1 IL 5
item2 CA 15
Note that your date is of type object from the way you define your inputs.

Groupby, apply function and combine results in dataframe

I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?

Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0

Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()

Merge between columns from the same dataframe

I've the following dataframe:
id;name;parent_of
1;John;3
2;Rachel;3
3;Peter;
Where the column "parent_of" is the id of the parent id. What I want to get the is the name instead of the id on the column "parent_of".
Basically I want to get this:
id;name;parent_of
1;John;Peter
2;Rachel;Peter
3;Peter;
I already wrote a solution but is not the more effective way:
import pandas as pd
d = {'id': [1, 2, 3], 'name': ['John', 'Rachel', 'Peter'], 'parent_of': [3,3,'']}
df = pd.DataFrame(data=d)
df_tmp = df[['id', 'name']]
df = pd.merge(df, df_tmp, left_on='parent_of', right_on='id', how='left').drop('parent_of', axis=1).drop('id_y', axis=1)
df=df.rename(columns={"name_x": "name", "name_y": "parent_of"})
print(df)
Do you have any better solution to achieve this?
Thanks!

Check with map
df['parent_of']=df.parent_of.map(df.set_index('id')['name'])
df
Out[514]:
id name parent_of
0 1 John Peter
1 2 Rachel Peter
2 3 Peter NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Function to replicate rows of dataframe if column contains list - python

Related

Search in pandas Dataframe column value

How to change the ids referenced in one dataframe to the ids that will be kept after the duplicates are removed from the second dataframe?

Summarize Pandas dataframe when column is between two values

Groupby, apply function and combine results in dataframe

Merge between columns from the same dataframe

Categories

Resources