I have the following data as pandas Dataframe:
df = pd.DataFrame({
'id': [1,2,3,4, 5],
'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'],
'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'],
'movie_ids': ['34,265,268,65',
'34,43,65,61',
'5,876,8',
'14,5,268',
'134,845,2']}).set_index(["id"], drop=False)
and a list of ids:
movie_ids = ['34','845']
I would like to get the indexes of those rows where any of the movie_ids' item is represented in the movie_ids column.
I was trying to convert the column value to list than filter on that, with that I get only the matched values:
result = list(filter(lambda x: set(countries).intersection(set(x.split(","))), df['movie_ids'].values))
than using loc fn to get only those rows:
df = df.loc[df['movie_ids'].isin(result)]
But I guess this is not the most efficient way, for example with the millions of rows.
df[df.movie_ids.str.contains(rf"\b{'|'.join(movie_ids)}\b")]
id first_name last_name movie_ids
id
1 1 Sheldon Copper 34,265,268,65
2 2 Raj Koothrappali 34,43,65,61
5 5 Amy Fowler 134,845,2
Related
I am working with two dataframes:
df - contains multiple rows with data about scientific articles, including a magazine_id which is connected to the ids in the second dataframe
magazines - contains only 2 columns: id and title
In the magazines dataframe there are duplicate titles.
I am unsure about how to change the ids referenced in the first dataframe to the ids that will be kept after the duplicates are removed.
df = pd.Dataframe({'id': [1003, 1009, 1010, 1034],
'title': ['Article1', 'Article2', 'Article3', 'Article4'],
'magazine_id': [1, 2, 3, 4]})
magazines = pd.Dataframe({'id': [1, 2, 3, 4],
'title': ['Mag1','Mag1','Mag3','Mag4']})
So from magazines, entry with id = 2 should be deleted because it has the same title as id = 1.
The output for df should be:
id title magazine_id
1003 'Article1' 1
1009 'Article2' 1
1010 'Article3' 3
1034 'Article4' 4
have created two data frames that align with your problem statement
de-dupe titles, keeping list of IDs that were used
then map (merge) to re-align title_id of second data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({"id": range(10), "title": np.random.choice(list("ABCDEFG"), 10)})
df2 = pd.DataFrame(
{"content_id": range(100), "title_id": np.random.choice(range(10), 100)}
)
# drop duplicates, keep original ids as list
df = (
df.groupby("title").agg(ids=("id", list)).reset_index().assign(id=lambda d: d.index)
)
# map title ids in second data frame to newly de-duped ids
df2 = (
df.loc[:, ["id", "ids"]]
.explode("ids")
.merge(df2, left_on="ids", right_on="title_id")
.drop(columns=["id", "title_id"])
.rename(columns={"ids": "title_id"})
)
Since your question is not clear I made a scenario. If it is your case my suggestion is that first merge the dfs then use drop_duplicates. See the following:
main_df = pd.DataFrame({'title_id':[1,2,3,4,5], 'authors': ['w', 'e', 'w','e','w']})
titles = pd.DataFrame({'id': [1,2,3,4,5], 'title': ['a', 'b', 'a', 'b','a']})
main_df.merge(titles, left_on='title_id', right_on='id').drop_duplicates(['title', 'authors'])
main_df:
titles df with duplicates:
The result:
Use drop_duplicates to get rid of duplicate titles, and use ffill:
magazines = magazines.drop_duplicates(subset=['title'])
df.loc[~df['magazine_id'].isin(magazines['id']), 'magazine_id'] = np.nan
df['magazine_id'] = df['magazine_id'].ffill().astype(int)
Output:
>>> df
id title magazine_id
0 1003 Article1 1
1 1009 Article2 1
2 1010 Article3 3
3 1034 Article4 4
>>> magazines
id title
0 1 Mag1
2 3 Mag3
3 4 Mag4
I am not sure the best way to title this. If I have a dataframe and one of the columns, lets call it 'Tags', may contain a list or may not. If 'Tags' is a list, then I want to replicate that row as many times as there are unique items in the 'Tags' column but then replace the items in that column with the unique item for each row.
Example:
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28'],
'Item': 'My_fake_item',
'Tags': [['A', 'B']],
'Count': 3}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
Would result in:
And I need a function that will change the dataframe to this:
Apply the explode method, for example
df_exploded = (
df.set_index(["Date", "Item", "Count"])
.apply(pd.Series.explode)
.reset_index()
)
will result in
df_exploded
>>>
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
and there's no need to check if an element is a list or not on the column
import pandas as pd
# create dummy dataframe
df = {'Date': ['2020-10-28', '2020-11-01'],
'Item': ['My_fake_item', 'My_other_item'],
'Tags': [['A', 'B'], 'C'],
'Count': [3, 5]}
df = pd.DataFrame(df, columns=['Date', 'Item', 'Tags', 'Count'])
will result in
Date Item Count Tags
0 2020-10-28 My_fake_item 3 A
1 2020-10-28 My_fake_item 3 B
2 2020-11-01 My_other_item 5 C
Unpredictably formatted df:
First Name number last_name
0 Cthulhu 666 Smith
df = pd.DataFrame({'First Name': ['Cthulhu'], 'number': [666], 'last_name': ['Smith']})
This needs to be put into column names and order: TemplateColumns = ['First Name', 'other', 'number']. If columns don't exist they can be created:
for col in TemplateColumns:
if col not in df:
df[col] = np.nan
Which gives:
First Name number last_name other
0 Cthulhu 666 Smith NaN
And initial columns need to be ordered the same as TemplateColumns, leaving the remaining columns at the end, to get desired_df:
First Name other number last_name
0 Cthulhu NaN 666 Smith
desired_df = pd.DataFrame({'First Name': ['Cthulhu'], 'other': [np.nan], 'number': [666], 'last_name': ['Smith']})
Reordering columns is well explained in other posts, but I don't know how to order the first n columns and keep the rest at the end. How can I do this?
Try this
cols = TemplateColumns + df.columns.difference(TemplateColumns, sort=False).tolist()
df_final = df.reindex(cols, axis=1)
Out[714]:
First Name other number last_name
0 Cthulhu NaN 666 Smith
You can write your own function to achieve this. Essentially you can use .reindex() to reorder the dataframe while including empty columns if they don't exist. The only remaining part to figure out would be how to add the remaining columns not in TemplateColumns to your dataframe. You can do this by obtaining the set difference of the column index from the TemplateColumns then updating the order before your call to .reindex
Set up data & function
def reordered(df, new_order, include_remaining=True):
cols_to_end = []
if include_remaining:
# gets the items in `df.columns` that are NOT in `new_order`
cols_to_end = df.columns.difference(new_order, sort=False)
# Ensures that the new_order items are first
final_order = new_order + list(cols_to_end)
return df.reindex(columns=final_order)
df = pd.DataFrame({'First Name': ['Cthulhu'], 'number': [666], 'last_name': ['Smith']})
new_order = ['First Name', 'other', 'number']
with include_remaining:
out = reordered(df, new_order, include_remaining=True)
print(out)
First Name other number last_name
0 Cthulhu NaN 666 Smith
without include_remaining:
out = reordered(df, new_order, include_remaining=False)
print(out)
First Name other number
0 Cthulhu NaN 666
Use insert like this:
for col in TemplateColumns:
if col not in df:
df.insert(1, col, np.nan)
I would like to group the ids by Type column and apply a function on the grouped stocks that returns the first row where the Value column of the grouped stock is not NaN and copies it into a separate data frame.
I got the following so far:
dummy data:
df1 = {'Date': ['04.12.1998','05.12.1998','06.12.1998','04.12.1998','05.12.1998','06.12.1998'],
'Type': [1,1,1,2,2,2],
'Value': ['NaN', 100, 120, 'NaN', 'NaN', 20]}
df2 = pd.DataFrame(df1, columns = ['Date', 'Type', 'Value'])
print (df2)
Date Type Value
0 04.12.1998 1 NaN
1 05.12.1998 1 100
2 06.12.1998 1 120
3 04.12.1998 2 NaN
4 05.12.1998 2 NaN
5 06.12.1998 2 20
import pandas as pd
selectedStockDates = {'Date': [], 'Type': [], 'Values': []}
selectedStockDates = pd.DataFrame(selectedStockDates, columns = ['Date', 'Type', 'Values'])
first_valid_index = df2[['Values']].first_valid_index()
selectedStockDates.loc[df2.index[first_valid_index]] = df2.iloc[first_valid_index]
The code above should work for the first id, but I am struggling to apply this to all ids in the data frame. Does anyone know how to do this?
Let's mask the values in dataframe where the values in column Value is NaN, then groupby the dataframe on Type and aggregate using first:
df2['Value'] = pd.to_numeric(df2['Value'], errors='coerce')
df2.mask(df2['Value'].isna()).groupby('Type', as_index=False).first()
Type Date Value
0 1.0 05.12.1998 100.0
1 2.0 06.12.1998 20.0
Just use groupby and first but you need to make sure that your null values are np.nan and not strings like they are in your sample data:
df2.groupby('Type')['Value'].first()
I have a simple dataframe :
df1 = pd.DataFrame([[19, '19-12','test1'], [20, '20-16','test2'], [21, '21-17','test3']], columns = ['de_id', 'sh_id', 'token'])
And this functions :
def get_lo(token,sh_id,de_id):
data_ws = dl_data(token,sh_id,de_id)
return data_ws
Returning a list of dicts (number of dicts can vary) with always the same structure and same keys, only data changes, example :
[{'id':'9','public_id':'00009','name':'John'},{'id':'10','public_id':'00010','name':'Doe'}]
Is it possible to do a df1.apply(lambda x: get_lo(x['token'],x['sh_id'],x['de_id']),axis=1) and create a new dataframe, df2, that would have apply results merged + df1['id'] ?
Like :
de_id id public_id name
19 9 00009 John
19 9 00010 Doe
20 56 00056 Bob
21 14 00014 Bill
I tried many things like pd.concat, pd.merge, combine, like :
df2= pd.DataFrame(columns=['id'])
df2 = df2.merge(df1.apply(lambda x: get_lo(x['token'],x['sh_id'],x['de_id'])),left_index=True, right_index=True)
or
df2= pd.DataFrame(columns=['id'])
df2 = merge(df2,df1.apply(lambda x: get_lo(x['token'],x['sh_id'],x['de_id'])),on='id,how='outer')
I can transform the list of dicts from get_lo() to <class 'pandas.core.series.Series'> or even to a pd.DataFrame but can't figure out the best way to return data for each row applied on df1 so to build df2.
EDIT - Details about dl_data() anky
Data from df1 is inputed to get_lo() row by row with an apply :
df1.apply(lambda x: get_lo(x['token'],x['sh_id'],x['de_id']),axis=1)
If I try to print the output :
print(df1.apply(lambda x: get_lo(x['token'],x['sh_id'],x['de_id']),axis=1))
I get :
0 [{'id': '9', 'public_id': '00009', 'name': 'Jo...
1 [{'id': '56', 'public_id': '00056', 'name': 'Bo...
2 [{'id': '14', 'public_id': '00014', 'name': 'Bi...
dtype: object
Type is <class 'pandas.core.series.Series'>
EDIT 2
I found a temporary solution to build df2 is :
def get_lo(df2,token,sh_id,de_id):
data_ws = dl_data(token,sh_id,de_id)
for o in data_ws:
df2.loc[len(df2.index)+1] = [de_id,o['id'],o['public_id'],o['name']]
df2 = pd.DataFrame(columns = ['de_id','id','public_id','name'])
df1.apply(lambda x: get_lo(df2,x['token'],x['sh_id'],x['de_id']),axis=1)
But I think it is pretty awful et won't be performant if df1 is big