I have a dataframe that has the user id in one column and a string consisting of comma-separated values of item ids for the items he possesses in the second column. I have to convert this into a resulting dataframe that has user ids as indices, and unique item ids as columns, with value 1 when that user has the item, and 0 when the user does not have the item. Attached below is the gist of the problem and the approach I am currently using to solve this problem.
temp = pd.DataFrame([[100, '10, 20, 30'],[200, '20, 30, 40']], columns=['userid','listofitemids'])
print(temp)
temp.listofitemids = temp.listofitemids.apply(lambda x:set(x.split(', ')))
dat = temp.values
df = pd.DataFrame(data = [[1]*len(dat[0][1])], index = [dat[0][0]], columns=dat[0][1])
for i in range(1, len(dat)):
t = pd.DataFrame(data = [[1]*len(dat[i][1])], index = [dat[i][0]], columns=dat[i][1])
df = df.append(t, sort=False)
df.head()
However, this code is clearly inefficient, and I am looking for a faster solution to this problem.
Let us try str.split with explode then crosstab
s = temp.assign(listofitemids=temp['listofitemids'].str.split(', ')).explode('listofitemids')
s = pd.crosstab(s['userid'], s['listofitemids']).mask(lambda x : x.eq(0))
s
Out[266]:
listofitemids 10 20 30 40
userid
100 1.0 1 1 NaN
200 NaN 1 1 1.0
Related
I am new to core python. I have a working code which I need to convert into a method.
So, I have around 50k data with 30 columns. Out of 30 columns 3 columns are important for this requirement. Id,Code, and bill_id. I need to populate new column "multiple_instance" with 0s and 1s. Hence, final dataframe will contain 50k data with 31 columns. 'Code' column contains n number of codes, hence I am filtering my interest of codes and applying the remaining concept.
I need to pass these 3 columns in a method() which would return 0s and 1s.
Note: multiple_instance_codes is a variable which can be changed later.
multiple_instance_codes = ['A','B','C','D']
filt = df['Code'].str.contains('|'.join(multiple_instance_codes ), na=False,case=False)
df_mul = df[filt]
df_temp = df_mul.groupby(['Id'])[['Code']].size().reset_index(name='count')
df_mul = df_mul.merge(df_temp, on='Id', how='left')
df_mul['Cumulative_Sum'] = df_mul.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_mul['multiple_instance'] = np.where(df_mul['Cumulative_Sum'] > 1, 1, 0)```
**Sample data :**
bill_id Id Code Cumulative_Sum multiple_instance
10 1 B 1 0
10 2 A 2 1
10 3 C 3 1
10 4 A 4 1
Nevermind, It is completed and working fine.
def multiple_instance(df):
df_scored = df.copy()
filt = df_scored['Code'].str.contains('|'.join(multiple_instance_codes), na=False,case=False)
df1 = df_scored[filt]
df_temp = df1.groupby(['Id'])[['Code']].size().reset_index(name='count')
df1 = df1.merge(df_temp, on='Id', how='left')
df1['Cum_sum'] = df1.groupby(['bill_id'])['count'].apply(lambda x: x.cumsum())
df_scored = df_scored.merge(df1)
df_scored['muliple instance'] = np.where(df_scored['Cumulative_Sum'] > 1, 1, 0)
return df_scored
I'm working currently with pandas in python.
I've got a dataset of customers (user_id on column1) and of the item they bought (column2).
Example dataset:
ID_user
ID_item
0
1
0
2
0
3
1
2
2
1
3
3
...
...
Now I want only to focus on customers, which have bought more than 10 items. How can I create a new dataframe with pandas and drop all other customers with less than 10 items bought?
Thank you very much!
You could first group your dataframe by the column "ID_user" and the .count() method. Afterwards filter out only those values that are bigger 10 with a lambda function.
# Group by column ID_user and the method .count()
df = df.groupby('ID_user').count()
# Only show values for which the lambda function evaluates to True
df = df[lambda row: row["ID_item"] > 10]
Or just do it in one line:
df = df.groupby('ID_user').count()[lambda row: row["ID_item"] > 10]
You can try groupby with transform then filter it
n = 10
cond = df.groupby('ID_user')['ID_item'].transform('sum')
out = df[cond>=n].copy()
A simple groupby + filter will do the job:
>>> df.groupby('ID_user').filter(lambda g: len(g) > 10)
Empty DataFrame
Columns: [ID_user, ID_item]
Index: []
Now, in your example, there aren't actually any groups that do have more than 10 items, so it's showing an empty dataframe here. But in your real data, this would work.
I have a df that contains several IDs, I´m trying to run a regression to the data and I need to be able to split it by ID to apply the regression to each ID:
Sample DF (this is only a sample real data is larger)
I tried to save the ID´s within a list like this:
id_list = []
for data in df['id'].unique():
id_list.append(data)
The list output is [1,2,3]
Then I was trying to use that to sort the DF:
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == campaign_id]
return df
when I call the function the result is:
However I only got the result for the first ID in the list ,the other 2 [2,3] are not returning any DF... which means that at some point the loop breaks.
Here it is the entire code:
df = pd.read_csv('budget.csv')
id_list = []
for unique_id in df['id'].unique():
id_list.append(unique_id)
def create_dataframe(df):
for unique_id in id_list:
df = df[df['Campaign ID'] == unique_id]
return df
print(create_dataframe(df))
You can use the code snippet df.loc[df['id'] == item] to extract sub dataframes based on a particular value of a column in the dataframe.
Please refer the full code below
import pandas as pd
df_dict = {"id" : [1,1,1,2,2,2,3,3,3],
"value" : [12,13,14,22,23,24,32,33,34]
}
df = pd.DataFrame(df_dict)
print(df)
id_list = []
for data in df['id'].unique():
id_list.append(data)
print(id_list)
for item in id_list:
sub_df = df.loc[df['id'] == item]
print(sub_df)
print("****")
The following output will be generated for this with the requirement of getting the sub dataframes for each of the distinct column ids
id value
0 1 12
1 1 13
2 1 14
3 2 22
4 2 23
5 2 24
6 3 32
7 3 33
8 3 34
[1, 2, 3]
id value
0 1 12
1 1 13
2 1 14
****
id value
3 2 22
4 2 23
5 2 24
****
id value
6 3 32
7 3 33
8 3 34
****
Now in your code snippet the issue was that the function createdataframe() is getting called only once and inside the function when we iterate through the elements, after fetching the details of the sub df for id =1 you have used a return statement to return this df. Hence you are getting only the sub df for id = 1.
You seem to be overnighting the df value in the for loop. I would recommend moving the df creation outside of the for loop and then append to it there. Then adding to it in each of the loops instead of overwriting it.
You can use numpy.split:
df.sort_values('id', inplace=True)
np.split(df, df.index[df.id.diff().fillna(0).astype(bool)])
or pandas groupby:
grp = df.groupby('id')
[grp.get_group(g) for g in df.groupby('id').groups]
Although I think you can make a regression directly using pandas groupby, since it logically apply any function you want taking each group as a distinct dataframe.
I have a large (3MM Record) File.
The file contains four columns: [id, startdate, enddate, status] there will be multiple status changes for each id, my goal is to transpose this data and end up with a wide dataframe with the following columns:
[id, status1, status2, status3... statusN]
Where the values of the rows will be id, and the startdate of the status on the columns.
An example of a row would be:
["xyz", '2020-08-24 23:42:54', '(blank)', '2020-08-26 21:23:45'...(startdate value for status N)]
I have written a script that does the following: iterate through all the rows of the first dataframe, and store the status in a set, that way, there are no duplicates and I can get an adequate list of all the statuses.
df = pd.read_csv('statusdata.csv')
columns = set()
columns.add('id')
for index, row in df.iterrows():
columns.add(row['status'])
Then I create a new dataframe with the columns 'id' and then all the other statuses taken from the Set
columnslist = list(columns)
newdf = pd.DataFrame(columns = columnslist)
newdf = newdf[['id']+[c for c in newdf if c not in ['id']]] #this will make 'id' the first column
Then I iterate through all the columns of the original dataframe and create a new record in the new dataframe if the id it's reading is not already in the dataframe, and then log the startdate of the status indicated in the original df on its matching column in the new df.
for index, row in df.iterrows():
if row['opportunityid'] not in newdf['id']:
newdf.loc[len(newdf), 'id'] = row['opportunityid']
newdf.loc[newdf['id'] == row['opportunityid'], row['status']] = row['startdate']
My concern is with the speed of the code. At this rate it will take 13+ hrs to go through all the lines of the original dataframe to transpose it into this new dataframe with unique keys. Is there a way to make this more efficient? Is there a way to allocate more memory from my computer? Or is there a way to deploy this code on aws or another cloud computing software to make it run faster? I'm currently running this on a 2020 13 inch mac book pro with 32 GB of ram.
Thanks!
IIUC, you could do this without iterating. First, create sample data:
from io import StringIO
import pandas as
data = '''id, start, end, status
A, 1, 10, X
A, 2, 20, Y
A, 3, 30, Z
A, 9, 99, Z
B, 4, 40, W
B, 5, 50, X
B, 6, 60, Y
'''
df = pd.read_csv(StringIO(data), sep=', ', engine='python')
print(df)
id start end status
0 A 1 10 X
1 A 2 20 Y
2 A 3 30 Z
3 A 9 99 Z # <- same id + status as previous row
4 B 4 40 W
5 B 5 50 X
6 B 6 60 Y
Second, select the columns of interest (everything but end); set id and start to row labels; squeeze() to ensure the object is converted to a pandas Series; and finally put status as column labels:
t = (df[['id', 'start', 'status']]
.groupby(['id','status'], as_index=False)['start'].max() # <- new
.set_index(['id', 'status'], verify_integrity=True)
.sort_index()
.squeeze()
.unstack(level='status')
)
print(t)
status W X Y Z
id
A NaN 1.0 2.0 9.0
B 4.0 5.0 6.0 NaN
The NaN values show what happens when there is not 100 percent overlap in status.
UPDATE
I added a row of data to cause duplicate (id, status) pair. Also added groupby() method to pull out latest (id, status) pair.
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it