I have a large (3MM Record) File.
The file contains four columns: [id, startdate, enddate, status] there will be multiple status changes for each id, my goal is to transpose this data and end up with a wide dataframe with the following columns:
[id, status1, status2, status3... statusN]
Where the values of the rows will be id, and the startdate of the status on the columns.
An example of a row would be:
["xyz", '2020-08-24 23:42:54', '(blank)', '2020-08-26 21:23:45'...(startdate value for status N)]
I have written a script that does the following: iterate through all the rows of the first dataframe, and store the status in a set, that way, there are no duplicates and I can get an adequate list of all the statuses.
df = pd.read_csv('statusdata.csv')
columns = set()
columns.add('id')
for index, row in df.iterrows():
columns.add(row['status'])
Then I create a new dataframe with the columns 'id' and then all the other statuses taken from the Set
columnslist = list(columns)
newdf = pd.DataFrame(columns = columnslist)
newdf = newdf[['id']+[c for c in newdf if c not in ['id']]] #this will make 'id' the first column
Then I iterate through all the columns of the original dataframe and create a new record in the new dataframe if the id it's reading is not already in the dataframe, and then log the startdate of the status indicated in the original df on its matching column in the new df.
for index, row in df.iterrows():
if row['opportunityid'] not in newdf['id']:
newdf.loc[len(newdf), 'id'] = row['opportunityid']
newdf.loc[newdf['id'] == row['opportunityid'], row['status']] = row['startdate']
My concern is with the speed of the code. At this rate it will take 13+ hrs to go through all the lines of the original dataframe to transpose it into this new dataframe with unique keys. Is there a way to make this more efficient? Is there a way to allocate more memory from my computer? Or is there a way to deploy this code on aws or another cloud computing software to make it run faster? I'm currently running this on a 2020 13 inch mac book pro with 32 GB of ram.
Thanks!
IIUC, you could do this without iterating. First, create sample data:
from io import StringIO
import pandas as
data = '''id, start, end, status
A, 1, 10, X
A, 2, 20, Y
A, 3, 30, Z
A, 9, 99, Z
B, 4, 40, W
B, 5, 50, X
B, 6, 60, Y
'''
df = pd.read_csv(StringIO(data), sep=', ', engine='python')
print(df)
id start end status
0 A 1 10 X
1 A 2 20 Y
2 A 3 30 Z
3 A 9 99 Z # <- same id + status as previous row
4 B 4 40 W
5 B 5 50 X
6 B 6 60 Y
Second, select the columns of interest (everything but end); set id and start to row labels; squeeze() to ensure the object is converted to a pandas Series; and finally put status as column labels:
t = (df[['id', 'start', 'status']]
.groupby(['id','status'], as_index=False)['start'].max() # <- new
.set_index(['id', 'status'], verify_integrity=True)
.sort_index()
.squeeze()
.unstack(level='status')
)
print(t)
status W X Y Z
id
A NaN 1.0 2.0 9.0
B 4.0 5.0 6.0 NaN
The NaN values show what happens when there is not 100 percent overlap in status.
UPDATE
I added a row of data to cause duplicate (id, status) pair. Also added groupby() method to pull out latest (id, status) pair.
Related
I am trying to compare 2 files one is in xls and other is in csv format.
File1.xlsx (not actual data)
Title Flag Price total ...more columns
0 A Y 12 300
1 B N 15 700
2 C N 18 1000
..
..
more rows
File2.csv (not actual data)
Title Flag Price total ...more columns
0 E Y 7 234
1 B N 16 600
2 A Y 12 300
3 C N 17 1000
..
..
more rows
I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.
If I use merge function as below, I am getting output with the values only for File 1.
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
output I am getting
Title Attention_Needed Price total
1 B N 15 700
2 C N 18 1000
This output is not showing the correct diff as record with Title 'E' is missing
I also tried using panda merge
diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')
& output for above was
Title Flag Price total Exist
0 A Y 12 300 both
1 B N 15 700 left_only
2 C N 18 1000 left_only
3 E Y 7 234 right_only
4 B N 16 600 right_only
5 C N 17 1000 right_only
Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.
Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines
Title Flag Price total Comments
0 A Y 12 300 matching
1 B N 15 700 Price, total different
2 C N 18 1000 Price different
3 E Y 7 234 New record
If above output can not be possible, then please suggest if there is any other way to solve this.
PS: This is my first question here, so please let me know if you need more details here.
Rows in DF1 Which Are Not Available in DF2
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']
Rows in DF2 Which Are Not Available in DF1
df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']
If you're differentiating by row not column
pd.concat([df1,df2]).drop_duplicates(keep=False)
If each df has the same columns and each column should be compared individually
for col in data.columns:
set(df1.col).symmetric_difference(df2.col)
# WARNING: this way of getting column diffs likely won't keep row order
# new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]
lets assume df1 (left) is our "source of truth" for what's considered an original record.
after running
diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')
take the output and split it into 2 df's.
df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]
Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".
Let's assume you add the 'comments' column to df1
you could say that everything in df2 is a new record, but that would disregard the "price/total different".
If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)
After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.
# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
weights = [0 for col in l]
total = 100
while total > 0:
for i, e in enumerate(l):
for j in range(i+1):
weights[j] += 1
total -= 1
return weights
def scoreRows(row1, row2):
s = 0
for i, colName in enumerate(colRank):
if row1[colName] == row2[colName]:
s += weights[i]
colRank = ['title', 'flag']
weights = getWeights(colRank)
Let's say only these 2 matter and the rest are considered 'modifications' to an original row
That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record
What makes a row a new record is completely up to you.
Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1
if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]
and you want to compare this row against that df
[1, 6, 5, 4]
this row has the same first element as both, the same second element as row2, and the same 4th element of row1.
so which row does it differ from?
if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records
if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table
# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = []
# list of lists
# each inner list index correlates to the index for df2
# inner lists are
# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)
for i, row1 in df2.itterrows():
thisRowsScores = []
#df2 first because they are what we are scoring
for j, row2 in df1.iterrows():
s = scoreRows(row1, row2)
if s>0: # only save rows and scores that matter
thisRowsScores.append((row2, s))
# at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
# or you can just keep the best score like i'll be doing
#sort by score
sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
rowScores.append(sortedRowScores[0])
# appends empty list if no good matches found in df1
# alternatively, remove 'reversed' from above and index at -1
The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments
At this point, lets just say that df1 already has the comments "matching" added to it
Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.
But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.
I'm working currently with pandas in python.
I've got a dataset of customers (user_id on column1) and of the item they bought (column2).
Example dataset:
ID_user
ID_item
0
1
0
2
0
3
1
2
2
1
3
3
...
...
Now I want only to focus on customers, which have bought more than 10 items. How can I create a new dataframe with pandas and drop all other customers with less than 10 items bought?
Thank you very much!
You could first group your dataframe by the column "ID_user" and the .count() method. Afterwards filter out only those values that are bigger 10 with a lambda function.
# Group by column ID_user and the method .count()
df = df.groupby('ID_user').count()
# Only show values for which the lambda function evaluates to True
df = df[lambda row: row["ID_item"] > 10]
Or just do it in one line:
df = df.groupby('ID_user').count()[lambda row: row["ID_item"] > 10]
You can try groupby with transform then filter it
n = 10
cond = df.groupby('ID_user')['ID_item'].transform('sum')
out = df[cond>=n].copy()
A simple groupby + filter will do the job:
>>> df.groupby('ID_user').filter(lambda g: len(g) > 10)
Empty DataFrame
Columns: [ID_user, ID_item]
Index: []
Now, in your example, there aren't actually any groups that do have more than 10 items, so it's showing an empty dataframe here. But in your real data, this would work.
I have a csv file which is continuously updated with new data. The data has 5 columns, however lately the data have been changed to 4 columns. The column that is not present in later data is the first one. When I try to read this csv file into a dataframe, half of the data is under the wrong columns. The data is around 50k entries.
df
################################
0 time C-3 C-4 C-5
______________________________
0 1 str 4 5 <- old entries
0 1 str 4 5
1 str 4 5 Nan <- new entries
1 str 4 5 Nan
1 str 4 5 Nan
#################################
The first column in earlier entries (where value = 0) are not important.
My expected output is the Dataframe being printed with the right values in the right columns. I have no idea on how to add the 0 before the str where the 0 is missing. Or reverse, remove the 0. (since the 0 looks to be a index counter with value starting at 1, then 2, etc.)
Here is how i load and process the csv file at the moment:
def process_csv(obj):
data='some_data'
path='data/'
file=f'{obj}_{data}.csv'
df=pd.read_csv(path+file)
df=df.rename(columns=df.iloc[0]).drop(df.index[0])
df = df[df.time != 'time']
mask = df['time'].str.contains(' ')
df['time'] = (pd.to_datetime(df.loc[mask,'time'])
.reindex(df.index)
.fillna(pd.to_datetime(df.loc[~mask, 'time'], unit='ms')))
df=df.drop_duplicates(subset=['time'])
df=df.set_index("time")
df = df.sort_index()
return df
since column time should be of type int, a missing first column causes str values that should end up in C-3 to end up in column time which causes error:
ValueError: non convertible value str with the unit 'ms'
Question: How can I either remove the early values from column 0 or add some values to the later entries?
If one CSV file contains multiple formats, and these CSV files cannot be changed, then you could parse the files before creating data frames.
For example, the test data has both 3- and 4-field records. The function gen_read_lines() always returns 3 fields per records.
from io import StringIO
import csv
import pandas as pd
data = '''1,10,11
2,20,21
3,-1,30,31
4,-2,40,41
'''
def gen_read_lines(filename):
lines = csv.reader(StringIO(filename))
for record in lines:
if len(record) == 3: # return all fields
yield record
elif len(record) == 4: # drop field 1
yield [record[i] for i in [0, 2, 3]]
else:
raise ValueError(f'did not expect {len(records)}')
records = (record for record in gen_read_lines(data))
df = pd.DataFrame(records, columns=['A', 'B', 'C'])
print(df)
A B C
0 1 10 11
1 2 20 21
2 3 30 31
3 4 40 41
I have a dataframe that has the user id in one column and a string consisting of comma-separated values of item ids for the items he possesses in the second column. I have to convert this into a resulting dataframe that has user ids as indices, and unique item ids as columns, with value 1 when that user has the item, and 0 when the user does not have the item. Attached below is the gist of the problem and the approach I am currently using to solve this problem.
temp = pd.DataFrame([[100, '10, 20, 30'],[200, '20, 30, 40']], columns=['userid','listofitemids'])
print(temp)
temp.listofitemids = temp.listofitemids.apply(lambda x:set(x.split(', ')))
dat = temp.values
df = pd.DataFrame(data = [[1]*len(dat[0][1])], index = [dat[0][0]], columns=dat[0][1])
for i in range(1, len(dat)):
t = pd.DataFrame(data = [[1]*len(dat[i][1])], index = [dat[i][0]], columns=dat[i][1])
df = df.append(t, sort=False)
df.head()
However, this code is clearly inefficient, and I am looking for a faster solution to this problem.
Let us try str.split with explode then crosstab
s = temp.assign(listofitemids=temp['listofitemids'].str.split(', ')).explode('listofitemids')
s = pd.crosstab(s['userid'], s['listofitemids']).mask(lambda x : x.eq(0))
s
Out[266]:
listofitemids 10 20 30 40
userid
100 1.0 1 1 NaN
200 NaN 1 1 1.0
I have two pandas dataframe. One Contains actual data and second contains row index which i need to replace with some value.
Df1 : Input record
A B record_id record_type
0 12342345 10 011 H
1 65767454 20 012 I
2 78545343 30 013 I
3 43455467 40 014 I
Df2 :Information contains which row index need to change(e.g :here it is #)
Column1 Column2 Column3 record_id
0 1 2 4 011
1 1 2 None 012
2 1 2 4 013
3 1 2 None 014
Output Result:
A B record_id record_type
0 # # 011 #
1 # # 012 I
2 # # 013 #
3 # # 014 I
So based on record_id lookup and want to change corresponding row index value.
Here (1 2 4 011) present in Df2 contains information saying we want to modify row index first ,second and forth for particular record whose id is 011 from Df1.
So in output result we replace row value for record id 011 for row index 1,2,4 and populate value as #.
please suggest any other approach to do same in pandas.
First, you can do some preprocessing to make life easier. Set the index to be record_id and then rename column3 from df2 to be record_type. Now the dataframes have identical index and column names and makes for easy automatic alignment.
df1 = df1.set_index('record_id')
df2 = df2.set_index('record_id')
df2 = df2.rename(columns={'Column3':'record_type'})
df2 = df2.replace('None', np.nan)
Then we can fill in missing values of df2 with d2 and then make all the original non-missing values '#'.
df2.fillna(df1).where(df2.isnull()).fillna('#')
Column1 Column2 record_type
record_id
11 # # #
12 # # I
13 # # #
14 # # I
Here (1 2 4 011) present in Df2 contains information saying we want to modify row index first, second and forth for particular record whose id is 011 from Df1.
This makes no sense to me -- the row with record_id = 011 does not itself have further rows (of which you seem to want to choose the first, second, fourth). Please complete the output values with the exact results you expect.
In any case, I came across the same problem as in the title, and solved it like this:
Assuming you have a DataFrame df and three equally long vectors rsel, csel (for row/column selectors) and val (say, of length N), and would like to do the equivalent of
df.lookup(rsel, csel) = val
Then, the following code will work (at least) for pandas v.0.23 and python 3.6, assuming that rsel does not contain duplicates!
Warning: this is not really suited for large datasets, because it initialises a full square matrix of the dimensions of shape (N, N)!
import pandas as pd
import numpy as np
from functools import reduce
def coalesce(df, ltr=True):
if not ltr:
df = df.iloc[:, ::-1] # flip left to right
# use iloc as safeguard against non-unique column names
list_of_series = [df.iloc[:, i] for i in range(len(df.columns))]
# this is like a SQL coalesce
return reduce(lambda interm, x: interm.combine_first(x), list_of_series)
# column names generally not unique!
square = pd.DataFrame(np.diag(val), index=rsel, columns=csel)
# np.diag creates 0s everywhere off-diagonal; set them to nan
square = square.where(np.diag([True] * len(rsel)))
# assuming no duplicates in rsel; this is empty
upd = pd.DataFrame(index=rsel, columns=sorted(csel.unique()))
# collapse square into upd
upd = upd.apply(lambda col: coalesce(square[square.columns == col.name]))
# actually update values
df.update(upd)
PS. If you know that you only have strings as column names, then square.filter(regex=col.name) is much faster than square[square.columns == col.name].