I am trying to compare in which index does the timedelta value in one dataframe1 is equal to the timedelta value in another dataframe2 and then trim the dataframe that has more values to make them both start at the same time:
Dataset1:
TimeStamp Col1 ... Col2500
0 days 10:37:34 346 ... 635
0 days 10:38:34 124 ... 546
0 days 10:39:34 346 ... 745
Dataset2:
TimeStamp Col1 ... Col50
0 days 10:25:20 123 ... 789
0 days 10:25:45 183 ... 787
...
...
0 days 10:37:40 223 ... 789
for i in df2.index:
if str(df1.index[0])[7:12] == str(df2.index[i])[7:12]:
index_value = i
break
df2 = df2.drop(df2.index[[0,i-1]])
Expected output will be Dataset2 starting at the same time (nearest to the minute) with Dataset1
You can use searchsorted for indices for first higher value in df2.index like first value of df1.index. Then select second df2 by positions by iloc:
#necessary both indices are sorted
df1 = df1.sort_index()
df2 = df2.sort_index()
a = df2.index.searchsorted(df1.index[0])
print (a)
2
df2 = df2.iloc[a:]
print (df2)
Col1 ... Col50
TimeStamp
10:37:40 223 ... 789
Related
df1 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','124','','656','']})
df2 = pd.DataFrame({'call_sign': ['ASD','CDSF','BSD','GFDFD','FHHH'],'frn':['1234','','124','','765']})
need to get a new df like
df2 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','','124','656','765']})
I need to take frn from df2 if it's missing in df1 and create a new df
Replace empty strings to missing values and use DataFrame.set_index with DataFrame.fillna, because need ordering like df2.call_sign add DataFrame.reindex:
df = (df1.set_index('call_sign').replace('', np.nan)
.fillna(df2.set_index('call_sign').replace('', np.nan))
.reindex(df2['call_sign']).reset_index())
print(df)
call_sign frn
0 ASD 123
1 CDSF NaN
2 BSD 124
3 GFDFD 656
4 FHHH 765
If you want to update df2 you can use boolean indexing:
# is frn empty string?
m = df2['frn'].eq('')
# update those rows from the value in df1
df2.loc[m, 'frn'] = df2.loc[m, 'call_sign'].map(df1.set_index('call_sign')['frn'])
Updated df2:
call_sign frn
0 ASD 1234
1 CDSF
2 BSD 124
3 GFDFD 656
4 FHHH 765
temp = df1.merge(df2,how='left',on='call_sign')
df1['frn']=temp.frn_x.where(temp.frn_x!='',temp.frn_y)
call_sign frn
0 ASD 123
1 BSD
2 CDSF 124
3 GFDFD 656
4 FHHH 765
I have a pandas dataframe (this is an example, actual dataframe is a lot larger):
data = [['345', 1, '2022_Jan'], ['678', 1, '2022_Jan'], ['123', 1, '2022_Feb'], ['123', 1, '2022_Feb'], ['345', 0, '2022_Mar'], ['678', 1, '2022_Mar'], ['901', 0, '2022_Mar'], ['678', 1, '2022_Mar']]
df = pd.DataFrame(data, columns = ['ID', 'Error Count', 'Year_Month'])
The question I want answered is :How many IDs have errors?
I want to get an output that groups by 'Year_Month' and counts 1 for each ID that occurs in each month. In other words, I want to count only 1 for each ID in a single month.
When I group by 'Year_Month' & 'ID': df.groupby(['Year_Month', 'ID']).count()
it will give me the following output (current output link below) with the total Error Count for each ID, but I only want to count each ID once. I also want the Year_Month to be ordered chronologically, not sure why it's not when my original dataframe is in order by month in the Year_Month column.
My current output
Desired output
Are these actually duplicate records? Are you sure you don't want to record that user 123 had two errors in February?
If so, drop duplicates first, then groupby and sum over Error Count. The .count() method doesn't do what you think it does:
df.drop_duplicates(["ID", "Year_Month"]) \
.groupby(["Year_Month", "ID"])["Error Count"] \
.sum()
Output:
In [3]: counts = df.drop_duplicates(["ID", "Year_Month"]) \
...: .groupby(["Year_Month", "ID"])["Error Count"] \
...: .sum()
In [4]: counts
Out[4]:
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
As far as sorting, you'd want to convert "Year_Month" to a datetime object, because right now they're just being sorted as strings:
In [5]: "2022_Feb" < "2022_Jan"
Out[5]: True
Here's how you could do that:
In [6]: counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b"))
Out[6]:
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
Here's one way:
(df
.groupby(['Year_Month', 'ID']) # group by the two columns
.sum('Error Count')['Error Count'] # aggregate the sum over error count
.apply(lambda x: int(bool(x)))) # convert to boolean and back to int
.to_frame('Error Count') # add name back to applied column
)
Error Count
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Here is another way to do that
Converting the sum to a boolean with astype(bool) to return True or False, based on values being 0 or non-zero, and then to 0 or 1 with astype(int)
df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
To sort, assign the outcome to a dataframe and then apply the ddejohn solution to sort
counts = df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b")) # ddejohn: https://stackoverflow.com/a/71927886/3494754 answer above
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
I've a pandas dataframe named 'trdf' with the shape [1 row X 420 columns].
0 1 2 \
0 B0742F7GT8 Stone & Beam Modern Tripod Floor Lamp, 61"H, W... 2018-04-22
3 4 5 6 7 8 9 ... \
0 24-Apr-2018 100.00% 17.06% 0.00% 5 66.67% 8 ...
410 411 412 413 414 415 416 417 418 419
0 56 161 -8 -166.67% 0 1 0.00% 100.00% 8 Planned Replenishment
I want to slice every 20 columns from last and add the column values as new row values. here is my code :
for i in range(420,20,-20):
trdf.append(trdf.loc[:,i:i-20])
print(trdf)
However, the dataframe is still same with respect to shape and values. Where's the error ?
I believe first create MultiIndex in columns and then unstack:
df.columns = [df.columns % 20, df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
Or use numpy solution with reshape, but finally all data are strings:
df = pd.DataFrame(df.values.reshape(20, 21))
If want use your solution create list of one row DataFrames and concat together:
L = []
for i in range(420,20,-20):
#change order for selecting
df2 = df.loc[:,i-20:i]
#for same columns
df2.columns = range(20)
L.append(df2)
df1 = pd.concat(L)
Also if need expected output join from last columns to first:
df.columns = [df.columns % 20, 20-df.columns // 20]
df = df.stack().reset_index(level=0, drop=True)
And:
df1 = pd.DataFrame(df.values.reshape(20, 21)[::-1])
I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position column in position 3:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
If you're merging on all common columns as in the OP, you don't even need to pass on=, simply calling merge() will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on= and right_on= (see #1 in this answer).
I am joining different values from a dataset into one column in Pandas Dataframe, however there are lots of duplication, how can I get rid of them without deleting any row?:
example:
newCol
------
123,456,129,123,123
237,438,365,432,438
using df.newCol.drop_duplicates() removes the entire rows but I want the result to be:
newCol
------
123,456,129
237,438,365,432
...
thank you
You need first split, apply set and then join:
df.newCol = df.newCol.apply(lambda x: ','.join(set(str(x).split(','))))
print (df)
newCol
0 129,123,456
1 432,365,438,237
But you can apply set in join before:
print (df)
0 1 2 3 4
0 123 456 129 123 123
1 237 438 365 432 438
df = df.apply(lambda x: ','.join(set(x.astype(str))), axis=1)
print (df)
0 129,123,456
1 432,365,438,237
dtype: object
Or unique:
df = df.apply(lambda x: ','.join((x.astype(str)).unique()), axis=1)
print (df)
0 123,456,129
1 237,438,365,432
dtype: object