Pandas table join and dedupe deciding which row to keep

Pandas table join and dedupe deciding which row to keep - python

I am trying to inner join and dedupe two tables while using a more complicated method of deciding which rows to keep after deduping than keep first or keep last.
Table A contains distinct IDs, and Age.
Table B contains multiple duplicated ID numbers, Ages, and data.
Only one row in Table B is correct so I want to keep only this row. The correct row is the one where the two Ages are most similar, but I also know that the correct Table B Ages are always lower than or equal to Table A Ages.
Table A
|ID |Age|
|----|---|
|1234| 45|
Table B
|ID |Age|data |
|----|---|-----|
|1234| 43|dataX|
|1234| 46|dataY|
|1234| 22|dataZ|
What I want is:
Joined Table
|ID |Age_A|Age_B|data |
|----|-----|-----|-----|
|1234| 45| 43|dataX|
How can I achieve this in Python Pandas?

We using merge_asof and merge
pd.merge_asof(df1,df2.sort_values(['Age']),on='Age',by='ID').merge(df2[['Age','data']],on='data')
Out[686]:
ID Age_x data Age_y
0 1234 45 dataX 43
Also we can get rid of the 2nd merge
df2['Age_B']=df2.Age
pd.merge_asof(df1,df2.sort_values(['Age']),on='Age',by='ID')
Out[688]:
ID Age data Age_B
0 1234 45 dataX 43

Related

How to use qcut in a dataframe with conditions value from columns

I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?

How to merge multiple rows into single cell based on id and then count?

How to merge multiple rows into single cell based on id using PySpark? I have a dataframe with ids and products. First I want to merge the products with the same id together into a list, then I want to count the number of occurrences of each unique list.
Input example 1:
id,product
1,HOME
1,mobile
2,watch
2,mobile
3,HOME
3,mobile
4,cd
4,music
4,video
Output:
product,count
HOME-mobile,2
mobile-watch,1
cd-music-video,1
Example 2 with sql code:
Input example:
cloths,brad
cloths,edu
cloths,keith
cloths,stef
enter,andr
enter,char
enter,danny
enter,lucas
Code:
SELECT
SS.SEC_NAME,
STUFF((SELECT '- ' + US.USR_NAME
FROM USRS US
WHERE US.SEC_ID = SS.SEC_ID
ORDER BY USR_NAME
FOR XML PATH('')), 1, 1, '') [SECTORS/USERS]
FROM SALES_SECTORS SS
GROUP BY SS.SEC_ID, SS.SEC_NAME
ORDER BY 1
Output:
cloths,brad-edu-keith-stef
enter,andr-char-danny-lucas
In this example the output does not have the count, but it should be included.
I would like to solve this in PySpark instead of sql/pig.

You can do this is PySpark by using groupby. First group on the id column and merge the products together into a single, sorted list. To get the count of the number of such lists, use groupby again and aggregate by count.
from pyspark.sql import functions as F
df2 = (df
.groupby("id")
.agg(F.concat_ws("-", F.sort_array(F.collect_list("product"))).alias("products"))
.groupby("products")
.agg(F.count("id")).alias("count"))
This should give you a dataframe like this:
+--------------+-----+
| products|count|
+--------------+-----+
| HOME-mobile| 2|
| mobile-watch| 1|
|cd-music-video| 1|
+--------------+-----+

Start an iteration on first row of a group Pandas

I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.
So far, I've been able to write this code:
import pandas
import datetime
wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
for each in range(i):
LapseDt = wd.loc[l, 'LapseDate']
if LapseDt != '?':
LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
if Covdt < LapseDt:
print('got one!')
l = l + 1
This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.
The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.
I've read through a question regarding finding the first row of a group, and am able to do so by using
wd.groupby(['EMPID','EBCN']).first()
but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?
Regarding my general method, I've read through the question here, which is very close to what I need:
pandas computation in each group
however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.
Is there a way to do what I'm attempting in Pandas/Python?

For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:
Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns
Once the data was in the correct format, I implemented the merge:
pandas.merge_asof(cov, term,
on='Date',
by='EMP|EBCN',
tolerance=pandas.Timedelta('5 days'))
Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.
Thanks so much to Boud for sending me down the correct path!

Fill in missing boolean rows in Pandas

I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.

I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)

Efficiently plotting multiple columns in pandas

I would like to know how to efficiently plot groups of multiple columns in a pandas dataframe.
I have the following dataframe
| a | b | c |...|trial1.1|trial1.2|...|trial1.12|trial2.1|...|trial2.12|trial3.1|...|trial3.12|
GlobalID|
sd12f |...|...|...|...| 210.1 | 213.1 |...| 170.1 | 176.2 |...| 160.31 | 162.4 |...| 186.1 |
...
I would like to loop through the rows and for each row plot three waveforms: trial1.[1-12], trial2.[1-12], trial3.[1-12]. What is the most efficient way to do this? Right now I have:
t1 = df.ix[0][df.columns[[colname.startswith('trial1') for colname in df]]]
t2 = df.ix[0][df.columns[[colname.startswith('trial2') for colname in df]]]
t3 = df.ix[0][df.columns[[colname.startswith('trial3') for colname in df]]]
t1.astype(float).plot()
t2.astype(float).plot()
t3.astype(float).plot()
I need the .astype(float) because the values are originally strings. Is there some more efficient way of doing this I am missing? I am new to python and pandas.

How about first transverse the dataframe, then split the dataframe by trials, then plot.
# Transverse
data = pd.read_csv("data.txt").T
# Insert your code to remove irrelevant rows, like a, b, c in your example
#
# Group by the trial number (the first six characters) and plot
data.groupby(lambda x: x[:6], axis=0).plot()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas table join and dedupe deciding which row to keep - python

Related

How to use qcut in a dataframe with conditions value from columns

How to merge multiple rows into single cell based on id and then count?

Start an iteration on first row of a group Pandas

Fill in missing boolean rows in Pandas

Efficiently plotting multiple columns in pandas

Categories

Resources