I need to create a categorical column indicating whether the client account code has occurred for the first time i.e. "New" or it has occurred before i.e. "Existing".
Only the first occurrence needs to be considered as "New", the rest of the occurrences, irrespective of the gap in occurrences, should all be considered as "Existing".
I tried looping through the list of unique account codes within which I would filter the Dataframe for that particular account code and find the minimum date which would be stored in a separate table. Then looking-up to this table I would enter the New/Existing tag in the categorical column. Couldn't Execute it properly though.
Is there a simple way to accomplish it?
I have attached the sample file below:
Sample Data
Also the Data has some non UTF-8 encoded characters which couldn't be handled by me.
Try:
df.assign(Occurence=np.where(~df['Account Code'].duplicated(),'New','Existing'))
Output:
Created Date Account Code Occurence
0 7-Sep-13 CL000247 New
1 7-Sep-13 CL000012 New
2 7-Sep-13 CL000875 New
3 7-Sep-13 CL000084 New
4 7-Sep-13 CL000186 New
5 7-Sep-13 CL000167 New
6 7-Sep-13 CL000167 Existing
7 7-Sep-13 CL000215 New
8 12-Sep-13 Wan2013001419 New
9 12-Sep-13 CL000097 New
...
Related
I have a dataframe new_df which has a list of customer id's, dates, and a customer segment for each day. Customer segment can take multiple values. I am looking to identify a list of customers whose segment has changed more than twice in the past 15 days.
Currently, I am using the following to check how many times each segment appears for each customer id.
segment_count = new_df.groupby(new_df['customer_id'].ne(new_df['customer_id'].shift()).cumsum())['segment'].value_counts()
My thinking is if a customer has more than 2 segments which have a count of >1, then they must have migrated from one segment to another at least twice. 2 sample customers may look like this:
|customer_id|day|segment|
|-----------|---|-------|
|12345|'2021-01-01'|G|
|12345|'2021-01-02'|G|
|12345|'2021-01-03'|M|
|12345|'2021-01-04'|G|
|12345|'2021-01-05'|M|
|12345|'2021-01-06'|M|
|6789|'2021-01-01'|G|
|6789|'2021-01-02'|G|
|6789|'2021-01-03'|G|
|6789|'2021-01-04'|G|
|6789|'2021-01-05'|G|
|6789|'2021-01-06'|M|
As an output, I would want to return the following:
customer_id
migration_count
12345
3
6789
1
Anyone have any advice on best way to tackle this or if there are any built in functions I can use to simplify?Thanks!
I am wondering if it is possible to have multiple indexes similar to the picture in which one of them (second level in my case) counts automatically?
I have the following problem that i have data which needs to be updated repeatedly and the data either belong to the category "Math" or "English". However I would like to keep track of the first entry, second entry and so on for each category.
Now the trick is that, I would like to have the second level index count automatically within the category, so that every time I add a new entry to a category "math", for example, it would automatically update the second level index.
Thanks for the help.
You can set_index() using a column and a computed series. In your case cumcount() does what you need.
df = pd.DataFrame({"category":np.random.choice(["English","Math"],15), "data":np.random.uniform(2,5,15)})
df2 = df.sort_values("category").set_index(["category", df.sort_values("category").groupby("category").cumcount()+1])
df2
output
data
category
English 1 2.163213
2 4.292678
3 4.227062
4 3.255596
5 3.376833
6 2.477596
Math 1 3.436956
2 3.275532
3 2.720285
4 2.181704
5 3.667757
6 2.683818
7 2.069882
8 3.155550
9 4.155107
I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution
I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)
I read pickledata and put into dataframe for further processing. And I found an issue when it comes to choose certain rows that contains special string.
Below is the first few lines of dataframe.
parent_pid \
0 UXXY-C240-M4L
2 UXXZ-B200-M5-U
4 UXXZ-B200-M5-U
6 UXXZ-B200-M5-U
8 UXXZ-B200-M5-U
pid
0 UXXY-F-H19001,UXX-SD480G...
2 UXX-SD-32G-S,UXX-ML-X64G...
4 UXX-SD-32G-S,UXX-SD-32G-...
6 UXX-SD-32G-S,UXX-MR-X32G...
8 UXX-SD-32G-S,UXX-MR-X32G...
when it comes to search rows that contains "UXXZ-B200-M5-U", I used below codes.
df.query('parent_pid == "UXXZ-B200-M5-U"')
And below is the return.
Empty DataFrame
Columns: [parent_pid, pid]
Index: []
I used many different ways to search rows with this string, it returns the same.
whitespace in the columns doesn't seem to matter.
df[df["parent_pid"].isin(["UXXZ-B200-M5-U"])]
df.filter(like="UXXZ-B200-M5-U").columns
Does anyone know what's the issue here?