MultiLevel indexing pandas trying to collate two different Series

MultiLevel indexing pandas trying to collate two different Series - python

So currently I am trying to write a function which will take a given document id, and produce a list of other document ids related to this. This is to be achieved by checking this document id against a table with document id's and user id's, each user id then gets checked for what document id's they have accessed.
I have a function which will return a pandas Series for either of these requests, but now I would like to put them together so I can run calculations.
I believe the best way to go about this is to utilise MultiLevel indexing, producing a DataFrame like this:
user_id document_id
user_a doc_a
doc_b
doc_c
user_b doc_d
doc_e
user_c doc_f
doc_g
I am not sure how to go about producing this though. What I can do currently is produce a Series of user_id, I then make a DataFrame with this Series as the first column. I can then produce a second column like so:
df['document_id'] = df['user_id'].apply(lambda x: return_documents(x))
However, all this is doing is producing a series in each cell of the document_id column.
Any help would be appreciated, thanks!

I think what u need is explode:
df['document_id'] = df['user_id'].apply(lambda x: return_documents(x))
df = df.explode('document_id')

Related

How to Index a dataframe based on an applied function? -Pandas

I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.

You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()

How to expand groupby df Pandas python

I have filtered a pandas data frame by grouping and taking sum, now I want all the details and no longer need the sum
for example what I have looks like the image below
what i want is for each of the individual transactions to be shown, as currently the amount column is the sum of all transactions done by an individual on a specific date i want to see all the individual amounts, is this possible?
I dont know how to filter the larger df by the groupby one, have also tried using isin() with multiple &s but it does not work as for example "David" could be in my groupby df on sept 15, but in the larger df he has made transactions on other days aswell and those are slipping through when using isin()

Hello there and welcome,
first of all, as I've learned my self, always try:
to give some data (in text, or code form) as your input
share your expected output, to avoid more questions
have fun :-)
I'm new as well, and I did my best to cover as much possibilities as I could, at least people can use my code to get your df.
#From the picture
data={'Date': ['2014-06-30','2014-07-02','2014-07-02','2014-07-03','2014-07-09','2014-07-14','2014-07-17','2014-07-25','2014-07-29','2014-07-29','2014-08-06','2014-08-11','2014-08-22'],
'LastName':['Cow','Kind','Lion','Steel','Torn','White','Goth','Hin','Hin','Torn','Goth','Hin','Hin'],
'FirstName':['C','J','K','J','M','D','M','G','G','M','M','G','G'],
'Vendor':['Jail','Vet','TGI','Dept','Show','Still','Turf','Glass','Sup','Ref','Turf','Lock','Brenn'],
'Amount': [5015.70,6293.27,7043.00,7600,9887.08,5131.74,5037.55,5273.55,9455.48,5003.71,6675,7670.5,8698.18]
}
df=pd.DataFrame(data)
incoming=df.groupby(['Date','LastName','FirstName','Vendor','Amount']).count()
#what I believe you did to get Date grouped
incoming
Now here my answer:
Firstly I merged First and Lastname
df['CompleteName']=df[['FirstName','LastName']].agg('.'.join,axis=1) # getting Names for df
Then I did some statistics for the amount, for different groups:
#creating a new column with as much Statistics from group (Complete Name, Date, Vendor, etc.)
df['AmountSumName']=df['Amount'].groupby(df['CompleteName']).transform('sum')
df['AmountSumDate']=df['Amount'].groupby(df['Date']).transform('sum')
df['AmountSumVendor']=df['Amount'].groupby(df['Vendor']).transform('sum')
df
Now just groupby as you wish
Hope I could answer you question.

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!

You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

Restructuring Pandas Dataframe for large number of columns

I have a pandas dataframe which is a large number of answers given by users in response to a survey and I need to re-structure it. There are up to 105 questions asked each year, but I only need maybe 20 of them.
The current structure is as below.
What I want to do is re-structure it so that the row values become column names and the answer given by the user is then the value in that column. In a picture (from Excel), what I want is the below (I know I'll need to re-name my columns, but that's fine once I can create the structure in the first place):
Is it possible to re-structure my dataframe this way? The outcome of this is to use some predictive analytics to predict a target variable, so I need to re-strcture before I can use Random Forest, kNN, and so on.

You might want try pivoting your table:
df.pivot(index=['SurveyID', 'UserID'], columns=['QuestionID'], values=['AnswerText'])
df.columns = [x[0] if x[1] == "" else "Answer_{}".format(x[1]) for x in df.columns.to_flat_index()]

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

I am currently working with dataframes in pandas. In sum, I have a dataframe called "Claims" filled with customer claims data, and I want to parse all the rows in the dataframe based on the unique values found in the field 'Part ID.' I would then like to take each set of rows and append it one at a time to an empty dataframe called "emptydf." This dataframe has the same column headings as the "Claims" dataframe. Since the values in the 'Part ID' column change from week to week, I would like to find some way to do this dynamically, rather than comb through the dataframe each week manually. I was thinking of somehow incorporating the df.where() expression and a For Loop, but am at a loss as to how to put it all together. Any insight into how to go about this, or even some better methods, would be great! The code I have thus far is divided into two steps as follows:
emptydf = Claims[0:0]
#Create empty dataframe
2.Parse_Claims = Claims.query('Part_ID == 1009')
emptydf = emptydf.append(Parse_Claims)
#Parse the dataframe by each unique Part ID number and append to empty dataframe. As you can see, I can only hard code one Part ID number at a time so far. This would take hours to complete manually, so I would love to figure out a way to iterate through the Part ID column and append the data dynamically.
Needless to say, I am super new to Python, so I definitely appreciate your patience in advance!

empty_df = list(Claims.groupby(Claims['Part_ID']))
this will create a list of tuples one for each part id. each tuple has 2 elements 1st is part id and 2nd is subset for that part id

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

MultiLevel indexing pandas trying to collate two different Series - python

I think what u need is explode: df['document_id'] = df['user_id'].apply(lambda x: return_documents(x)) df = df.explode('document_id')

Related

How to Index a dataframe based on an applied function? -Pandas

How to expand groupby df Pandas python

pyspark Drop rows in dataframe to only have X distinct values in one column

Restructuring Pandas Dataframe for large number of columns

Create Loop to dynamically select rows from dataframe, then append selected rows to another dataframe: df.query()

Categories

Resources