I have the following scenario in a sales dataframe, each row being a distinct sale:
Category Product Purchase Value | new_column_A new_column_B
A C 30 |
B B 50 |
C A 100 |
I'm trying to find in the qcut documentation but can't find it anywhere, how to add a series of columns based on the following logic:
df['new_column_A'] = when category = A and product = A then df['new_column_A'] = pd.qcut(df['Purchase_Value'], q=4)
df['new_column_B' = when category = A and product = B then
df['new_column_B'] = pd.qcut(df['Purchase_Value'], q=4)
Preferably i would like for this new column of percentile cut to be created in the same original dataframe.
The first thing that comes to my mind is to split the dataframe into separate ones by doing the filtering I need, but i would like to keep all these columns in the original dataframe.
Does anyone knows if this is possible and how I can do it?
Related
I want to find all rows where a certain value is present inside the column's list value.
So imagine I have a dataframe set up like this:
| placeID | users |
------------------------------------------------
| 134986| [U1030, U1017, U1123, U1044...] |
| 133986| [U1034, U1011, U1133, U1044...] |
| 134886| [U1031, U1015, U1133, U1044...] |
| 134976| [U1130, U1016, U1133, U1044...] |
How can I get all rows where 'U1030' exists in the users column?
Or... is the real problem that I should not have my data arranged like this, and I should instead explode that column to have a row for each user?
What's the right way to approach this?
The way you have stored data looks fine to me. You do not need to change the format of storing data.
Try this :
df1 = df[df['users'].str.contains("U1030")]
print(df1)
This will give you all the rows containing specified user in df format.
When you are wanting to check whether a value exists inside the column when the value in the column is a list, it's helpful to use the map function.
Implementing it like below, with a lambda inline function, the list of values stored in the 'users' column is mapped to the value u, and userID is compared to it...
Really the answer is pretty straightforward when you look at the code below:
# user_filter filters the dataframe to all the rows where
# 'userID' is NOT in the 'users' column (the value of which
# is a list type)
user_filter = df['users'].map(lambda u: userID not in u)
# cuisine_filter filters the dataframe to only the rows
# where 'cuisine' exists in the 'cuisines' column (the value
# of which is a list type)
cuisine_filter = df['cuisines'].map(lambda c: cuisine in c)
# Display the result, filtering by the weight assigned
df[user_filter & cuisine_filter]
I have a Spark dataframe which looks a bit like this:
id country date action
1 A 2019-01-01 suppress
1 A 2019-01-02 suppress
2 A 2019-01-03 bid-up
2 A 2019-01-04 bid-down
3 C 2019-01-01 no-action
3 C 2019-01-02 bid-up
4 D 2019-01-01 suppress
I want to reduce this dataframe by grouping by id, country and collecting the unique values of the 'action' column into an array, but this array should be ordered by the date column.
E.g.
id country action_arr
1 A [suppress]
2 A [bid-up, bid-down]
3 C [no-action, bid-up]
4 D [suppress]
To explain this a little more concisely i have some SQL (presto) code that does exactly what i want... i'm just struggling to do this in PySpark or SparkSQL:
SELECT id, country, array_distinct(array_agg(action ORDER BY date ASC)) AS actions
FROM table
GROUP BY id, country
Now here's my attempt in PySpark:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('action').orderBy('date')
sorted_list_df = df.withColumn('sorted_list', F.collect_set('action').over(w))
Then I want to find out the number of occurrences of each set of actions by group:
df = sorted_list_df.select('country', 'sorted_list').groupBy('coutry', 'sorted_list').agg(F.count('sorted_list'))
The code runs but in the output he sorted_list column is basically the same as action without any array aggregation..Can someone help?
EDIT: I managed to pretty much get what i want.. but the results don't fully match the presto results. Can anyone explain why? Solution below:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('action').orderBy('date')
df_2 = df.withColumn("sorted_list", F.collect_set("action").over(Window.partitionBy("id").orderBy("date")))
test = df_2.select('id', 'country', 'sorted_list')\
.dropDuplicates()\
.select('country', 'sorted_list')\
.groupBy('site_name', 'sorted_list')\
.agg(F.count('sorted_list'))
IMO, your window definition is wrong. You should partition by the column with which you want to make groups of, and then collect a set of unique values per group.
IIUC, you just need to do:
w = Window.partitionBy(['id', 'country']).orderBy('date')
sorted_list_df = df.withColumn('sorted_list', F.collect_set('action').over(w))
df_new = sorted_list_df.select('id', 'country', 'sorted_list').withColumn("count_of_elems", F.size("sorted_list"))
DRAWBACK:
If you use a window, you will have a new set for every row, and your row count is going to be the same as the old df. There is not going to be an aggregation per se, since I don't think that's not what you want too.
This next line aggregates the values as a set for every group. I'm hoping it gets you exactly what you want:
df_new = sorted_list_df.groupby('id', 'country').agg(F.max('sorted_list').alias('sorted_list')).withColumn("count_of_elems", F.size("sorted_list"))
I am new to python and doing a data analysis project and would request some help.
I have a dataframe of 400,000 rows and it has columns: ID, Type 1 Categories, Type 2 Categories, Type 3 Categories, Amount, Age, Fraud.
The Category columns are column of lists. This list contains different terms which I want to take and create a matrix that counts and shows how many times a particular terms occurs in that row (with a column per term and frequency).
So the goal is to create a dataframe of sparse matrix, with each of those unique categories becoming a column - my dataset has over 2000 different categories - maybe thats why the count vectorizer is not good for this?
I tried two methods one using a count vectorizer and another using for loops
but Count Vectorizer crashes everytime it runs.
The second method is far too slow. Therefore, I was wondering if there anyway to improve these solutions.
I also split the dataframe into multiple chunks and it still causing problems
Example:
+------+--------------------------------------------+---------+---------+
| ID | Type 1 Category | Amount | Fraud |
+------+--------------------------------------------+---------+---------+
| ID1 | [Lex1, Lex2, Lex1, Lex4, Lex2, Lex1] | 110.0 | 0 |
| ID2 | [Lex3, Lex6, Lex3, Lex6, Lex3, Lex1, Lex2] | 12.5 | 1 |
| ID3 | [Lex7, Lex3, Lex2, Lex3, Lex3] | 99.1 | 0 |
+------+--------------------------------------------+---------+---------+
col = 'Type 1 Category'
# prior to this, I combined the entire dataframe based on ID
# this was from old dataframe where each row had different occurrence of id
# and only one category per row
terms = df_old[col].unique()
countvec = CountVectorizer(vocabulary=terms)
# create bag of words
df = df.join(pd.DataFrame(countvec.fit_transform(df[col]).toarray(),
columns=countvec.get_feature_names(),
index=df.index))
# drop original column of lists
df = df.drop(col, axis = 1)
##### second split dataframe to chunks using np.split
df_l3 = df_split[3]
output.index = df_l3.index
# Assign the columns.
output[['ID', '[col]']] = df_l3[['ID', '[col]']]
# split dataframe into chunks and 114305 is where the index starts
last = 114305+int(df_l3.shape[0])
for i in range(114305,last):
print(i)
for word in words:
output.ix[i,str(word)] = output[col][i].count(str(word))
Runs of memory for count vectorizer, second one does not count frequencies anymore. Worked for chunk 1 where index starts from zero but does not for the others.
I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.
I have a dataset like this:
Policy | Customer | Employee | CoveragDate | LapseDate
123 | 1234 | 1234 | 2011-06-01 | 2015-12-31
124 | 1234 | 1234 | 2016-01-01 | ?
125 | 1234 | 1234 | 2011-06-01 | 2012-01-01
124 | 5678 | 5555 | 2014-01-01 | ?
I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.
So, expected output would be:
Policy | Customer | Employee
123 | 1234 | 1234
because policy 123's lapse date was within 5 days of policy 124's covered date.
I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.
So far, I've been able to write this code:
import pandas
import datetime
wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
for each in range(i):
LapseDt = wd.loc[l, 'LapseDate']
if LapseDt != '?':
LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
if Covdt < LapseDt:
print('got one!')
l = l + 1
This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.
The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.
I've read through a question regarding finding the first row of a group, and am able to do so by using
wd.groupby(['EMPID','EBCN']).first()
but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?
Regarding my general method, I've read through the question here, which is very close to what I need:
pandas computation in each group
however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.
Is there a way to do what I'm attempting in Pandas/Python?
For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:
Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns
Once the data was in the correct format, I implemented the merge:
pandas.merge_asof(cov, term,
on='Date',
by='EMP|EBCN',
tolerance=pandas.Timedelta('5 days'))
Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.
Thanks so much to Boud for sending me down the correct path!