Creating a table with boolean condition between two columns - python

I'm currently looking at Reddit data set which has comments and subreddit type as two of its columns. My goal is, as there's too many rows, want to restrict the dataset to something smaller.
By looking at df['subreddit'].value_counts > 10000, I am looking for subreddits with more than 10000 comments. How do I create a new dataframe that meets this condition? Would I use loc or set up some kind of if statement?

First you are performing df['subreddit'].value_counts(). This returns a series, what you might want to do, is transform this into a dataframe to later perform some filtering.
What I would do is;
aux_df = df['subreddit'].value_counts().reset_index()
filtered_df = aux_df[aux_df['subreddit'] > 10000].rename(columns={'index':'subreddit','subreddit':'amount'})
Optionally with loc:
filtered_df = aux_df.loc[aux_df['subreddit'].gt(10000)].rename(columns={'index':'subreddit','subreddit':'amount'})
Edit
Based on the comment, I would first create a list of all subreddits with more than 10000 comments, which is provided above, and then simply filter the original dataframe with those values:
df = df[df['subreddit'].isin(list(filtered_df['subreddit']))]

Related

Using groupby to speed up random number generation in part of a dataframe

I have a program that uses a mask similar to the check marked answer shown here to create multiple sets of random numbers in a dataframe, df.
Create random.randint with condition in a group by?
My code:
for city in state:
mask = df['City'] == city
df.loc[mask, 'Random'] = np.random.randint(1, 200, mask.sum())
This takes quite some time the bigger dataframe df is. Is there a way to speed this up with groupby?
You can try:
df['Random'] = df.assign(Random=0).groupby(df['City'])['Random'] \
.transform(lambda x: np.random.randint(1, 200, len(x)))
I've figured out a much quicker way to do this. I'll keep it more general given the application might be different depending on what you want to achieve and keep Corralien's answer as the check mark.
Instead of creating a mask or group and using .loc to update the dataframe in place, I sorted the dataframe by the 'City' then created a list of unique values from my 'City' column.
Looping over the unique list (i.e.; the grouping), I generated the random numbers for each grouping, putting them in a new list using the .extend() function. I then added the 'Random' column from this list, and sorted the dataframe back using the index.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

list of booleans to slice DataFrame

I'm trying to find a less manual, more convenient way to slice a Pandas DataFrame based on multiple boolean conditions. To illustrate what I'm after, here is a simplified example
df = pd.DataFrame({'col1':[True,False,True,False,False,True],'col2':[False,False,True,True,False,False]})
suppose I am interested in the subset of the DataFrame where both 'col1' and 'col2' are True. I can find this by running:
df[(df['col1']==True) & (df['col2']==True)]
This is manageable enough in a small dimensional example like this one, but the real one can have up to a hundred columns, so rather than have to string together a long boolean like the one above, I would rather read the columns of interest into a list, e.g.
['col1','col2']
and select where those columns listed are True
If you need all columns:
df[df.all(axis=1)==True]
If you have list of columns:
df[df[COLS].all(axis=1)==True]
For opposite just do False:
df[df.all(axis=1)==False]

Comparing Pandas Dataframe Rows & Dropping rows with overlapping dates

I have a dataframe filled with trades taken from a trading strategy. The logic in the trading strategy needs to be updated to ensure that trade isn't taken if the strategy is already in a trade - but that's a different problem. The trade data for many previous trades is read into a dataframe from a csv file.
Here's my problem for the data I have:
I need to do a row-by-row comparison of the dataframe to determine if Entrydate of rowX is less than ExitDate rowX-1.
A sample of my data:
Row 1:
EntryDate ExitDate
2012-07-25 2012-07-27
Row 2:
EntryDate ExitDate
2012-07-26 2012-07-29
Row 2 needs to be deleted because it is a trade that should not have occurred.
I'm having trouble identifying which rows are duplicates and then dropping them. I tried the approach in answer 3 of this question with some luck but it isn't ideal because I have to manually iterate through the dataframe and read each row's data. My current approach is below and is ugly as can be. I check the dates, and then add them to a new dataframe. Additionally, this approach gives me multiple duplicates in the final dataframe.
for i in range(0,len(df)+1):
if i+1 == len(df): break #to keep from going past last row
ExitDate = df['ExitDate'].irow(i)
EntryNextTrade = df['EntryDate'].irow(i+1)
if EntryNextTrade>ExitDate:
line={'EntryDate':EntryDate,'ExitDate':ExitDate}
df_trades=df_trades.append(line,ignore_index=True)
Any thoughts or ideas on how to more efficiently accomplish this?
You can click here to see a sampling of my data if you want to try to reproduce my actual dataframe.
You should use some kind of boolean mask to do this kind of operation.
One way is to create a dummy column for the next trade:
df['EntryNextTrade'] = df['EntryDate'].shift()
Use this to create the mask:
msk = df['EntryNextTrade'] > df'[ExitDate']
And use loc to look at the subDataFrame where msk is True, and only the specified columns:
df.loc[msk, ['EntryDate', 'ExitDate']]

Categories