Pandas Dataframe selecting groups with minimal cardinality - python

I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....

Use groupby/filter:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop

Related

How to limit rows in pandas dataframe?

How to limit number of rows in pandas dataframe in python code. I needed last 1000 rows the rest need to delete.
For example 1000 rows, in pandas dataframe -> 1000 rows in csv.
I tried df.iloc[:1000]
I needed autoclean pandas dataframe and saving last 1000 rows.
If you want first 1000 records you can use:
df = df.head(1000)
With df.iloc[:1000] you get the first 1000 rows.
Since you want to get the last 1000 rows, you have to change this line a bit to df_last_1000 = df.iloc[-1000:]
To safe it as a csv file you can use pandas' to_csv() method: df_last_1000.to_csv("last_1000.csv")
Are you trying to limit the number of rows when importing a csv, or when exporting a dataframe to a new csv file?
Importing first 1000 rows of csv:
df_limited = pd.read_csv(file, nrows=1000)
Get first 1000 rows of a dataframe (for export):
df_limited = df.head(1000)
Get last 1000 rows of a dataframe (for export):
df_limited = df.tail(1000)
Edit 1
As you are exporting a csv:
You can make a range selection with [n:m] where n is the starting point of your selection and m is the end point.
It works like this:
If the number is positive, it's counting from the top of the list, beginning of the string, top of the dataframe etc.
If the number is negative, it counts from the back.
[5:] selects everything from the 5th element to the end (as there is
no end point given)
[3:8] selects everything from the 3rd element up to the 8th
[5:-2] selects everything from the 5th element up to the 2nd to last
(the 2nd from the back)
[-1000:] the start point is 1000 elements from the back, the end
point is the last element (this is what you wanted, i think)
[:1000] selects the first 1000 lines (start point is the beginning, as there is no number given, end point is 1000 elements from the front)
Edit 2
After a quick check (and a very simple benchmark) it looks like df.tail(1000) is significantly faster than df.iloc[-1000:]

Pyspark random split changes distribution of data

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?
So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.

More efficient way to classify

normal = []
nine_plus []
tw_plus = []
for i in df['SubjectID'].unique():
x= df.loc[df['SubjectID']==i]
if(len(x['Year Term ID'].unique())<=8):
normal.append(i)
elif(len(x['Year Term ID'].unique())>=9 and len(x['Year Term ID'].unique())<13):
nine_plus.append(i)
elif(len(x['Year Term ID'].unique())>=13):
tw_plus.append(i)
Hello, I am dealing with a dataset that has 10 million rows. The dataset is about student records and I am trying to classify the students into three groups according to how many semesters they have attended. I feel like I am using very crude method right now, and there could be more efficient way of categorizing. Any suggestions?
You go through a lot of repeated iterations that are likely to make your data frame slower than a simple Python list. Use the data frame organization in your favor.
Group your rows by Subject_ID, then Year_Term_ID.
Extract the count of rows in each sub-group -- which you currently have as len(x(...
Make a function, lambda, or extra column that represents the classification; call that len expression load:
0 if load <= 8 else 1 if load <= 12 else 3
Use that expression to re-group your students into the three desired classifications.
Do not iterate through the rows of the data frame: this is a "code smell" that you're missing a vectorized capability.
Does that get you moving?

Comparing data in 2 separate DataFrames and producing a result in Python/Pandas

I am new to Python and I'm trying to produce a similar result of Excel's IndexMatch function with Python & Pandas, though I'm struggling to get it working.
Basically, I have 2 separate DataFrames:
The first DataFrame ('market') has 7 columns, though I only need 3 of those columns for this exercise ('symbol', 'date', 'close'). This df has 13,948,340 rows.
The second DataFrame ('transactions') has 14 columns, though only I only need 2 of those columns ('i_symbol', 'acceptance_date'). This df has 1,428,026 rows.
My logic is: If i_symbol is equal to symbol and acceptance_date is equal to date: print symbol, date & close. This should be easy.
I have achieved it with iterrows() but because of the size of the dataset, it returns a single result every 3 minutes - which means I would have to run the script for 1,190 hours to get the final result.
Based on what I have read online, itertuples should be a faster approach, but I am currently getting an error:
ValueError: too many values to unpack (expected 2)
This is the code I have written (which currently produces the above ValueError):
for i_symbol, acceptance_date in transactions.itertuples(index=False):
for symbol, date in market.itertuples(index=False):
if i_symbol == symbol and acceptance_date == date:
print(market.symbol + market.date + market.close)
2 questions:
Is itertuples() the best/fastest approach? If so, how can I get the above working?
Does anyone know a better way? Would indexing work? Should I use an external db (e.g. mysql) instead?
Thanks, Matt
Regarding question 1: pandas.itertuples() yields one namedtuple for each row. You can either unpack these like standard tuples or access the tuple elements by name:
for t in transactions.itertuples(index=False):
for m in market.itertuples(index=False):
if t.i_symbol == m.symbol and t.acceptance_date == m.date:
print(m.symbol + m.date + m.close)
(I did not test this with data frames of your size but I'm pretty sure it's still painfully slow)
Regarding question 2: You can simply merge both data frames on symbol and date.
Rename your "transactions" DataFrame so that it also has columns named "symbol" and "date":
transactions = transactions[['i_symbol', 'acceptance_date']]
transactions.columns = ['symbol','date']
Then merge both DataFrames on symbol and date:
result = pd.merge(market, transactions, on=['symbol','date'])
The result DataFrame consists of one row for each symbol/date combination which exists in both DataFrames. The operation only takes a few seconds on my machine with DataFrames of your size.
#Parfait provided the best answer below as a comment. Very clean, worked incredibly fast - thank you.
pd.merge(market[['symbol', 'date', 'close']], transactions[['i_symbol',
'acceptance_date']], left_on=['symbol', 'date'], right_on=['i_symbol',
'acceptance_date']).
No need for looping.

I need to apply multiple equations on a pandas column based on certain criteria

I have a dataframe which requires multiple equations based on certain criteria. I need to take the first 3 letters of an identifier, then, if it is True, I need to divide the value associated with that row by a certain amount.
The dataframe is as follows:
ID Value
US123 10000
US121 10000
MX122 10000
MX125 10000
BR123 10000
BR127 10000
I need to divide the value by 100 if the ID starts with 'MX', and divide the value by 1000 if the ID starts with 'BR'. All other values need to remain the same.
I also do not want to create a new filtered dataframe. I have had success filtering by ID then doing the logic checks, but I need to apply it over a much larger frame.
This is the code I am using for the filtered frame.
filtered['Value'] = np.where(filtered.ID.apply(lambda x: x[:3]).isin(['MX']) == True, filtered.Value/100, filtered.Value/1000)
I've also tried df.loc but I cannot figure out how to apply the changes to the dataframe, it seems to only show me a series of data but not apply it to the DF.
That code is here:
df.loc[(df['ID'].str.contains("MX") == True), 'Value']/100
df.loc[(df['ID'].str.contains("BR") == True), 'Value']/1000
Is there any better way to do this? How can I apply the changes using df.loc to the main dataframe rather than have it appear in a series?
The desired output should be:
ID Value
US123 10000
US121 10000
MX122 100
MX125 100
BR123 10
BR127 10
Thanks!
After computing the divided values using .loc, it must be re-assigned back to the DF used to make the selection as the operation is not inplace by itself.
Use str.startswith to check for string starting with a given pattern.
df.loc[df['ID'].str.startswith('MX'), 'Value'] /= 100
df.loc[df['ID'].str.startswith('BR'), 'Value'] /= 1000
df['Value'] = df['Value'].astype(int)
df

Categories