How to limit rows in pandas dataframe? - python

How to limit number of rows in pandas dataframe in python code. I needed last 1000 rows the rest need to delete.
For example 1000 rows, in pandas dataframe -> 1000 rows in csv.
I tried df.iloc[:1000]
I needed autoclean pandas dataframe and saving last 1000 rows.

If you want first 1000 records you can use:
df = df.head(1000)

With df.iloc[:1000] you get the first 1000 rows.
Since you want to get the last 1000 rows, you have to change this line a bit to df_last_1000 = df.iloc[-1000:]
To safe it as a csv file you can use pandas' to_csv() method: df_last_1000.to_csv("last_1000.csv")

Are you trying to limit the number of rows when importing a csv, or when exporting a dataframe to a new csv file?
Importing first 1000 rows of csv:
df_limited = pd.read_csv(file, nrows=1000)
Get first 1000 rows of a dataframe (for export):
df_limited = df.head(1000)
Get last 1000 rows of a dataframe (for export):
df_limited = df.tail(1000)
Edit 1
As you are exporting a csv:
You can make a range selection with [n:m] where n is the starting point of your selection and m is the end point.
It works like this:
If the number is positive, it's counting from the top of the list, beginning of the string, top of the dataframe etc.
If the number is negative, it counts from the back.
[5:] selects everything from the 5th element to the end (as there is
no end point given)
[3:8] selects everything from the 3rd element up to the 8th
[5:-2] selects everything from the 5th element up to the 2nd to last
(the 2nd from the back)
[-1000:] the start point is 1000 elements from the back, the end
point is the last element (this is what you wanted, i think)
[:1000] selects the first 1000 lines (start point is the beginning, as there is no number given, end point is 1000 elements from the front)
Edit 2
After a quick check (and a very simple benchmark) it looks like df.tail(1000) is significantly faster than df.iloc[-1000:]

Related

generate output files with random samples from pandas dataframe

I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.
for example:
df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))
I can remove a random sample of 100 rows and output a file with time stamp:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
the above works but if I try to apply it to the entire example dataframe:
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.
any guidance would be appreciated
UPDATE
this to gets me almost there:
for i in range(4):
df_subset=df.sample(100)
df=df.drop(df_subset.index)
time.sleep(1) #added because runs too fast for unique naming
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.
After running your code, I think the problem is not with df.drop
but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.
I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
df=df.drop(df_subset.index)
Another way is to shuffle your rows and get rid of that awful loop.
df.sample(frac=1)
and save slices of the shuffled dataframe.

Manipulate string based on characters in Python

I'm working on some data manipulations with time intervals, and have two time formats in the pandas dataframe. Every first occurrence of the time interval is duplicated (1:221:22 in the example below), and the second occurrence is in quotations and preceded by two commas. How can I manipulate the data as effectively as possible?
From example data:
obs1, 1:221:22,
obs2, ",,1:22"
To:
obs1, 1:22,
obs2, 1:22
First you need one filter to separate how to threat the columns.
filter_commas = (df[comma_column].str.startswith(",,"))
Then you have to threat based on your data.
#First removing all the commas at start
df.loc[filter_commas,column_name] = df.loc[filter_commas, column_name].str.replace(",","")
Then you have to split the data for the ones that aren't
#Splitting the rest of rows based in half of the row length
df.loc[~filter_commas,column_name] = df.loc[~filter_commas,column_name].apply(lambda row_val: row_val[:len(row_val)/2])
The code maybe wrong but this should put you in the right track

Insert a series of values into pd.dataframe randomly

I have a large dataframe and what I want to do is overwrite X entries of that dataframe with a new value I set. The new entries have to be at a random position, but they have to be in order. Like I have a Column with random numbers, and want to overwrite 20 of them in a row with the new value x.
I tried df.sample(x) and then update the dataframe, but I only get individual entries. But I need the X new entries in a row (consecutively).
Somebody got a solution? I'm quite new to Python and have to get into it for my master thesis.
CLARIFICATION:
My dataframe has 5 columns with almost 60,000 rows, each row for 10 minutes of the year.
One Column is 'output' with electricity production values for that 10 minutes.
For 2 consecutive hours (120 consecutive minutes, hence 12 consecutive rows) of the year I want to lower that production to 60%. I want it to happen at a random time of the year.
Another column is 'status', with information about if the production is reduced or not.
I tried:
df_update = df.sample(12)
df_update.status = 'reduced'
df.update(df_update)
df.loc[('status) == 'reduced', ['production']] *=0.6
which does the trick for the total amount of time (12*10 minutes), but I want 120 consecutive minutes and not separated.
I decided to get a random value and just index the next 12 entries to be 0.6. I think this is what you want.
df = pd.DataFrame({'output':np.random.randn(20),'status':[0]*20})
idx = df.sample(1).index.values[0]
df.loc[idx:idx+11,"output"]=0.6
df.loc[idx:idx+11,"status"]=1

Dataframe - Sum Arrays in Cell - Memory Issue

I have a very large dataframe (close to 1 million rows), which has a couple of meta data columns and one single column that contains a long string of triples. One string could look like this:
0,0,123.63;10,360,2736.11;30,270,98.08;...
That is, three values separated by comma and then separated by semicolon. Let us refer to the three values as IN, OUT, MEASURE. Effectively i want to group my data by the original columns + the IN & OUT columns and then sum over the MEASURE column. Since each long string contains roughly 30 triples my dataframe would grow to be ~30 million rows if i simply unstacked the data. Obviously this is not feasible.
So given a set of columns (which may in- or exclude the IN & OUT columns) over which I want to group and then sum my MEASURE data, how would I efficiently strip out the relevant data and sum everything up without blowing up my memory?
My current solution simply loops over each row and then over each triple and keeps a running total of each group I specified. This is very slow, so I am looking for something faster, perhaps vectorised. Any help would be appreciated.
Edit: Sample data below (columns separated by pipe)
DATE|REGION|PRIORITY|PARAMETERS
10-Oct-2016|UK|High|0,0,77.82;30,90,7373.70;
10-Oct-2016|US|Low|0,30,7.82;30,90,733.70;
11-Oct-2016|UK|High|0,0,383.82;40,90,713.75;
12-Oct-2016|NA|Low|40,90,937.11;30,180,98.23;
where PARAMETERS has the form 'IN,OUT,MEASURE;IN,OUT,MEASURE;...'
I basically want to (as an example) create a pivot table where
values=MEASURE
index=DATE, IN
columns=PRIORITY

Pandas Dataframe selecting groups with minimal cardinality

I have a problem where I need to take groups of rows from a data frame where the number of items in a group exceeds a certain number (cutoff). For those groups, I need to take some head rows and the tail row.
I am using the code below
train = train[train.groupby('id').id.transform(len) > headRows]
groups = pd.concat([train.groupby('id').head(headRows),train.groupby('id').tail(1)]).sort_index()
This works. But the first line, it is very slow :(. 30 minutes or more.
Is there any way to make the first line faster ? If I do not use the first line, there are duplicate indices from the result of the second line, which messes up things.
Thanks in advance
Regards
Note:
My train data frame has around 70,000 groups of varying group size over around 700,000 rows . It actually follows from my other question as can be seen here Data processing with adding columns dynamically in Python Pandas Dataframe.
Jeff gave a great answer there, but it fails if the group size is less or equal to parameter I pass in head(parameter) when concatenating my rows as in Jeffs answer : In [31]: groups = concat.....
Use groupby/filter:
>>> df.groupby('id').filter(lambda x: len(x) > cutoff)
This will just return the rows of your dataframe where the size of the group is greater than your cutoff. Also, it should perform quite a bit better. I timed filter here with a dataframe with 30,039 'id' groups and a little over 4 million observations:
In [9]: %timeit df.groupby('id').filter(lambda x: len(x) > 12)
1 loops, best of 3: 12.6 s per loop

Categories