generate output files with random samples from pandas dataframe

generate output files with random samples from pandas dataframe - python

I have a dataframe with 500K rows. I need to distribute sets of 100 randomly selected rows to volunteers for labeling.
for example:
df = pd.DataFrame(np.random.randint(0,450,size=(450,1)),columns=list('a'))
I can remove a random sample of 100 rows and output a file with time stamp:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
the above works but if I try to apply it to the entire example dataframe:
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
df=df.drop(df_subset.index)
it runs continuously - my expected output is 5 timestampdfsample.csv files 4 of which have 100 rows and the fifth 50 rows all randomly selected from df however df.drop(df_sample.index) doesn't update df so condition is always true and it runs forever generating csv files. I'm having trouble solving this problem.
any guidance would be appreciated
UPDATE
this to gets me almost there:
for i in range(4):
df_subset=df.sample(100)
df=df.drop(df_subset.index)
time.sleep(1) #added because runs too fast for unique naming
df_subset.to_csv(time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv')
it requires me to specify number of files. If I say 5 for the example df I get an error on the 5th. I hoped for 5 files with the 5th having 50 rows but not sure how to do that.

After running your code, I think the problem is not with df.drop
but with the line containing time.strftime('%Y%m%d_%H%M%S') + 'dfsample.csv', because Pandas creates multiple CSV files within a second which might be causing some overwriting issues.
I think if you want label files using a timestamp, perhaps going to the millisecond level might be more useful and prevent possibility of overwrite. In your case
while len(df)>0:
df_subset=df.sample(100)
df_subset.to_csv(datetime.now().strftime("%Y%m%d_%H%M%S.%f") + 'dfsample.csv')
df=df.drop(df_subset.index)

Another way is to shuffle your rows and get rid of that awful loop.
df.sample(frac=1)
and save slices of the shuffled dataframe.

Related

How to limit rows in pandas dataframe?

How to limit number of rows in pandas dataframe in python code. I needed last 1000 rows the rest need to delete.
For example 1000 rows, in pandas dataframe -> 1000 rows in csv.
I tried df.iloc[:1000]
I needed autoclean pandas dataframe and saving last 1000 rows.

If you want first 1000 records you can use:
df = df.head(1000)

With df.iloc[:1000] you get the first 1000 rows.
Since you want to get the last 1000 rows, you have to change this line a bit to df_last_1000 = df.iloc[-1000:]
To safe it as a csv file you can use pandas' to_csv() method: df_last_1000.to_csv("last_1000.csv")

Are you trying to limit the number of rows when importing a csv, or when exporting a dataframe to a new csv file?
Importing first 1000 rows of csv:
df_limited = pd.read_csv(file, nrows=1000)
Get first 1000 rows of a dataframe (for export):
df_limited = df.head(1000)
Get last 1000 rows of a dataframe (for export):
df_limited = df.tail(1000)
Edit 1
As you are exporting a csv:
You can make a range selection with [n:m] where n is the starting point of your selection and m is the end point.
It works like this:
If the number is positive, it's counting from the top of the list, beginning of the string, top of the dataframe etc.
If the number is negative, it counts from the back.
[5:] selects everything from the 5th element to the end (as there is
no end point given)
[3:8] selects everything from the 3rd element up to the 8th
[5:-2] selects everything from the 5th element up to the 2nd to last
(the 2nd from the back)
[-1000:] the start point is 1000 elements from the back, the end
point is the last element (this is what you wanted, i think)
[:1000] selects the first 1000 lines (start point is the beginning, as there is no number given, end point is 1000 elements from the front)
Edit 2
After a quick check (and a very simple benchmark) it looks like df.tail(1000) is significantly faster than df.iloc[-1000:]

pandas inserting rows in a monotonically increasing dataframe using itertuples

I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.

A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09

Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]

Find average in large csv file using pandas

I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.
I need to find the average distance for each month.
This is what I have so far:
import pandas as pd
for x in range(1, 60):
df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
for chunk in df:
print df["distance"].mean()
First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.
But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.
Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .

Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.
distance_array = []
for x in range(1,61):
df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
for index, row in df.iterrows():
distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))

I am presuming that the datasets are too large to load into memory as a pandas dataframe. If that is the case consider using a generator on each csv file, something similar too: Where to use yield in Python best?
As the overall result that you are after is the average you can accumulate the the total sum over each row and track how many rows with incremental step.

How to write a pandas DataFrame to a CSV by fixed size chunks

I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.

You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))

You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)

Fit Data in Pandas DataFrame

I am querying a database for a few variables from an experiment, one at a time and storing the data in a Pandas DataFrame. I can get the data that I need, looks as below for instance:
file time variableid data
0 1 1503657 1 11
1 1 1503757 1 22
There is data for several variables that I will be grabbing like this, and then I will be combining them into a single DataFrame to be output to a csv. Each variable's data column will be added as a new column with the corresponding name of the variable (as the file_id should always be the same). The time column values might be different (one DF could be longer than the other, the data wasn't sampled at all of the same times, etc), but if I merge the tables on the time (and file) column, then any discrepancies are filled in with NaN (and I will fill them in with DF.fillna(0)) and the DF can be resorted by the time.
What I need though is a way to filter the data so that it fits a certain rate, such as every 100 milliseconds (1503700,1503800,...). The datapoint itself doesn't have to fit that rate exactly (and in fact the data rarely falls on a time that ends in 00 for instance), but it should be the closest matching data for that time (it could be the closest before or after that time actually, as long as it is consistent throughout).
I thought about iterating over all the values in the time column and adding the row with the closest time one by one (I would first create a blank DF with the desired times), but there are sometimes 50,000+ rows in a sample table. I found an answer about interpolating (link below), but I don't really want to add or modify any of the data itself, just pull the rows that most closely match the rate that I want to sample the data (one reason is some of the data is binary and I wouldn't want to end up with something like 0.5 because the before desired time and after desired time values were 0 and 1). Any help is greatly appreciated, thanks.
combining pandas dataframes of different sampling rates

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.