i'm sure most of you might find this basic but i'm somehow finding it very confusing to understand the way to access a particular chunk in pandas and append it later. I know to append the set but i don't know to identify the data based on a chunk
for ex, just imagine my table has 36000 records and i chunk it by 1200, now i want to access just the 3rd chunk only. how to achieve it in pandas? i googled it extensively but no good results
for df in pd.read_sql_query('select id from table;', conn, chunksize=1200):
print(df)
Pandas - Slice Large Dataframe in Chunks
thank you for pointing out to this link. The fix is pretty simple!
df = pd.read_sql_query('select * from x',conn)
chunksize=100
new_df= [df[i: i+n] for i in range(0,df.shape[0],chunksize)]
now if you add the index with your frame and print it, you can see your chunk
new_df[0] ----prints the data in the 1st chunk
new_df[1] ----prints the data in 2nd chunk
new_df[2] ----prints the data in the 3rd chunk
Related
I've been searching for a solution to this for a while, and I'm really stuck! I have a very large text file, imported as a panda dataframe containing just two columns but with hundreds of thousands to millions of rows. The columns contain packet dumps: one is the data of the packets formatted as ascii representations of monotonically increasing integers, and the second the packet time.
I want to go through this dataframe, and make sure that the dataframe is monotonically increasing, and if there are missing data, to insert a new rows in order to make the list monotonically increasing. i.e the 'data' column should be filled in with the appropriate value but the time should be changed to 'NaN' or 'NULL', etc.
The following is a sample of the data:
data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303400 1527986052.506439335
So I have two questions:
1) I've been trying to loop through the dataframe using itertuples to try to get the next row do a comparison with the current row and if the difference s more than the 100 to add a new row, but unfortunately I've struggled with this since, there doesn't seem to be a good way to retreive the row after the one called.
2) Is there a better way (faster) way to do this other than the way I've proposed?
This may be trivial, though I've really struggled with it. Thank you in advance for your help.
A problem at a time. You can do a verbatim check df.data.is_monotonic_increasing.
Inserting new indices: it is better to go the other way around. You already know the index you want. It is given by range(min_val, max_val+1, 100). You can create a blank DataFrame with this index and update it using your data.
This may be memory intensive so you may need to go over your data in chunks. In that case, you may need to provide index range ahead of time.
import pandas as pd
# test data
df = pd.read_csv(
pd.compat.StringIO(
"""data frame_time_epoch
303030303030303000 1527986052.485855896
303030303030303100 1527986052.491020305
303030303030303200 1527986052.496127062
303030303030303300 1527986052.501301944
303030303030303500 1527986052.506439335"""
),
sep=r" +",
)
# check if the data is increasing
assert df.data.is_monotonic_increasing
# desired index range
rng = range(df.data.iloc[0], df.data.iloc[-1] + 1, 100)
# blank frame with full index
df2 = pd.DataFrame(index=rng, columns=["frame_time_epoch"])
# update with existing data
df2.update(df.set_index("data"))
# result
# frame_time_epoch
# 303030303030303000 1.52799e+09
# 303030303030303100 1.52799e+09
# 303030303030303200 1.52799e+09
# 303030303030303300 1.52799e+09
# 303030303030303400 NaN
# 303030303030303500 1.52799e+09
Just for examination: Did you try sth like
delta = df['data'].diff()
delta[delta>0]
delta[delta<100]
I am going to start off with stating I am very much new at working in Python. I have a very rudimentary knowledge of SQL but this is my first go 'round with Python. I have a csv file of customer related data and I need to output the records of customers who have spent more than $1000. I was also given this starter code:
import csv
import re
data = []
with open('customerData.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data.append(row)
print(data[0])
print(data[1]["Name"])
print(data[2]["Spent Past 30 Days"])
I am not looking for anyone to give me the answer, but maybe nudge me in the right direction. I know that it has opened the file to read and created a list (data) and is ready to output the values of the first and second row. I am stuck trying to figure out how to call out the column value without limiting it to a specific row number. Do I need to make another list for columns? Do I need to create a loop to output each record that meets the > 1000 criteria? Any advice would be much appreciated.
To get a particular column you could use a for loop. I'm not sure exactly what you're wanting to do with it, but this might be a good place to start.
for i in range(0,len(data)):
print data[i]['Name']
len(data) should equal the number of rows, thus iterating through the entire column
The sample code does not give away the secret of data structure. It looks like maybe a list of dicts. Which does not make much sense, so I'll guess how data is organized. Assuming data is a list of lists you can get at a column with a list comprehension:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_column = [row[1] for row in data]
print(spent_column) # prints: ['Spent Past 30 Days', 890, 1200]
But you will probably want to know who is a big spender so maybe you should return the names:
data = [['Name','Spent Past 30 Days'],['Ernie',890],['Bert',1200]]
spent_names = [row[0] for row in data[1:] if int(row[1])>1000]
print(spent_names) # prints: ['Bert']
If the examples are unclear I suggest you read up on list comprehensions; they are awesome :)
You can do all of the above with regular for-loops as well.
What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?
I thought about doing something like:
stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)):
start = i
end = i + stepsize-1 #neglect last row ...
df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')
But I bet there is a smarter solution out there?
As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.
This answer brought me to a satisfying solution using:
numpy.array_split(object, number_of_chunks)
for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
chunk.to_csv(f'/data/bs_{idx}.csv')
Use id in the filename else it will not work. You missed id, and without id, it gives an error.
for id, df_i in enumerate(np.array_split(df, number_of_chunks)):
df_i.to_csv('/data/bs_{id}.csv'.format(id=id))
I have 60 HUGE csv files (around 2.5 GB each). Each cover data for a month and has a 'distance' column I am interested in. Each has around 14 million rows.
I need to find the average distance for each month.
This is what I have so far:
import pandas as pd
for x in range(1, 60):
df=pd.read_csv(r'x.csv', error_bad_lines=False, chunksize=100000)
for chunk in df:
print df["distance"].mean()
First I know 'print' is not a good idea. I need to assign the mean to a variable I guess. Second, what I need is the average for the whole dataframe and not just each chunk.
But I don't know how to do that. I was thinking of getting the average of each chunk and taking the simple average of all the chunks. That should give me the average for the dataframe as long as chunksize is equal for all chunks.
Third, I need to do this for all of the 60 csv files. Is my looping for that correct in the code above? My files are named 1.csv to 60.csv .
Few things I would fix based on how your file is named. I presume your files are named like "1.csv", "2.csv". Also remember that range is exclusive, and thus you would need to go to 61 in the range.
distance_array = []
for x in range(1,61):
df = pd.read((str(x) + ".csv", error_bad_lines=False, chunksize=100000)
for index, row in df.iterrows():
distance_array.append(x['distance'])
print(sum(distance_array)/len(distance_array))
I am presuming that the datasets are too large to load into memory as a pandas dataframe. If that is the case consider using a generator on each csv file, something similar too: Where to use yield in Python best?
As the overall result that you are after is the average you can accumulate the the total sum over each row and track how many rows with incremental step.
I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.
You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))
You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)