resample pandas dataframe at to arbitrary number - python

I have a loop in which a new data frame is populated with values during each step. The number of rows in the new dataframe is different for each step in the loop. At the end of the loop, I want to compare the dataframes and in order to do so, they all need to be the same length. Is there a way I can resample the dataframe at each step to an arbitrary number (eg. 5618) of rows?

If your dataframe is too small by N rows, then you can randomly sample N rows with replacement and add the rows on to the end of your original dataframe. If your dataframe is too big, then sample the desired number from the original dataframe .
if len(df) <5618:
df1 = df.sample(n=5618-len(df),replace=True)
df = pd.concat([df,df1])
if len(df) > 5618:
df = df.sample(n=5618)

Related

Split Data Frame into New Dataframe for each consecutive Column

Looking to split columns of this data frame into multiple data frames. Each with the date column and the consecutive column. How do I get a function that can automate this. So we would have n data frames, n being the number of columns in the original data frame - 1( the date column).
The first thing first is to set the date column as the index:
df.set_index('Date')
Then, when you filter the data frame by a single column you will get a series object with the date and your column of interest:
e.g. df.P19245Y8E will give a series of the second column.
I think this will do what you need, but if you really want to create separate dataframes for each column then you just iterate through the columns:
new_dfs = []
for col in df.columns:
new_dfs.append(df[col])
or with list comprehension:
new_dfs = [df[col] for col in df.columns]

Aggregate Function to dataframe while retaining rows in Pandas

I want to aggregate my data based off a field known as COLLISION_ID and a count of each COLLISION_ID.
I want to remove repeating COLLISION_IDs since they have the same Coordinates, but retain a count of occurrences in original data-set.
My code is below
df2 = df1.groupby(['COLLISION_ID'])[['COLLISION_ID']].count()
This returns such:
I would like my data returned as the COLLISION_ID numbers, the count, and the remaining columns of my data which are not shown here(~40 additional columns that will be filtered later)
If you are talking about filter , we should do transform
df1['count_col']=df1.groupby(['COLLISION_ID'])['COLLISION_ID'].transform('count')
Then you can filter the df1 with column count

python - append only select columns as rows

Original file has multiple columns but there are lots of blanks and I want to rearrange so that there is one nice column with info. Starting with 910 rows, 51 cols (newFile df) -> Want 910+x rows, 3 cols (final df) final df has 910 rows.
newFile sample
for i in range (0,len(newFile)):
for j in range (0,48):
if (pd.notnull(newFile.iloc[i,3+j])):
final=final.append(newFile.iloc[[i],[0,1,3+j]], ignore_index=True)
I have this piece of code to go through newFile and if 3+j column is not null, to copy columns 0,1,3+j to a new row. I tried append() but it adds not only rows but a bunch of columns with NaNs again (like the original file).
Any suggestions?!
Your problem is that you are using a DataFrame and keeping column names, so adding a new columns with a value will fill the new column with NaN for the rest of the dataframe.
Plus your code is really inefficient given the double for loop.
Here is my solution using melt()
#creating example df
df = pd.DataFrame(numpy.random.randint(0,100,size=(100, 51)), columns=list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXY'))
#reconstructing df as long version, keeping columns from index 0 to index 3
df = df.melt(id_vars=df.columns[0:2])
#dropping the values that are null
df.dropna(subset=['value'],inplace=True)
#here if you want to keep the information about which column the value is coming from you stop here, otherwise you do
df.drop(inplace=True,['variable'],axis=1)
print(df)

Random Sampling of Pandas data frame (both rows and columns)

I know how to randomly sample few rows from a pandas data frame. Lets say I had a data frame df, then to get a fraction of rows, I can do :
df_sample = df.sample(frac=0.007)
However what I need is random rows as above AND also random columns from the above data frame.
Df is currently 56Kx8.5k. If I want say 500x1000 where both 500 and 1000 are randomly sampled how to do this?
I think one approach would be do something like
df.columns to get a list of columns names.
Then do some random sampling of the indices of this list of columns and use that random indices to filter out remaining columns?
Just call sample twice, with corresponding axis parameters:
df.sample(n=500).sample(n=1000, axis=1)
For the first one, axis=0 by default. The first sampling samples lines, while the second considers columns.

Optimal filling of pandas DataFrame column by matching values in another DataFrame

Basically I have two DataFrames and want to re-populate a column of the second by matching three row elements of the second with the first. To give an example, I have columns "Period" and "Hub" in both DataFrames. For each row in the second DataFrame, I want to take the value of Index (which is a date) and "Product"/"Hub" (which are strings) and find the row in the first DataFrame that has these same values (in the corresponding columns) and return the value of "Period" from that row. I can then populate my row in the second DataFrame with this value.
I have a working solution, but it's really slow. Perhaps this is just due to the size of the DataFrames (approx. 100k rows) but it's taking over an hour to process!
Anyway, this is my working solution - any tips on how to speed it up would be really appreciated!
def selectData(hub, product):
qry = "Hub=='"+hub+"' and Product=='"+product+"'"
return data_1.query(qry)
data_2["Period"] = data_2.apply(lambda row: selectData(row["Hub"], row["Product"]).ix[row.index, "Period"], axis=1)
EDIT: I should note that the first DataFrame is guaranteed to have a unique result to my query but contains a larger set of data than that required to populate data_2
EDIT2: I just realised this is not in fact a working solution...
if i understand your problem correctly, you want merge these 2 dataframe on index(date), Product, Hub and obtain Period from data_1
I don't have data but tested it on random ints. It should be very fast with 100k rows in data_1
#data_1 is the larger dictonary
n=100000
data_1 = pd.DataFrame(np.random.randint(1,100,(n,3)),
index=pd.date_range('2012-01-01',periods=n, freq='1Min').date,
columns=['Product', 'Hub', 'Period']).drop_duplicates()
data_1.index.name='Date'
#data_2 is a random subset, w/o column Period
data_2 = data_1.ix[np.random.randint(0,len(data_1),1000), ['Product','Hub']]
To join on index + some columns, you can do this:
data_3 = data_2.reset_index().merge(data_1.reset_index(), on=['Date','Product','Hub'], how='left')

Categories