If I have a DataFrame that is composed of 100 rows w/ 4 columns for sake of example, how can I create 5 new DataFrames that are each composed of 20 rows w/ 4 columns?
That is, if an arbitrary column of the original DataFrame holds the list [0, 1, 2, 3, ..., 98, 99], how would I create 5 new DataFrames such that the first DataFrame's arbitrary column holds the list [0, 1, 2, ..., 9], the second DataFrame's arbitrary column holds the list [10, 11, 12, ..., 19], etc. etc.?
I tried the following to a DataFrame consisting of a single column A that holds the list [0, 1, 2, 3, ..., 98, 99], but it gave me 100 CSV files each w/ a single row rather than the desired 5 CSenter code hereV files each w/ 20 rows:
import pandas as pd
import numpy as np
list = []
for i in range (0, 100):
list.append(i)
df = pd.DataFrame(data=list, columns=['A'])
groups = df['A'].groupby(np.arange(len(df['A']/10)))
for (frameno, frame) in groups:
frame.to_csv("/Users/ephemeralhappiness/Desktop/Cycle Test/" + "%s.csv" % frameno)
Just change your groupby to:
# to get 5 groups
nrows = 20
groups = df.groupby(df.index // nrows)
print(groups.ngroups)
5
Related
I have a dataframe that looks like
RMSE SELECTED DATA information
0 100 [12, 15, 19, 13] (arr1, str1, fl1)
1 200 [7, 12, 3] (arr2, str2, fl2)
2 300 [5, 9, 3, 3, 3, 3] (arr3, str3, fl3)
Here, I want to break up the information column into three distinct columns: the first column containing the arrays , the second column containing the string and the last column containing the float Thus the new dataframe would look like
RMSE SELECTED DATA ARRAYS STRING FLOAT
0 100 [12, 15, 19, 13] arr1 str1 fl1
1 200 [7, 12, 3] arr2 str2 fl2
2 300 [5, 9, 3, 3, 3, 3] arr3 str3 fl3
I thought one way would be to isolate the information column and then slice it using .apply like so:
df['arrays'] = df['information'].apply(lambda row : row[0])
and do this for each entry. But I was curious if there is a better way to do this as if there are many more entries it may become tedious or slow with a for loop
Let us recreate the dataframe
tojoin = pd.DataFrame(df.pop('information').to_numpy().tolist(),
index = df.index,
columns = ['ARRAYS', 'STRING', 'FLOAT'])
df = df.join(tojoin)
I have a large df with coordinates in multiple dimensions. I am trying to create classes (Objects) based on threshold difference between the coordinates. An example df is as below:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
So based on this df I want to group each row to a class based on -+ 2 across all coordinates. So the df will have a unique group name added to each row. So the output for this threshold function is:
'x' 'y' 'z' 'group'
1 10 7 -
2 14 6 -
3 5 2 G1
4 14 43 -
5 3 1 G1
6 12 40 -
It is similar to clustering but I want to work on my own threshold functions. How can this done in python.
EDIT
To clarify the threshold is based on the similar coordinates. All rows with -+ threshold across all coordinates will be grouped as a single object. It can also be taken as grouping rows based on a threshold across all columns and assigning unique labels to each group.
As far as I understood, what you need is a function apply. It was not very clear from your statement, whether you need all the differences between the coordinates, or just the neighbouring differences (x-y and y-z). The row 5 has the difference between x and z coordinate 4, but is still assigned to the class G1.
That's why I wrote it for the two possibilities and you can just choose which one you need more:
import pandas as pd
import numpy as np
def your_specific_function(row):
'''
For all differences use this:
diffs = np.array([abs(row.x-row.y), abs(row.y-row.z), abs(row.x-row.z)])
'''
# for only x - y, y - z use this:
diffs = np.diff(row)
statement = all(diffs <= 2)
if statement:
return 'G1'
else:
return '-'
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6], 'y': [10, 14, 5, 14, 3, 12],'z': [7, 6, 2, 43, 1, 40]})
df['group'] = df.apply(your_specific_function, axis = 1)
print(df.head())
I am looking to quickly combine columns that are genetic complements of each other. I have a large data frame with counts and want to combine columns where the column names are complements. I have a currently have a system that
Gets the complement of a column name
Checks the columns names for the compliment
Adds together the columns if there is a match
Then deletes the compliment column
However, this is slow (checking every column name) and gives different column names based on the ordering of the columns (i.e. deletes different compliment columns between runs). I was wondering if there was a way to incorporate a dictionary key:value pair to speed the process and keep the output consistent. I have an example dataframe below with the desired result (ATTG|TAAC & CGGG|GCCC are compliments).
df = pd.DataFrame({"ATTG": [3, 6, 0, 1],"CGGG" : [0, 2, 1, 4],
"TAAC": [0, 1, 0, 1], "GCCC" : [4, 2, 0, 0], "TTTT": [2, 1, 0, 1]})
## Current Pseudocode
for item in df.columns():
if compliment(item) in df.columns():
df[item] = df[item] + df[compliment(item)]
del df[compliment(item)]
## Desired Result
df_result = pd.DataFrame({"ATTG": [3, 7, 0, 2],"CGGG" : [4, 4, 1, 4], "TTTT": [2, 1, 0, 1]})
Translate the columns, then assign the columns the translation or original that is sorted first. This allows you to group compliments.
import numpy as np
mytrans = str.maketrans('ATCG', 'TAGC')
df.columns = np.sort([df.columns, [x.translate(mytrans) for x in df.columns]], axis=0)[0, :]
df.groupby(level=0, axis=1).sum()
# AAAA ATTG CGGG
#0 2 3 4
#1 1 7 4
#2 0 0 1
#3 1 2 4
Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)
I have a thousands of data frame like the following, though much larger (1000000 rows, 100 columns).
data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[45, 66, 6, 6, 1, 432, 3],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
I want to randomly sample from this data frame and make a new data frame such that the sum of count should only equal N. Meaning I want to randomly sample based on the count value as a weight, and make a new data frame with this new resampled data such that sum of count is N.
The relative proportions should stay approximately the same, and no value when resampled should exceed the count of the original count value. The values in cols1 (or any other column except Value and count) should remain the same.
For example, if N was 50, it might look like:
pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'count':[4, 7, 1, 1, 0, 37, 0],
'Value':['Apple', 'Boy', 'Car', 'Corn', 'Anne', 'Barnes', 'Bayesian']})
How can this be done?
Efficiency is key, otherwise I could expand the data frame based on count and randomly sample without replacement, then merge it back together.
Thanks,
Jack
Using multinomial sampling, this is relatively easy.
import numpy as np
from itertools import chain
def downsample(df, N):
prob = df['count']/sum(df['count'])
df['count'] = list(chain.from_iterable(np.random.multinomial(n = N, pvals = prob, size = 1)))
df = df[df['count'] != 0]
return df
For OP's example:
downsample(data, 50)
returns:
Value cols1 count
1 Boy 5 1
3 Corn 4 16
5 Barnes 32 33