How to stack column values in a single Dataframe cell - python

I have a data frame of one column consisting many values. I want to stack all those values into one cell of same or another dataframe.
column_df =
index voltage
0 5.143590
1 5.175285
2 5.231214
3 6.040188
4 7.776510
5 9.540277
6 11.476937
7 13.277916
8 15.088566
9 16.895921
10 18.701332
I want to stack column values into a dataframe cell. Finally I want to achieve something like
Expected output:
cell_df =
index voltage
0 [ 5.14359 , 5.175285, 5.231214, 6.040188, 7.77651 , 9.540277, 11.476937, 13.277916, 15.088566, 16.895921, 18.701332]
My code is:
cell_df = pd.Dataframe()
cell_df['voltage'][0] = np.array([column_df['voltage']])
Present output:
ValueError: setting an array element with a sequence.

You can cast the "voltage" series as a list and use it in your cell_df constructor:
cell_df = pd.DataFrame({"voltage": [column_df["voltage"].tolist()]})

Related

Appending New Row to Existing Dataframe using Pandas

I'm trying to using pandas to append a blank row based on the values in the first column. When the first six characters in the first column don't match, I want an empty row between them (effectively creating groups). Here is an example of what the output could look like:
002446
002447-01
002447-02
002448
This is what I was able to put together thus far.
readie=pd.read_csv('title.csv')
i=0
for row in readie:
readie.append(row)
i+=1
if readie['column title'][i][0:5]!=readie['column title'][i+1][0:5]:
readie.append([])
When running this code, I get the following error message:
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
I believe there are other ways to do this, but I would like to use pandas if at all possible.
I'm using the approach from this answer.
Assuming that strings like '123456' and '123' are considered as not matching:
df_orig = pd.DataFrame(
{'col':['002446','002447-01','002447-02','002448','00244','002448']}
)
df = df_orig.reset_index(drop=True) # reset your index
first_6 = df['col'].str.slice(stop=6)
mask = first_6 != first_6.shift(fill_value=first_6[0])
df.index = df.index + mask.cumsum()
df = df.reindex(range(df.index[-1] + 1))
print(df)
col
0 002446
1 NaN
2 002447-01
3 002447-02
4 NaN
5 002448
6 NaN
7 00244
8 NaN
9 002448

Manipulate row values based on lists

i have actually a problem and I do not know how to solve it.
I have two lists, which have always the same lengths:
max_values = [333,30,10]
min_values = [30,10,0]
every index of the lists represents the cluster number of a range of the max and the min values, so:
Index/Cluster 0: 0-10
Index/Cluster 1: 10-30
Index/Cluster 2: 30-333
Furthermore I have one dataframe as follows:
Dataframe
Within the df, I have a column called "AVG_MPH_AREA"
It should be checked between which cluster range the value is. After the "Cluster" column should be set to the correct index of the list. The old values should be dropped.
In this case it's a list of 3 clusters, but it could also be more or less...
Any idea how to switch that or with which functions?
Came up with a small function that could do the task
max_values = [333,30,10]
min_values = [30,10,0]
Make a dictionary that contains Cluster_num as key and (min_values, max_values) as value.
def temp_func(x):
# constructing the dict inside to apply this func to AVG_MPH_AREA column in dataframe
dt = {}
cluster_list=list(zip(min_values, max_values))
for i in range(len(cluster_list)):
dt[i] = cluster_list[i]
for key, value in dt.items():
x = int(round(x))
if x in list(range(value[0], value[1])):
return key
else:
continue
Now apply the function to the AVG_MPH_AREA column
df["Cluster"] = df["AVG_MPH_AREA"].apply(temp_func)
Output:
AVG_MPH_AREA cluster
0 10.770 1
1 10.770 1
2 10.780 1
3 5.780 2
4 24.960 1
5 267.865 0

How to retrieve cells from a dataframe based on condition from another dataframe

We have two dataframes, first one contains some float values (which mean average speed).
0 1 2
1 15.610826 19.182879 6.678087
2 13.740250 15.666897 17.640749
3 2.379010 2.889702 2.955097
4 20.540628 9.661226 9.479921
And another dataframe with geographical coordinates, where the average speed takes place.
0 1 2
1 [52.2399255, 21.0654495] [52.23893150000001, 21.06087] [52.23800850000001,21.056779]
2 [52.2449705, 21.0755175] [52.2452905, 21.075118000000003] [52.245557500000004, 21.0748175]
3 [52.2401885, 21.012981500000002] [52.239134, 21.009432] [52.238420500000004, 21.007080000000002]
4 [52.221506500000004, 20.9665085] [52.222458, 20.968952] [52.224409, 20.969248999999998]
Now I want to create a list with coordinates where average speed is above 18, in this case this would be
list_above_18=[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]
How can I select values from a dataframe based on values in another dataframe?
You can use enumerate to zip the dataframes and work on the elements seperately. See below (A,B are your dataframes, in same order you provided them):
list_above_18=[]
p=list(enumerate(zip(A.values, B.values)))
for i in p:
for k in range(3):
if i[1][0][k]>18:
list_above_18.append(i[1][1][k])
Output:
>>>print(list_above_18)
[[52.23893150000001, 21.06087] , [52.221506500000004, 20.9665085]]
Considering the shape of the Average Speed dataset will remain same as the coordinates dataset, you can try the below
coord_df[data_df.iloc[:,:] > 18].T.stack().values
Here,
coord_df = DataFrame with coordinate values
data_df = Average Speed values
This would return a numpy array with just the coordinate values where the Average speed is greater than 18
How this works :
data_df.iloc[:,:] > 18
Creates a dataframe mask such that all the values which are smaller than 18 are marked as False and rest as True
coord_df[data_df.iloc[:,:] > 18]
Passes the mask in the Target Dataframe i.e. coordinate dataframe which then results in a dataframe which shows coordinate values only for those cells where the mask has True i.e. where the average speed was above 18
.T.stack().values
This then retrieves only the non-null values from the resultant dataframe and returns a numpy array
References I took :
Get non-null elements in a pandas DataFrame --- To get only the non null values from a dataframe (.T.stack().values)
Let the first df be df1 and second df be df2
output_array = df2[df1>18].values.flatten() # df1>18 would create the mask
output_array = [val for val in output_array if type(val) == list] # removing the nan values. We can't use np.isnan as it would not work for list
Sample Input:
df1
df2
output_array
[[15.1, 20.5], [91.5, 95.8]]

Return value from row based on column value pandas

I have a data frame that has a column of lists of strings, I want to find the value of a colum in a row which is based on the value of another column
i.e
samples subject trial_num
0 ['aa','bb'] 1 1
1 ['bb','cc'] 1 2
I have ['bb','cc'] and I want to get the value from the trial_num column where this list equals the samples colum, in this case 2.
Given the search column (samples) contains a list, it makes thing a tiny bit more complicated.
In this case, the apply() function can be used to test the values, and return a boolean mask, which can be applied to the DataFrame to obtain the required value.
Example code:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num']
Output:
1 2
Name: trial_num, dtype: int64
To only return the required value (2), simply append .iloc[0] to the end of the statement, as:
df.loc[df['samples'].apply(lambda x: x == ['bb', 'cc']), 'trial_num'].iloc[0]
>>> 2

Saving dataframe as another value in python

I am having some issue with copying a dataframe. Basically, I want to replicate a dataframe with another variable but with the columns being numerical instead of categorical. Below I have function that returns dataframe mean_df when I print it out I see that the rows are categorical. I then create a new dataframe (mean_df_num) which is equal to mean_df. Then I convert the rows to index values (for mean_df_num) instead of the categorical letters. However, when I print my mean_df after I see that it has also changed indices to be numerical. Why does this happen and is there a way around this?
mean_df = mean_funct(train_df_cat)
print(mean_df)
mean_df_num = mean_df
mean_df_num.index = range(len(mean_df_num)) #Convert df to numerical indices
print(mean_df)
Output:
m00 mu02 mu11
a 1.00162 0.357137 -0.245608
c 0.766659 0.354217 0.244405
e 0.929145 0.422447 0.0602329
m 1.61799 2.85194 -1.80078
n 1.03976 0.700674 -1.0011
o 0.97873 0.754065 0.172753
r 0.623244 0.11065 1.52705
s 0.789545 0.177259 -0.154744
x 1.0039 0.404982 -1.51634
z 0.919228 0.3578 0.42973
m00 mu02 mu11
0 1.00162 0.357137 -0.245608
1 0.766659 0.354217 0.244405
2 0.929145 0.422447 0.0602329
3 1.61799 2.85194 -1.80078
4 1.03976 0.700674 -1.0011
5 0.97873 0.754065 0.172753
6 0.623244 0.11065 1.52705
7 0.789545 0.177259 -0.154744
8 1.0039 0.404982 -1.51634
9 0.919228 0.3578 0.42973
Pandas dataframe is essentially a pointer. That meas when you do mean_df_num=mean_df, then mean_df_num and mean_df point to the same object. You change one, you change the other. The way around this is .copy(), i.e. mean_df_num=mean_df.copy().
Actually, for your purpose, it's better just do mean_df_num=mean_df.reset_index(drop=True). It does both at the same time: copy the data and set index as range index.

Categories