How would I combined multiples columns into a single column in excel using pandas in Python?
a=[5,4,3,2,5,4,6,9,8,4,3,2,6]
b=[11,12,1,2,11,9,11,11,4,12,0,2,11]
c=[9,5,4,6,10,5,12,13,14,10,3,6.1,5]
from pandas import DataFrame
df = DataFrame({'Stimulus Time': a c, 'Reaction Time': b})
df.to_excel('case2.xlsx',sheet_name='sheet1', index=False)
This gives me the following output:
Reaction Time Stimulus Time
0 11 5
1 12 4
2 1 3
3 2 2
4 11 5
5 9 4
6 11 6
7 11 9
8 4 8
9 12 4
10 0 3
11 2 2
12 11 6
However I need the output in the following format:
Reaction Time Stimulus Time
From 0 to 11 5
From 1 to 12 4
From 2 to 1 3
From 3 to 2 2
.......
.......
........
Thanks,
D
I'd suggest using an intermediate list which converts the start and end times to a string.
e.g.
d = ["From "+str(i)+" to "+ str(j) for i,j in zip(range(0,len(b)),b)]
Related
Let's say I have the following df -
data={'Location':[1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4]}
df = pd.DataFrame(data=data)
df
Location
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 3
14 3
15 4
16 4
17 4
In addition, I have the following dict:
Unlock={
1:"A",
2:"B",
3:"C",
4:"D",
5:"E",
6:"F",
7:"G",
8:"H",
9:"I",
10:"J"
}
I'd like to create another column that will randomly select a string from the 'Unlock' dict based on the condition that Location<=Unlock. So for example - for Location 2 some rows will get 'A' and some rows will get 'B'.
I've tried to do the following but with no luck (I'm getting an error) -
df['Name']=np.select(df['Location']<=Unlock,np.random.choice(Unlock,size=len(df))
Thanks in advance for your help!
You can convert your dictionary values to a list, and randomly select the values of a subset of this list: only up to Location number of elements.
With Python versions >= 3.7, dict maintains insertion order. For lower versions - see below.
lst = list(Unlock.values())
df['Name'] = df['Location'].transform(lambda loc: np.random.choice(lst[:loc]))
Example output:
Location Name
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 2 B
7 2 A
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 3 C
14 3 B
15 4 A
16 4 C
17 4 D
If you are using a lower version of Python, you can Build a list of dictionary values, sorted by key:
lst = [value for key, value in sorted(Unlock.items())]
For a vectorial method, multiply by a random value (0,1] and ceil, then map with your dictionary.
This will give you an equiprobable value between 1 and the current value (included):
import numpy as np
df['random'] = (np.ceil(df['Location'].mul(1-np.random.random(size=len(df))))
.astype(int).map(Unlock)
)
output (reproducible with np.random.seed(0)):
Location random
0 1 A
1 1 A
2 1 A
3 2 B
4 2 A
5 2 B
6 2 A
7 2 B
8 2 B
9 3 B
10 3 C
11 3 B
12 3 B
13 3 C
14 3 A
15 4 A
16 4 A
17 4 D
I want to select 3 residual data that only pass through the threshold in a row, where my threshold is 3. Here I attach the csv data to the link and what I currently do is for the filter. where I need the time criteria there. Consecutive data are those that pass the threshold and are sequentially timed
df[df.residual_value >= 3]
Data csv
IIUC, you want to filter the rows that are greater or equal than 3, only if 3 consecutive rows match the criterion. You can use rolling+min:
processing:
df[df['col'].rolling(window=3).min().shift(-2).ge(3)]
example dataset:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0,10,100)})
>>> df.head(15)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
10 7
11 6
12 8
13 8
14 1
output:
col
2 3
3 3
4 7
5 9
9 4
10 7
11 6
...
Suppose I have a data frame like this:
df=
p1 v1 p2 v2 p3 v3 p4 v4 p5 v5 p6 v6
0 3 6 5 8 4 4 8 4 9 6 0 0
1 5 0 5 9 0 8 8 5 5 2 2 9
2 6 9 8 6 9 9 9 2 8 4 2 6
3 4 1 8 0 5 9 0 2 1 2 4 8
4 1 4 8 1 3 1 4 9 6 2 6 7
5 5 4 6 5 5 2 3 0 5 5 6 4
6 4 4 9 0 2 1 7 0 1 0 8 8
7 9 1 7 3 5 4 4 4 8 9 3 8
8 1 5 0 5 4 3 6 5 2 3 1 4
9 9 1 7 6 5 3 6 8 8 4 7 5
10 1 6 5 8 2 5 1 5 3 4 5 8
11 8 7 6 6 9 3 5 5 9 7 6 7
p and v are certain parameters measured for different samples (e.g. 1, 2, 3..etc.). Now I want to multiply all the columns of "p_" by a number and use diff() on all the columns of "v_" to subtracting the subsequent rows of that column.
I want to save the results in the same DataFrame using the corresponding samples name and the first letter of mathematical operation, like Dv1, Dv2 for output of df. diff('v1'), df.diff('v2') and etc. Similarly, for columns of p, it would be like Mp1.
Manually for each column, I can do the operation and save the results, but it's tedious(as the number of samples is very high), so I want to automate it using conditions like for loop.
Any suggestion to do mathematical operations(subtracting, multiplying or dividing) in multiple columns of the pandas' DataFrame and save the result in the same DataFrame as a new column using a combination of column name and mathematical operation name.
New DataFrame should look like this p1 Mp1 v1 DV1 p2 Mp2 v2 Dv2 p3 Mp3 v3 Dv3....... etc.
Try this:
# Find all columns that starts with p and followed by a number
p = df.columns[df.columns.str.match('p\d')]
# Find all columns that starts with v and followed by a number
v = df.columns[df.columns.str.match('v\d')]
# Multiply the p columns by 2
mp = df[p].mul(2).add_prefix('M')
# Take a diff of the v columns
dv = df[v].diff().add_prefix('D')
# The display order of the columns
cols = [f'{j}{i}' for i in range(1,7) for j in ['p', 'Mp', 'v', 'Dv']]
# The final result
final = pd.concat([df, mp, dv], axis=1)[cols]
Something like:
import pandas as pd
import io
str_data = """
p1,v1,p2,v2,p3,v3,p4,v4,p5,v5,p6,v6
3,6,5,8,4,4,8,4,9,6,0,0
5,0,5,9,0,8,8,5,5,2,2,9
6,9,8,6,9,9,9,2,8,4,2,6
4,1,8,0,5,9,0,2,1,2,4,8
1,4,8,1,3,1,4,9,6,2,6,7
5,4,6,5,5,2,3,0,5,5,6,4
4,4,9,0,2,1,7,0,1,0,8,8
9,1,7,3,5,4,4,4,8,9,3,8
1,5,0,5,4,3,6,5,2,3,1,4
9,1,7,6,5,3,6,8,8,4,7,5
1,6,5,8,2,5,1,5,3,4,5,8
8,7,6,6,9,3,5,5,9,7,6,7
"""
df = pd.read_csv(io.StringIO(str_data))
#Doing this in case you have a pN, but not a vN, or vice versa to avoid errors
p_samples = [int(c[1:]) for c in df.columns if c.startswith('p')]
v_samples = [int(c[1:]) for c in df.columns if c.startswith('v')]
samples = set(p_samples).intersection(v_samples)
samples = sorted(list(samples))
data = {}
mult_num = 7 #not sure what you want to multiply by
for sample in samples:
p_col = 'p{}'.format(sample)
v_col = 'v{}'.format(sample)
Mp_col = 'Mp{}'.format(sample)
Dv_col = 'Dv{}'.format(sample)
data[p_col] = df[p_col]
data[Mp_col] = mult_num*df[p_col]
data[v_col] = df[v_col]
data[Dv_col] = df[v_col].diff()
new_df = pd.DataFrame(data)
print(new_df)
I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])
I have a pandas DataFrame say this:
user value
0 a 1
1 a 2
2 a 3
3 a 4
4 a 5
5 b 6
6 b 7
7 b 8
8 b 9
9 b 10
10 c 11
11 c 12
12 c 13
13 c 14
14 c 15
Now I want to group by user, and create two mutually exclusive random samples out of it e.g
Set1 with 1 samples per group:
user value
3 a 4
9 b 10
13 c 14
Set2 with 2 samples per group:
user value
0 a 1
1 a 2
5 b 6
6 b 7
10 c 11
11 c 12
So far i'v tried this:
u = np.array(['a','b','c'])
u = np.repeat(u,5)
df = pd.DataFrame({'user':u,'value':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]})
set1 = df.groupby(['user']).tail(1)
set2 = df.groupby(['user']).head(2)
But these are not random samples, and i would like them to be mutually exclusive. Any ideas?
PS. Each group always has at least 3 elements
You can randomly select 3 records for each user:
a = df.groupby("user")["value"].apply(lambda x: x.sample(3))
a
Out[27]:
user
a 3 4
0 1
2 3
b 5 6
7 8
6 7
c 14 15
10 11
13 14
dtype: int64
And assign first one to the first set, the remaining two to the second set:
a.groupby(level=0).head(1)
Out[28]:
user
a 3 4
b 5 6
c 14 15
dtype: int64
a.groupby(level=0).tail(2)
Out[29]:
user
a 0 1
2 3
b 7 8
6 7
c 10 11
13 14
dtype: int64
This maybe a bit naive but all I did was reindex the DataFrame with a random permutation of the length of the DataFrame and reset the index. After that I take the head and tail as you did with your original code, seems to work. This could probably be made into a function:
a = np.arange(len(df))
np.random.shuffle(a)
df = df.reindex(a).reset_index()
set1 = df.groupby(['user']).tail(1)
>>>
index user value
12 9 b 10
13 10 c 11
14 1 a 2
set2 = df.groupby(['user']).head(2)
>>>
index user value
0 6 b 7
1 2 a 3
2 5 b 6
3 13 c 14
4 3 a 4
6 12 c 13
Hope this helps.
There is likely a better solution but what about just randomizing your data before grouping and then taking the tail and head per group? You could take a set of your indices, take a random permutation of it and use that to create a new scrambled dataframe, then do your current procedure.