I have dataframe which contains three columns like this:
id no. name
1A 32 ABC
4D 34 CFD
3B 32 DGA
and i want to shift the third column to consecutive next row like this:
1A 32
ABC
4D 34
CFD
3B 32
DGA
How this is possible in Python?
I tried via creating two dataframes one contains "id and no." and another one "name" and then merged them. But I did not like the output its not clean.
You can split out the name series into a new dataframe and manipulate the indices of the two resulting dataframes. Finally, use concat to combine them.
# split out dataframe of names
df_name = df.pop('name').to_frame('id')
df_name['no.'] = ''
# manipulate indices so they are non-overlapping
df_name.index = df_name.index * 2 + 1
df.index = df.index * 2
# concatenate two dataframes
res = pd.concat([df, df_name]).sort_index()
Result:
print(res)
id no.
0 1A 32
1 ABC
2 4D 34
3 CFD
4 3B 32
5 DGA
Related
I have a dataframe of one categorical column and 2 numerical columns. All the categories in the categorical column have no missing value. However for some rows in the categorical column, the adjacent row in the first numerical column have NA's. My issue is I would like to fill NA's rows of the first numerical column with the corresponding row value of the second numerical column, but, I want to do this only for the category rows that are adjacent to the rows in the first column that have NA's. I want to do this operation without changing the shape of the original dataframe. Example dataset df below:
dataframe example to fill NA
Cat_col num_col1 num_col2
SS 22 54
PP NA 89
CC 128 34
XX NA 56
SS 67 56
XX NA 90
CC 47 10
BB NA 29
From the above table, I want to fill NA values of num_col1 with corresponding row values in num_col2, but only for PP and XX row categories in Cat_col. Also do this without changing shape of the dataframe.
First of all You should provide piece of your code showing your effort to solve the problem.
If I understand correctly your question solution could look like as follows:
data = '''Cat_col num_col1 num_col2 SS 22 54 PP NA 89 CC 128 34 XX NA 56 SS 67 56 XX NA 90 CC 47 10 BB NA 29'''.split(' ')
Preparing the data into column/row format:
n=3
result = [data[i:i+n] for i in range(0, len(data), n)]
Create a dataframe and filter for categories:
df = pd.DataFrame(result[1:],columns=result[0])
cat_filter = ['PP', 'XX']
na_filter = dff['num_col1'] == 'NA' #filter for NA values, however instead of NA it would be much easier if missing values would be stated as np.Nan. Then df['num_col1].isna() could be applied
row_mask = dff['Cat_col'].isin(cat_filter) & na_filter #mask will gives the row
Assign values from num_col2 to num_col1:
df.loc[row_mask,'num_col1'] = df.loc[row_mask, 'num_col2']
Output:
Cat_col num_col1 num_col2
0 SS 22 54
1 PP 89 89
2 CC 128 34
3 XX 56 56
4 SS 67 56
5 XX 90 90
6 CC 47 10
7 BB NA 29
I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):
A
id
value1
1
80
1
78
2
76
2
79
B
id
value2
2
65
2
67
3
74
3
65
# to reproduce the data frames
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
What I'm trying to create:
id
value1
value2
2
76
65
2
79
67
I tried:
data = pd.merge(A, B, on="stay_id")
But the result is:
id
value1
value2
2
76
65
2
76
67
2
79
65
2
79
67
So the first value gets repeated along the axis. I also tried:
complete = A.copy()
complete["B" = B["value2"]
Does this ensure the values being matched for the id?
If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.
Considering there is a third dataframe (called dfC), you can try the code below :
id
value3
2
72
2
83
4
78
4
76
list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]
>>> print(out)
id value
2 2 76
3 2 79
4 2 65
5 2 67
8 2 72
9 2 83
After hours of trying I finally found a way using some of #Lucas M. Uriarte's logic, thanks a lot for that!
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
df1_patients = set(df1.index.values)
df2_patients = set(df2.index.values)
patients = set.intersection(df1, df2)
patients = list(patients)
reduced_df1 = df1.loc[patients]
reduced_df2 = df2.loc[patients]
reduced_df1.sort_index(inplace=True)
reduced_df2.sort_index(inplace=True)
data = reduced_df1.copy()
data["value2"] = reduced_df2["value2"]
This as far as I can see it ensures keeping only the entries that are in both data frames and matches the values row by row in this scenario.
I modify the answer according to the comments exchanged in the question, you need to find a unique identifier to merge from; since you have the number of measurements, this is "n_meassurement" and "stay_id", consider, for example, the following modification of the dataframes:
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65],
"number_measurement":["measurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79],
"number_measurement":["meassurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
output = pd.merge(df1, df2, on=["stay_id", "number_measurement"])
print(output)
Output:
stay_id value_x number_measurement value_y
0 2 65 measurement_1 76
1 2 67 measurement_2 79
now just drop the column number_measurement:
output.drop("number_measurement", axis=1)
stay_id value_x value_y
0 2 65 76
1 2 67 79
I have data frame of 20 columns. All of them have a common text and a serial number. I want to trim the text part and make the name shorter. Below is an example:
xdf = pd.DataFrame({'Column1':[10,20],'Column2':[80,90]})
Column1 Column2
0 10 80
1 20 90
Expected output:
C1 C2
0 10 80
1 20 90
Solution1:
oldcols = ['Column1','Column2']
newcols = ['C1','C2']
xdf.rename(columns=dict(zip(oldcols,newcols)),inplace=True)
C1 C2
0 10 80
1 20 90
Solution2:
for i in range(len(oldcols)):
xdf.rename(columns={'%s'%(xdf[i]):'%s'%(xdf[i].replace('Column','C'))},inplace=True)
raise KeyError(key) from err
Solution1 works fine but I have to prepare an old and new column names list. Instead, I want to iterate through each column name and replace the column text. However, solution2 is not working.
You could use str.findall on the columns to split into text and number; then use a list comprehension to take only the first letter and join it with the numbers for each column name:
xdf.columns = [x[0]+y for li in xdf.columns.str.findall(r'([A-Za-z]+)(\d+)') for x,y in li]
Output:
C1 C2
0 10 80
1 20 90
I have a dataframe df_in, which contains column names that start with pi and pm.
df_in = pd.DataFrame([[1,2,3,4,"",6,7,8,9],["",1,32,43,59,65,"",83,97],["",51,62,47,58,64,74,86,99],[73,51,42,67,54,65,"",85,92]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
If a row in column name which starts with pi is blank, make the same rows of columns which starts with pm also blank till we have a new column which starts with pi. And repeat the same process for other columns also.
Expected Output:
df_out = pd.DataFrame([[1,2,3,4,"","",7,8,9],["","","","",59,65,"","",""],["","","","",58,64,74,86,99],[73,51,42,67,54,65,"","",""]], columns=["piabc","pmed","pmrde","pmret","pirtc","pmere","piuyt","pmfgf","pmthg"])
How to do it?
You can create groups by compare columns names by str.startswith with cumulative sum and then compare values by empty spaces in groupby for mask used for set empty spaces in DataFrame.mask:
g = df_in.columns.str.startswith('pi').cumsum()
df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform(lambda x: x.iat[0]), '')
#first for me failed in pandas 1.2.3
#df = df_in.mask(df_in.eq('').groupby(g, axis=1).transform('first'), '')
print (df)
piabc pmed pmrde pmret pirtc pmere piuyt pmfgf pmthg
0 1 2 3 4 7 8 9
1 59 65
2 58 64 74 86 99
3 73 51 42 67 54 65
i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()