Create multiple DataFrames based on given column values [duplicate] - python

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed 4 years ago.
There's probably a simple solution to this that I just couldn't find...
With the given DataFrame, how can I separate it into multiple DataFrames and go from something like:
>>>import pandas as pd
>>>d ={'LOT': [102,104,162,102,104,102],'VAL': [22,424,65,4,34,6]}
>>>df = pd.DataFrame(data=d)
>>>df
LOT VAL
0 102 22
1 104 424
2 162 65
3 102 4
4 104 34
5 102 6
to:
>>>df[0]
LOT VAL
0 102 22
1 102 4
2 102 6
>>>df[1]
LOT VAL
0 104 424
1 104 34
>>>df[2]
LOT VAL
0 162 65
With 3 distinct DataFrames
Please let me know if you need more information.

This is a simple groupby. Let me see if I find a dupe:
import pandas as pd
df = pd.DataFrame({
'LOT': [102,104,162,102,104,102],
'VAL': [22,424,65,4,34,6]
})
df = [x for _, x in df.groupby('LOT')]
Ok, I found something. However the answer seems overcomplicated so I'm gonna leave this here.
Looks a lot like: Split pandas dataframe based on groupby

Related

Concatenating Two Columns in Pandas Using the Index

I have three Pandas data frames consisting of the id of patients being used as their index, and time series values of different patients in order of their time of measurement. All of these patients have the same number of measurements (there are two values for each patient). I want to create a third data frame which just concatenates these data frames. Catch: Not all patients are represented in all data frames. The final data frame should contain only the patients represented in ALL three data frames. An example for the data frames (please note there's three in total):
A
id
value1
1
80
1
78
2
76
2
79
B
id
value2
2
65
2
67
3
74
3
65
# to reproduce the data frames
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
What I'm trying to create:
id
value1
value2
2
76
65
2
79
67
I tried:
data = pd.merge(A, B, on="stay_id")
But the result is:
id
value1
value2
2
76
65
2
76
67
2
79
65
2
79
67
So the first value gets repeated along the axis. I also tried:
complete = A.copy()
complete["B" = B["value2"]
Does this ensure the values being matched for the id?
If I understand correctly, first start by making the dataframes have the same columns names by using pandas.DataFrame.set_axis and then, concatenate those dataframes with the help of pandas.concat. Finally, use a boolean mask to keep only the rows with an id figuring in all the dataframes.
Considering there is a third dataframe (called dfC), you can try the code below :
id
value3
2
72
2
83
4
78
4
76
list_df = [dfA, dfB, dfC]
out = pd.concat([df.set_axis(['id', 'value'], axis=1) for df in list_df], ignore_index=True)
out = out[out.id.isin(list(set.intersection(*(set(df["id"]) for df in list_df))))]
>>> print(out)
id value
2 2 76
3 2 79
4 2 65
5 2 67
8 2 72
9 2 83
After hours of trying I finally found a way using some of #Lucas M. Uriarte's logic, thanks a lot for that!
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79]})
df1_patients = set(df1.index.values)
df2_patients = set(df2.index.values)
patients = set.intersection(df1, df2)
patients = list(patients)
reduced_df1 = df1.loc[patients]
reduced_df2 = df2.loc[patients]
reduced_df1.sort_index(inplace=True)
reduced_df2.sort_index(inplace=True)
data = reduced_df1.copy()
data["value2"] = reduced_df2["value2"]
This as far as I can see it ensures keeping only the entries that are in both data frames and matches the values row by row in this scenario.
I modify the answer according to the comments exchanged in the question, you need to find a unique identifier to merge from; since you have the number of measurements, this is "n_meassurement" and "stay_id", consider, for example, the following modification of the dataframes:
df1 =pd.DataFrame({"stay_id":[2,2,3,3], "value":[65,67,74,65],
"number_measurement":["measurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
df2 =pd.DataFrame({"stay_id":[1,1,2,2],"value":[80,78,76,79],
"number_measurement":["meassurement_1", "measurement_2", "measurement_1", "measurement_2" ]})
output = pd.merge(df1, df2, on=["stay_id", "number_measurement"])
print(output)
Output:
stay_id value_x number_measurement value_y
0 2 65 measurement_1 76
1 2 67 measurement_2 79
now just drop the column number_measurement:
output.drop("number_measurement", axis=1)
stay_id value_x value_y
0 2 65 76
1 2 67 79

How to merge every two columns, with pandas, substituting only if the left column value is nan or 0 [duplicate]

This question already has answers here:
Efficiently replace values from a column to another column Pandas DataFrame
(5 answers)
Python Pandas replace NaN in one column with value from corresponding row of second column
(7 answers)
Closed 10 months ago.
I have 2n columns and each pair looks like this:
1 0
2 0
45 1
44 10
43 22
0 55
0 46
0 75
I want to turn each pair of columns into a single one where the 0 or NaN of the left column are substituted by the values on the right column.
In this example the result would be
1
2
45
44
43
55
46
75
And it is important that this is done for every pair of columns in the dataframe.
try this :
import pandas as pd
import numpy as np
d = {'col1': [1,2,45,44,43,0,0,0,2],
'col2': [0,0,1,10,22,55,46,75,np.nan],
}
df = pd.DataFrame(data=d)
df=df.replace(np.nan,0)
df['col2']=np.where(df['col1']==0,df['col2'],df['col1'])
First, a dataframe is created from the dictionary. Then all 0 are replaced by np.nan. This has the advantage that you can use the fillna() function afterwards to replace all np.nan values with the corresponding value from col2.
import pandas as pd
import numpy as np
d = {'col1': [0,0,1,10,22,55,46,34,np.nan],
'col2': [1,2,45,44,43,0,0,0,2]}
df = pd.DataFrame(d)
df.replace(0, np.nan, inplace=True)
df['col1'].fillna(df['col2']).to_frame()

how to delete rows based on multiple columns and condition with pandas and python? [duplicate]

This question already has answers here:
How to filter this dataframe?
(3 answers)
Closed 11 months ago.
the data:
consider this sample dataset
https://docs.google.com/spreadsheets/d/17Xjc81jkjS-64B4FGZ06SzYDRnc6J27m/edit#gid=1176233701
How to delete rows rows based on multiple columns condition?
i am filtering the data based on my thread i asked earlier. How to filter this dataframe?
The solution in this thread ended up with errors
I want to filter the data based on the Edit section in the above thread?
You can combine filters using the & operator, like so:
# Dataframe with random values in range [0, 100] with shape [100,4]
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Example filters
filter1 = df['A'] > 10
filter2 = df['B'] > 90
filter3 = df['C'] > df['D']
# Filter/remove rows
df[filter1 & filter2 & filter3]
OUTPUT:
A B C D
0 51 92 73 36
17 73 95 77 20
91 88 95 79 54
95 68 99 68 40

Merging two dataframes while considering overlaps and missing indexes [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have multiple dataframes that have an ID and a value and I am trying to merge them such that each ID has all the values in it's row.
ID
Value
1
10
3
21
4
12
5
43
7
11
And then I have another dataframe:
ID
Value2
1
12
2
14
4
55
6
23
7
90
I want to merge these two in a way where it considers the ID's that are already in the first dataframe and if an ID that is the second dataframe is not in the first one, it adds it to the ID row with value2 leaving value empty. This is what my result would look like:
ID
Value
Value2
1
10
12
3
21
-
4
12
55
5
43
-
7
11
90
2
-
14
6
-
23
Hope this makes sense. I don't really care for the order of the ID numbers, they can be sorted or not. My goal is to be able to create dictionaries for each ID with "Value", "Value2", "Value3,... as keys and the corresponding actual value numbers as the keys values. Please let me know if any clarification needed.
You can use pandas' merge method (see here for the help page):
import pandas as pd
df1.merge(df2, how='outer', on='ID')
Specifying 'outer' will use union keys from both dataframes.

groupby with multiple columns with addition and frequency counts in pandas [duplicate]

This question already has answers here:
Multiple aggregations of the same column using pandas GroupBy.agg()
(4 answers)
Closed 4 years ago.
I have a table that is looks like follows:
name type val
A online 12
B online 24
A offline 45
B online 32
A offline 43
B offline 44
I want to dataframe in such a manner that it can be groupby with multiple cols name & type, which also have additional columns that return the count of the record with val being added of the same type records. It should be like follows:
name type count val
A online 1 12
offline 2 88
B online 2 56
offline 1 44
I have tried pd.groupby(['name', 'type'])['val'].sum() that gives the addition but unable to add the count of records.
Add parameter sort=False to groupby for avoid default sorting and aggregate by agg with tuples with new columns names and aggregate functions, last reset_index for MultiIndex to columns:
df1 = (df.groupby(['name', 'type'], sort=False)['val']
.agg([('count', 'count'),('val', 'sum')])
.reset_index())
print (df1)
name type count val
0 A online 1 12
1 B online 2 56
2 A offline 2 88
3 B offline 1 44
You can try pivoting i.e
df.pivot_table(index=['name','type'],aggfunc=['count','sum'],values='val')
count sum
val val
name type
A offline 2 88
online 1 12
B offline 1 44
online 2 56

Categories