I've been trying to turn this
| row_id | col_id |
|--------|--------|
| 1 | 23 |
| 4 | 45 |
| ... | ... |
| 1 | 23 |
| ... | ... |
| 4 | 45 |
| ... | ... |
| 4 | 45 |
| ... | ... |
Into this
| row_id | col_id | count |
|--------|--------|---------|
| 1 | 23 | 2 |
| 4 | 45 | 3 |
| ... | ... | ... |
So all (row_i, col_j) occurrences are added into the 'count' column. Note that row_id and column_id won't be unique in any of both cases.
Now success until now, at least if I want to keep being efficient. I can iterate over each pair and add up occurrences, but there has to be a simpler way in pandas—or numpy for that matter.
Thanks!
EDIT 1:
As #j-bradley suggested, I tried the following
# I use django-pandas
rdf = Record.objects.to_dataframe(['row_id', 'column_id'])
_ = rdf.groupby(['row_id', 'column_id'])['row_id'].count().head(20)
_.head(10)
And that outputs
row_id column_id
1 108 1
168 1
218 1
398 2
422 1
10 35 2
355 1
489 1
100 352 1
366 1
Name: row_id, dtype: int64
This seems ok. But it's a Series object and I'm not sure how to turn this into a dataframe with the required three columns. Pandas noob, as it seems. Any tips?
Thanks again.
you can group by columns a and b and call count on the group by object:
df =pd.DataFrame({'A':[1,4,1,4,4], 'B':[23,45,23,45,45]})
df.groupby(['A','B'])['A'].count()
returns:
A B
1 23 2
4 45 3
Edited to make the answer more explicit
To turn the series back to a dataframe with a column named count:
_ = df.groupby(['A','B'])['A'].count()
the name of the series becomes the column name:
_.name = 'Count'
resetting the index, promotes the multi-index to columns and turns the series into a dataframe:
df =_.reset_index()
Related
I'm trying to count the individual values per group in a dataset and add them as a new column to a table. The first one works, the second one produces wrong values.
When I use the following code
unique_id_per_column = source_table.groupby("disease").some_id.nunique()
I'll get
| | disease | some_id |
|---:|:------------------------|--------:|
| 0 | disease1 | 121 |
| 1 | disease2 | 1 |
| 2 | disease3 | 5 |
| 3 | disease4 | 9 |
| 4 | disease5 | 77 |
These numbers seem to check out, but I want to add them to another table where I have already a column with all values per group.
So I used the following code
table["unique_ids"] = source_table.groupby("disease").uniqe_id.transform("nunique")
and I get the following table, with wrong numbers for every row except the first.
| | disease |some_id | unique_ids |
|---:|:------------------------|-------:|------------------:|
| 0 | disease1 | 151 | 121 |
| 1 | disease2 | 1 | 121 |
| 2 | disease3 | 5 | 121 |
| 3 | disease4 | 9 | 121 |
| 4 | disease5 | 91 | 121 |
I've expected that I will get the same results as in the first table. Anyone knows why I get the number for the first row repeated instead of correct numbers?
Solution with Series.map if need create column in another DataFrame:
s = source_table.groupby("disease").some_id.nunique()
table["unique_ids"] = table["disease"].map(s)
I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)
Assume 2 data frames (df_a and df_b). I want to traverse row-wise and check for an exact match in the value column. If a match is found, I want the index of the matched row to be added in df_a.
df_a
| Index | Name | Value |
|-------|------|-------|
| 1 | Bon | 124 |
| 2 | Bon | 412 |
| 3 | Jaz | 634 |
| 4 | Cal | 977 |
| 5 | Cal | 412 |
| 6 | Bon | 412 |
df_b
| Index | Name | Value |
|-------|------|-------|
| 1 | Cal | 977 |
| 2 | Jaz | 634 |
| 3 | Lan | 650 |
| 4 | Bon | 412 |
Expected Output df
| Index | Name | Value | Index_in_df_b |
|-------|------|-------|---------------|
| 1 | Bon | 124 | Unmatched |
| 2 | Bon | 412 | 4 |
| 3 | Jaz | 634 | 2 |
| 4 | Cal | 977 | 1 |
| 5 | Cal | 412 | Unmatched |
| 6 | Bon | 412 | Unmatched |
Existing Solution:
Create a column --> df_a['Index_in_df_b'] = 'Unmatched'
Then I had 3 solutions:
Started using iterrows. This solution took a lot of time to process so we shifted to using .loc. This solution took about 20 minutes to process the data frames with over 7 columns and around 15000 rows in each of them. Then we started using .at. This seems to be by far the best way to process it. It took ~3 minutes to process the same data frame as mentioned above. This is the current solution.
for index_a in df_a.index:
for index_b in df_b.index:
if df_a.at[index_a,'Name'] == df_b.at[index_b, 'Name']:
# Processing logic to check for value
I'm not sure if apply can be used since 2 data frames and their row-wise details are necessary and also not sure about vectorization methods. Is there a faster way to proceed with this problem or is the current solution apt for this?
Use if need match Name and Value columns use DataFrame.merge with left join and convert index to column Index_in_df_b:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()
df = df_a.merge(df2, on=['Name','Value'], how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 Unmatched
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 Unmatched
If need match only by Value column output is different in sample data:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Value']]
df = df_a.merge(df2, on='Value', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 Unmatched
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 4
If need match only by Name column output is different in sample data:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Name']]
df = df_a.merge(df2, on='Name', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 4
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 1
I have two Spark DataFrames, with values that I would like to add, and then multiply, and keep the lowest pair of values only. I have written a function that will do this:
math_func(aValOne, aValTwo, bValOne, bValTwo):
tmpOne = aValOne + bValOne
tmpTwo = aValTwo + bValTwo
final = tmpOne*tmpTwo
return final
I would like to iterate through two Spark DataFrames, "A" and "B", row by row, and keep the lowest values results. So if I have two DataFrames:
DataFrameA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DataFrameB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
I would like to first take row 0 from DataFrameA:, compare it to rows 0 and 1 of DataFrameB, and then keep the lowest value results. I have tried this:
results = DataFrameA.select('ID')(lambda i: DataFrameA.select('ID')(math_func(DataFrameA.ValOne, DataFrameA.ValTwo, DataFrameB.ValOne, DataFrameB.ValOne))
but I get errors about iterating through a DataFrame column. I know that in Pandas I would essentially make a nested "for loop", and then just write the results to another DataFrame and append the results. The results I would expect are:
Initial Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
0 | 117 | 1
1 | 77 | 0
1 | 150 | 1
Final Results:
DataFrameA_ID | Value | DataFrameB_ID
0 | 54 | 0
1 | 77 | 0
I am quite new at Spark, but I know enough to know I'm not approaching this the right way.
Any thoughts on how to go about this?
You will need multiple steps to achieve this.
Suppose you have data
DFA:
ID | ValOne | ValTwo
0 | 2 | 4
1 | 3 | 6
DFB:
ID | ValOne | ValTwo
0 | 4 | 5
1 | 7 | 9
Step 1.
Do a cartesian join on your 2 dataframes. That will give you:
Cartesian:
DFA.ID | DFA.ValOne | DFA.ValTwo | DFB.ID | DFB.ValOne | DFB.ValTwo
0 | 2 | 4 | 0 | 4 | 5
1 | 3 | 6 | 0 | 4 | 5
0 | 2 | 4 | 1 | 7 | 9
1 | 3 | 6 | 1 | 7 | 9
Step 2.
Multiply columns:
Multiplied:
DFA.ID | DFA.Mul | DFB.ID | DFB.Mul
0 | 8 | 0 | 20
1 | 18 | 0 | 20
0 | 8 | 1 | 63
1 | 18 | 1 | 63
Step 3.
Group by DFA.ID and select min from DFA.Mul and DFB.Mul
Original DataFrame:
+----+----------+----------+----------+----------+
| ID | var1hrs | var2hrs | ind1var | ind2var |
+----+----------+----------+----------+----------+
| 1 | 55 | 45 | 123 | 456 |
| 2 | 48 | 60 | 331 | 222 |
+----+----------+----------+----------+----------+
Target DataFrame:
+----+------------+------+------+
| ID | type | hrs | ind |
+----+------------+------+------+
| 1 | primary | 55 | 123 |
| 1 | secondary | 45 | 456 |
| 2 | primary | 48 | 331 |
| 2 | secondary | 60 | 222 |
+----+------------+------+------+
How would I go about melting multiple groups of variables into a single label column? The "1" in the variable names indicate type = "primary" and "2" indicates type = "secondary".
After modify the columns' name, we can using wide_to_long
df.columns=df.columns.str[:4]
s=pd.wide_to_long(df,['var','ind'],i='ID',j='type').reset_index()
s=s.assign(type=s.type.map({'1':'primary','2':'secondary'})).sort_values('ID')
s
ID type var ind
0 1 primary 55 123
2 1 secondary 45 456
1 2 primary 48 331
3 2 secondary 60 222
(Comments inlined)
# set ID as the index and sort columns
df = df.set_index('ID').sort_index(axis=1)
# extract primary columns
prim = df.filter(like='1')
prim.columns = ['ind', 'vars']
# extract secondary columns
sec = df.filter(like='2')
sec.columns = ['ind', 'vars']
# concatenation + housekeeping
v = (pd.concat([prim, sec], keys=['primary', 'secondary'])
.swaplevel(0, 1)
.rename_axis(['ID', 'type'])
.reset_index()
)
print(v)
ID type ind vars
0 1 primary 123 55
1 2 primary 331 48
2 1 secondary 456 45
3 2 secondary 222 60
This is more or less one efficient way of doing it, even if the steps are a bit involved.