Altering the Shape of a Pandas Dataframe Without Losing Any Datapoints - python

I have a pandas dataframe (df) of the following format:
+------+-------+-------+
| Zone | Group | Count |
+------+-------+-------+
| 897 | 1 | 78 |
| 897 | 2 | 49 |
| 897 | 3 | 23 |
| 482 | 1 | 157 |
| 482 | 2 | 57 |
| 482 | 3 | 28 |
+------+-------+-------+
I would like to alter the dateframe so that there exists only one row per Zone. The output would be...
+------+----------+----------+----------+
| Zone | Count_G1 | Count_G2 | Count_G3 |
+------+----------+----------+----------+
| 897 | 78 | 49 | 23 |
| 482 | 157 | 57 | 28 |
+------+----------+----------+----------+
In terms of generating the new column names, I think the best method would be to use some automated counter-based method. I have provided sample data, but the actual problem I am working on has hundreds of rows of data to be transformed in this manner.
The following post addresses one approach to naming new columns based on dictionaries, which would be a less than ideal approach in this case.
Renaming columns of a pandas dataframe without column names

Related

Getting different Values when using groupby(column)["id"].nunique and trying to add a column using transform

I'm trying to count the individual values per group in a dataset and add them as a new column to a table. The first one works, the second one produces wrong values.
When I use the following code
unique_id_per_column = source_table.groupby("disease").some_id.nunique()
I'll get
| | disease | some_id |
|---:|:------------------------|--------:|
| 0 | disease1 | 121 |
| 1 | disease2 | 1 |
| 2 | disease3 | 5 |
| 3 | disease4 | 9 |
| 4 | disease5 | 77 |
These numbers seem to check out, but I want to add them to another table where I have already a column with all values per group.
So I used the following code
table["unique_ids"] = source_table.groupby("disease").uniqe_id.transform("nunique")
and I get the following table, with wrong numbers for every row except the first.
| | disease |some_id | unique_ids |
|---:|:------------------------|-------:|------------------:|
| 0 | disease1 | 151 | 121 |
| 1 | disease2 | 1 | 121 |
| 2 | disease3 | 5 | 121 |
| 3 | disease4 | 9 | 121 |
| 4 | disease5 | 91 | 121 |
I've expected that I will get the same results as in the first table. Anyone knows why I get the number for the first row repeated instead of correct numbers?
Solution with Series.map if need create column in another DataFrame:
s = source_table.groupby("disease").some_id.nunique()
table["unique_ids"] = table["disease"].map(s)

What is the efficient way to perform row wise match in pandas?

Assume 2 data frames (df_a and df_b). I want to traverse row-wise and check for an exact match in the value column. If a match is found, I want the index of the matched row to be added in df_a.
df_a
| Index | Name | Value |
|-------|------|-------|
| 1 | Bon | 124 |
| 2 | Bon | 412 |
| 3 | Jaz | 634 |
| 4 | Cal | 977 |
| 5 | Cal | 412 |
| 6 | Bon | 412 |
df_b
| Index | Name | Value |
|-------|------|-------|
| 1 | Cal | 977 |
| 2 | Jaz | 634 |
| 3 | Lan | 650 |
| 4 | Bon | 412 |
Expected Output df
| Index | Name | Value | Index_in_df_b |
|-------|------|-------|---------------|
| 1 | Bon | 124 | Unmatched |
| 2 | Bon | 412 | 4 |
| 3 | Jaz | 634 | 2 |
| 4 | Cal | 977 | 1 |
| 5 | Cal | 412 | Unmatched |
| 6 | Bon | 412 | Unmatched |
Existing Solution:
Create a column --> df_a['Index_in_df_b'] = 'Unmatched'
Then I had 3 solutions:
Started using iterrows. This solution took a lot of time to process so we shifted to using .loc. This solution took about 20 minutes to process the data frames with over 7 columns and around 15000 rows in each of them. Then we started using .at. This seems to be by far the best way to process it. It took ~3 minutes to process the same data frame as mentioned above. This is the current solution.
for index_a in df_a.index:
for index_b in df_b.index:
if df_a.at[index_a,'Name'] == df_b.at[index_b, 'Name']:
# Processing logic to check for value
I'm not sure if apply can be used since 2 data frames and their row-wise details are necessary and also not sure about vectorization methods. Is there a faster way to proceed with this problem or is the current solution apt for this?
Use if need match Name and Value columns use DataFrame.merge with left join and convert index to column Index_in_df_b:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()
df = df_a.merge(df2, on=['Name','Value'], how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 Unmatched
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 Unmatched
If need match only by Value column output is different in sample data:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Value']]
df = df_a.merge(df2, on='Value', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 Unmatched
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 4
If need match only by Name column output is different in sample data:
df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Name']]
df = df_a.merge(df2, on='Name', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
Name Value Index_in_df_b
0 Bon 124 4
1 Bon 412 4
2 Jaz 634 2
3 Cal 977 1
4 Cal 412 1

plotting the data from csv file in python

5.30-420462 | 100 | SAT-Synergy-gen2 |
| 5.30-42 | 92 | Scale |
| 5.30-423 | 90 | Scale |
| 5.30-420 | 76 | Scale |
| 5.30-420462 | 85 | Scale |
| 5.30-4205 | 88 | Scale |
| 5.30-420664 | 88 | Scale |
| 5.30-421187 | 90 | Scale |
| 5.30-421040 | 93 | Scale |
| 5.30-421225 | 100 | Scale-DCS-VET |
| 5.30-421069 | 100 | UPT_C7000 |
| 5.30-420664 | 0 | UPT_C7000 |
| 5.30-421040 | 100 | UPT_C7000 |
| 5.30-420693 | 100 | UPT_C7000 |
| 5.30-420543 | 88 | UPT_C7000 |
| 5.30-421225 | 76 | UPT_C7000 |
| 5.30-420462 | 96 | UPT_C7000 |
The above is the data from the database in the csv file. I want to use the first and second columns for plotting the graph and 3rd column will be the reference to 1st and 2nd column. Can someone help me to plot these data using pandas or any module?
Try using matplotlib for plotting data. And reading the data in with pandas. Then you could try set label and other stuff.
Reading in the data from your file -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Plotting the data -> https://datatofish.com/plot-dataframe-pandas/

How to create duplicate rows based on columns?

Consider this data frame
order number | Item | column 0 | column 1 | Column 2
12 | [abcd][efgh] | [abcd | [efgh] |
34 | [mnop] | | [mnop] | |
56 | [xyzz][zzyx][mnoq] | [xyzz] | [zzyx] | [mnoq]
How do I turn it into?
order number | Item | column 0 |
12 | [abcd][efgh] | [abcd |
12 | [abcd][efgh] | [efgh] |
34 | [mnop] | | [mnop] |
56 | [xyzz][zzyx][mnoq] | [xyzz] |
56 | [xyzz][zzyx][mnoq] | [zzyx] |
56 | [xyzz][zzyx][mnoq] | [mnoq] |
This is my first time posting on stackoverflow so apologies for any mistakes. I've tried searching the blogs but have not any luck with this kind of problem. Any help is really appreciated

Need to aggregate count(rowid, colid) on dataframe in pandas

I've been trying to turn this
| row_id | col_id |
|--------|--------|
| 1 | 23 |
| 4 | 45 |
| ... | ... |
| 1 | 23 |
| ... | ... |
| 4 | 45 |
| ... | ... |
| 4 | 45 |
| ... | ... |
Into this
| row_id | col_id | count |
|--------|--------|---------|
| 1 | 23 | 2 |
| 4 | 45 | 3 |
| ... | ... | ... |
So all (row_i, col_j) occurrences are added into the 'count' column. Note that row_id and column_id won't be unique in any of both cases.
Now success until now, at least if I want to keep being efficient. I can iterate over each pair and add up occurrences, but there has to be a simpler way in pandas—or numpy for that matter.
Thanks!
EDIT 1:
As #j-bradley suggested, I tried the following
# I use django-pandas
rdf = Record.objects.to_dataframe(['row_id', 'column_id'])
_ = rdf.groupby(['row_id', 'column_id'])['row_id'].count().head(20)
_.head(10)
And that outputs
row_id column_id
1 108 1
168 1
218 1
398 2
422 1
10 35 2
355 1
489 1
100 352 1
366 1
Name: row_id, dtype: int64
This seems ok. But it's a Series object and I'm not sure how to turn this into a dataframe with the required three columns. Pandas noob, as it seems. Any tips?
Thanks again.
you can group by columns a and b and call count on the group by object:
df =pd.DataFrame({'A':[1,4,1,4,4], 'B':[23,45,23,45,45]})
df.groupby(['A','B'])['A'].count()
returns:
A B
1 23 2
4 45 3
Edited to make the answer more explicit
To turn the series back to a dataframe with a column named count:
_ = df.groupby(['A','B'])['A'].count()
the name of the series becomes the column name:
_.name = 'Count'
resetting the index, promotes the multi-index to columns and turns the series into a dataframe:
df =_.reset_index()

Categories