How to dimensionalize a pandas dataframe - python

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?

Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column

You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

Related

Getting different Values when using groupby(column)["id"].nunique and trying to add a column using transform

I'm trying to count the individual values per group in a dataset and add them as a new column to a table. The first one works, the second one produces wrong values.
When I use the following code
unique_id_per_column = source_table.groupby("disease").some_id.nunique()
I'll get
| | disease | some_id |
|---:|:------------------------|--------:|
| 0 | disease1 | 121 |
| 1 | disease2 | 1 |
| 2 | disease3 | 5 |
| 3 | disease4 | 9 |
| 4 | disease5 | 77 |
These numbers seem to check out, but I want to add them to another table where I have already a column with all values per group.
So I used the following code
table["unique_ids"] = source_table.groupby("disease").uniqe_id.transform("nunique")
and I get the following table, with wrong numbers for every row except the first.
| | disease |some_id | unique_ids |
|---:|:------------------------|-------:|------------------:|
| 0 | disease1 | 151 | 121 |
| 1 | disease2 | 1 | 121 |
| 2 | disease3 | 5 | 121 |
| 3 | disease4 | 9 | 121 |
| 4 | disease5 | 91 | 121 |
I've expected that I will get the same results as in the first table. Anyone knows why I get the number for the first row repeated instead of correct numbers?
Solution with Series.map if need create column in another DataFrame:
s = source_table.groupby("disease").some_id.nunique()
table["unique_ids"] = table["disease"].map(s)

Python, Pandas: Keep only the newest and unique data inside dataframe

Good evening,
the objects inside my dataframe can pop up as many times they want, always with additional - changing - extra data and at least with a unique timestamp (column with date is not unique), something like this...
id | object | additional_data | date | timestamp
1 | item_a | ... | 2014-04-15 | 10:16:22
2 | item_a | ... | 2014-04-10 | 18:19:01
3 | item_a | ... | 2014-04-10 | 17:59:43
4 | item_b | ... | 2014-04-13 | 10:16:22
5 | item_c | ... | 2014-04-15 | 00:01:59
6 | item_c | ... | 2014-04-14 | 08:46:00
7 | item_d | ... | 2014-04-15 | 10:12:47
Is it possible to filter the dataframe only for the unqique and newest data? For example like this:
id | object | additional_data | date | timestamp
1 | item_a | ... | 2014-04-15 | 10:16:22
4 | item_b | ... | 2014-04-13 | 10:16:22
5 | item_c | ... | 2014-04-15 | 00:01:59
7 | item_d | ... | 2014-04-15 | 10:12:47
Thanks for all your help and have a great day!
Firstly sort your dataframe by 'date' and 'timestamp' column by using sort_values():
df=df.sort_values(by=['date','timestamp'],ascending=[False,False]])
Now use drop_duplicates() method:
df=df.drop_duplicates(subset=['object'],ignore_index=True)
OR
you can also do this by sort_values() and groupby():
df.sort_values(by=['date','timestamp'],ascending=[False,False]).groupby('object',as_index=False).first()

Fetch values corresponding to id of each row python

Is is possible to fetch column containing values corresponding to an id column?
Example:-
df1
| ID | Value | Salary |
|:---------:--------:|:------:|
| 1 | amr | 34 |
| 1 | ith | 67 |
| 2 | oaa | 45 |
| 1 | eea | 78 |
| 3 | anik | 56 |
| 4 | mmkk | 99 |
| 5 | sh_s | 98 |
| 5 | ahhi | 77 |
df2
| ID | Dept |
|:---------:--------:|
| 1 | hrs |
| 1 | cse |
| 2 | me |
| 1 | ece |
| 3 | eee |
Expected Output
| ID | Dept | Value |
|:---------:--------:|----------:|
| 1 | hrs | amr |
| 1 | cse | ith |
| 2 | me | oaa |
| 1 | ece | eea |
| 3 | eee | anik |
I want to fetch each values in the 'Value' column corresponding to values in df2's ID column. And create column containing 'Values' in df2. The number of rows in the two dfs are not the same. I have tried
this
Not worked
IIUC , you can try df.merge after assigning a helper column by doing groupby+cumcount on ID:
out = (df1.assign(k=df1.groupby("ID").cumcount())
.merge(df2.assign(k=df2.groupby("ID").cumcount()),on=['ID','k'])
.drop("k",1))
print(out)
ID Value Dept
0 1 Amr hrs
1 1 ith cse
2 2 oaa me
3 1 eea ece
4 3 anik eee
is this what you want to do?
df1.merge(df2, how='inner',on ='ID')
Since you have duplicated IDs in both dfs, but these are ordered, try:
df1 = df1.drop(columns="ID")
df3 = df2.merge(df1, left_index=True, right_index=True)

Value error when merging 2 dataframe with identical number of row

I have a dataframe like this:
+-----+-------+---------+
| id | Time | Name |
+-----+-------+---------+
| 1 | 1 | John |
+-----+-------+---------+
| 2 | 2 | David |
+-----+-------+---------+
| 3 | 4 | Rebecca |
+-----+-------+---------+
| 4 | later | Taylor |
+-----+-------+---------+
| 5 | later | Li |
+-----+-------+---------+
| 6 | 8 | Maria |
+-----+-------+---------+
I want to merge with another table based on 'id' and time:
data1=pd.merge(data1, data2,left_on=['id', 'time'],right_on=['id', 'time'], how='left')
The other table data
+-----+-------+--------------+
| id | Time | Job |
+-----+-------+--------------+
| 2 | 2 | Doctor |
+-----+-------+--------------+
| 1 | 1 | Engineer |
+-----+-------+--------------+
| 4 | later | Receptionist |
+-----+-------+--------------+
| 3 | 4 | Professor |
+-----+-------+--------------+
| 5 | later | Lawyer |
+-----+-------+--------------+
| 6 | 8 | Trainer |
+-----+-------+--------------+
It raised error:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
What I tried:
data1['time']=data1['time'].astype(str)
data2['time']=data2['time'].astype(str)
Did not work. What can I do?
PS: in this example Id are different, but in my data Id can be the same so I need to merge both on Time and Id
Have you tried also casting 'id' column to either str or int?
Sorry but I have not enough reputation for just comment your question.

Need to aggregate count(rowid, colid) on dataframe in pandas

I've been trying to turn this
| row_id | col_id |
|--------|--------|
| 1 | 23 |
| 4 | 45 |
| ... | ... |
| 1 | 23 |
| ... | ... |
| 4 | 45 |
| ... | ... |
| 4 | 45 |
| ... | ... |
Into this
| row_id | col_id | count |
|--------|--------|---------|
| 1 | 23 | 2 |
| 4 | 45 | 3 |
| ... | ... | ... |
So all (row_i, col_j) occurrences are added into the 'count' column. Note that row_id and column_id won't be unique in any of both cases.
Now success until now, at least if I want to keep being efficient. I can iterate over each pair and add up occurrences, but there has to be a simpler way in pandas—or numpy for that matter.
Thanks!
EDIT 1:
As #j-bradley suggested, I tried the following
# I use django-pandas
rdf = Record.objects.to_dataframe(['row_id', 'column_id'])
_ = rdf.groupby(['row_id', 'column_id'])['row_id'].count().head(20)
_.head(10)
And that outputs
row_id column_id
1 108 1
168 1
218 1
398 2
422 1
10 35 2
355 1
489 1
100 352 1
366 1
Name: row_id, dtype: int64
This seems ok. But it's a Series object and I'm not sure how to turn this into a dataframe with the required three columns. Pandas noob, as it seems. Any tips?
Thanks again.
you can group by columns a and b and call count on the group by object:
df =pd.DataFrame({'A':[1,4,1,4,4], 'B':[23,45,23,45,45]})
df.groupby(['A','B'])['A'].count()
returns:
A B
1 23 2
4 45 3
Edited to make the answer more explicit
To turn the series back to a dataframe with a column named count:
_ = df.groupby(['A','B'])['A'].count()
the name of the series becomes the column name:
_.name = 'Count'
resetting the index, promotes the multi-index to columns and turns the series into a dataframe:
df =_.reset_index()

Categories