Python - Grouping Rows into List [duplicate] - python

This question already has answers here:
Data loading using arrays in Python
(3 answers)
Closed 4 years ago.
I have a pandas data frame like this:
TransactionID ProductID
1 132
1 256
1 985
2 321
3 451
3 219
I want to group by the 'TransactionID' and assign the 'ProductID' to a list, like this:
list = [[132, 256, 985], [321], [451, 291]]
What is the proper way to performe this task?
Thanks in advance!

Something like this might help.
You simply group them by the TransactionID and then take the ProductID from it and convert to list
grouped_list = list(df.groupby('TransactionID')['ProductID'].apply(list))
As mentioned in the comments, it is not good to use 'list' as your variable name. This is because you will re-define the original function of the list command by setting it to the grouped list that you extracted from the df.

the following is not so good but can work.
result = [list(i.ProductID) for i in dict(list(df.groupby("TransactionID"))).values()]

Related

How do I json_normalize() a specific field within a df and keep the other columns? [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 9 months ago.
So here's my simple example (the json field in my actual dataset is very nested so I'm unpacking things one level at a time). I need to keep certain columns on the dataset post json_normalize().
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
Start:
Expected (Excel mockup):
Actual:
import json
d = {'report_id': [100, 101, 102], 'start_date': ["2021-03-12", "2021-04-22", "2021-05-02"],
'report_json': ['{"name":"John", "age":30, "disease":"A-Pox"}', '{"name":"Mary", "age":22, "disease":"B-Pox"}', '{"name":"Karen", "age":42, "disease":"C-Pox"}']}
df = pd.DataFrame(data=d)
display(df)
df = pd.json_normalize(df['report_json'].apply(json.loads), max_level=0, meta=['report_id', 'start_date'])
display(df)
Looking at the documentation on json_normalize(), I think the meta parameter is what I need to keep the report_id and start_date but it doesn't seem to be working as the expected fields to keep are not appearing on the final dataset.
Does anyone have advice? Thank you.
as you're dealing with a pretty simple json along a structured index you can just normalize your frame then make use of .join to join along your axis.
from ast import literal_eval
df.join(
pd.json_normalize(df['report_json'].map(literal_eval))
).drop('report_json',axis=1)
report_id start_date name age disease
0 100 2021-03-12 John 30 A-Pox
1 101 2021-04-22 Mary 22 B-Pox
2 102 2021-05-02 Karen 42 C-Pox

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

Columns in Pandas Dataframe [duplicate]

This question already has answers here:
Binning a column with pandas
(4 answers)
Closed 3 years ago.
I have a dataframe of cars. I have its car price column and I want to create a new column carsrange that would have values like 'high','low' etc according to car price. Like for example :
if price is between 0 and 9000 then cars range should have 'low' for those cars. similarly, if price is between 9000 and 30,000 carsrange should have 'medium' for those cars etc. I tried doing it, but my code is replacing one value to the other. Any help please?
I ran a for loop in the price column, and use the if-else iterations to define my column values.
for i in cars_data['price']:
if (i>0 and i<9000): cars_data['carsrange']='Low'
elif (i<9000 and i<18000): cars_data['carsrange']='Medium-Low'
elif (i<18000 and i>27000): cars_data['carsrange']='Medium'
elif(i>27000 and i<36000): cars_data['carsrange']='High-Medium'
else : cars_data['carsrange']='High'
Now, When I run the unique function for carsrange, it shows only 'High'.
cars_data['carsrange'].unique()
This is the Output:
In[74]:cars_data['carsrange'].unique()
Out[74]: array(['High'], dtype=object)
I believe I have applied the wrong concept here. Any ideas as to what I should do now?
you can use list:
resultList = []
for i in cars_data['price']:
if (i>0 and i<9000):
resultList.append("Low")
else:
resultList.append("HIGH")
# write other conditions here
cars_data["carsrange"] = resultList
then find uinque values from cars_data["carsrange"]

How to import specific values from a data frame into a variable in pandas [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a folder with around 1000 .txt files, and I need to run the same code on each of them. One particular thing I need to keep track of is the count of a particular haplotype. I used
hap = df['hap'].value_counts().to_frame() to create a new data frame with the counts of each haplotype. It looks something like this.
hap count
1 347
5 171
3 168
7 140
6 56
11 51
9 33
2 31
10 3
I was wondering if there was a way for me to extract the count of just haplotype 7 and store its value in a variable.
One method that I have used is to use the df['haplotype'].tolist() command and then run a for loop with a basic if-else clause that keeps track of the count haplotype 7, according to the number of times it occurs in the list. But I'm curious to know if I can access it in the manner I've described above.
You can get the row where hap is 7 using the [] operator.
cnt = df[df['hap']==7]['count']

Merging dataframes together in a for loop [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 4 years ago.
I have a dictionary of pandas dataframes, each frame contains timestamps and market caps corresponding to the timestamps, the keys of which are:
coins = ['dashcoin','litecoin','dogecoin','nxt']
I would like to create a new key in the dictionary 'merge' and using the pd.merge method merge the 4 existing dataframes according to their timestamp (I want completed rows so using 'inner' join method will be appropriate.
Sample of one of the data frames:
data2['nxt'].head()
Out[214]:
timestamp nxt_cap
0 2013-12-04 15091900
1 2013-12-05 14936300
2 2013-12-06 11237100
3 2013-12-07 7031430
4 2013-12-08 6292640
I'm currently getting a result using this code:
data2['merged'] = data2['dogecoin']
for coin in coins:
data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp')
but this repeats 'dogecoin' in 'merged', however if data2['merged'] is not = data2['dogecoin'] (or some similar data) then the merge function won't work as the values are non existent in 'merge'
EDIT: my desired result is create one merged dataframe seen in a new element in dictionary 'data2' (data2['merged']), containing the merged data frames from the other elements in data2
Try replacing the generalized pd.merge() with actual named df but you must begin dataframe with at least a first one:
data2['merged'] = data2['dashcoin']
# LEAVE OUT FIRST ELEMENT
for coin in coins[1:]:
data2['merged'] = data2['merged'].merge(data2[coin], on='timestamp')
Since you've already made coins a list, why not just something like
data2['merged'] = data2[coins[0]]
for coin in coins[1:]:
data2['merged'] = pd.merge(....
Unless I'm misunderstanding, this question isn't specific to dataframes, it's just about how to write a loop when the first element has to be treated differently to the rest.

Categories