Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)
Related
I currently have 2 csv files and am reading them both in, and need to get the ID's in one csv and find them in the other so that I can get their row of data. Currently I have the following code that I believe goes through the first dataframe but only is adding the last match onto the new dataframe. I need it to add all of the subsequent rows however.
Here is my code:
patientSet = pd.read_csv("794_chips_RMA.csv")
affSet = probeset[probeset['Analysis']==1].reset_index(drop=True)
houseGenes = probeset[probeset['Analysis']==0].reset_index(drop=True)
for x in affSet['Probeset']:
#patients = patientSet[patientSet['ID']=='1557366_at'].reset_index(drop=True)
#patients = patientSet[patientSet['ID']=='224851_at'].reset_index(drop=True)
patients = patientSet[patientSet['ID']==x].reset_index(drop=True)
print(affSet['Probeset'])
print(patientSet['ID'])
print(patients)
The following is the output:
0 1557366_at
1 224851_at
2 1554784_at
3 231578_at
4 1566643_a_at
5 210747_at
6 231124_x_at
7 211737_x_at
Name: Probeset, dtype: object
0 1007_s_at
1 1053_at
2 117_at
3 121_at
4 1255_g_at
...
54670 AFFX-ThrX-5_at
54671 AFFX-ThrX-M_at
54672 AFFX-TrpnX-3_at
54673 AFFX-TrpnX-5_at
54674 AFFX-TrpnX-M_at
Name: ID, Length: 54675, dtype: object
ID phchp003v1 phchp003v2 phchp003v3 ... phchp367v1 phchp367v2 phchp368v1 phchp368v2
0 211737_x_at 12.223453 11.747159 9.941889 ... 14.828389 9.322779 10.609053 10.771162
as you can see, it is only matching the very last ID from the first dataframe, and not all of them. How can I get them all to match and be in patients? Thank you.
you probably want to use the merge function
df_inner = pd.merge(df1, df2, on='id', how='inner')
check here https://www.datacamp.com/community/tutorials/joining-dataframes-pandas search for "inner join"
--edit--
you can specify the columns (using left_on=None,right_on=None,) , look here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
#Rui Lima already posted the correct answer, but you'll need to use the following to make it work:
df = pd.merge(patientSet , affSet, on='ID', how='inner')
I am new to python I have a data frame with different groups and titles. Now I want to add a column based on median for each group (grp_pred), but I am not sure how I can accomplish this.
This is how my df looks like
df
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
median_dict =
{'18-34': 0.395992275,
'18-54': 0.79392129200000006,
'25-54': 0.64958055850000007,
'M18-34': 0.1171878905,
'M18-54': 0.27340067349999997,
'M25-54': 0.23422200100000001,
'V18-34': 0.2283782815,
'V18-54': 0.4497918595,
'V25-54': 0.37749252799999999}
required output
so basically I want to compare median values store in the dictionary across each title and then assign to a certain group if the value is equal to that specific median. e.g say if the median is 0.395992275 then pred_grp is 18-24 and so forth
df_out
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 18-54 V18-54 M18-54 pred_grp
HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393 0.739303 0.380100 0.344279 18-54
MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374 0.880062 0.503115 0.352025
PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093 0.717562 0.384010 0.316514
RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674 0.838560 0.548663 0.255410
ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934 0.877049 0.557377 0.303279
How would appreciate your help!!
Thanks in advance
Based on what I understood from comments , you can try creating a df of same structure from the dictionary as the input dataframe and then get the column which has the least difference:
u = df.set_index("title")
v = pd.DataFrame.from_dict(median_dict,orient='index').T.reindex(u.columns,axis=1)
df['pred_group'] = (u - v.to_numpy()).idxmin(axis=1).to_numpy()
print(df)
title M18-34 V18-34 18-34 25-54 V25-54 M25-54 \
0 HEPEN 0.102488 0.200995 0.312438 0.667662 0.334328 0.321393
1 MATED 0.151090 0.208723 0.361371 0.733645 0.428349 0.280374
2 PEERT 0.098296 0.157929 0.262779 0.624509 0.325033 0.283093
3 RZOEK 0.143695 0.336882 0.503607 0.657216 0.414844 0.214674
4 ERKEN 0.204918 0.409836 0.631148 0.467213 0.286885 0.163934
18-54 V18-54 M18-54 pred_group
0 0.739303 0.380100 0.344279 18-34
1 0.880062 0.503115 0.352025 18-34
2 0.717562 0.384010 0.316514 18-34
3 0.838560 0.548663 0.255410 M25-54
4 0.877049 0.557377 0.303279 25-54
I want to group by one column (tag) and sum up the corresponding quantites (qty). The related reference no. column should be separated by commas
import pandas as pd
tag = ['PO_001045M100960','PO_001045M100960','PO_001045MSP2526','PO_001045M870191', 'PO_001045M870191', 'PO_001045M870191']
reference= ['PA_000003', 'PA_000005', 'PA_000001', 'PA_000002', 'PA_000004', 'PA_000009']
qty=[4,2,2,1,1,1]
df = pd.DataFrame({'tag' : tag, 'reference':reference, 'qty':qty})
tag reference qty
PO_001045M100960 PA_000003 4
PO_001045M100960 PA_000005 2
PO_001045MSP2526 PA_000001 2
PO_001045M870191 PA_000002 1
PO_001045M870191 PA_000004 1
PO_001045M870191 PA_000009 1
If I use df.groupby('tag')['qty'].sum().reset_index(), I am getting the following result.
tag qty
ASL_PO_000001045M100960 6
ASL_PO_000001045M870191 3
ASL_PO_000001045MSP2526 2
I need an additional column where the reference no. are added under the respective tags like,
tag qty refrence
ASL_PO_000001045M100960 6 PA_000003, PA_000005
ASL_PO_000001045M870191 3 PA_000002, PA_000004, PA_000009
ASL_PO_000001045MSP2526 2 PA_000001
How can I achieve this?
Thanks.
Use pandas.DataFrame.groupby.agg:
df.groupby('tag').agg({'qty': 'sum', 'reference': ', '.join})
Output:
reference qty
tag
PO_001045M100960 PA_000003, PA_000005 6
PO_001045M870191 PA_000002, PA_000004, PA_000009 3
PO_001045MSP2526 PA_000001 2
Note: if reference column is numeric, ', '.join will not work. In such case, use lambda x: ', '.join(str(i) for i in x)
I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]
You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2
Given I have the following csv data.csv:
id,category,price,source_id
1,food,1.00,4
2,drink,1.00,4
3,food,5.00,10
4,food,6.00,10
5,other,2.00,7
6,other,1.00,4
I want to group the data by (price, source_id) and I am doing it with the following code
import pandas as pd
df = pd.read_csv('data.csv', names=['id', 'category', 'price', 'source_id'])
grouped = df.groupby(['price', 'source_id'])
valid_categories = ['food', 'drink']
for price_source, group in grouped:
if group.category.size < 2:
continue
categories = group.category.tolist()
if 'other' in categories and len(set(categories).intersection(valid_categories)) > 0:
pass
"""
Valid data in this case is:
1,food,1.00,4
2,drink,1.00,4
6,other,1.00,4
I will need all of the above data including the id for other purposes
"""
Is there an alternate way to perform the above filtering in pandas before the for loop and if it's possible, will it be any faster than the above?
The criteria for filtering is:
size of the group is greater than 1
the group by data should contain category other and at least one of either food or drink
You could directly apply a custom filter to the GroupBy object, something like
crit = lambda x: all((x.size > 1,
'other' in x.category.values,
set(x.category) & {'food', 'drink'}))
df.groupby(['price', 'source_id']).filter(crit)
Outputs
category id price source_id
0 food 1 1.0 4
1 drink 2 1.0 4
5 other 6 1.0 4