I would like to convert columns to subcolumns.
Supposing that data is like;
Q1 Q2:Q21 Q2:Q22 Q2:Q23 Q3:Q31 Q3:Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
After reshaping the columns, I would like to have;
Q1 Q2 Q3
Q1 Q21 Q22 Q23 Q31 Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
Here, what I tried;
import pandas as pd
survey = pd.read_csv('survey.csv')
# first column names
survey_cols = [col.split(':')[0] for col in survey.columns]
# unique column names
survey_ucols = []
for e in survey_cols:
if e not in survey_ucols:
survey_ucols.append(e)
# second column names, subcolumns
survey_subcols = []
for col in survey_ucols:
survey_subcols.append([subcol.split(':')[-1] for subcol in survey.columns if col in subcol])
# create new df
tuples = list(zip(survey_ucols,survey_subcols))
cols = pd.MultiIndex.from_tuples(tuples, names=['mainQ', 'subQ'])
survey_new = pd.DataFrame(survey, columns=cols)
Thanks in advance
You can create helper DataFrame with Index.to_series and Series.str.split, so possible forward filling missing values per rows by ffill, last assign back MultiIndex.from_arrays:
df = survey.columns.to_series().str.split(':', expand=True).ffill(axis=1)
survey.columns = pd.MultiIndex.from_arrays([df[0].tolist(), df[1].tolist()])
#simplified
#survey.columns = [df[0].tolist(), df[1].tolist()]
print (survey)
Q1 Q2 Q3
Q1 Q21 Q22 Q23 Q31 Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
Detail:
print (df)
0 1
Q1 Q1 Q1
Q2:Q21 Q2 Q21
Q2:Q22 Q2 Q22
Q2:Q23 Q2 Q23
Q3:Q31 Q3 Q31
Q3:Q32 Q3 Q32
Related
I am new to this, and I need to split a column that contains two strings into 2 columns, like this:
Initial dataframe:
Full String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
Final dataframe:
First String Second String
0 Orange Juice
1 Pink Bird
2 Blue Ball
3 Green Tea
4 Yellow Sun
I tried this but doesn't work:
df['First String'] , df['Second String'] = df['Full String'].str.split()
and this:
df['First String', 'Second String'] = df['Full String'].str.split()
How to make it work? Thank you!!!
The key here is to include the parameter expand=True in your str.split() to expand the split strings into separate columns.
Type it like this:
df[['First String','Second String']] = df['Full String'].str.split(expand=True)
Output:
Full String First String Second String
0 Orange Juice Orange Juice
1 Pink Bird Pink Bird
2 Blue Ball Blue Ball
3 Green Tea Green Tea
4 Yellow Sun Yellow Sun
have you tried this solution ?
https://stackoverflow.com/a/14745484/15320403
df = pd.DataFrame(df['Full String'].str.split(' ',1).tolist(), columns = ['First String', 'Second String'])
I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.
Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22
df1
ITEM CATEGORY COLOR LOCATION PRICE
23661 BIKE BLUE A 30000
23661 BIKE BLUE B 43563
23661 BIKE BLUE C 45124
23661 BIKE BLUE D 28000
48684 CAR RED B 45145
48684 CAR RED D 35613
48684 CAR RED A 82312
48684 CAR RED C 24536
48684 CAR RED E 45613
54519 BIKE BLACK A 21345
54519 BIKE BLACK B 62623
54519 BIKE BLACK C 14613
54519 BIKE BLACK E 14365
54519 BIKE BLACK D 67353
Expecting outcome is the location that has the highest price for the vehicle.
ITEM CATEGORY COLOR LOCATION PRICE
23661 BIKE BLUE C 45124
48684 CAR RED A 82312
54519 BIKE BLACK D 67353
df.sort_values(df['PRICE'], ascending=False, kind='quicksort') Using this code we can manually do one by one. How to do this for the whole df.
sort_values + drop_duplicates:
df.sort_values(['ITEM','PRICE'],ascending=[True,False]).drop_duplicates('ITEM')
ITEM CATEGORY COLOR LOCATION PRICE
2 23661 BIKE BLUE C 45124
6 48684 CAR RED A 82312
13 54519 BIKE BLACK D 67353
U can do groupby on color column & then apply sort function on price: Something like this df.groupby(["COLOR"]).apply(lambda x: x.sort_values(["PRICE"], ascending = False))
I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)
I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200