I have this example CSV file:
Name,Dimensions,Color
Chair,!12:88:33!!9:10:50!!40:23:11!,Red
Table,!9:10:50!!40:23:11!,Brown
Couch,!40:23:11!!12:88:33!,Blue
I read it into a dataframe, then split Dimensions by ! and take the first value of each !..:..:..!-section. I append these as new columns to the dataframe, and delete Dimensions. (code for this below)
import pandas as pd
df = pd.read_csv("./data.csv")
df[["first","second","third"]] = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df = df.drop("Dimensions", axis=1)
And I get this:
Name Color first second third
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
I named them ["first","second","third"] by manually here.
But what if there are more than 3 in the future, or only 2, or I don't know how many there will be, and I want them to be named using a string + an enumerating number?
Like this:
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Question:
How do I make the naming automatic, based on the string "data_" so it gives each column the name "data_" + the number of the column? (So I don't have to type in names manually)
Use DataFrame.pop for use and drop column Dimensions, add DataFrame.add_prefix to default columnsnames and append to original DataFrame by DataFrame.join:
df = (df.join(df.pop('Dimensions')
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]).add_prefix('data_')))
print (df)
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Nevermind, hahah, I solved it.
import pandas as pd
df = pd.read_csv("./data.csv")
df2 = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df[[ ("data_"+str(i)) for i in range(len(df2.columns)) ]] = df2
df = df.drop("Dimensions", axis=1)
Related
I have a dataframe of the following form:
Year 1 Grade
Year 2 Grade
Year 3 Grade
Year 4 Grade
Year 1 Students
Year 2 Students
Year 3 Students
Year 4 Students
60
70
80
100
20
32
18
25
I would like to somehow transpose this table to the following format:
Year
Grade
Students
1
60
20
2
70
32
3
80
18
4
100
25
I created a list of years and initiated a new dataframe with the "year" column. I was thinking of matching the year integer to the column name containing it in the original DF, match and assign the correct value, but got stuck there.
You need a manual reshaping using a split of the Index into a MultiIndex:
out = (df
.set_axis(df.columns.str.split(expand=True), axis=1) # make MultiIndex
.iloc[0] # select row as Series
.unstack() # unstack Grade/Students
.droplevel(0) # remove literal "Year"
.rename_axis('Year') # set index name
.reset_index() # index to column
)
output:
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
Or using pivot_longer from janitor:
# pip install pyjanitor
import janitor
out = (df.pivot_longer(
names_to = ('ignore', 'Year', '.value'),
names_sep = ' ')
.drop(columns='ignore')
)
out
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
The .value determines which parts of the sub labels in the columns are retained; the labels are split apart by names_sep, which can be a string or a regex. Another option is by using a regex, with names_pattern to split and reshape the columns:
df.pivot_longer(names_to = ('Year', '.value'),
names_pattern = r'.+(\d)\s(.+)')
Year Grade Students
0 1 60 20
1 2 70 32
2 3 80 18
3 4 100 25
Here's one way to do it. Feel free to ask questions about how it works.
import pandas as pd
cols = ["Year 1 Grade", "Year 2 Grade", "Year 3 Grade" , "Year 4 Grade",
"Year 1 Students", "Year 2 Students", "Year 3 Students", "Year 4 Students"]
vals = [60,70,80,100,20,32,18,25]
vals = [[v] for v in vals]
df = pd.DataFrame({k:v for k,v in zip(cols,vals)})
grades = df.filter(like="Grade").T.reset_index(drop=True).rename(columns={0:"Grades"})
students = df.filter(like="Student").T.reset_index(drop=True).rename(columns={0:"Students"})
pd.concat([grades,students], axis=1)
I came up with these. grades here are your first row.
df = pd.DataFrame(grades) # your dataframe without columns
grades = np.array(df.iloc[0]).reshape(4,2) # extract the first row, turn it into an array and reshape it to 4*2
new_df = pd.DataFrame(grades).reset_index()
new_df.columns = ['Year', 'Grade', 'Students'] # rename the columns
I have the large CSVs with following sample dataframes:
df1 =
Index Fruit Vegetable
0 Mango Spinach
1 Berry Carrot
2 Banana Cabbage
df2 =
Index Unit Price
0 Mango_123 30
1 234_Artichoke_CE 45
2 23_Banana 12
3 Berry___LE 10
4 Cabbage___12LW 25
5 Rice_ww_12 40
6 Spinach_KJ 34
7 234_Carrot_23 08
8 10000_Lentil 12
9 Pot________12 32
I would like to replace the names in df2 to replace the names in df1 to create the following dataframe:
df3=
Index Fruit Vegetable
0 Mango_123 Spinach_KJ
1 Berry___LE 234_Carrot_23
2 23_Banana Cabbage___12LW
What would be a generic way to do this? Thank you.
You can use fuzzy matching with thefuzz.process.extractOne, that will compute the closest match using Levenshtein Distance:
# pip install thefuzz
from thefuzz import process
cols = ['Fruit', 'Vegetable']
df1[cols] = df1[cols].applymap(lambda x: process.extractOne(x, df2['Unit'])[0])
output:
Index Fruit Vegetable
0 0 Mango_123 Spinach_KJ
1 1 Berry___LE 234_Carrot_23
2 2 23_Banana Cabbage___12LW
Your problem will be better solved by using list comprehension:
fruit_list = [df2.Unit[df2.Unit.str.contains(x)].values[0] for x in df1.Fruit.tolist()]
vegetable_list = [df2.Unit[df2.Unit.str.contains(x)].values[0] for x in df1.Vegetable.tolist()]
Above code will create two lists, one will extract all the fruits from df2 while other will do the same for vegetables. Then, create a new df and do the following:
df3 = pd.DataFrame(columns=["Fruit", "Vegetable"])
df3["Fruit"] = fruit_list
df3["Vegetable"] = vegetable_list
data aggregation parsed from file at the moment:
obj price1*red price1*blue price2*red price2*blue
a 5 7 10 12
b 15 17 20 22
desired outcome:
obj color price1 price2
a red 5 7
a blue 10 12
b red 15 17
b blue 20 22
this example is simplified. the data of the real usecase persists out of 404 columns and 10'000 of rows. The data mostly has arround 99 positions of colors and 4 different kind of pricelists (pricelists are always 4 kinds of).
I already tried a different approach from another part i programmed before in python
df_pricelist = pd.melt(df_pricelist, id_vars=["object_nr"], var_name='color', value_name='prices')
but this approach was initially used to pivot data from a single attribute to multiple lines. Or in other words only 1 cell for the different pricelists instead of multiple cells.
Where i also used assign to add the different blocks of the string to dofferent column cells.
To get all the different columns into the dataframe i use str.startswith. This way i don't have to know all the different colors there could be.
A solution that makes use of a MultiIndex as an intermediate step:
import pandas as pd
# Construct example dataframe
col_names = ["obj", "price1*red", "price1*blue", "price2*red", "price2*blue"]
data = [
["a", 5, 7, 10, 12],
["b", 15, 17, 20, 22],
]
df = pd.DataFrame(data, columns=col_names)
# Convert objects column into rows index
df2 = df.set_index("obj")
# Convert columns index into two-level multi-index by splitting name strings
color_price_pairs = [tuple(col_name.split("*")) for col_name in df2.columns]
df2.columns = pd.MultiIndex.from_tuples(color_price_pairs, names=("price", "color"))
# Stack colors-level of the columns index into a rows index level
df2 = df2.stack()
df2.columns.name = ""
# Optional: convert rows index (containing objects and colors) into columns
df2 = df2.reset_index()
This is a print-out that shows both the original dataframe df and the result dataframe df2:
In [1] df
Out[1]:
obj price1*red price1*blue price2*red price2*blue
0 a 5 7 10 12
1 b 15 17 20 22
In [2]: df2
Out[2]:
obj color price1 price2
0 a blue 7 12
1 a red 5 10
2 b blue 17 22
3 b red 15 20
I have a dataframe where I need to drop if any of the combinations in my nested list are met. Here's the sample dataframe:
df = pd.DataFrame([['A','Green',10],['A','Red',20],['B','Blue',5],['B','Red',15],['C','Orange',25]],columns = ['Letter','Color','Value'])
print df
Letter Color Value
0 A Green 10
1 A Red 20
2 B Blue 5
3 B Red 15
4 C Orange 25
I have a list of letter/color combinations that I need to remove from the dataframe:
dropList = [['A','Green'],['B','Red']]
How can I drop from the dataframe where the letter/color combinations are in any of the nested lists?
Approaches I can do if necessary, but want to avoid:
Write a .apply function
Any form of brute force iteration
Convert the dropList to a df and merge
#df_out = code here to drop if letter/color combo appears in my droplist
print df_out
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
I imagine there is some simple one/two line solution that I just can't see...Thanks!
you can create a helper DF:
In [36]: drp = pd.DataFrame(dropList, columns=['Letter','Color'])
merge (left) your main DF with the helper DF and select only those rows that are missing in the right DF:
In [37]: df.merge(drp, how='left', indicator=True) \
.query("_merge=='left_only'") \
.drop('_merge',1)
Out[37]:
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
You can use the diff between the Letter Color combo and dropList to reindex the DF.
result = (
df.set_index(['Letter','Color'])
.pipe(lambda x: x.reindex(x.index.difference(dropList)))
.reset_index()
)
result
Out[45]:
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
Here is a crazy use of isin() though my first choice would be #MaxU's solution
new_df = df[~df[['Letter', 'Color']].apply(','.join,axis = 1).isin([s[0]+','+s[1] for s in dropList])]
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
Multi-indexing on the columns you use in dropList should do what you're after. Subtract the elements to be dropped from the full set of multiindex elements, then slice the dataframe by that remainder.
Note that the elements of dropList need to be tuples for the lookup.
dropSet = {tuple(elem) for elem in dropList}
# Creates a multi-index on letter/colour.
temp = df.set_index(['Letter', 'Color'])
# Keep all elements of the index except those in droplist.
temp = temp.loc[list(set(temp.index) - dropSet)]
# Reset index to get the original column layout.
df_dropped = temp.reset_index()
This returns:
In [4]: df_dropped
Out[4]:
Letter Color Value
0 B Blue 5
1 A Red 20
2 C Orange 25
Transform the list of lists into a dictionary
mapper = dict(dropList)
Now filter out, by mapping the dictionary to the dataframe
df[df.Letter.map(mapper) != df.Color]
Yields
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
This post is inspired by #Wen's solution to a later problem, please upvote there.
df2 = pd.DataFrame(dropList, columns=['Letter', 'Color'])
df.loc[~df.index.isin(df.merge(df2.assign(a='key'), how='left').dropna().index)]
I have fields in a pandas dataframe like the sample data below. The values in one of the fields are fractions with the form something/count(something). I would like to split the values like the example output below, and create new records. Basically the numerator and the denominator. Some of the values even have multiple /, like count(something)/count(thing)/count(dog). So I'd want to split that value in to 3 records. Any tips on how to do this would be greatly appreciated.
Sample Data:
SampleDf=pd.DataFrame([['tom','sum(stuff)/count(things)'],['bob','count(things)/count(stuff)']],columns=['ReportField','OtherField'])
Example Output:
OutputDf=pd.DataFrame([['tom1','sum(stuff)'],['tom2','count(things)'],['bob1','count(things)'],['bob2','count(stuff)']],columns=['ReportField','OtherField'])
There might be a better way but try this,
df = df.set_index('ReportField')
df = pd.DataFrame(df.OtherField.str.split('/', expand = True).stack().reset_index(-1, drop = True)).reset_index()
You get
ReportField 0
0 tom sum(stuff)
1 tom count(things)
2 bob count(things)
3 bob count(stuff)
One possible way might be as following:
# split and stack
new_df = pd.DataFrame(SampleDf.OtherField.str.split('/').tolist(), index=SampleDf.ReportField).stack().reset_index()
print(new_df)
Output:
ReportField level_1 0
0 tom 0 sum(stuff)
1 tom 1 count(things)
2 bob 0 count(things)
3 bob 1 count(stuff)
Now, combine ReportField with level_1:
# combine strings for tom1, tom2 ,.....
new_df['ReportField'] = new_df.ReportField.str.cat((new_df.level_1+1).astype(str))
# remove level column
del new_df['level_1']
# rename columns
new_df.columns = ['ReportField', 'OtherField']
print (new_df)
Output:
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)
You can use:
split with expand=True for new DataFrame
reshape by stack and reset_index
add Counter to ReportField column with convert to str by astype
remove helper column level_1 by drop
OutputDf = SampleDf.set_index('ReportField')['OtherField'].str.split('/',expand=True)
.stack().reset_index(name='OtherField')
OutputDf['ReportField'] = OutputDf['ReportField'] + OutputDf['level_1'].add(1).astype(str)
OutputDf = OutputDf.drop('level_1', axis=1)
print (OutputDf)
ReportField OtherField
0 tom1 sum(stuff)
1 tom2 count(things)
2 bob1 count(things)
3 bob2 count(stuff)