pandas change attribute values to row values of object - python

data aggregation parsed from file at the moment:
obj price1*red price1*blue price2*red price2*blue
a 5 7 10 12
b 15 17 20 22
desired outcome:
obj color price1 price2
a red 5 7
a blue 10 12
b red 15 17
b blue 20 22
this example is simplified. the data of the real usecase persists out of 404 columns and 10'000 of rows. The data mostly has arround 99 positions of colors and 4 different kind of pricelists (pricelists are always 4 kinds of).
I already tried a different approach from another part i programmed before in python
df_pricelist = pd.melt(df_pricelist, id_vars=["object_nr"], var_name='color', value_name='prices')
but this approach was initially used to pivot data from a single attribute to multiple lines. Or in other words only 1 cell for the different pricelists instead of multiple cells.
Where i also used assign to add the different blocks of the string to dofferent column cells.
To get all the different columns into the dataframe i use str.startswith. This way i don't have to know all the different colors there could be.

A solution that makes use of a MultiIndex as an intermediate step:
import pandas as pd
# Construct example dataframe
col_names = ["obj", "price1*red", "price1*blue", "price2*red", "price2*blue"]
data = [
["a", 5, 7, 10, 12],
["b", 15, 17, 20, 22],
]
df = pd.DataFrame(data, columns=col_names)
# Convert objects column into rows index
df2 = df.set_index("obj")
# Convert columns index into two-level multi-index by splitting name strings
color_price_pairs = [tuple(col_name.split("*")) for col_name in df2.columns]
df2.columns = pd.MultiIndex.from_tuples(color_price_pairs, names=("price", "color"))
# Stack colors-level of the columns index into a rows index level
df2 = df2.stack()
df2.columns.name = ""
# Optional: convert rows index (containing objects and colors) into columns
df2 = df2.reset_index()
This is a print-out that shows both the original dataframe df and the result dataframe df2:
In [1] df
Out[1]:
obj price1*red price1*blue price2*red price2*blue
0 a 5 7 10 12
1 b 15 17 20 22
In [2]: df2
Out[2]:
obj color price1 price2
0 a blue 7 12
1 a red 5 10
2 b blue 17 22
3 b red 15 20

Related

Pandas: Give (string + numbered) name to unknown number of added columns

I have this example CSV file:
Name,Dimensions,Color
Chair,!12:88:33!!9:10:50!!40:23:11!,Red
Table,!9:10:50!!40:23:11!,Brown
Couch,!40:23:11!!12:88:33!,Blue
I read it into a dataframe, then split Dimensions by ! and take the first value of each !..:..:..!-section. I append these as new columns to the dataframe, and delete Dimensions. (code for this below)
import pandas as pd
df = pd.read_csv("./data.csv")
df[["first","second","third"]] = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df = df.drop("Dimensions", axis=1)
And I get this:
Name Color first second third
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
I named them ["first","second","third"] by manually here.
But what if there are more than 3 in the future, or only 2, or I don't know how many there will be, and I want them to be named using a string + an enumerating number?
Like this:
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Question:
How do I make the naming automatic, based on the string "data_" so it gives each column the name "data_" + the number of the column? (So I don't have to type in names manually)
Use DataFrame.pop for use and drop column Dimensions, add DataFrame.add_prefix to default columnsnames and append to original DataFrame by DataFrame.join:
df = (df.join(df.pop('Dimensions')
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]).add_prefix('data_')))
print (df)
Name Color data_0 data_1 data_2
0 Chair Red 12 9 40
1 Table Brown 9 40 None
2 Couch Blue 40 12 None
Nevermind, hahah, I solved it.
import pandas as pd
df = pd.read_csv("./data.csv")
df2 = (df['Dimensions']
.str.strip('!')
.str.split('!{1,}', expand=True)
.apply(lambda x: x.str.split(':').str[0]))
df[[ ("data_"+str(i)) for i in range(len(df2.columns)) ]] = df2
df = df.drop("Dimensions", axis=1)

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

how to append data from different data frame in python?

I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?
One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN
Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.

pandas drop where two columns match nested list values

I have a dataframe where I need to drop if any of the combinations in my nested list are met. Here's the sample dataframe:
df = pd.DataFrame([['A','Green',10],['A','Red',20],['B','Blue',5],['B','Red',15],['C','Orange',25]],columns = ['Letter','Color','Value'])
print df
Letter Color Value
0 A Green 10
1 A Red 20
2 B Blue 5
3 B Red 15
4 C Orange 25
I have a list of letter/color combinations that I need to remove from the dataframe:
dropList = [['A','Green'],['B','Red']]
How can I drop from the dataframe where the letter/color combinations are in any of the nested lists?
Approaches I can do if necessary, but want to avoid:
Write a .apply function
Any form of brute force iteration
Convert the dropList to a df and merge
#df_out = code here to drop if letter/color combo appears in my droplist
print df_out
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
I imagine there is some simple one/two line solution that I just can't see...Thanks!
you can create a helper DF:
In [36]: drp = pd.DataFrame(dropList, columns=['Letter','Color'])
merge (left) your main DF with the helper DF and select only those rows that are missing in the right DF:
In [37]: df.merge(drp, how='left', indicator=True) \
.query("_merge=='left_only'") \
.drop('_merge',1)
Out[37]:
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
You can use the diff between the Letter Color combo and dropList to reindex the DF.
result = (
df.set_index(['Letter','Color'])
.pipe(lambda x: x.reindex(x.index.difference(dropList)))
.reset_index()
)
result
Out[45]:
Letter Color Value
0 A Red 20
1 B Blue 5
2 C Orange 25
Here is a crazy use of isin() though my first choice would be #MaxU's solution
new_df = df[~df[['Letter', 'Color']].apply(','.join,axis = 1).isin([s[0]+','+s[1] for s in dropList])]
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
Multi-indexing on the columns you use in dropList should do what you're after. Subtract the elements to be dropped from the full set of multiindex elements, then slice the dataframe by that remainder.
Note that the elements of dropList need to be tuples for the lookup.
dropSet = {tuple(elem) for elem in dropList}
# Creates a multi-index on letter/colour.
temp = df.set_index(['Letter', 'Color'])
# Keep all elements of the index except those in droplist.
temp = temp.loc[list(set(temp.index) - dropSet)]
# Reset index to get the original column layout.
df_dropped = temp.reset_index()
This returns:
In [4]: df_dropped
Out[4]:
Letter Color Value
0 B Blue 5
1 A Red 20
2 C Orange 25
Transform the list of lists into a dictionary
mapper = dict(dropList)
Now filter out, by mapping the dictionary to the dataframe
df[df.Letter.map(mapper) != df.Color]
Yields
Letter Color Value
1 A Red 20
2 B Blue 5
4 C Orange 25
This post is inspired by #Wen's solution to a later problem, please upvote there.
df2 = pd.DataFrame(dropList, columns=['Letter', 'Color'])
df.loc[~df.index.isin(df.merge(df2.assign(a='key'), how='left').dropna().index)]

Nesting columns under a new header column in a DataFrame

I have a time series DataFrame df1 with prices in a ticker column, from which a new DataFrame df2 is created by concatenating df1 with 3 other columns sharing the same DateTimeIndex, as shown:
Now I need to set up the ticker name "Equity(42950 [FB])" to become the new header and to nest the 3 other columns under it, and to have the ticker's prices replaced by the values in the "closePrice" column.
How to achieve this in Python?
pd.MultiIndex:
d = pd.DataFrame(np.arange(20).reshape(5,4), columns=['Equity', 'closePrice', 'mMb', 'mMv'])
arrays = [['Equity','Equity','Equity'],['closePrice', 'mMb','mMv']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(d.values[:, 1:], columns=index)
df
Equity
closePrice mMb mMv
0 1 2 3
1 5 6 7
2 9 10 11
3 13 14 15
4 17 18 19

Categories