Dataframe reverse for drop(column = ) - python

I'm trying to manipulate a dataframe using a cumsum function.
My data looks like this:
To perform my cumsum, I use
df = pd.read_excel(excel_sheet, sheet_name='Sheet1').drop(columns=['Material']) # Dropping material column
I run the rest of my code, and get my expected outcome of a dataframe cumsum without the material listed:
df2 = df.as_matrix() #Specifying Array format
new = df2.cumsum(axis=1)
print(new)
However, at the end, I need to replace this material column. I'm unsure how to use the add function to get this back to the beginning of the dataframe.

IIUC, then you can just set the material column to the index, then do your cumsum, and put it back in at the end:
df2 = df.set_index('Material').cumsum(1).reset_index()
An alternative would be to do your cumsum on all but the first column:
df.iloc[:,1:] = df.iloc[:,1:].cumsum(1)

Related

How to delete multiple rows in data frame at panda in python?

I am using pandas to make a dataframe. I want to delete 12 initial rows by drop function. every resources website says that you should use drop to delete the rows unfortunately it doesn't work. I don't know why. the error says that 'list' object has no attribute 'drop' could you do me a favor and find it what should I do?
url=Exp01.html
url=str(url)
df = pd.read_html(url)
df = df.drop(index=['1','12'],axis=0,inplace=True)
print(df)
You can slice the rows out:
df = df.loc[11:]
df
loc in general is configured this way:
df.loc[x:y]
where x is the starting index and y is the ending index.
[11:] gives starting index as 11 and no ending index
Pandas read_html returns a list of dataframes.
So df is a list on your example. First, take a look at what the list holds.
If it's just one table (dataframe), you can change it to:
df = pd.read_html(url)[0]
Full code:
url=Exp01.html
url=str(url)
df = pd.read_html(url)[0]
df.drop(index=df.index[:12], axis=0, inplace=True)

Trying to replace the values of a dataframe with the values of another dataframe

I have two dataframes. One has two important labels that have some associated columns for each label. The second one has the same labels and more useful data for those same labels. I'm trying to replace the values in the first with the values of the second for each appropriate label. For example:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'a_1':[1,2,3,4], 'a_2':[4,2,4,1], 'b_1':[1,2,3,4], 'b_2':[4,2,4,2]}
df_2 = {'n':['x','y','z','t'], 'n_1':[1,2,3,4], 'n_2':[1,2,3,4]}
I want to replace the values for n_1 and n_2 in a_1 and a_2 for a and b that are the same as n. So far i tried using the replace and map functions, and they work when I use them like this:
df.iloc[0] = df.iloc[0].replace({'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[0])
I can make the substitution for one specific line, but if I try to put that in a for loop and change the numbers I get the error Cannot index by location index with a non-integer key. If I take the ilocs from there I get the original df unchanged and without any error messages. I get the same behavior when I use the map function. The way i tried to do the for loop and the map:
for i in df:
df.iloc[i] = df.iloc[i].replace{'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[i])
df.iloc[i] = df.iloc[i].replace{'b_1':df['b_1']}, df_2['n_1'].loc(df['b'].iloc[i])
And so on. And for the map function:
for i in df:
df = df.map(df['b_1']}: df_2['n_1'].loc(df['b'].iloc[i])
df = df.map(df['a_1']}: df_2['n_1'].loc(df['a'].iloc[i])
I would like the resulting dataframe to have the same format as the first but with the values of the second, something like this:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'an_1':[1,2,3,4], 'an_2':[1,2,3,4], 'bn_1':[1,2,3,4], 'bn_2':[1,2,3,4]}
where an and bn are the values for a and b when n is equal to a or b in the second dataframe.
Hope this is comprehensible.

Pandas - How to insert a new column with the count when there are multiple clauses

I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

how can I use reset_index with the multi grouped values(Hierarchical format) in Pandas Python

this is my data format, I want to reset the index and wanna make it in one table format, so I can take the count of all the id's which is 2nd row and can plot them with the histogram by date and the count,
any simple idea?
if reset_index() is not working, you can convert the table manually also.
Assume df1 is your existing data frame, we'll create df2 (new one) that you want.
df2 = pd.DataFrame()
df2['DateTime'] = df1.index.get_level_values(0).tolist()
df2['ID1'] = df1.index.get_level_values(1).tolist()
df2['ID2'] = df1['ID2'].values.tolist()
df2['Count'] = df1['Count'].values.tolist()

Categories