I have a DataFrame with the following structure:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['R.04T', 1, 2013, 23456, 22, 1 ], ['R.04T', 15, 2014,
23456, 22, 1], ['F.04T', 9, 2010, 75920, 00, 3], ['F.04T', 4,
2012, 75920, 00, 3], ['R.04T', 7, 2013, 20054, 13, 1],
['R.04T',12, 2014, 20058,13, 1]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['product_code', 'sold', 'year', 'city_number',
'district_number', 'number_of_the_department'])
print(df)
I want to know if the locations ('city_number' + 'district_number' + 'number_of_the_department') have increased or decreased the amount of sales per year, per article. Id thought about joining the columns to one location column like the following:
# join the locations
df['location'] = df['city_number'].astype(str) + ','+
df['district_number'].astype(str) + ','+ df['number_of_the_department'].astype(str)
But I'm not sure how to groupby? the df to get my answer of the question.
I want to know if the sales have increased or decreased (per year and item) by a certain percentage per year (p.ex. 2013 to 2014 x% decreased).
Maybe someone can help? :)
Try this:
df = df.assign(
pct_change_sold=df.sort_values(by="year")
.groupby(by=["city_number", "district_number", "number_of_the_department"])["sold"]
.pct_change()
.fillna(0)
)
product_code sold year city_number district_number number_of_the_department pct_change_sold
0 R.04T 1 2013 23456 22 1 0.000000
1 R.04T 15 2014 23456 22 1 14.000000
2 F.04T 9 2010 75920 0 3 0.000000
3 F.04T 4 2012 75920 0 3 -0.555556
4 R.04T 7 2006 75920 22 1 0.000000
5 U.90G 12 2005 75021 34 3 0.000000
Related
I have a data set like this:
dfdict = {
'year' : [2021, 2021, 2021, 2021, 2021, 2022, 2022, 2022, 2022, 2022],
'value' : [1,2,3,4,5,6,7,8,9,10]
}
df = pd.DataFrame(dfdict)
I also have a dictionary whose keys are years and values are the limit values of each year I want to apply a condition:
limitdict = {
'2021' : [2, 4],
'2022' : [7, 8]
}
How can I show the rows of df whose values for each year are either smaller than the lower limit or larger than the upper limit of the limitdict? The result will look like:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
Another possible solution:
# astype is needed because your dictionary keys are strings
year = df['year'].astype('str')
df[(
df['value'].lt([limitdict[x][0] for x in year]) |
df['value'].gt([limitdict[x][1] for x in year])
)]
Or:
year = df['year'].astype('str')
z1, z2 = zip(*[limitdict[x] for x in year])
df[(df['value'].lt(z1) | df['value'].gt(z2))]
Output:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
I suggest splitting the dataframe by year and then using between to filter out values in the range specified in the limitdict. Note that I am using the ~ symbol to filter out values within the range specified in the limitdic: df_year[~df_year.value.between(limitdict[str(year)][0],limitdict[str(year)][1])].
list_of_dataframes = []
for year in df.year.unique():
df_year = df[df.year == year]
list_of_dataframes.append(df_year[~df_year.value.between(limitdict[str(year)][0],limitdict[str(year)][1])])
output_df = pd.concat(list_of_dataframes)
This returns:
year value
0 2021 1
4 2021 5
5 2022 6
8 2022 9
9 2022 10
I have a DataFrame with 4 fields: Locatiom Year, Week and Sales. I would like to know the difference in Sales between two years preserving the granularity of the dataset. I mean, I would like to know for each Location, Year and Week, what is the difference to the same week of another Year.
The following will generate a Dataframe with a similar structure:
raw_data = {'Location': ['A']*30 + ['B']*30 + ['C']*30,
'Year': 3*([2018]*10+[2019]*10+[2020]*10),
'Week': 3*(3*list(range(1,11))),
'Sales': random.randint(100, size=(90))
}
df = pd.DataFrame(raw_data)
Location Year Week Sales
A 2018 1 67
A 2018 2 93
A 2018 … 67
A 2019 1 49
A 2019 2 38
A 2019 … 40
B 2018 1 18
… … … …
Could you please show me what would be the best approach?
Thank you very much
You can do it using groupby and shift:
df["Next_Years_Sales"] = df.groupby(["Location", "Week"])["Sales"].shift(-1)
df["YoY_Sales_Difference"] = df["Next_Years_Sales"] - df["Sales"]
Spot checking it:
df[(df["Location"] == "A") & (df["Week"] == 1)]
Out[37]:
Location Year Week Sales Next_Years_Sales YoY_Sales_Difference
0 A 2018 1 99 10.0 -89.0
10 A 2019 1 10 3.0 -7.0
20 A 2020 1 3 NaN NaN
I have already looked for this type of question but none of them really answers my question.
Suppose I have two dataframes and the indices of these are NOT consistent. df2 is a subset of df1 and I want to remove all the rows in df1 that are present in df2.
I already tried the following but it's not giving me the result I'm looking.
df1[~df1.index.isin(df2.index)]
Unfortunately, I can't share the original data with you however, the number of columns in the two dataframes are 14.
Here's an example of what I'm looking for:
df1 =
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
df2 =
month year sale
0 1 2012 55
1 10 2014 31
and I'm looking for:
df =
month year sale
0 4 2014 40
1 7 2013 84
You could create a multi-index with all the columns in each dataframe. From that point you have just to drop the indices of the second from the first one:
df1.set_index(list(df1.columns)).drop(df2.set_index(list(df2.columns)).index).reset_index()
Result with your example data:
month year sale
0 4 2014 40
1 7 2013 84
Use left join by DataFrame.merge and indicator parameter, then compare new column for Series.eq (==) and filter by boolean indexing:
df = df1[df1.merge(df2, indicator=True, how='left')['_merge'].eq('left_only')]
print (df)
month year sale
1 4 2014 40
2 7 2013 84
So what you want is to remove by values, not by index.
Use concatenate and drop:
comp = pd.concat([df1, df2]).drop_duplicates(keep=False)
Example:
df1 = pd.DataFrame({'month': [1, 4, 7, 10], 'year': [2012, 2014, 2013, 2014], 'sale': [55, 40, 84, 31]})
df2 = pd.DataFrame({'month': [1, 10], 'year': [2012, 2014], 'sale': [55, 31]})
pd.concat([df1, df2]).drop_duplicates(keep=False)
Result:
month sale year
1 4 40 2014
2 7 84 2013
can you try below:
df1[~df1.isin(df2)]
I have created a dataframe after importing weather data, now called "weather".
The end goal is to be able to view data for specific month and year.
It started like this:
Then I ran weather = weather.T to transform the graph making it look like:
Then I ran weather.columns=weather.iloc[0] to make the graph look like:
But the "year" column and "month" column are located in the index (i think?). How would i get it so it looks like:
Thanks for looking! Will appreciate any help :)
Please note I will remove the first row with the years in it. So don't worry about this part
This just means the pd.Index object underlying your pd.DataFrame object, unbeknownst to you, has a name:
df = pd.DataFrame({'YEAR': [2016, 2017, 2018],
'JAN': [1, 2, 3],
'FEB': [4, 5, 6],
'MAR': [7, 8, 9]})
df.columns.name = 'month'
df = df.T
df.columns = df.iloc[0]
print(df)
YEAR 2016 2017 2018
month
YEAR 2016 2017 2018
JAN 1 2 3
FEB 4 5 6
MAR 7 8 9
If this really bothers you, you can use reset_index to elevate your index to a series and then drop the extra heading row. You can, at the same time, remove the column name:
df = df.reset_index().drop(0)
df.columns.name = ''
print(df)
month 2016 2017 2018
1 JAN 1 2 3
2 FEB 4 5 6
3 MAR 7 8 9
How can I use better solution instead of following codes? in big data set with lots of columns this code takes too much time
import pandas as pd
df = pd.DataFrame({'Jan':[10,20], 'Feb':[3,5],'Mar':[30,4],'Month':
[3,2],'Year':[2016,2016]})
# Jan Feb Mar Month Year
# 0 10 3 30 3 2016
# 1 20 5 4 2 2016
df1['Antal_1']= np.nan
df1['Antal_2']= np.nan
for i in range(len(df)):
if df['Yaer'][i]==2016:
df['Antal_1'][i]=df.iloc[i,df['Month'][i]-1]
df['Antal_2'][i]=df.iloc[i,df['Month'][i]-2]
else:
df['Antal_1'][i]=df.iloc[i,-1]
df['Antal_2'][i]=df.iloc[i,-2]
df
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
You should see a marginal speed-up by using df.apply instead of iterating rows:
import pandas as pd
df = pd.DataFrame({'Jan': [10, 20], 'Feb': [3, 5], 'Mar': [30, 4],
'Month': [3, 2],'Year': [2016, 2016]})
df = df[['Jan', 'Feb', 'Mar', 'Month', 'Year']]
def calculator(row):
m1 = row['Month']
m2 = row.index.get_loc('Month')
return (row[int(m1-1)], row[int(m1-2)]) if row['Year'] == 2016 \
else (row[m2-1], row[m2-2])
df['Antal_1'], df['Antal_2'] = list(zip(*df.apply(calculator, axis=1)))
# Jan Feb Mar Month Year Antal_1 Antal_2
# 0 10 3 30 3 2016 30 3
# 1 20 5 4 2 2016 5 20
It's not clear to me what you want to do in the case of the year not being 2016, so I've made the value 100. Show an example and I can finish it. If it's just NaNs, then you can remove the first two lines from below.
df['Antal_1'] = 100
df['Antal_2'] = 100
df.loc[df['Year']==2016, 'Antal_1'] = df[df.columns[df.columns.get_loc("Month")-1]]
df.loc[df['Year']==2016, 'Antal_2'] = df[df.columns[df.columns.get_loc("Month")-2]]