I have an excel files and the first two rows are:
Weekly Report
December 1-7, 2014
And after that comes the relevant table.
When I use
filename = r'excel.xlsx'
df = pd.read_excel(filename)
print(df)
I get
Weekly Report Unnamed: 1 Unnamed: 2 Unnamed:
3 Unnamed: 4 Unnamed: 5
0 December 1-7, 2014 NaN NaN
NaN NaN NaN
1 NaN NaN NaN
NaN NaN NaN
2 Date App Campaign
Country Cost Installs
What I mean is that the columns name is unnamed because it is in the first irrelevant row.
If pandas would read only the table my columns will be installs, cost etc... which I want.
How can I tell him to read starting from line 3?
Use skiprows to your advantage -
df = pd.read_excel(filename, skiprows=[0,1])
This should do it. pandas ignores the first two rows in this case -
skiprows : list-like
Rows to skip at the beginning (0-indexed)
More details here
Related
I have a 2dataframes, which I am calling as df1 and df2.
df1 has columns like KPI and context and it looks like this.
KPI Context
0 Does the company have a policy in place to man... Anti-Bribery Policy\nBroadridge does not toler...
1 Does the company have a supplier code of conduct? Vendor Code of Conduct Our vendors play an imp...
2 Does the company have a grievance/complaint ha... If you ever have a question or wish to report ...
3 Does the company have a human rights policy ? Human Rights Statement of Commitment Broadridg...
4 Does the company have a policies consistent wi... Anti-Bribery Policy\nBroadridge does not toler...
df2 has a single column 'keyword'
df2:
Keyword
0 1.5 degree
1 1.5°
2 2 degree
3 2°
4 accident
I wanted to create another dataframe out of these two dataframe wherein if a particular value from 'Keyword' column of df2 is present in the 'Context' of df1 then simply write the count of it.
for which I have used pd.crosstab() however I suspect that its not giving me the expected output.
here's what I have tried so far.
new_df = df1.explode('Context')
new_df1 = df2.explode('Keyword')
new_df = pd.crosstab(new_df['KPI'], new_df1['Keyword'], values=new_df['Context'], aggfunc='count').reset_index().rename_axis(columns=None)
print(new_df.head())
the new_df looks like this.
KPI 1.5 degree 1.5° \
0 Does the Supplier code of conduct cover one or... NaN NaN
1 Does the companies have sites/operations locat... NaN NaN
2 Does the company have a due diligence process ... NaN NaN
3 Does the company have a grievance/complaint ha... NaN NaN
4 Does the company have a grievance/complaint ha... NaN NaN
2 degree 2° accident
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 1.0 NaN NaN
4 NaN NaN NaN
The expected output which I want is something like this.
0 KPI 1.5 degree 1.5° 2 degree 2° accident
1 Does the company have a policy in place to man 44 2 3 5 9
what exactly am I missing? please let me know, thanks!
There is multiple problems - first explode working with splitted values, not with strings. Then for extract Keyword from Context need Series.str.findall and for crosstab use columns in same DataFrame, not 2 different:
import re
pat = '|'.join(r"\b{}\b".format(re.escape(x)) for x in df2['Keyword'])
df1['new'] = df1['Context'].str.findall(pat, flags=re.I)
new_df = df1.explode('new')
out = pd.crosstab(new_df['KPI'], new_df['new'])
I have below data frame of item with expiry date:
Item Expiry Date Stock
Voucher 1 1-Mar-2022 3
Voucher 2 31-Apr-2022 2
Voucher 3 1-Feb-2022 1
And I want to create an aging dashboard and map out my number of stock there:
Jan Feb Mar Apr
Voucher 1 3
Voucher 2 2
Voucher 3 1
Any ideas or guides how to do something like above please? I searched a lot of resources, cannot find any. I'm very new on building dashboards. Thanks.
You can extract the month name (NB. Your dates are invalid. 31 Apr. is impossible) and pivot the table. If needed, reindex with a list of months names:
from calendar import month_abbr
cols = month_abbr[1:] # first item is empty string
(df.assign(month=df['Expiry Date'].str.extract('-(\D+)-'))
.pivot(index='Item', columns='month', values='Stock')
.reindex(columns=cols)
)
If you expect to have duplicated Items, use pivot_table with sum as aggregation function instead
Output:
month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Item
Voucher 1 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 2 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
Voucher 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You may try like this:
import pandas as pd
# Item Expiry Date Stock
# Voucher 1 1-Mar-2022 3
# Voucher 2 31-Apr-2022 2
# Voucher 3 1-Feb-2022 1
data = {'Item': ['Voucher 1', 'Voucher 2', 'Voucher 3'],
'Expiry Date': ['1-Mar-2022', '31-Apr-2022', '1-Feb-2022'],
'Stock': [3, 2, 1]}
df = pd.DataFrame(data)
# Using pandas apply method, get the month from each row using axis=1 and store it in new column 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
df['Month'] = df.apply(lambda x: x['Expiry Date'].split('-')[1], axis=1)
# Using pandas pivot method, set 'Item' column as index,
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html
# set unique values in 'Month' column as separate columns
# set values in 'Stock' column as values for respective month columns
# and using 'rename_axis' method, remove the row name 'Month'
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html
new_df = df.pivot(index='Item', columns='Month', values='Stock').rename_axis(None, axis=1)
# Sort the month column names by first converting it to the the pandas timestamp object
# then using it as a key in a sorted function on all columns
new_df = new_df[sorted(new_df.columns, key=lambda x: pd.to_datetime(x, format='%b'))]
print(new_df)
And this is the output I am getting:
Feb Mar Apr
Item
Voucher 1 NaN 3.0 NaN
Voucher 2 NaN NaN 2.0
Voucher 3 1.0 NaN NaN
This is not my actual data, just a representation of a larger set.
I have a dataframe (df) that looks like this:
id text_field text_value
1 date 2021-07-01
1 hour 07:04
2 available yes
2 sold no
Due to project demand i need to manipulate this data to a certain point. The main part is this one:
df.set_index(['id','text_field'], append=True).unstack().droplevel(0,1).droplevel(0)
Leaving me with something like this:
text_field date hour available sold
id
1 2021-07-01 NaN NaN NaN
1 NaN 07:04 NaN NaN
2 NaN NaN yes NaN
2 NaN NaN NaN no
That is very close to what i need, but i'm failing to achieve the next step. I need to group this data by id, leaving only one id on each line.
Something like this:
text_field date hour available sold
id
1 2021-07-01 07:04 NaN NaN
2 NaN NaN yes no
Can somebody help me?
As mentioned by #Nk03 in the comments, you could use the pivot feature of pandas:
import pandas as pd
# Creating example dataframe
data = {
'id': [1, 1, 2, 2],
'text_field': ['date', 'hour', 'available', 'sold'],
'text_value': ['2021-07-01', '07:04', 'yes', 'no']
}
df = pd.DataFrame(data)
# Pivoting on dataframe
df_pivot = df.pivot(index='id', columns='text_field')
print(df_pivot)
Console output:
text_value
text_field available date hour sold
id
1 NaN 2021-07-01 07:04 NaN
2 yes NaN NaN no
in my code I've generated a range of dates using pd.date_range in an effort to compare it to a column of dates read in from excel using pandas. The generated range of dates is refered to as "all_dates".
all_dates=pd.date_range(start='1998-12-31', end='2020-06-23')
for i, date in enumerate(period): # where 'Period' is the column of excel dates
if date==all_dates[i]: # loop until date from excel doesn't match date from generated dates
continue
else:
missing_dates_stock.append(i) # keep list of locations where dates are missing
stock_data.insert(i,"NaN") # insert 'NaN' where missing date is found
This results in TypeError: argument of type 'Timestamp' is not iterable. How can I make the data types match such that I can iterate and compare them? Apologies as I am not very fluent in Python.
I think you are trying to create a NaN row if the date does not exist in the excel file.
Here's a way to do it. You can use the df.merge option.
I am creating df1 to simulate the excel file. It has two columns sale_dt and sale_amt. If the sale_dt does not exist, then we want to create a separate row with NaN in the columns. To ensure we simulate it, I am creating a date range from 1998-12-31 through 2020-06-23 skipping 4 days in between. So we have a dataframe with 4 missing date between each two rows. The solution should create 4 dummy rows with the correct date in ascending order.
import pandas as pd
import random
#create the sales dataframe with missing dates
df1 = pd.DataFrame({'sale_dt':pd.date_range(start='1998-12-31', end='2020-06-23', freq='5D'),
'sale_amt':random.sample(range(1, 2000), 1570)
})
print (df1)
#now create a dataframe with all the dates between '1998-12-31' and '2020-06-23'
df2 = pd.DataFrame({'date':pd.date_range(start='1998-12-31', end='2020-06-23', freq='D')})
print (df2)
#now merge both dataframes with outer join so you get all the rows.
#i am also sorting the data in ascending order so you can see the dates
#also dropping the original sale_dt column and renaming the date column as sale_dt
#then resetting index
df1 = (df1.merge(df2,left_on='sale_dt',right_on='date',how='outer')
.drop(columns=['sale_dt'])
.rename(columns={'date':'sale_dt'})
.sort_values(by='sale_dt')
.reset_index(drop=True))
print (df1.head(20))
The original dataframe was:
sale_dt sale_amt
0 1998-12-31 1988
1 1999-01-05 1746
2 1999-01-10 1395
3 1999-01-15 538
4 1999-01-20 1186
... ... ...
1565 2020-06-03 560
1566 2020-06-08 615
1567 2020-06-13 858
1568 2020-06-18 298
1569 2020-06-23 1427
The output of this will be (first 20 rows):
sale_amt sale_dt
0 1988.0 1998-12-31
1 NaN 1999-01-01
2 NaN 1999-01-02
3 NaN 1999-01-03
4 NaN 1999-01-04
5 1746.0 1999-01-05
6 NaN 1999-01-06
7 NaN 1999-01-07
8 NaN 1999-01-08
9 NaN 1999-01-09
10 1395.0 1999-01-10
11 NaN 1999-01-11
12 NaN 1999-01-12
13 NaN 1999-01-13
14 NaN 1999-01-14
15 538.0 1999-01-15
16 NaN 1999-01-16
17 NaN 1999-01-17
18 NaN 1999-01-18
19 NaN 1999-01-19
Here is a sample of a df I am working with. I am particularly interested in these two columns rusher and receiver.
rusher receiver
0 A.Ekeler NaN
1 NaN S.Barkley
2 C.Carson NaN
3 J.Jacobs NaN
4 NaN K.Drake
I want to run a groupby that considers all of these names in both columns (because the same name can show up in both columns).
My idea is to create a new column player, and then I can just groupby player, if that makes sense. Here is what I want my output to look like
rusher receiver player
0 A.Ekeler NaN A.Ekeler
1 NaN S.Barkley S.Barkley
2 C.Carson NaN C.Carson
3 J.Jacobs NaN J.Jacobs
4 NaN K.Drake K.Drake
I would like to take the name from whichever column it is listed under in that particular row and place it into the player column, so I can then run a groupby.
I have tried various string methods but I don't know how to work around the NaNs
Check with fillna
df['player'] = df['rusher'].fillna(df['receiver'])