Stackplot with strings - python

I have a dataframe that looks something like this
import pandas as pd
sectors = [['Industrials', 'Health Care', 'Information Technology', 'Industrials'], ['Health Care', 'Health Care', 'Information Technology'], ['Industrials', 'Information Technology', 'Health Care', 'Information Technology', 'Information Technology'], ['Information Technology', 'Health Care']]
some_date = ['2015-12-01', '2016-01-05', '2016-02-01', '2016-03-01']
somelist = []
for i in range(len(some_date)):
somelist.append((some_date[i], sectors[i]))
df = pd.DataFrame(somelist, columns = ['date', 'sectors'])
I would like to create a plt.stackplot where the X-axis is the date and the Y-axis is number of times any sector is mentioned.
The problem is that it's strings and not integers, one approach could be to iterate through each row of the DataFrame and count how many times each sector is mentioned for each date, but I don't always know the names of the sectors I have so I'm wondering if there's a more efficient way to solve this?
I tried to plot a plt.pie by using df['sectors'].sum() to check how many times throughout the complete date-range each sector is mentioned, but for this I would also somehow need to convert the strings.

Not sure how efficient this is, but I fixed the data as shown here;
plot_sectors = list(set(df['sectors'].sum()))
plot_sectors = {key: [0]*df.shape[0] for key in plot_sectors}
for i in range(df.shape[0]):
for sector in df.iloc[i]['sectors']:
plot_sectors[sector][i] += 1
For the stacked plot, I used;
y = plot_sectors.values()
x = np.arange(df.shape[0])
plt.stackplot(x,y, labels = plot_sectors.keys())
And for the pie plot I used;
plt.pie([sum(values) for key, values in plot_sectors.items()], autopct='%1.1f%%',
labels=plot_sectors.keys())
plt.axis('equal')
plt.show()

Related

Write a function that filters a dataset for rows that contains all of the words in a list of words

I want to get a sub-dataframe that contains all elements in a list.
Let's take the DataFrame as an example.
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',\
'skill and practical oriented teaching',\
'all agricultural practices']
}
df = pd.DataFrame(my_dict)
Output looks thus:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
2 Teacher skill and practical oriented teaching
3 Farming all agricultural practices
my_lst = ['of', 'all']
I want to filter df with mylst to get a sub_DataFrame that looks like this:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
I've tried df[df.Job_Detail.isin(['of', 'all']) but it returns an empty DataFrame.
I'm no pandas expert, but the best function to use here seems to be str.contains
From the docs:
Series.str.contains(pat, case=True, flags=0, na=None, regex=True)
Test if pattern or regex is contained within a string of a Series or Index.
Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
Edit: This masks using or, not and
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
mask = df.Job_Detail.str.contains('|'.join(my_lst), regex=True)
print(df[mask])
Here's a solution that masks uing and:
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
print("------")
masks = [df.Job_Detail.str.contains(word) for word in my_lst]
mask = pd.concat(masks, axis=1).all(axis=1)
print(df[mask])
#Lone Your code answered a different question, but it helped me arrive at the answer. Thank you, appreciated.
Here's the closest to what I needed:
df[(df.Job_Detail.str.contains('of')) & (df.Job_Detail.str.contains('all'))]

Python: Looping over list of countries for holidays

I am fairly new to Python. I am leveraging Python's holiday package which has public holidays by country. In order to get a country's holiday, you can run something like:
sorted(holidays.US(years=np.arange(2014,2030,1)).items()
This will give the date and holiday. Now, I want the data against a bunch of countries. How do I loop over the list of countries instead of replacing the country name in the above code every single time?
the countries under consideration here are:
[FRA, Norway, Finland, US, Germany, UnitedKingdom, Sweden]
I tried a for loop like this:
countrylistLoop = ['FRA', 'Norway', 'Finland', 'US', 'Germany', 'UnitedKingdom', 'Sweden']
for i in countrylistLoop:
print(sorted(holidays.i(years=np.arange(2014,2030,1)).items()),columns=['Date','Holiday'])
This throws an AttributeError:
AttributeError: module 'holidays' has no attribute 'i'.
This makes sense but I am not sure how to proceed!
Ideally, I would like to loop over and store the results in a dataframe. Any help is highly appreciated! Thank you!
You could get the items the following way
import holidays
countrylistLoop = ['FRA', 'Norway', 'Finland', 'US', 'Germany', 'UnitedKingdom', 'Sweden']
for country in countrylistLoop:
hd = sorted(holidays.CountryHoliday(country, years=np.arange(2014,2030,1)).items())
But it doesn't take the columns argument for sorted.
or you can sort items based on the index
hd = sorted(list(holidays.CountryHoliday(country,
years=np.arange(2014,2030,1)).items()),
key=lambda holiday: holiday[1])
To provide an additional country identifier you can do this:
all_holidays = []
country_list = ['UnitedStates', 'India', 'Germany']
for country in country_list:
for holiday in holidays.CountryHoliday(country, years = np.arange(2018,2021,1)).items():
all_holidays.append({'date' : holiday[0], 'holiday' : holiday[1], 'country': country})
all_holidays = pd.DataFrame(all_holidays)
all_holidays
The result will be:
enter image description here
for i in countrylistLoop:
holiday = getattr(holidays, i)(years=np.arange(2014,2030,1)).items()
sorted(holiday)
print(holiday)
To get an attribute dynamically, use getattr.
Otherwise, I split the sorted function out because it returns None, as all mutating builtins for python do.

Add same value to multiple sets of rows. The value changes based on condition

I have a dataframe that is dynamically created.
I create my first set of rows as:
df['tourist_spots'] = pd.Series(<A list of tourist spots in a city>)
To this df I add:
df['city'] = <City Name>
So far so good. A bunch of rows are created with the same city name for multiple tourist spots.
I want to add a new city. So I do:
df['tourist_spots'].append(pd.Series(<new data>))
Now, when I append a new city with:
df['city'].append('new city')
the previously updated city data is gone. It is as if every time the rows are replaced and not appended.
Here's an example of what I want:
Step 1:
df['tourist_spot'] = pd.Series('Golden State Bridge' + a bunch of other spots)
For all the rows created by the above data I want:
df['city'] = 'San Francisco'
Step 2:
df['tourist_spot'].append(pd.Series('Times Square' + a bunch of other spots)
For all the rows created by the above data, I want:
df['city'] = 'New York'
How can I achieve this?
Use dictionary to add rows to your data frame, it is faster method.
Here is an e.g.
STEP 1
Create dictionary:
dict_df = [{'tourist_spots': 'Jones LLC', 'City': 'Boston'},
{'tourist_spots': 'Alpha Co', 'City': 'Boston'},
{'tourist_spots': 'Blue Inc', 'City': 'Singapore' }]
STEP2
Convert dictionary to dataframe:
df = pd.DataFrame(dict_df)
STEP3
Add new entries to dataframe in dictionary format:
df = df.append({'tourist_spots': 'New_Blue', 'City': 'Singapore'}, ignore_index=True)
References:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html

How to plot a list using matplotlib.pyplot?

Why am I getting a TypeError: unhashable type: numpy.ndarray error? Also, I don't recall importing numpy into my code I what is numpy.ndarray doing? The error is in the last line of the codes
import pandas as pd
import matplotlib.pyplot as plt
entries_csv = "C:\\Users\\Asus\\Desktop\\Entries.csv"
listofaccounts_csv = "C:\\Users\\Asus\\Desktop\\List of Accounts.csv"
data_entries = pd.read_csv(entries_csv)
data_listofaccounts = pd.read_csv(listofaccounts_csv)
i = 0
summary_name = [0]*len(data_listofaccounts)
summary = [0]*1*len(data_listofaccounts)
for account_name in data_listofaccounts['Account Name']:
summary_name[i] = account_name
for debit_account in data_entries['DEBIT ACCOUNT']:
if account_name == debit_account:
summary[i] += data_entries['DEBIT AMOUNT']
i += 1
plt.bar(list(summary_name), list(summary))
These are the data:
1.) Entries:
2.) List of Accounts:
Basically for each item in list of accounts, I want to make a summary where all the debit amounts will sum for each type of account
I think in this case you really want to utilize the pd.merge functionality between your two dataframes. See here: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.merge.html . Once you have joined the two tables you want to groupby according to the Account Name and perform your aggregations. So for example:
list_of_accounts_df = pd.DataFrame({
'Account Name': ['ACCOUNT PAYABLE', 'OUTSIDE SERVICE'],
'Type': ['CURRENT LIABILITY', 'EXPENSE']
})
entries_df = pd.DataFrame({
'DEBIT ACCOUNT':['OUTSIDE SERVICE', 'OUTSIDE SERVICE'],
'DEBIT AMOUNT': [46375.8, 42091.42] ,
'CREDIT ACCOUNT':['CASH IN BANK', 'CASH ON HAND'],
'CREDIT AMOUNT':[46375.8, 42091.42]
})
pd.merge(list_of_accounts_df, entries_df, left_on='Account Name', right_on='DEBIT ACCOUNT', how='left').fillna(0).groupby('Account Name')['DEBIT AMOUNT'].sum()
The output becomes a series where each index is the Account Name, and the value is the sum of all the debit amounts for that series. So in this case:
Account Name
ACCOUNT PAYABLE 0.00
OUTSIDE SERVICE 88467.22
And then regarding your question of how to plot it, for bar plots, you cannot directly provide string values for the x or y-axis.
Using this example: https://pythonspot.com/matplotlib-bar-chart/, in our case you can just do:
objects = x.index.values
y_pos = range(len(objects)
vals = x.values
plt.bar(y_pos, vals, align='center')
plt.xticks(y_pos, objects)
plt.ylabel('Sum of Debits')
plt.title('Total Debits Per Account')
plt.show()
Which gives this in our simple example:

How can I use a variable value to call a dataframe?

Say that I'm given a dataframe that summarizes different companies:
summary=pandas.DataFrame(columns=['Company Name', 'Formation Date', 'Revenue', 'Profit', 'Loss'])
And then say each company in that dataframe has its own corresponding dataframe, named after the company, giving a more in-depth picture of the company's history and stats. Something like:
exampleco=pandas.Dataframe(columns=['Date', 'Daily Profit', 'Daily Loss', 'Daily Revenue'])
I have a script that processes each row of the summary dataframe, but I would like to grab the name from row['Company Name'] and use it to access the company's own dataframe.
In other words I'd love it if there was something that worked like this:
.
.
>>> company=row['Company Name']
>>> pandas.get_dataframe_from_variable(company)
Empty DataFrame
Columns: ['Date', 'Daily Profit', 'Daily Loss', 'Daily Revenue']
Index: []
[0 rows x 2 columns]
.
.
Any ideas of how I might get this to work would be much appreciated.
Thanks in advance!
You can use a dictionary to contain your DataFrames and use strings as the keys.
companies = {'company1':pandas.DataFrame(columns=['Date', 'Daily Profit',
'Daily Loss', 'Daily Revenue']),
'company2':pandas.DataFrame(columns=['Date', 'Daily Profit',
'Daily Loss', 'Daily Revenue'])}
company=row['Company Name'] # Get your company name as a string from your summary.
company_details = companies[company] # Returns a DataFrame.

Categories