Plotting graph from csv flie - python

I'm new to python. I've to plot graphs from a csv file that I've created.
a) Monthly sales vs Product Price
b) Geographic Region vs No of customer
The code that I've implemented was
import pandas as pd
import matplotlib.pyplot as plot
import csv
data = pd.read_csv('dataset_books.csv')
data.hist(bins=90)
plot.xlim([0,115054])
plot.title("Data")
x = plot.xlabel("Monthly Sales")
y = plot.ylabel("Product Price")
plot.show()
the output that I'm getting is not what I expected and is not approved.
I need a Horizontal Histogram with line plot.
Book ID Product Name Product Price Monthly Sales Shipping Type Geographic Region No Of Customer Who Bought the Product Customer Type
1 The Path to Power 486 2566.08 Free Gatton 4 Old
2 Touching Darkness (Midnighters, #2) 479 1264.56 Paid Hooker Creek 2 New
3 Star Wars: Lost Stars 456 1203.84 Paid Gladstone 2 New
4 Winter in Madrid 454 599.28 Paid Warruwi 1 New
5 Hairy Maclary from Donaldson's Dairy 442 2333.76 Free Mount Gambier 4 Old
6 Stand on Zanzibar 413 3816.12 Free Cessnock 7 Old
7 Marlfox 411 3797.64 Free Edinburgh 7 Old
8 The Matlock Paper 373 3446.52 Free Gladstone 7 Old
9 Tears of a Tiger 361 1906.08 Free Melbourne 4 Old
10 Star Wars: Vision of the Future 355 937.2 Paid Wagga Wagga 2 New
11 Nefes Nefese 344 454.08 Paid Gatton 1 New
this is my CSV file.
Can anyone help me?

Try this to check your columns name:
df.columns
>> Index(['Book ID', 'Product Name', 'Product Price', 'Monthly Sales',
'Shipping Type', 'Geographic Region',
'No Of Customer Who Bought the Product', 'Customer Type'],
dtype='object')
Next to plot horizontal plot you need 'barh'.
df['Product Price'].plot(kind='barh')
Another option to chose column is 'iloc'
df.iloc[:, 2].plot(kind='barh')
It will generate the same output

Related

isin only returning first line from csv

I'm reading from a sqlite3 db into a df:
id symbol name
0 1 QCLR Global X Funds Global X NASDAQ 100 Collar 95-1...
1 2 LCW Learn CW Investment Corporation
2 3 BUG Global X Funds Global X Cybersecurity ETF
3 4 LDOS Leidos Holdings, Inc.
4 5 LDP COHEN & STEERS LIMITED DURATION PREFERRED AND ...
... ... ... ...
10999 11000 ERIC Ericsson American Depositary Shares
11000 11001 EDI Virtus Stone Harbor Emerging Markets Total Inc...
11001 11002 EVX VanEck Environmental Services ETF
11002 11003 QCLN First Trust NASDAQ Clean Edge Green Energy Ind...
11003 11004 DTB DTE Energy Company 2020 Series G 4.375% Junior...
[11004 rows x 3 columns]
Then I have a symbols.csv file which I want to use to filter the above df:
AKAM
AKRO
Here's how I've tried to do it:
origin_symbols = pd.read_sql_query("SELECT id, symbol, name from stock", conn)
mikey_symbols = pd.read_csv("symbols.csv")
df = origin_symbols[origin_symbols['symbol'].isin(mikey_symbols)]
But for some reason I only get the first line returned from the csv:
id symbol name
6475 6476 AKAM Akamai Technologies, Inc. Common Stock
1 df
Where am I going wrong here?
You need convert csv file to Series, here is added column name and for Series select it (e.g. by position):
mikey_symbols = pd.read_csv("symbols.csv", names=['tmp']).iloc[:, 0]
#or by column name
#mikey_symbols = pd.read_csv("symbols.csv", names=['tmp'])['tmp']
And then remove possible traling spaces in both by Series.str.strip:
df = origin_symbols[origin_symbols['symbol'].str.strip().isin(mikey_symbols.str.strip())]

Data Matching using pandas cumulative columns

I am trying to solve this problem
I have two data tables for example
names age salary vehicle
jeff 20 100 missing
shinji 24 120 missing
rodger 18 150 missing
eric 25 160 missing
romeo 30 170 missing
and this other data table
names age salary vehicle industry
jeff 20 100 car video games
jeff 20 100 car cell phone
jeff 20 100 motorcycle soft drink
jeff 20 100 boat pharmaceuticals
shinji 24 120 car robots
shinji 24 120 car animation
rodger 18 150 car cars
rodger 18 150 motorcycle glasses
eric 25 160 boat video games
eric 25 160 car arms
romeo 30 70 boat vaccines
so for my first row I want vehicle instead of missing I want "CMB" for car,boat and motorcycle because jeff has all 3. For Shinji I would only want C because he has a car. For Rodger I want CM because he only has a boat.For eric I want CB because he CB because he has a car and boat.
For romeo B because he only has a boat.
So for I want to go down the vehicle column of my second table and find all the vehicle the person.
But I am not sure the logic on how to to this. I know I can match them by age name and salary.
Try this:
tmp = (
# Find the unique vehicloes for each person
df2[['names', 'vehicle']].drop_duplicates()
# Get the first letter of each vehicle in capital form
.assign(acronym=lambda x: x['vehicle'].str[0].str.upper())
# For each person, join the acronyms of all vehicles
.groupby('names')['acronym'].apply(''.join)
)
result = df1.merge(tmp, left_on='names', right_index=True)

How to combine common rows in DataFrame

I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64

Plot multiple attributes from rows/columns in Pandas

See my data:
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
house_number room_type square_feet
0 House 1 Master Bedroom 250
1 House 1 Bedroom 1 180
2 House 1 Bedroom 2 150
3 House 1 Kitchen 200
4 House 1 Bathroom 1 25
5 House 1 Bathroom 2 30
6 House 2 Master Bedroom 300
7 House 2 Bedroom 1 170
8 House 2 Bedroom 2 175
9 House 2 Kitchen 210
10 House 2 Bathroom 1 30
11 House 2 Bathroom 2 20
Data Table
I'm very new to programming. I'm using Jupyter Notebook and Pandas/matplotlib to plot some data. How would I be able to make a bar chart from this table where the x axis would be room_type and the y axis would be square feet. I only want to plot the data for House 1. I haven't been able to find anything online where I can select only that data from a particular column that matches with a particular value in another column. Does that make sense?
Thanks for any help you can provide!
IIUC, you can do it by filtering the dataframe first then calling plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'house_number':['House 1']*6+['House 2']*6
,'room_type':['Master Bedroom', 'Bedroom 1', 'Bedroom 2', 'Kitchen',
'Bathroom 1', 'Bathroom 2']*2
,'square_feet':[250,180,150,200,25,30,300,170,175,210,30,20]})
ax = df.query('house_number == "House 1"').plot.bar(x='room_type', y='square_feet')
ax.set_title('House 1')
ax.set_ylabel('square ft')
Output:
Or, you can filter the dataframe using boolean indexing:
df[df['house_number'] == 'House 1'].plot.bar(x='room_type', y='square_feet')

How to data clean in groups

I have an extremely long dataframe with a lot of data which I have to clean so that I can proceed with data visualization. There are several things I have in mind that needs to be done and I can do each of them to a certain extent but I don't know how to, or if it's even possible, to do them together.
This is what I have to do:
Find the highest arrival count every year and see if the mode of transport is by air, sea or land.
period arv_count Mode of arrival
0 2013-01 984350 Air
1 2013-01 129074 Sea
2 2013-01 178294 Land
3 2013-02 916372 Air
4 2013-02 125634 Sea
5 2013-02 179359 Land
6 2013-03 1026312 Air
7 2013-03 143194 Sea
8 2013-03 199385 Land
... ... ... ...
78 2015-03 940077 Air
79 2015-03 133632 Sea
80 2015-03 127939 Land
81 2015-04 939370 Air
82 2015-04 118120 Sea
83 2015-04 151134 Land
84 2015-05 945080 Air
85 2015-05 123136 Sea
86 2015-05 154620 Land
87 2015-06 930642 Air
88 2015-06 115631 Sea
89 2015-06 138474 Land
This is an example of what the data looks like. I don't know if it's necessary but I have created another column just for year like so:
def year_extract(year):
return year.split('-')[0].strip()
df1 = pd.DataFrame(df['period'])
df1 = df1.rename(columns={'period':'Year'})
df1 = df1['Year'].apply(year_extract)
df1 = pd.DataFrame(df1)
df = pd.merge(df, df1, left_index= True, right_index= True)
I know how to use groupby and I know how to find maximum but I don't know if it is possible to find maximum in a group like finding the highest arrival count in 2013, 2014, 2015 etc
The data above is the total arrival count for all countries based on the mode of transport and period but the original data also had hundreds of additional rows of which region and country stated are stated but I dropped because I don't know how to use or clean them. It looks like this:
period region country moa arv_count
2013-01 Total Total Air 984350
2013-01 Total Total Sea 129074
2013-01 Total Total Land 178294
2013-02 Total Total Air 916372
... ... ... ... ...
2015-12 AMERICAS USA Land 2698
2015-12 AMERICAS Canada Land 924
2013-01 ASIA China Air 136643
2013-01 ASIA India Air 55369
2013-01 ASIA Japan Air 51178
I would also like to make use of the region data if it is possible. Hoping to create a clustered column chart with the 7 regions as the x axis and arrival count as y axis and each region showing the arrival count via land, sea and air but I feel like there are too much excess data that I don't know how to deal with right now.
For example, I don't know how to deal with the period and the country because all I need is the total arrival count of land, sea and air based on region and year regardless of country and months.
I used this dataframe to test the code (the one in your question):
df = pd.DataFrame([['2013-01', 'Total', 'Total', 'Air', 984350],
['2013-01', 'Total', 'Total', 'Sea', 129074],
['2013-01', 'Total', 'Total', 'Land', 178294],
['2013-02', 'Total', 'Total', 'Air', 916372],
['2015-12', 'AMERICAS', 'USA', 'Land', 2698],
['2015-12', 'AMERICAS', 'Canada', 'Land', 924],
['2013-01', 'ASIA', 'China', 'Air', 136643],
['2013-01', 'ASIA', 'India', 'Air', 55369],
['2013-01', 'ASIA', 'Japan', 'Air', 51178]],
columns = ['period', 'region', 'country', 'moa', 'arv_count'])
Here is the code to get the sum of arrival counts, by year, region and type (sea, land air):
First add a 'year' column:
df['year'] = pd.to_datetime(df['period']).dt.year
Then group by (year, region, moa) and sum arv_count in each group:
df.groupby(['region', 'year', 'moa']).arv_count.sum()
Here is the output:
region year moa
AMERICAS 2015 Land 3622
ASIA 2013 Air 243190
Total 2013 Air 1900722
Land 178294
Sea 129074
I hope this is what you were looking for!

Categories