I'm quite new to pandas programming. I have a file excel that I put into a dataframe and I was trying to do a group by with a count() for an attribute like in the code below and afterwards to show in a plotbar the frequency of these items I've grouped (y axis the frequency, x axis the item) :
red_whine=pd.read_csv('winequality-red.csv',header=1,sep=';',names=['fixed_acidity','volatile_acidity',...])
frequency=red_whine.groupby('quality')['quality'].count()
pdf=pd.DataFrame(frequency)
print(pdf[pdf.columns[0]])
but if I do this, this code will print me the result below like if it was a unique column:
quality
3 10
4 53
5 680
6 638
7 199
8 18
How can I keep the two columns separated?
import urllib2 # By recollection, Python 3 uses import urllib
target_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine = pd.read_csv(urllib2.urlopen(target_url), sep=';')
vc = wine.quality.value_counts()
>>> vc
5 681
6 638
7 199
4 53
8 18
3 10
Name: quality, dtype: int64
>>> vc.index
Int64Index([5, 6, 7, 4, 8, 3], dtype='int64')
>>> vc.values
array([681, 638, 199, 53, 18, 10])
For plotting, please refer to this:
Plotting categorical data with pandas and matplotlib
Related
(This is a slightly different version of a question I asked earlier)
I have a dataframe that looks like this:
INC_KEY PTOCCUPATIONALINDUSTRY
96 170000016620 Other Services
127 170000016651 Other Services
170 170000016694 Manufacturing
181 170000016706 Construction
268 170000016793 Other Services
I also have a CSV file, which I plan to turn into a dataframe, that looks like this (assume it is a dataframe):
My task is to convert the values in PTOCCUPATIONALINDUSTRY to the numbers that you see in the dictionary. So the output should look like this:
INC_KEY PTOCCUPATIONALINDUSTRY
96 170000016620 14
127 170000016651 14
170 170000016694 2
181 170000016706 8
268 170000016793 14
What is the easiest way to do this without manually doing if statements for each value? (I'm a newbie btw).
Once you import your csv file in a DF, you can create a dictionary with keys being your label column, and your values being the start column, as below:
j = dict(zip(df_csv['Label'],df_csv['Start']))
>>> j
{'Finance, Insurance and Real Estate': 1,
'Natoural Resources and Mining': 10,
'Other Services': 14,
'Construction': 2,
'Government': 3}
Then, you can rewrite over your PTOCCUPATIONALINDUSTRY column, with the values from your dictionary using map:
df['PTOCCUPATIONALINDUSTRY']=df['PTOCCUPATIONALINDUSTRY'].map(j)
>>> df
INC_KEY PTOCCUPATIONALINDUSTRY
0 1.700000e+11 14.0
1 1.700000e+11 14.0
2 1.700000e+11 NaN
3 1.700000e+11 2.0
4 1.700000e+11 14.0
First, convert the Start and Label columns of your dataframe to a dictionary.
import pandas as pd
df = pd.DataFrame({'FmtName': ['PtOccupation', 'PtOccupation', 'PtOccupation'],
'Start': [1, 10, 11],
'Label': ['Finance...', 'Natural...', 'Info...']})
label_start_dict = pd.Series(df['Start'].values, index=df['Label']).to_dict()
print(label_start_dict)
Output:
{'Finance...': 1, 'Natural...': 10, 'Info...': 11}
After that, you can use map to replace values in the other dataframe, as in the answer to your previous question.
Consider my series as below: First column is article_id and the second column is frequency count.
article_id
1 39
2 49
3 187
4 159
5 158
...
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
I got this series from a dataframe with the following command:
logs.loc[logs['article_id'] <= 17029].groupby('article_id')['article_id'].count()
logs is the dataframe here and article_id is one of the columns in it.
How do I plot a bar chart(using Matlplotlib) such that the article_id is on the X-axis and the frequency count on the Y-axis ?
My natural instinct was to convert it into a list using .tolist() but that doesn't preserve the article_id.
IIUC you need Series.plot.bar:
#pandas 0.17.0 and above
s.plot.bar()
#pandas below 0.17.0
s.plot('bar')
Sample:
import pandas as pd
import matplotlib.pyplot as plt
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
print (s)
1 39
2 49
3 187
4 159
5 158
16947 14
16948 7
16976 2
16977 1
16978 1
16980 1
Name: article_id, dtype: int64
s.plot.bar()
plt.show()
The new pandas API suggests the following way:
import pandas as pd
s = pd.Series({16976: 2, 1: 39, 2: 49, 3: 187, 4: 159,
5: 158, 16947: 14, 16977: 1, 16948: 7, 16978: 1, 16980: 1},
name='article_id')
s.plot(kind="bar", figsize=(20,10))
If you are working on Jupyter, you don't need the matplotlib library.
Just use 'bar' in kind parameter of plot
Example
series = read_csv('BwsCount.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.plot(kind='bar')
Default value of kind is 'line' (ie. series.plot() --> will automatically plot line graph)
For your reference:
kind : str
‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
I have a problem regarding how I can plot multi-indexed data in a single bar chart. I started with a DataFrame with three columns (artist, genre and miscl_count) and 195 rows. I then grouped the data by two of the columns, which resulted in the table below. My question is, how can I create a bar plot from this, so that the each group in "miscl_count" are shown as three separate bar plots across all five genres (i.e. a total amount of 3x5 bars)? I would also like the genre to identify what color a bar is assigned.
I know that there is unstacking, but I don't understand how I can get this to work with Matplotlib or Seaborn.
The head of the DataFrame, that I perform the groupby method on looks like this:
print(miscl_df.head())
artist miscl_count genre
0 band1 5 a
1 band2 6 b
2 band3 5 b
3 band4 4 b
4 band5 5 b
5 band6 5 c
miscl_df_group = miscl_df.groupby(['genre', 'miscl_count']).count()
print(miscl_df_group)
After group by, the output looks like this:
artist
miscl_count 4 5 6
genre
a 11 9 9
b 19 13 16
c 13 14 16
d 10 9 12
e 21 14 10
Just to make sure I made myself clear, the output should be shown as a single chart (and not as subplots)!
Working solution to be used on the grouped data:
miscl_df_group.unstack(level='genre').plot(kind='bar')
Alternatively, it can also be used this way:
miscl_df_group.unstack(level='miscl_count').plot(kind='bar')
with seaborn, no need to group the data, this is done under the hood:
import seaborn as sns
sns.barplot(x="artist", y="miscl_count", hue="genre", data=miscl_df)
(change the column names at will, depending on what you want)
# full working example
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.DataFrame()
df["artist"] = list(map(lambda i: f"band{i}", np.random.randint(1,4,size=(100,))))
df["genre"] = list(map(lambda i: f"genre{i}", np.random.randint(1,6,size=(100,))))
df["count"] = np.random.randint(50,100,size=(100,))
# df
# count genre artist
# 0 97 genre9 band1
# 1 95 genre7 band1
# 2 65 genre3 band2
# 3 81 genre1 band1
# 4 58 genre10 band1
# .. ... ... ...
# 95 61 genre1 band2
# 96 53 genre9 band2
# 97 55 genre9 band1
# 98 94 genre1 band2
# 99 85 genre8 band1
# [100 rows x 3 columns]
sns.barplot(x="artist", y="count", hue="genre", data=df)
I have a dataframe with 2 columns and 3000 rows.
First column is representing time in time-steps. For example first row is 0, second is 1, ..., last one is 2999.
Second column is representing pressure. The pressure changes as we iterate over the rows, but shows a repetitive behaviour. So every few steps we see that it goes to its minimum value (which is 375), then goes up again, then again at 375 etc.
What I want to do in Python, is to iterate over the rows and see:
1) at which time-steps we see pressure is at its minimum
2)Find the frequency between the minimum values.
import numpy as np
import pandas as pd
import numpy.random as rnd
import scipy.linalg as lin
from matplotlib.pylab import *
import re
from pylab import *
import datetime
df = pd.read_csv('test.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["Timestamp", "Pressure"]
print(df[[0, 1]])
You don't need to iterate row-wise, you can compare the entire column against the min value to mask it, you can then use the mask to find the timestep diff:
Data setup:
In [44]:
df = pd.DataFrame({'timestep':np.arange(20), 'value':np.random.randint(375, 400, 20)})
df
Out[44]:
timestep value
0 0 395
1 1 377
2 2 392
3 3 396
4 4 377
5 5 379
6 6 384
7 7 396
8 8 380
9 9 392
10 10 395
11 11 393
12 12 390
13 13 393
14 14 397
15 15 396
16 16 393
17 17 379
18 18 396
19 19 390
mask the df by comparing the column against the min value:
In [45]:
df[df['value']==df['value'].min()]
Out[45]:
timestep value
1 1 377
4 4 377
We can use the mask with loc to find the corresponding 'timestep' value and use diff to find the interval differences:
In [48]:
df.loc[df['value']==df['value'].min(),'timestep'].diff()
Out[48]:
1 NaN
4 3.0
Name: timestep, dtype: float64
You can divide the above by 1/60 to find frequency wrt to 1 minute or whatever frequency unit you desire
I have a very large dataframe (about 1.1M rows) and I am trying to sample it.
I have a list of indexes (about 70,000 indexes) that I want to select from the entire dataframe.
This is what Ive tried so far but all these methods are taking way too much time:
Method 1 - Using pandas :
sample = pandas.read_csv("data.csv", index_col = 0).reset_index()
sample = sample[sample['Id'].isin(sample_index_array)]
Method 2 :
I tried to write all the sampled lines to another csv.
f = open("data.csv",'r')
out = open("sampled_date.csv", 'w')
out.write(f.readline())
while 1:
total += 1
line = f.readline().strip()
if line =='':
break
arr = line.split(",")
if (int(arr[0]) in sample_index_array):
out.write(",".join(e for e in (line)))
Can anyone please suggest a better method? Or how I can modify this to make it faster?
Thanks
We don't have your data, so here is an example with two options:
after reading: use a pandas Index object to select a subset via the .iloc selection method
while reading: a predicate with the skiprows parameter
Given
A collection of indices and a (large) sample DataFrame written to test.csv:
import pandas as pd
import numpy as np
indices = [1, 2, 3, 10, 20, 30, 67, 78, 900, 2176, 78776]
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list("ABCD"))
df.to_csv("test.csv", header=False)
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
A 1000000 non-null int32
B 1000000 non-null int32
C 1000000 non-null int32
D 1000000 non-null int32
dtypes: int32(4)
memory usage: 15.3 MB
Code
Option 1 - after reading
Convert a sample list of indices to an Index object and slice the loaded DataFrame:
idxs = pd.Index(indices)
subset = df.iloc[idxs, :]
print(subset)
The .iat and .at methods are even faster, but require scalar indices.
Option 2 - while reading (Recommended)
We can write a predicate that keeps selected indices as the file is being read (more efficient):
pred = lambda x: x not in indices
data = pd.read_csv("test.csv", skiprows=pred, index_col=0, names="ABCD")
print(data)
See also the issue that led to extending skiprows.
Results
The same output is produced from the latter options:
A B C D
1 74 95 28 4
2 87 3 49 94
3 53 54 34 97
10 58 41 48 15
20 86 20 92 11
30 36 59 22 5
67 49 23 86 63
78 98 63 60 75
900 26 11 71 85
2176 12 73 58 91
78776 42 30 97 96