Parsing Nested Row Text Document for Frequency Distribution Plot with Python - python

I have a document with the following structure:
CUSTOMERID1
conversation-id-123
conversation-id-123
conversation-id-123
CUSTOMERID2
conversation-id-456
conversation-id-789
I'd like to parse the document to get a frequency distribution plot with the number of conversations on the X axis and the # of customers on the Y axis. Does anyone know the easiest way to do this with Python?
I'm familiar with the frequency distribution plot piece but am struggling with how to parse the data into the right data structure to build the plot. Thank you for any help you can provide ahead of time!

You can try the following:
>>> dict_ = {}
>>> with open('file.csv') as f:
for line in f:
if line.startswith('CUSTOMERID'):
dict_[line.strip('\n')] = list_ = []
else:
list_.append(line.strip().split('-'))
>>> df = pd.DataFrame.from_dict(dict_, orient='index').stack()
>>> df.transform(lambda x:x[-1]).groupby(level=0).count().plot(kind='bar')
Output:
If you want only 1 and 2 in X axis, just change dict_[line.strip('\n')] = list_ = [] this line to dict_[line.strip('CUSTOMERID/\n')] = list_ = [].

Related

Comparing data from 2 nested dictionaries and producing a box-plot

I am trying to produce a box plot using matplotlib with data from nested dictionaries. Below is a rough outline of the structure of dictionary in question.
m_data = {scenario:{variable:{'model_name':value, ''model_name':value ...}
One issue is that I want to look at the change in the models output between the two different scenarios ( scenario 1 [VAR1] - scenario 2 [VAR2]) and then plot this difference in a box plot.
I have managed to do this, however, I want to be able to label the outliers with the model name. My current method separates the keys from the values, therefore the outlier data point has no name associated with it anymore.
#BOXPLOT
#set up blank lists
future_rain = []
past_rain = []
future_temp = []
past_temp = []
#single out the values for each model from the nested dictioaries
for key,val in m_data[FUTURE_SCENARIO][VAR1].items():
future_rain.append(val)
for key,val in m_data[FUTURE_SCENARIO][VAR2].items():
future_temp.append(val)
for key,val in m_data['historical'][VAR1].items():
past_rain.append(val)
for key,val in m_data['historical'][VAR2].items():
past_temp.append(val)
#blanks for final data
bx_plt_rain = []
bx_plt_temp = []
#allow for the subtration of two lists
zip_object = zip(future_temp, past_temp)
for future_temp_i, past_temp_i in zip_object:
bx_plt_temp.append(future_temp_i - past_temp_i)
zip_object = zip(future_rain, past_rain)
for future_rain_i, past_rain_i in zip_object:
bx_plt_rain.append(future_rain_i - past_rain_i)
#colour ouliers red
c = 'red'
outlier_col = {'flierprops': dict(color =c, markeredgecolor=c)}
#plot
bp = plt.boxplot(bx_plt_rain, patch_artist=True, showmeans=True, vert= False, meanline=True, **outlier_col)
bp['boxes'][0].set(facecolor = 'lightgrey')
plt.show()
If anyone knows of a workaround for this I would be extremely grateful.
As a bit of a hack you could create a function that looks through the dict for the outlier value and returns the key.
def outlier_name(outlier_val, inner_dict):
for key, value in inner_dict.items():
if value == outlier_val:
return key
This could be pretty intensive if your data sets are large.

Reading a csv file and counting a row depending on another row

I have a csv file where i need to read different columns and sum their numbers up depending on another row in the dataset.
The question is:
How do the flight phases (ex. take off, cruise, landing..) contribute
to fatalities?
I have to sum up column number 23 for each different data in column 28.
I have a solution with masks and a lot of IF statements:
database = pd.read_csv('Aviation.csv',quotechar='"',skipinitialspace=True, delimiter=',', encoding='latin1').fillna(0)
data = database.as_matrix()
TOcounter = 0
for r in data:
if r[28] == "TAKEOFF":
TOcounter += r[23]
print(TOcounter)
This example shows the general idea of my solution. Where i would have to add a lot of if statements and counters for every different data in column 28.
But i was wondering if there is a smarter solution to the issue.
The raw data can be found at: https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv
It sounds like what you are trying to achieve is
df.groupby('Broad.Phase.of.Flight')['Total.Fatal.Injuries'].sum()
This is a quick solution, not checking for errors like if can convert a string for float. Also you should think about in searching for the right column(with text) instead of reliing on the column index (like 23 and 28)
but this should work:
import csv
import urllib2
import collections
url = 'https://raw.githubusercontent.com/edipetres/Depressed_Year/master/Dataset_Assignment/AviationDataset.csv'
response = urllib2.urlopen(url)
df = csv.reader(response)
d = collections.defaultdict(list)
for i,row in enumerate(df):
key = row[28]
if key == "" or i == 0 : continue
val = 0 if(row[23]) =="" else float(row[23])
d.setdefault(key,[]).append(val)
d2 = {}
for k, v in d.iteritems(): d2[k] = sum(v)
for k, v in d2.iteritems(): print "{}:{}".format(k,v)
Result:
TAXI:110.0
STANDING:193.0
MANEUVERING:6430.0
DESCENT:1225.0
UNKNOWN:919.0
TAKEOFF:5267.0
LANDING:592.0
OTHER:107.0
CRUISE:6737.0
GO-AROUND:783.0
CLIMB:1906.0
APPROACH:4493.0

for loop in scipy.stats.linregress

I am using the scipy stats module to calculate the linear regression. ie
slope, intercept, r_value, p_value, std_err
= stats.linregress(data['cov_0.0075']['num'],data['cov_0.0075']['com'])
where data is a dictionary containing several 'cov_x' keys corresponding to a dataframe with columns 'num' and 'com'
I want to be able to loop through this dictionary and do linear regression on each 'cov_x'. I am not sure how to do this. I tried:
for i in data:
slope_+str(i), intercept+str(i), r_value+str(i),p_value+str(i),std_err+str(i)= stats.linregress(data[i]['num'],data[i]['com'])
Essentially I want len(x) slope_x values.
You could use a list comprehension to collect all the stats.linregress return values:
result = [stats.linregress(df['num'],df['com']) for key, df in data.items()]
result is a list of 5-tuples. To collect all the first, second, third, etc... elements from each 5-tuple into separate lists, use zip(*[...]):
slopes, intercepts, r_values, p_values, stderrs = zip(*result)
You should be able to do what you're trying to, but there are a couple of things you should watch out for.
First, you can't add a string to a variable name and store it that way. No plus signs on the left of the equals sign. Ever.
You should be able to accomplish what you're trying to do, however. Just make sure that you use the dict data type if you want string indexing.
import scipy.stats as stats
import pandas as pd
import numpy as np
data = {}
l = ['cov_0.0075','cov_0.005']
for i in l:
x = np.random.random(100)
y = np.random.random(100)+15
d = {'num':x,'com':y}
df = pd.DataFrame(data=d)
data[i] = df
slope = {}
intercept = {}
r_value = {}
p_value = {}
std_error = {}
for i in data:
slope[str(i)], \
intercept[str(i)], \
r_value[str(i)],\
p_value[str(i)], std_error[str(i)]= stats.linregress(data[i]['num'],data[i]['com'])
print(slope,intercept,r_value,p_value,std_error)
should work just fine. Otherwise, you can store individual values and put them in a list later.

Averaging values in a list of a list of a list in Python

I'm working on a method to average data from multiple files and put the results into a single file. Each line of the files looks like:
File #1
Test1,5,2,1,8
Test2,10,4,3,2
...
File #2
Test1,2,4,5,1
Test2,4,6,10,3
...
Here is the code I use to store the data:
totalData = []
for i in range(0, len(files)):
data = []
if ".csv" in files[i]:
infile = open(files[i],"r")
temp = infile.readline()
while temp != "":
data.append([c.strip() for c in temp.split(",")])
temp = infile.readline()
totalData.append(data)
So what I'm left with is totalData looking like the following:
totalData = [[
[Test1,5,2,1,8],
[Test2,10,4,3,2]],
[[Test1,2,4,5,1],
[Test2,4,6,10,3]]]
What I want to average is for all Test1, Test2, etc, average all the first values and then the second values and so forth. So testAverage would look like:
testAverage = [[Test1,3.5,3,3,4.5],
[Test2,7,5,6.5,2.5]]
I'm struggling to think of a concise/efficient way to do this. Any help is greatly appreciated! Also, if there are better ways to manage this type of data, please let me know.
It just need two loops
totalData = [ [['Test1',5,2,1,8],['Test2',10,4,3,2]],
[['Test1',2,4,5,1],['Test2',4,6,10,3]] ]
for t in range(len(totalData[0])): #tests
result = [totalData[0][t][0],]
for i in range(1,len(totalData[0][0])): #numbers
sum = 0.0
for j in range(len(totalData)):
sum += totalData[j][t][i]
sum /= len(totalData)
result.append(sum)
print result
first flatten it out
results = itertools.chain.from_iterable(totalData)
then sort it
results.sort()
then use groupby
data = {}
for key,values in itertools.groupby(results,lambda x:x[0]):
columns = zip(*values)
data[key] = [sum(c)*1.0/len(c) for c in columns]
and finally just print your data
If your data structure is regular, the best is probably to use numpy. You should be able to install it with pip from the terminal
pip install numpy
Then in python:
import numpy as np
totalData = np.array(totalData)
# remove the last dimension (i.e. 'Test1', 'Test2'), since it's not a number
totalData = np.array(totalData[:, :, 1:], float)
# average
np.mean(totalData, axis=0)

Python iteration with array

I guess it is a simple question, I am doing simple while iteration and want to save data within data array so I can simple plot it.
tr = 25 #sec
fr = 50 #Hz
dt = 0.002 #2ms
df = fr*(dt/tr)
i=0;
f = 0
data = 0
while(f<50):
i=i+1
f = ramp(fr,f,df)
data[i] = f
plot(data)
How to correctly define data array? How to save results in array?
One possibility:
data = []
while(f<50):
f = ramp(fr,f,df)
data.append(f)
Here, i is no longer needed.
you could initialize a list like this:
data=[]
then you could add data like this:
data.append(f)
For plotting matplotlib is a good choice and easy to install and use.
import pylab
pylab.plot(data)
pylab.show()
He needs "i" b/c it starts from 1 in the collection. For your code to work use:
data = {} # this is dictionary and not list

Categories