how to average between values in two files? - python

I have a two file matrices, that look like this
File1:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,
0.26, 0.10].....'key100',g,l,i,o,+: [0.1, 0.1, 0.29, 0.19, 0.20]}
File2:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.96, 0.06, 0.01],'key2',g,l,i,o,+: [0.0, 0.1, 0.95,
0.26, 0.11].....'key100',g,l,i,o,+: [0.2, 0.0, 0.23, 0.16, 0.21]}
Both files have the same 'keys'. I want to average the values between the two files, so the result file looks like this:
Desired output file:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.94, 0.04, 0.01],'key2',g,l,i,o,+: [0.05, 0.15, 0.925,
0.26, 0.105].....'key100',g,l,i,o,+: [0.15, 0.1, 0.29, 0.175, 0.205]}
I have thought about the python script I could write, but since I am quite new to this, any quick ideas would be welcome:
import gzip
import numpy as np
inFile1 = gzip.open('/home/file1')
inFile2 = gzip.open('/home/file2')
inFile.next()
for line in inFile:
cols = line.strip().split('\t')
data = cols[6:]
for line in inFile2:
cols = line.strip().split('\t')
data2 = cols[6:]
newdata = (data + data2)/2

You could use regex to replace the strings and make it JSON compatible. Then you can easily convert it into a dict and then just use normal python to analyse the data (compare the dicts):
import re
import json
s = '''{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,
0.26, 0.10],'key100',g,l,i,o,+: [0.1, 0.1, 0.29, 0.19, 0.20]}'''
s2 = re.sub('\'(key\d+)\',g,l,i,o,\+', r'"\1"', s)
print(s2)
d = json.loads(s2)
print(d)

Problem is your data format , as Wodin commented :
what is this format? It looks a bit like a Python dict, but the
,g,l,i,o,+ doesn't make sense for a dict.
I tried with your data , You can take hint , help from this code:
I tried with
File1.txt
{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,0.26, 0.10]}
{'key3',g,l,i,o,+: [0.0, 0.0, 0.98, 0.02, 0.01],'key4',g,l,i,o,+: [0.1, 0.2, 0.90,0.268, 0.10]}
File2.txt:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.96, 0.06, 0.01],'key2',g,l,i,o,+: [0.0, 0.1, 0.95,0.26, 0.11]}
{'key3',g,l,i,o,+: [0.0, 0.0, 0.98, 0.02, 0.01],'key4',g,l,i,o,+: [0.1, 0.2, 0.90,0.268, 0.10]}
Code:
import re
pattern=r"('key\w+',g,l,i,o,\+):\s(\[.+?\])"
with open('File1.txt','r') as f:
for line in f:
average = {}
pr=re.finditer(pattern,line)
for find in pr:
with open('File2','r') as ff:
for line in ff:
for find1 in re.finditer(pattern,line):
if find.group(1)==find1.group(1):
average_part=list(map(lambda x: sum(x) / len(x), list(zip(eval(find.group(2)),eval(find1.group(2))))))
rest_part=find.group(1)
average[rest_part]=average_part
print(average)
output:
{"'key2',g,l,i,o,+": [0.05, 0.15000000000000002, 0.925, 0.26, 0.10500000000000001], "'key1',g,l,i,o,+": [0.0, 0.0, 0.94, 0.04, 0.01]}
{"'key3',g,l,i,o,+": [0.0, 0.0, 0.98, 0.02, 0.01], "'key4',g,l,i,o,+": [0.1, 0.2, 0.9, 0.268, 0.1]}

Related

Pandas .apply with conditional if in one column

I have a dataframe as below. I am trying to check if there is 0 or 1 in the vector column, if yes,
add 10 to the vector and divide by adding 2 to the vector otherwise keep the same vector.
df = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3'],
'vector': [[0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11],
[0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19],
[0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]]})
df
Output:
user vector
0 user 1 [0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]
I used the following code:
df['vector']=df.apply(lambda x: x['vector']+10/(x['vector']+2) if x['vector']==0|1 else x['vector'], axis=1)
But the Output:
user vector
0 user 1 [0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]
The expected output:
Use a list comprehension (faster than apply):
df['vector'] = [[x+10/(x+2) if x in [0,1] else x for x in v] for v in df['vector']]
Output:
user vector
0 user 1 [0.01, 0.07, 5.0, 0.14, 5.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 4.333333333333334]

How to plot numpy arrays in pandas dataframe

I have the DataFrame:
df =
sample_type observed_data
A [0.2, 0.5, 0.17, 0.1]
A [0.9, 0.3, 0.24, 0.5]
A [0.9, 0.5, 0.6, 0.39]
B [0.01, 0.07, 0.15, 0.26]
B [0.08, 0.14, 0.32, 0.58]
B [0.01, 0.16, 0.42, 0.41]
where the data type in the observed_data column is np.array. What's the easiest and most efficient way of plotting each of the numpy arrays overlayed on the same plot using matplotlib and/or plotly and showing A and B as separate colors or line types (eg. dashed, dotted, etc.)?
You can use this...
df = pd.DataFrame({'sample_type' : ['A', 'A', 'A', 'B', 'B', 'B'],
'observed_data' : [[0.2, 0.5, 0.17, 0.1], [0.9, 0.3, 0.24, 0.5], [0.9, 0.5, 0.6, 0.39],
[0.01, 0.07, 0.15, 0.26], [0.08, 0.14, 0.32, 0.58], [0.01, 0.16, 0.42, 0.41]]})
for ind, cell in df['observed_data'].iteritems():
if len(cell) > 0:
if df.loc[ind,'sample_type'] == 'A':
plotted = plt.plot(np.linspace(0,1,len(cell)), cell, color='blue', marker = 'o', linestyle = '-.')
else:
plotted = plt.plot(np.linspace(0,1,len(cell)), cell, color='red', marker = '*', linestyle = ':')
plt.show()

Transform JSON file to Data Frame in Python

I have a text file which has a JSON structure and I want to transform it to a data frame.
The JSON files includes several such JSON strings:
{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}
tweets_data_path = "data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets_data
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df
However, apparently there is something wrong with either the json.loads or the append command, because the tweets_data is empty when I call it.
Do you have an idea?
This is how your code should be to append data to tweets_data.
import json
tweets_data_path = "data.txt"
tweets_data = []
with open(tweets_data_path, 'r') as f:
for line in f.readlines():
try:
tweet = json.loads(json.dumps(line))
tweets_data.append(tweet)
except:
continue
print(tweets_data)
["{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}\n", "{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}"]
instead of loading JSON into a dictionary, then converting that dictionary into a pandas dataframe, simply use pandas built-in function to convert from JSON to pandas dataframe
df = pd.read_json(tweets_file)
alternatively, if you wish to load JSON into dictionary, then convert dictionary to dataframe:
tweets_data = json.loads(tweets_file.read())
df = pd.DataFrame.from_dict(tweets_data, orient='columns')

How to loop over a dataframe and create list

So, i have the following data below and i want to loop through the dataframe and perform some functions and at the end save the results from the function in a list. I am have trouble creating a list. i only get a single value in the list and not the two means which i intend to get. Anybody with a more effective way to solve this problem please share.
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
import pandas as pd
dicts = pd.DataFrame(dicts, columns = dicts.keys())
def Mean(self):
list_mean = []
list_all = []
for i, row in dicts.iterrows():
if (row['Age'] > 0.2) & (row['Fare'] < 0.1):
list_all.append(row['PassengerId'])
elif (row['Age'] > 0.2) & (row['Fare'] > 0.1):
list_all.clear()
list_all.append(row['PassengerId'])
return list_mean.append(np.mean(list_all))
Mean()
Help Please!!
Some of changes you have to made in you solution to resolve this issue. And for vectorized answer checkout my Code section.
1.
Return statement return list_mean should placed in function block not in if-block
Change:
. . .
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
To:
. . .
list_mean = []
for i, row in dicts.iterrows():
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
CODE :(Vectorized-Version-Solution) No need of defining explicit class to perform this action
import numpy as np
dict_ = {
'PassengerId':
[0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived': [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass': [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age':
[0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp': [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare':
[0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]
}
import pandas as pd
dicts = pd.DataFrame(dict_, columns=dict_.keys())
l1 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] < 0.1)]
l2 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] > 0.1)]
print( (sum(list(l1))/len(l1), sum(list(l2))/len(l2)) )
OUTPUT :
(0.00375, 0.0036666666666666666)
import pandas as pd
import numpy as np
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
df = pd.DataFrame(dict, columns = dict.keys())
def calculate_mean():
l1, l2 = [], []
for i, row in df.iterrows():
if row['Age'] > 0.2 and row['Fare'] < 0.1:
l1.append(row['PassengerId'])
elif row['Age'] > 0.2 and row['Fare'] > 0.1:
l2.append(row['PassengerId'])
return np.mean(l1), np.mean(l2)
print(calculate_mean()) # (0.00375, 0.0036666666666666666)

Computation within a list of dictionaries

I have a list of dictionaries:
wt
Out[189]:
[defaultdict(int,
{'A01': 0.15,
'A02': 0.17,
'A03': 0.13,
'A04': 0.17,
'A05': 0.01,
'A06': 0.12,
'A07': 0.15,
'A08': 0.0,
'A09': 0.02,
'A10': 0.09}),
defaultdict(int,
{'A01': 0.02,
'A02': 0.02,
'A03': 0.06,
'A04': 0.08,
'A05': 0.08,
'A06': 0.04,
'A07': 0.02,
'A08': 0.24,
'A09': 0.34,
'A10': 0.1}),
defaultdict(int,
{'A01': 0.0,
'A02': 0.12,
'A03': 0.01,
'A04': 0.01,
'A05': 0.11,
'A06': 0.13,
'A07': 0.1,
'A08': 0.36,
'A09': 0.13,
'A10': 0.03})]
And I have another dictionary:
zz
Out[188]: defaultdict(int, {'S1': 0.44, 'S2': 0.44, 'S3': 0.12})
I need to run a loop to aggregate the following computation:
'S1':0.44 * 'A01':0.15 + 'S2':0.44 * 'A01':0.02 + 'S3':0.12 * 'A01':0.00 ----- to be stored in a dict with the key 'A01'
'S1':0.44 * 'A02':0.17 + 'S2':0.44 * 'A02':0.02 + 'S3':0.12 * 'A02':0.12 ----- to be stored in a dict with the key 'A02'
.
.
.and so on upto:
'S1':0.44 * 'A10':0.09 + 'S2':0.44 * 'A10':0.1 + 'S3':0.12 * 'A10':0.03 ----- to be stored in a dict with the key 'A10'
Can somebody please suggest a loop for this? The issue I'm facing is that:
wt[0]
Out[197]:
defaultdict(int,
{'A01': 0.15,
'A02': 0.17,
'A03': 0.13,
'A04': 0.17,
'A05': 0.01,
'A06': 0.12,
'A07': 0.15,
'A08': 0.0,
'A09': 0.02,
'A10': 0.09})
But:
wt[0][0]
Out[199]: 0
I'm not being able to access each value within the dict.
You can do your aggregation with a dict comprehension:
x = [defaultdict(int, {'A01': 0.15, 'A02': 0.17, 'A03': 0.13, 'A04': 0.17, 'A05': 0.01, 'A06': 0.12, 'A07': 0.15, 'A08': 0.0, 'A09': 0.02, 'A10': 0.09}),
defaultdict(int, {'A01': 0.02, 'A02': 0.02, 'A03': 0.06, 'A04': 0.08, 'A05': 0.08, 'A06': 0.04, 'A07': 0.02, 'A08': 0.24, 'A09': 0.34, 'A10': 0.1}),
defaultdict(int, {'A01': 0.0, 'A02': 0.12, 'A03': 0.01, 'A04': 0.01, 'A05': 0.11, 'A06': 0.13, 'A07': 0.1, 'A08': 0.36, 'A09': 0.13, 'A10': 0.03})]
mult = defaultdict(int, {'S1': 0.44, 'S2': 0.44, 'S3': 0.12})
d = {k: sum(d[k] * mult['S'+str(idx+1)]
for idx, d in enumerate(x)) for k in x[0].keys()}
If you want to multiply your matrix with a vector, you should try numpy:
import numpy as np
# Transform data to matrix
x = np.array([[d['A'+str(i+1).zfill(2)] for i in range(len(d))] for d in x])
v = np.array([mult['S'+str(i+1)] for i in range(len(mult))]).reshape(1, 3)
print(np.matmul(v, x))
# [[0.0748 0.098 0.0848 0.1112 0.0528 0.086 0.0868 0.1488 0.174 0.0872]]

Categories