Transform JSON file to Data Frame in Python - python

I have a text file which has a JSON structure and I want to transform it to a data frame.
The JSON files includes several such JSON strings:
{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}
tweets_data_path = "data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
tweets_data
df = pd.DataFrame.from_dict(pd.json_normalize(tweets_data), orient='columns')
df
However, apparently there is something wrong with either the json.loads or the append command, because the tweets_data is empty when I call it.
Do you have an idea?

This is how your code should be to append data to tweets_data.
import json
tweets_data_path = "data.txt"
tweets_data = []
with open(tweets_data_path, 'r') as f:
for line in f.readlines():
try:
tweet = json.loads(json.dumps(line))
tweets_data.append(tweet)
except:
continue
print(tweets_data)
["{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}\n", "{'cap': {'english': 0.1000, 'universal': 0.225}, 'display_scores': {'english': {'astroturf': 0.5, 'fake_follower': 0.8, 'financial': 0.2, 'other': 1.8, 'overall': 1.8, 'self_declared': 0.0, 'spammer': 0.2}, 'universal': {'astroturf': 0.4, 'fake_follower': 0.2, 'financial': 0.2, 'other': 0.4, 'overall': 0.8, 'self_declared': 0.0, 'spammer': 0.0}}, 'raw_scores': {'english': {'astroturf': 0.1, 'fake_follower': 0.16, 'financial': 0.05, 'other': 0.35, 'overall': 0.35, 'self_declared': 0.0, 'spammer': 0.04}, 'universal': {'astroturf': 0.07, 'fake_follower': 0.03, 'financial': 0.05, 'other': 0.09, 'overall': 0.16, 'self_declared': 0.0, 'spammer': 0.01}}, 'user': {'majority_lang': 'de', 'user_data': {'id_str': '123456', 'screen_name': 'beispiel01'}}}"]

instead of loading JSON into a dictionary, then converting that dictionary into a pandas dataframe, simply use pandas built-in function to convert from JSON to pandas dataframe
df = pd.read_json(tweets_file)
alternatively, if you wish to load JSON into dictionary, then convert dictionary to dataframe:
tweets_data = json.loads(tweets_file.read())
df = pd.DataFrame.from_dict(tweets_data, orient='columns')

Related

Pandas .apply with conditional if in one column

I have a dataframe as below. I am trying to check if there is 0 or 1 in the vector column, if yes,
add 10 to the vector and divide by adding 2 to the vector otherwise keep the same vector.
df = pd.DataFrame({'user': ['user 1', 'user 2', 'user 3'],
'vector': [[0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11],
[0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19],
[0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]]})
df
Output:
user vector
0 user 1 [0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]
I used the following code:
df['vector']=df.apply(lambda x: x['vector']+10/(x['vector']+2) if x['vector']==0|1 else x['vector'], axis=1)
But the Output:
user vector
0 user 1 [0.01, 0.07, 0.0, 0.14, 0.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 1]
The expected output:
Use a list comprehension (faster than apply):
df['vector'] = [[x+10/(x+2) if x in [0,1] else x for x in v] for v in df['vector']]
Output:
user vector
0 user 1 [0.01, 0.07, 5.0, 0.14, 5.0, 0.55, 0.11]
1 user 2 [0.12, 0.27, 0.1, 0.14, 0.1, 0.09, 0.19]
2 user 3 [0.58, 0.07, 0.02, 0.14, 0.04, 0.06, 4.333333333333334]

Convert list to string inPython

I have following code
round_subset_list = subset.round(2).values.tolist()
print(round_subset_list)
The result is
[0.47, -0.36, -0.5, 0.2, 0.35, 1.82, -0.78, -0.91, 0.36, -1.74, 0.24, 0.76, 0.57, 2.32, 1.55, -1.31, -0.09, -0.02, -0.07, -0.19, -0.25, -1.09, 0.64, 1.22, -0.56, 1.76, 0.13, 1.33, -0.74, -1.15, 1.63, 1.04, -0.26, 0.02, -1.2, 0.37, 0.43, 0.04, 1.34, 0.57, 0.76, -1.25, -0.05, 0.12, 0.8, -0.99, -0.11, -0.54, -0.08, -0.04, -0.76, -0.8, 0.35, 1.54, -0.99, -0.35, -0.28, 0.45, -0.04, -0.06, 0.02, 0.58, -0.32, -0.1, 0.28, 0.3, -0.36, 0.81, 0.79, 0.21, 1.81, 0.19, 0.84, 0.2, -0.06, -0.11, -1.4, -2.08, 0.88, -0.14, -0.96, 1.3, 0.06, -0.37, 1.49, -0.91, 1.14, -1.05, 1.49, -0.79, 2.02, 0.38, 2.4, 1.25, 0.5, 1.11, -0.54, -0.1, 0.63, 1.01]
I wanna convert them into something look like ['0.47', '-0.36', '-0.5', '0.2', ...]
subset.string = ''.join(str(e) for e in round_subset_list)
print(subset.string)
The above code doesn't work
Try this.
round_subset_list = [str(i) for i in round_subset_list]
You can use the map fuction.
round_subset_list = [0.47, -0.36, -0.5, 0.2, 0.35, 1.82, -0.78, -0.91, 0.36, -1.74, 0.24, 0.76, 0.57, 2.32, 1.55, -1.31, -0.09, -0.02, -0.07, -0.19, -0.25, -1.09, 0.64, 1.22, -0.56, 1.76, 0.13, 1.33, -0.74, -1.15, 1.63, 1.04, -0.26, 0.02, -1.2, 0.37, 0.43, 0.04, 1.34, 0.57, 0.76, -1.25, -0.05, 0.12, 0.8, -0.99, -0.11, -0.54, -0.08, -0.04, -0.76, -0.8, 0.35, 1.54, -0.99, -0.35, -0.28, 0.45, -0.04, -0.06, 0.02, 0.58, -0.32, -0.1, 0.28, 0.3, -0.36, 0.81, 0.79, 0.21, 1.81, 0.19, 0.84, 0.2, -0.06, -0.11, -1.4, -2.08, 0.88, -0.14, -0.96, 1.3, 0.06, -0.37, 1.49, -0.91, 1.14, -1.05, 1.49, -0.79, 2.02, 0.38, 2.4, 1.25, 0.5, 1.11, -0.54, -0.1, 0.63, 1.01]
print(list(map(str, round_subset_list)))

How to loop over a dataframe and create list

So, i have the following data below and i want to loop through the dataframe and perform some functions and at the end save the results from the function in a list. I am have trouble creating a list. i only get a single value in the list and not the two means which i intend to get. Anybody with a more effective way to solve this problem please share.
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
import pandas as pd
dicts = pd.DataFrame(dicts, columns = dicts.keys())
def Mean(self):
list_mean = []
list_all = []
for i, row in dicts.iterrows():
if (row['Age'] > 0.2) & (row['Fare'] < 0.1):
list_all.append(row['PassengerId'])
elif (row['Age'] > 0.2) & (row['Fare'] > 0.1):
list_all.clear()
list_all.append(row['PassengerId'])
return list_mean.append(np.mean(list_all))
Mean()
Help Please!!
Some of changes you have to made in you solution to resolve this issue. And for vectorized answer checkout my Code section.
1.
Return statement return list_mean should placed in function block not in if-block
Change:
. . .
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
To:
. . .
list_mean = []
for i, row in dicts.iterrows():
if (row['Age'] > self.age) & (row['Fare'] < self.fare):
list_mean.append(row['PassengerId'])
return list_mean
. . .
CODE :(Vectorized-Version-Solution) No need of defining explicit class to perform this action
import numpy as np
dict_ = {
'PassengerId':
[0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived': [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass': [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age':
[0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp': [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare':
[0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]
}
import pandas as pd
dicts = pd.DataFrame(dict_, columns=dict_.keys())
l1 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] < 0.1)]
l2 = dicts['PassengerId'][np.logical_and(dicts['Age'] > 0.2, dicts['Fare'] > 0.1)]
print( (sum(list(l1))/len(l1), sum(list(l2))/len(l2)) )
OUTPUT :
(0.00375, 0.0036666666666666666)
import pandas as pd
import numpy as np
dict = {'PassengerId' : [0.0, 0.001, 0.002, 0.003, 0.004, 0.006, 0.007, 0.008, 0.009, 0.01],
'Survived' : [0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0],
'Pclass' : [1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.5],
'Age' : [0.271, 0.472, 0.321, 0.435, 0.435, np.nan, 0.673, 0.02, 0.334, 0.171],
'SibSp' : [0.125, 0.125, 0.0, 0.125, 0.0, 0.0, 0.0, 0.375, 0.0, 0.125],
'Parch' : [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.167, 0.333, 0.0],
'Fare' : [0.014, 0.139, 0.015, 0.104, 0.016, 0.017, 0.101, 0.041, 0.022, 0.059]}
df = pd.DataFrame(dict, columns = dict.keys())
def calculate_mean():
l1, l2 = [], []
for i, row in df.iterrows():
if row['Age'] > 0.2 and row['Fare'] < 0.1:
l1.append(row['PassengerId'])
elif row['Age'] > 0.2 and row['Fare'] > 0.1:
l2.append(row['PassengerId'])
return np.mean(l1), np.mean(l2)
print(calculate_mean()) # (0.00375, 0.0036666666666666666)

Computation within a list of dictionaries

I have a list of dictionaries:
wt
Out[189]:
[defaultdict(int,
{'A01': 0.15,
'A02': 0.17,
'A03': 0.13,
'A04': 0.17,
'A05': 0.01,
'A06': 0.12,
'A07': 0.15,
'A08': 0.0,
'A09': 0.02,
'A10': 0.09}),
defaultdict(int,
{'A01': 0.02,
'A02': 0.02,
'A03': 0.06,
'A04': 0.08,
'A05': 0.08,
'A06': 0.04,
'A07': 0.02,
'A08': 0.24,
'A09': 0.34,
'A10': 0.1}),
defaultdict(int,
{'A01': 0.0,
'A02': 0.12,
'A03': 0.01,
'A04': 0.01,
'A05': 0.11,
'A06': 0.13,
'A07': 0.1,
'A08': 0.36,
'A09': 0.13,
'A10': 0.03})]
And I have another dictionary:
zz
Out[188]: defaultdict(int, {'S1': 0.44, 'S2': 0.44, 'S3': 0.12})
I need to run a loop to aggregate the following computation:
'S1':0.44 * 'A01':0.15 + 'S2':0.44 * 'A01':0.02 + 'S3':0.12 * 'A01':0.00 ----- to be stored in a dict with the key 'A01'
'S1':0.44 * 'A02':0.17 + 'S2':0.44 * 'A02':0.02 + 'S3':0.12 * 'A02':0.12 ----- to be stored in a dict with the key 'A02'
.
.
.and so on upto:
'S1':0.44 * 'A10':0.09 + 'S2':0.44 * 'A10':0.1 + 'S3':0.12 * 'A10':0.03 ----- to be stored in a dict with the key 'A10'
Can somebody please suggest a loop for this? The issue I'm facing is that:
wt[0]
Out[197]:
defaultdict(int,
{'A01': 0.15,
'A02': 0.17,
'A03': 0.13,
'A04': 0.17,
'A05': 0.01,
'A06': 0.12,
'A07': 0.15,
'A08': 0.0,
'A09': 0.02,
'A10': 0.09})
But:
wt[0][0]
Out[199]: 0
I'm not being able to access each value within the dict.
You can do your aggregation with a dict comprehension:
x = [defaultdict(int, {'A01': 0.15, 'A02': 0.17, 'A03': 0.13, 'A04': 0.17, 'A05': 0.01, 'A06': 0.12, 'A07': 0.15, 'A08': 0.0, 'A09': 0.02, 'A10': 0.09}),
defaultdict(int, {'A01': 0.02, 'A02': 0.02, 'A03': 0.06, 'A04': 0.08, 'A05': 0.08, 'A06': 0.04, 'A07': 0.02, 'A08': 0.24, 'A09': 0.34, 'A10': 0.1}),
defaultdict(int, {'A01': 0.0, 'A02': 0.12, 'A03': 0.01, 'A04': 0.01, 'A05': 0.11, 'A06': 0.13, 'A07': 0.1, 'A08': 0.36, 'A09': 0.13, 'A10': 0.03})]
mult = defaultdict(int, {'S1': 0.44, 'S2': 0.44, 'S3': 0.12})
d = {k: sum(d[k] * mult['S'+str(idx+1)]
for idx, d in enumerate(x)) for k in x[0].keys()}
If you want to multiply your matrix with a vector, you should try numpy:
import numpy as np
# Transform data to matrix
x = np.array([[d['A'+str(i+1).zfill(2)] for i in range(len(d))] for d in x])
v = np.array([mult['S'+str(i+1)] for i in range(len(mult))]).reshape(1, 3)
print(np.matmul(v, x))
# [[0.0748 0.098 0.0848 0.1112 0.0528 0.086 0.0868 0.1488 0.174 0.0872]]

how to average between values in two files?

I have a two file matrices, that look like this
File1:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,
0.26, 0.10].....'key100',g,l,i,o,+: [0.1, 0.1, 0.29, 0.19, 0.20]}
File2:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.96, 0.06, 0.01],'key2',g,l,i,o,+: [0.0, 0.1, 0.95,
0.26, 0.11].....'key100',g,l,i,o,+: [0.2, 0.0, 0.23, 0.16, 0.21]}
Both files have the same 'keys'. I want to average the values between the two files, so the result file looks like this:
Desired output file:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.94, 0.04, 0.01],'key2',g,l,i,o,+: [0.05, 0.15, 0.925,
0.26, 0.105].....'key100',g,l,i,o,+: [0.15, 0.1, 0.29, 0.175, 0.205]}
I have thought about the python script I could write, but since I am quite new to this, any quick ideas would be welcome:
import gzip
import numpy as np
inFile1 = gzip.open('/home/file1')
inFile2 = gzip.open('/home/file2')
inFile.next()
for line in inFile:
cols = line.strip().split('\t')
data = cols[6:]
for line in inFile2:
cols = line.strip().split('\t')
data2 = cols[6:]
newdata = (data + data2)/2
You could use regex to replace the strings and make it JSON compatible. Then you can easily convert it into a dict and then just use normal python to analyse the data (compare the dicts):
import re
import json
s = '''{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,
0.26, 0.10],'key100',g,l,i,o,+: [0.1, 0.1, 0.29, 0.19, 0.20]}'''
s2 = re.sub('\'(key\d+)\',g,l,i,o,\+', r'"\1"', s)
print(s2)
d = json.loads(s2)
print(d)
Problem is your data format , as Wodin commented :
what is this format? It looks a bit like a Python dict, but the
,g,l,i,o,+ doesn't make sense for a dict.
I tried with your data , You can take hint , help from this code:
I tried with
File1.txt
{'key1',g,l,i,o,+: [0.0, 0.0, 0.92, 0.02, 0.01],'key2',g,l,i,o,+: [0.1, 0.2, 0.90,0.26, 0.10]}
{'key3',g,l,i,o,+: [0.0, 0.0, 0.98, 0.02, 0.01],'key4',g,l,i,o,+: [0.1, 0.2, 0.90,0.268, 0.10]}
File2.txt:
{'key1',g,l,i,o,+: [0.0, 0.0, 0.96, 0.06, 0.01],'key2',g,l,i,o,+: [0.0, 0.1, 0.95,0.26, 0.11]}
{'key3',g,l,i,o,+: [0.0, 0.0, 0.98, 0.02, 0.01],'key4',g,l,i,o,+: [0.1, 0.2, 0.90,0.268, 0.10]}
Code:
import re
pattern=r"('key\w+',g,l,i,o,\+):\s(\[.+?\])"
with open('File1.txt','r') as f:
for line in f:
average = {}
pr=re.finditer(pattern,line)
for find in pr:
with open('File2','r') as ff:
for line in ff:
for find1 in re.finditer(pattern,line):
if find.group(1)==find1.group(1):
average_part=list(map(lambda x: sum(x) / len(x), list(zip(eval(find.group(2)),eval(find1.group(2))))))
rest_part=find.group(1)
average[rest_part]=average_part
print(average)
output:
{"'key2',g,l,i,o,+": [0.05, 0.15000000000000002, 0.925, 0.26, 0.10500000000000001], "'key1',g,l,i,o,+": [0.0, 0.0, 0.94, 0.04, 0.01]}
{"'key3',g,l,i,o,+": [0.0, 0.0, 0.98, 0.02, 0.01], "'key4',g,l,i,o,+": [0.1, 0.2, 0.9, 0.268, 0.1]}

Categories