How to create a dictionary from a dictionary in python? - python

Date Subject Maths Science English French Spanish German
16:00:00 Uploaded 100 95 65 32 23 45
17:00:00 Unknown 45 53 76 78 54 78
18:00:00 Captured 43 45 56 76 98 34
Date BoardType Maths Science English French Spanish German
16:00:00 CBSE 50 95 65 32 23 45
17:00:00 NTSE 45 53 76 78 54 78
18:00:00 YTTB 100 45 56 76 98 34
I have these 2 tables in my text file called dataVal.txt.
I want the output to be like:-
'Subject':'Uploaded':'16:00:00':''Maths':'100', Science::95....something like this.
Basically 'Subject' is the main key for the first table which which has 'Uploaded' as its value and then 'Uploaded' becomes the key which has '16:00:00' as its value and then this becomes the key and has Maths, science, english and so on as its values which furthur have their own values as 100, 95,65 and so on.
dic = dict()
with open('C:\\Users\\aman.seth\\Documents\\dataVal.txt','r') as fh:
for l in fh.readlines():
try:
lines = l.split('\t')
date, sub, num = lines[0], lines[1], [str(x) for x in lines[2:]]
dic.setdefault(sub, {})
dic[sub][date] = num
except Exception as er:
print er
print dic
This is what I have done so far but it is not enough and accurate I guess.

Please try this and lets fixed:
import re
dic = dict()
with open('txt', 'r') as fh:
memory = None
for line in fh.readlines():
lines = line.rstrip('\n')
if line.split():
try:
match = re.search('(BoardType|Subject)', line)
if match:
memory = match.group(1)
dic.setdefault(memory, {})
header = line.split()
mark_index = header[2:]
else:
mark_dict = dict()
lines = [ x for x in line.split(' ') if x]
date, sub, num = lines[0], lines[1], [str(x) for x in lines[2:]]
dic[memory].setdefault(sub, {})
mark = dict(zip(mark_index, num))
dic[memory][sub][date] = mark
except Exception as error:
print 'Error: ', error
import pprint
pprint.pprint(dic)
Output:
{'BoardType': {'CBSE': {'16:00:00': {'English': '65',
'French': '32',
'German': '45',
'Maths': '50',
'Science': '95',
'Spanish': '23'}},
'NTSE': {'17:00:00': {'English': '76',
'French': '78',
'German': '78',
'Maths': '45',
'Science': '53',
'Spanish': '54'}},
'YTTB': {'18:00:00': {'English': '56',
'French': '76',
'German': '34',
'Maths': '100',
'Science': '45',
'Spanish': '98'}}},
'Subject': {'Captured': {'18:00:00': {'English': '56',
'French': '76',
'German': '34\n',
'Maths': '43',
'Science': '45',
'Spanish': '98'}},
'Unknown': {'17:00:00': {'English': '76',
'French': '78',
'German': '78\n',
'Maths': '45',
'Science': '53',
'Spanish': '54'}},
'Uploaded': {'16:00:00': {'English': '65',
'French': '32',
'German': '45\n',
'Maths': '100',
'Science': '95',
'Spanish': '23'}}}}

Related

Pandas: Flatten Nested Dictionary vertically

I have a list of dictionary as below:
[{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},{'tagId': '20',
'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},{'tagId': '40',
'tagName': 'FG'}]}]
I want to turn this into a dataframe like below:
Name tagList_tagID tagList_tagName
Jack 10 AB
Jack 20 BC
mike 30 DE
mike 40 FG
How can I convert this list of dictionaries to pandas dataframe in an efficient way.
Try with json.normalize:
lst = [{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},
{'tagId': '20', 'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},
{'tagId': '40', 'tagName': 'FG'}]}]
df = pd.json_normalize(lst, record_path="tagList", meta=["name"])
#formatting to match expected output
df = df.set_index("name").add_prefix("tagList_")
>>> df
tagList_tagId tagList_tagName
name
jack 10 AB
jack 20 BC
mike 30 DE
mike 40 FG

Python, Take Multiple Lists and Putting into pd.Dataframe

I have seen a variety of answers to this question (like this one), and have had no success in getting my lists into one dataframe. I have one header list (meant to be column headers), and then a variable that has multiple records in it:
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m']
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m']
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
When I try something like:
df = pd.DataFrame(list(zip(['1', 'Jack, '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
# ['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'])))
It lists all the attributes as their own rows, like so:
0 1
0 1 2
1 Jack Jill
2 57.4 km 34.0 km
3 4 2
4 21.7 km 17.9 km
5 5:57 /km 5:27 /km
6 994 m 152 m
How do I get this into a frame that has list1 as the headers, and the rest of the data neatly squared away?
Given
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'],
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
do
pd.DataFrame(list2, columns=list1)
which returns
Rank Athlete Distance Runs Longest Avg. Pace Elev. Gain
0 1 Jack 57.4 km 4 21.7 km 5:57 /km 994 m
1 2 Jill 34.0 km 2 17.9 km 5:27 /km 152 m
2 3 Kelsey 32.6 km 2 21.3 km 5:46 /km 141 m
Change your second list into a list of lists and then
df = pd.DataFrame(columns = list1, data = list2)

Pandas clean messy data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on some real time data of people and the age column of dataFrame is really messy.
I want expected output to be age_bins in range [0,10,20,30,40,50,60,70,80,90,100].
Whats the best way to clean messy data like this?
df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
'30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
'66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
'24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
'20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
'61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
'16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
'35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})
You can split values by Series.str.split with remove possible traling - by Series.str.strip to 2 columns and for each of them use cut:
df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])]
g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)
Then compare both and if match (also are replaced missing values between both Series) then is created new column by Series.mask:
df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
Age age_bins
0 23 20-29
1 64 60-69
2 71 70-79
3 53 50-59
4 40 40-49
.. ... ...
72 40-50 NaN
73 46 40-49
74 48 40-49
75 57 50-59
76 56 50-59
[77 rows x 2 columns]
Not matched values:
df1 = df[df['age_bins'].isna()]
print (df1)
Age age_bins
12 8-68 NaN
13 21-72 NaN
42 18-99 NaN
51 34-66 NaN
53 40-89 NaN
64 55-74 NaN
68 35-54 NaN
72 40-50 NaN

Attempting to grab certain Elements

I am new to lxml module in Python.
I am trying to parse data from a website: https://weather.com/weather/tenday/l/USCA1037:1:US
I am trying to grab the text of :
<span classname="narrative" class="narrative">
Cloudy. Low 49F. Winds WNW at 10 to 20 mph.
</span>
However, I am getting my xpath all mixed up.
To be exact, the location of this line is
//*[#id="twc-scrollabe"]/table/tbody/tr[4]/td[2]/span
I've attempted as the following
import requests
import lxml.html
from lxml import etree
html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")
element_object = lxml.html.fromstring(html.content) # htmlelement object returns bytes
# element_object has root of <html>
table = element_object.xpath('//div[#class="twc-table-scroller"]')[0]
day_of_week = table.xpath('.//span[#class="date-time"]/text()') # returns list of items from "dates-time"
dates = table.xpath('.//span[#class="day-detail clearfix"]/text()')
td = table.xpath('.//tbody/tr/td/span[contains(#class, "narrative")]')
print td
# print td displays an empty list.
I would like my program to also parse "Cloudy. Low 49F. Winds WNW at 10 to 20 mph."
Please help...
Some <td> have title= with description
import requests
import lxml.html
html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")
element_object = lxml.html.fromstring(html.content)
table = element_object.xpath('//div[#class="twc-table-scroller"]')[0]
td = table.xpath('.//tr/td[#class="twc-sticky-col"]/#title')
print(td)
Result
['Mostly cloudy skies early, then partly cloudy after midnight. Low 48F. Winds SSW at 5 to 10 mph.',
'Mainly sunny. High 66F. Winds WNW at 5 to 10 mph.',
'Sunny. High 71F. Winds NW at 5 to 10 mph.',
'A mainly sunny sky. High 69F. Winds W at 5 to 10 mph.',
'Some clouds in the morning will give way to mainly sunny skies for the afternoon. High 67F. Winds WSW at 5 to 10 mph.',
'Considerable clouds early. Some decrease in clouds later in the day. High 67F. Winds WSW at 5 to 10 mph.',
'Partly cloudy. High near 65F. Winds WSW at 5 to 10 mph.',
'Cloudy skies early, then partly cloudy in the afternoon. High 61F. Winds WSW at 10 to 20 mph.',
'Sunny skies. High 62F. Winds WNW at 10 to 20 mph.',
'Mainly sunny. High 61F. Winds WNW at 10 to 20 mph.',
'Sunny along with a few clouds. High 64F. Winds WNW at 10 to 15 mph.',
'Mostly sunny skies. High around 65F. Winds WNW at 10 to 15 mph.',
'Mostly sunny skies. High 66F. Winds WNW at 10 to 20 mph.',
'Mainly sunny. High around 65F. Winds WNW at 10 to 20 mph.',
'A mainly sunny sky. High around 65F. Winds WNW at 10 to 20 mph.']
There is no <tbody> in HTML but web browser may display it in DevTool - so don't use tbody in xpath.
Some text is in <span></span> but some in <span><span></span></span>
import requests
import lxml.html
html = requests.get("https://weather.com/weather/tenday/l/USCA1037:1:US")
element_object = lxml.html.fromstring(html.content)
table = element_object.xpath('//div[#class="twc-table-scroller"]')[0]
td = table.xpath('.//tr/td//span/text()')
print(td)
Result
['Tonight', 'APR 21', 'Partly Cloudy', '--', '48', '10', '%', 'SSW 7 mph ', '85', '%',
'Mon', 'APR 22', 'Sunny', '66', '51', '10', '%', 'WNW 9 mph ', '67', '%',
'Tue', 'APR 23', 'Sunny', '71', '53', '0', '%', 'NW 8 mph ', '59', '%',
'Wed', 'APR 24', 'Sunny', '69', '52', '10', '%', 'W 9 mph ', '71', '%',
'Thu', 'APR 25', 'Partly Cloudy', '67', '51', '10', '%', 'WSW 9 mph ', '71', '%',
'Fri', 'APR 26', 'Partly Cloudy', '67', '51', '10', '%', 'WSW 9 mph ', '69', '%',
'Sat', 'APR 27', 'Partly Cloudy', '65', '50', '10', '%', 'WSW 9 mph ', '71', '%',
'Sun', 'APR 28', 'AM Clouds/PM Sun', '61', '49', '20', '%', 'WSW 13 mph ', '75', '%',
'Mon', 'APR 29', 'Sunny', '62', '48', '10', '%', 'WNW 14 mph ', '63', '%',
'Tue', 'APR 30', 'Sunny', '61', '49', '0', '%', 'WNW 14 mph ', '61', '%',
'Wed', 'MAY 1', 'Mostly Sunny', '64', '50', '0', '%', 'WNW 12 mph ', '60', '%',
'Thu', 'MAY 2', 'Mostly Sunny', '65', '50', '0', '%', 'WNW 12 mph ', '61', '%',
'Fri', 'MAY 3', 'Mostly Sunny', '66', '51', '0', '%', 'WNW 13 mph ', '61', '%',
'Sat', 'MAY 4', 'Sunny', '65', '51', '0', '%', 'WNW 14 mph ', '62', '%',
'Sun', 'MAY 5', 'Sunny', '65', '51', '0', '%', 'WNW 14 mph ', '63', '%']
If you want to grab text like Sunny. High 66F. Winds WNW at 5 to 10 mph., you can get them from the title attributes of <td>.
This should work.
td = table.xpath('.//tbody/tr/td[#class="description"]/#title')

Extract data of a column by matching data of other column in Python

I want to extract all MaxTemp for year 2010
Year Month Day MaxTemp MinTemp
2010 1 1 -19.9 -37.2
2010 1 2 -13.8 -20
2010 1 3 -13.1 -15.9
2010 1 4 -12 -22.3
2010 1 5 -11.8 -14.4
2010 1 6 -14.3 -32.5
2010 1 7 -28.2 -37.3
2011 1 8 -21.9 -31.3
2011 1 9 -7.4 -22.8
2011 1 10 -6.6 -15.3
2011 1 11 -0.7 -15.2
2011 1 12 4.3 -5.8
my current code is
with open('data/'+file_name, "rU") as files:
val = list(csv.reader(files))[1:]
specific output
[['2010', '01', '01', '9.6', '5.8'], ['2010', '01', '02', '8.6', '6.2'], ['2010', '01', '03', '8.8', '6.0'], ['2010', '01', '04', '6.8', '5.6'], ['2010', '01', '05', '9.0', '4.4'], ['2010', '01', '06', '8.1', '1.0'], ['2010', '01', '07', '6.3', '0.9'], ['2010', '01', '08', '7.8', '4.2'], ['2010', '01', '09', '10.4', '7.5'], ['2010', '01', '10', '11.5', '7.9'], ['2010', '01', '11', '11.9', '8.9']]
this extract whole csv without header. How can i accomplish the desired task of extracting all MaxTemp for only year 2010. Year I can pass as an argument. Thanks much.
As rofls said, but with comprehensions
with open('data/'+file_name, "rU") as files:
data = [x[3] for x in csv.reader(files) if str(x[0]) == '2010']
Here you go:
maxList = list()
for row in val:
rowInStr = ''.join(row)
rowSplitList = rowInStr.split(" ")
if rowSplitList[0] == "2010":
rowSplitList = filter(lambda a: a!='', rowSplitList)
maxList.append(rowSplitList[-2])
Output:
['-19.9', '-13.8', '-13.1', '-12', '-11.8', '-14.3', '-28.2', '-21.9', '-7.4', '-6.6', '-0.7', '4.3']
Please use proper indentation. I prefer tab for indentation.
Hope this will help.
You can do it like this, then data will be a list of MaxTemps from 2010:
with open('data/'+file_name, "rU") as files:
data = list()
for line in csv.reader(files):
if int(line[0])==2010:
data.append(line[3])
Better would be to use something like pandas and filter the data that way. If it's not a huge amount of data you should be ok this way though.

Categories