Pandas clean messy data [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on some real time data of people and the age column of dataFrame is really messy.
I want expected output to be age_bins in range [0,10,20,30,40,50,60,70,80,90,100].
Whats the best way to clean messy data like this?
df = pd.DataFrame({'Age':['23', '64', '71', '53', '40', '45', '30-39', '50-59', '60-69',
'30', '65', '44', '8-68', '21-72', '26', '36', '43', '70', '52',
'66', '27', '17', '51', '68', '35', '28', '58', '33', '31', '50',
'24', '88', '29', '21', '78', '60', '63', '37', '32', '49',
'20-29', '47', '18-99', '41', '39', '42', '38', '7', '40-49', '82',
'61', '34-66', '62', '40-89', '80-89', '55', '0.25', '13-19', '69',
'16', '8', '10', '25', '34', '55-74', '75-', '70-79', '79',
'35-54', '55-', '95', '54', '40-50', '46', '48', '57', '56']})

You can split values by Series.str.split with remove possible traling - by Series.str.strip to 2 columns and for each of them use cut:
df1 = df['Age'].str.strip('-').str.split('-', expand=True).astype(float)
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j-1) for i, j in zip(bins[:-1], bins[1:])]
g1 = pd.cut(df1[0], bins=bins, right=False, labels=labels)
g2 = pd.cut(df1[1], bins=bins, right=False, labels=labels)
Then compare both and if match (also are replaced missing values between both Series) then is created new column by Series.mask:
df['age_bins'] = g1.mask(g1.ne(g2.fillna(g1)))
print (df)
Age age_bins
0 23 20-29
1 64 60-69
2 71 70-79
3 53 50-59
4 40 40-49
.. ... ...
72 40-50 NaN
73 46 40-49
74 48 40-49
75 57 50-59
76 56 50-59
[77 rows x 2 columns]
Not matched values:
df1 = df[df['age_bins'].isna()]
print (df1)
Age age_bins
12 8-68 NaN
13 21-72 NaN
42 18-99 NaN
51 34-66 NaN
53 40-89 NaN
64 55-74 NaN
68 35-54 NaN
72 40-50 NaN

Related

Pandas: Flatten Nested Dictionary vertically

I have a list of dictionary as below:
[{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},{'tagId': '20',
'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},{'tagId': '40',
'tagName': 'FG'}]}]
I want to turn this into a dataframe like below:
Name tagList_tagID tagList_tagName
Jack 10 AB
Jack 20 BC
mike 30 DE
mike 40 FG
How can I convert this list of dictionaries to pandas dataframe in an efficient way.
Try with json.normalize:
lst = [{'name': 'jack', 'tagList': [{'tagId': '10', 'tagName': 'AB'},
{'tagId': '20', 'tagName': 'BC'}]},
{'name': 'mike', 'tagList': [{'tagId': '30', 'tagName': 'DE'},
{'tagId': '40', 'tagName': 'FG'}]}]
df = pd.json_normalize(lst, record_path="tagList", meta=["name"])
#formatting to match expected output
df = df.set_index("name").add_prefix("tagList_")
>>> df
tagList_tagId tagList_tagName
name
jack 10 AB
jack 20 BC
mike 30 DE
mike 40 FG

How to store for loop iterations to a new data set

screenshot
My code:
for video in most_disliked:
df1 = video['id'],video['statistics']
print(df1)
Output:
('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'})
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2', 'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'})
Expected Output:
Videoid viewcount likecount dislikecount favouritecount commentcount
bvyTxpY9qJM 145 3 0 0 0
gShHA7BZNCw 36 3 0 0 0
q7gxl8RJEv4 11 2 0 0 1
df1 = video['id'],video['statistics'] creates a tuple of two elements video['id'] and video['statistics'].
To create a dataframe from the most_disliked list, you can use this example:
df1 = pd.DataFrame([{'Videoid': video['id'], **video['statistics']} for video in most_disliked])
print(df1)
Prints:
Videoid viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1
data = [('bvyTxpY9qJM', {'viewCount': '145', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('gShHA7BZNCw', {'viewCount': '36', 'likeCount': '3',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '0'}),
('q7gxl8RJEv4', {'viewCount': '11', 'likeCount': '2',
'dislikeCount': '0', 'favoriteCount': '0', 'commentCount': '1'}),
]
most_liked = pd.DataFrame(data, columns=['id', 'stat'])
df2 = pd.merge(most_liked['id'], most_liked['stat'].apply(pd.Series),
left_index=True, right_index=True)
Output
id viewCount likeCount dislikeCount favoriteCount commentCount
0 bvyTxpY9qJM 145 3 0 0 0
1 gShHA7BZNCw 36 3 0 0 0
2 q7gxl8RJEv4 11 2 0 0 1

Python, Take Multiple Lists and Putting into pd.Dataframe

I have seen a variety of answers to this question (like this one), and have had no success in getting my lists into one dataframe. I have one header list (meant to be column headers), and then a variable that has multiple records in it:
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m']
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m']
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
When I try something like:
df = pd.DataFrame(list(zip(['1', 'Jack, '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
# ['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'])))
It lists all the attributes as their own rows, like so:
0 1
0 1 2
1 Jack Jill
2 57.4 km 34.0 km
3 4 2
4 21.7 km 17.9 km
5 5:57 /km 5:27 /km
6 994 m 152 m
How do I get this into a frame that has list1 as the headers, and the rest of the data neatly squared away?
Given
list1 = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']
list2 = (['1', 'Jack', '57.4 km', '4', '21.7 km', '5:57 /km', '994 m'],
['2', 'Jill', '34.0 km', '2', '17.9 km', '5:27 /km', '152 m'],
['3', 'Kelsey', '32.6 km', '2', '21.3 km', '5:46 /km', '141 m'])
do
pd.DataFrame(list2, columns=list1)
which returns
Rank Athlete Distance Runs Longest Avg. Pace Elev. Gain
0 1 Jack 57.4 km 4 21.7 km 5:57 /km 994 m
1 2 Jill 34.0 km 2 17.9 km 5:27 /km 152 m
2 3 Kelsey 32.6 km 2 21.3 km 5:46 /km 141 m
Change your second list into a list of lists and then
df = pd.DataFrame(columns = list1, data = list2)

Extract data of a column by matching data of other column in Python

I want to extract all MaxTemp for year 2010
Year Month Day MaxTemp MinTemp
2010 1 1 -19.9 -37.2
2010 1 2 -13.8 -20
2010 1 3 -13.1 -15.9
2010 1 4 -12 -22.3
2010 1 5 -11.8 -14.4
2010 1 6 -14.3 -32.5
2010 1 7 -28.2 -37.3
2011 1 8 -21.9 -31.3
2011 1 9 -7.4 -22.8
2011 1 10 -6.6 -15.3
2011 1 11 -0.7 -15.2
2011 1 12 4.3 -5.8
my current code is
with open('data/'+file_name, "rU") as files:
val = list(csv.reader(files))[1:]
specific output
[['2010', '01', '01', '9.6', '5.8'], ['2010', '01', '02', '8.6', '6.2'], ['2010', '01', '03', '8.8', '6.0'], ['2010', '01', '04', '6.8', '5.6'], ['2010', '01', '05', '9.0', '4.4'], ['2010', '01', '06', '8.1', '1.0'], ['2010', '01', '07', '6.3', '0.9'], ['2010', '01', '08', '7.8', '4.2'], ['2010', '01', '09', '10.4', '7.5'], ['2010', '01', '10', '11.5', '7.9'], ['2010', '01', '11', '11.9', '8.9']]
this extract whole csv without header. How can i accomplish the desired task of extracting all MaxTemp for only year 2010. Year I can pass as an argument. Thanks much.
As rofls said, but with comprehensions
with open('data/'+file_name, "rU") as files:
data = [x[3] for x in csv.reader(files) if str(x[0]) == '2010']
Here you go:
maxList = list()
for row in val:
rowInStr = ''.join(row)
rowSplitList = rowInStr.split(" ")
if rowSplitList[0] == "2010":
rowSplitList = filter(lambda a: a!='', rowSplitList)
maxList.append(rowSplitList[-2])
Output:
['-19.9', '-13.8', '-13.1', '-12', '-11.8', '-14.3', '-28.2', '-21.9', '-7.4', '-6.6', '-0.7', '4.3']
Please use proper indentation. I prefer tab for indentation.
Hope this will help.
You can do it like this, then data will be a list of MaxTemps from 2010:
with open('data/'+file_name, "rU") as files:
data = list()
for line in csv.reader(files):
if int(line[0])==2010:
data.append(line[3])
Better would be to use something like pandas and filter the data that way. If it's not a huge amount of data you should be ok this way though.

How to create a dictionary from a dictionary in python?

Date Subject Maths Science English French Spanish German
16:00:00 Uploaded 100 95 65 32 23 45
17:00:00 Unknown 45 53 76 78 54 78
18:00:00 Captured 43 45 56 76 98 34
Date BoardType Maths Science English French Spanish German
16:00:00 CBSE 50 95 65 32 23 45
17:00:00 NTSE 45 53 76 78 54 78
18:00:00 YTTB 100 45 56 76 98 34
I have these 2 tables in my text file called dataVal.txt.
I want the output to be like:-
'Subject':'Uploaded':'16:00:00':''Maths':'100', Science::95....something like this.
Basically 'Subject' is the main key for the first table which which has 'Uploaded' as its value and then 'Uploaded' becomes the key which has '16:00:00' as its value and then this becomes the key and has Maths, science, english and so on as its values which furthur have their own values as 100, 95,65 and so on.
dic = dict()
with open('C:\\Users\\aman.seth\\Documents\\dataVal.txt','r') as fh:
for l in fh.readlines():
try:
lines = l.split('\t')
date, sub, num = lines[0], lines[1], [str(x) for x in lines[2:]]
dic.setdefault(sub, {})
dic[sub][date] = num
except Exception as er:
print er
print dic
This is what I have done so far but it is not enough and accurate I guess.
Please try this and lets fixed:
import re
dic = dict()
with open('txt', 'r') as fh:
memory = None
for line in fh.readlines():
lines = line.rstrip('\n')
if line.split():
try:
match = re.search('(BoardType|Subject)', line)
if match:
memory = match.group(1)
dic.setdefault(memory, {})
header = line.split()
mark_index = header[2:]
else:
mark_dict = dict()
lines = [ x for x in line.split(' ') if x]
date, sub, num = lines[0], lines[1], [str(x) for x in lines[2:]]
dic[memory].setdefault(sub, {})
mark = dict(zip(mark_index, num))
dic[memory][sub][date] = mark
except Exception as error:
print 'Error: ', error
import pprint
pprint.pprint(dic)
Output:
{'BoardType': {'CBSE': {'16:00:00': {'English': '65',
'French': '32',
'German': '45',
'Maths': '50',
'Science': '95',
'Spanish': '23'}},
'NTSE': {'17:00:00': {'English': '76',
'French': '78',
'German': '78',
'Maths': '45',
'Science': '53',
'Spanish': '54'}},
'YTTB': {'18:00:00': {'English': '56',
'French': '76',
'German': '34',
'Maths': '100',
'Science': '45',
'Spanish': '98'}}},
'Subject': {'Captured': {'18:00:00': {'English': '56',
'French': '76',
'German': '34\n',
'Maths': '43',
'Science': '45',
'Spanish': '98'}},
'Unknown': {'17:00:00': {'English': '76',
'French': '78',
'German': '78\n',
'Maths': '45',
'Science': '53',
'Spanish': '54'}},
'Uploaded': {'16:00:00': {'English': '65',
'French': '32',
'German': '45\n',
'Maths': '100',
'Science': '95',
'Spanish': '23'}}}}

Categories