I have a data frame
USER =
zipcode userCount
0 00601 5
1 00602 23
2 00603 53
3 00604 2
4 00605 6
5 00606 10
6 00610 8
7 00612 33
8 00613 2
9 00614 2
10 00616 1
11 00617 9
12 00622 6
13 00623 28
14 00624 10
15 00627 8
16 00631 1
17 00637 13
18 00638 9
19 00641 12
20 00646 13
When I save it
USER.to_csv('Total_user.csv',index = False)
I got missing 0 in front of the zipcode. 00601 -> 601
zipcode userCount
601 5
602 23
603 53
604 2
605 6
606 10
610 8
612 33
613 2
614 2
616 1
617 9
622 6
623 28
624 10
627 8
631 1
637 13
638 9
641 12
646 13
Is that anything I missed in the to_csv line? I just want to save the 0 in front of the csv. Then when I read_csv(low_memory = False) Then the zipcode has the normal format.
Assuming that the column df['zipcode'] of the first dataframe is already a column of strings, then save it this way:
>>> df.to_csv('zipcodes.csv',dtype={'zipcode':'str','userCount':int})
And then when reading, set all data types to be str, and then convert the ones that are not this way:
>>> pd.read_csv('zipcodes.csv',dtype='str',index_col=0)
zipcode userCount
0 00601 5
1 00602 23
2 00603 53
3 00604 2
4 00605 6
5 00606 10
6 00610 8
7 00612 33
8 00613 2
9 00614 2
10 00616 1
11 00617 9
12 00622 6
13 00623 28
14 00624 10
15 00627 8
16 00631 1
17 00637 13
18 00638 9
19 00641 12
20 00646 13
>>> df['userCount'] = df['userCount'].astype(int)
>>> df.dtypes
zipcode object
userCount int64
dtype: object
Your data probably being stored as an object type in the data frame. You can confirm this by typing:
df.dtypes
>>> zipCode object
userCount object
dtype: object
Python doesn't like 0 prefixed integers thus the object dtype. You'll need to quote your data when you save it. You can do this via the quoting parameter in read_csv()
import csv
df.to_csv('tmp.csv', quoting=csv.QUOTE_NONNUMERIC)
If you don't quote your data pandas will convert it to an integer when you re-read it and strip the leading zeros.
Please use dtype=str as a parameter to read_csv(file,sep,dtype=str) method.
That will fix the issue.
Related
In my dataset, I have a few rows which contain characters.
I only need rows which contain all integers. What is the best possible way to do this? Below data set:
e.g I want to remove the rows 2nd and 3rd as they contain 051A, 04A, and 08B respectively.
1 2017 0 321 3 20 42 18
2 051A 0 321 3 5 69 04A
3 460 0 1633 16 38 17 08B
4 1811 0 822 8 13 65 18
Not sure if apply can be avoided here
df.apply(lambda x: pd.to_numeric(x, errors = 'coerce')).dropna()
0 1 2 3 4 5 6 7
0 1 2017.0 0 321 3 20 42 18.0
3 4 1811.0 0 822 8 13 65 18.0
This very similar to #jpp's solution but differs in the technique to check if digit.
df[df.applymap(lambda x: str(x).isdecimal()).all(1)].astype(int)
0 1 2 3 4 5 6 7
0 1 2017 0 321 3 20 42 18
3 4 1811 0 822 8 13 65 18
Thanks to #jpp for suggesting isdecimal as opposed to isdigit
For this task, as stated, try / except is a solution which should deal with all cases.
pd.DataFrame.applymap applies a function to each element in the dataframe.
def CheckInt(s):
try:
int(s)
return True
except ValueError:
return False
res = df[df.applymap(CheckInt).all(axis=1)].astype(int)
# 0 1 2 3 4 5 6 7
# 0 1 2017 0 321 3 20 42 18
# 3 4 1811 0 822 8 13 65 18
As an alternative to the other good answers, this solution uses the stack + unstack paradigm to avoid a loopy solution.
v = df.stack().astype(str)
v.where(v.str.isdecimal()).unstack().dropna().astype(int)
0 1 2 3 4 5 6 7
0 1 2017 0 321 3 20 42 18
3 4 1811 0 822 8 13 65 18
In one line, I think you can use convert_objects function from pandas. With this, we convert object to integer, which will result in NA. We finally drop na.
df = df.convert_objects(convert_numeric=True).dropna()
You can check more information here on pandas documentation.
I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]
I have the following table in pandas
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
4 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 NaT
Some user_ids have multiple lines that are identical except for their different mission_delta values. How do I transform this into one line for each id, with a columns named "mission_delta_1", "mission_delta_2" (the number of them vary, it could be 1 per user_id to maybe 5 per user_id so naming has to be iterative_ etc so output would be:
user_id idaggregate_info num_events num_lark_convo_events num_meals_logged num_breakfasts num_lunches num_dinners num_snacks total_activity sleep_duration num_activity_events num_weights num_notifs idusermission completed mission_delta_1 mission_delta_2
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28 NaT
2 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05 NaT
Not a duplicate as those address exploding all columns, there is just one that needs to be unstacked. The solutions offered in the duplicate link fail:
df.groupby(level=0).apply(lambda x: pd.Series(x.values.flatten()))
produces the same df as the original with different labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
The next options:
result2.groupby(level=0).apply(lambda x: pd.Series(x.stack().values))
produces:
0 0 0
1 406
2 94
3 20
4 7
and
df.groupby(level=0).apply(lambda x: x.values.ravel()).apply(pd.Series)
produces the original dataframe:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 406 94 20 7 2 2 2 1 4456 47738 72 0 18 1426 0 NaT
1 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 7 days 10:04:28
2 1 1247 121 48 26 8 7 2 9 48695 37560 53 14 48 1379 1 NaT
3 2 2088 356 32 15 6 6 1 2 41598 184113 314 1 21 967 1 8 days 00:03:05
In essence, I want to turn a df:
id mission_delta
0 NaT
1 1 day
1 2 days
1 1 day
2 5 days
2 NaT
into
id mission_delta1 mission_delta_2 mission_delta_3
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
You might try this;
grp = df.groupby('id')
df_res = grp['mission_delta'].apply(lambda x: pd.Series(x.values)).unstack().fillna('NaT')
df_res = df_res.rename(columns={i: 'mission_delta_{}'.format(i + 1) for i in range(len(df_res))})
print(df_res)
mission_delta_1 mission_delta_2 mission_delta_3
id
0 NaT NaT NaT
1 1 day 2 days 1 day
2 5 days NaT NaT
I have golf data from the PGA Tour that I am trying to clean. Originally, I had white space in front of some of the hole scores. This is what a unique count looked like:
df["Hole Score"].value_counts()
Out[76]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score, dtype: int64
I was able to run a whitespace remover function that got rid of leading white space. However, my count values function returns the same frequency counts:
df["Hole Score_2"].value_counts()
Out[74]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score_2, dtype: int64
For reference, this is the helper function that I used:
def remove_whitespace(x):
try:
x = "".join(x.split())
except:
pass
return x
df["Hole Score_2"] = df["Hole Score"].apply(remove_whitespace)
My question: How do I get the unique counts to list one Hole Score per count? What could be causing the double counting?
This is quite simple, but I do not get why I can't merge two dataframes. I have the following dfs with different shapes (one is larger and wider than the other):
df1
A id
0 microsoft inc 1
1 apple computer. 2
2 Google Inc. 3
3 IBM 4
4 amazon, Inc. 5
df2
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
0 (409-6043-01) 234324 API Other 2
0 23423423 API NaN NaN 3
0 (001722-5e240-60) NaN NaN Other 4
1 (0012172-52411-60) 32423423. NaN Other 4
0 29849032-29482390 API Yes False 5
1 329482030-23490-1 API Yes False 5
I would like to merge df1 and df2 by the index column:
df3
A B C D E id
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 Google Inc. 23423423 API NaN NaN 3
3 IBM (001722-5e240-60) NaN NaN Other 4
4 IBM (0012172-52411-60) 32423423. NaN Other 4
5 amazon, Inc. 29849032-29482390 API Yes False 5
6 amazon, Inc. 329482030-23490-1 API Yes False 5
I know that this could be done by using merge(). Also, I read this excellent tutorial and tried to:
In:
pd.merge(df1, df2, on=df1.id, how='outer')
Out:
IndexError: indices are out-of-bounds
Then I tried:
pd.merge(df2, df1, on='id', how='outer')
And apparently its repeating several times the merged rows, something like this:
A B C D E index
0 microsoft inc (01780-500-01) 237489 - 342 API True. 1
1 apple computer. (409-6043-01) 234324 API Other 2
2 apple computer. (409-6043-01) 234324 API Other 2
3 apple computer. (409-6043-01) 234324 API Other 2
4 apple computer. (409-6043-01) 234324 API Other 2
5 apple computer. (409-6043-01) 234324 API Other 2
6 apple computer. (409-6043-01) 234324 API Other 2
7 apple computer. (409-6043-01) 234324 API Other 2
8 apple computer. (409-6043-01) 234324 API Other 2
...
I think that this is related with the fact that I created a temporal index df2['position'] = df2.index since the indices look weird, and then removed it. So, my question is how to get df3?
UPDATE
I fixed the index of df2 like this:
df2.reset_index(drop=True, inplace=True)
And now looks like this:
B C D E id
0 (01780-500-01) 237489 - 342 API True. 1
1 (409-6043-01) 234324 API Other 2
2 23423423 API NaN NaN 3
3 (001722-5e240-60) NaN NaN Other 4
4 (0012172-52411-60) 32423423. NaN Other 4
5 29849032-29482390 API Yes False 5
6 329482030-23490-1 API Yes False 5
I am still having the same issue. The merged rows are repeating several times.
>>>print(df2.dtypes)
B object
C object
D object
E object
id int64
dtype: object
>>>print(df1.dtypes)
A object
id int64
dtype: object
Update2
>>>print(df2['id'])
0 1
1 2
2 3
3 4
4 4
5 5
6 5
7 6
8 6
9 7
10 8
11 8
12 8
13 8
14 9
15 10
16 11
17 11
18 12
19 12
20 13
21 13
22 14
23 15
24 16
25 16
26 17
27 17
28 18
29 18
...
476 132
477 132
478 132
479 132
480 132
481 132
482 132
483 132
484 133
485 133
486 133
487 133
488 134
489 134
490 134
491 134
492 135
493 135
494 136
495 136
496 137
497 137
498 137
499 137
500 137
501 137
502 137
503 138
504 138
505 138
Name: id, dtype: int64
And
>>>print(df1)
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 8
12 12
13 6
14 7
15 8
16 6
17 11
18 13
19 14
20 15
21 11
22 2
23 16
24 17
25 18
26 9
27 19
28 11
29 20
..
108 57
109 43
110 22
111 2
112 58
113 49
114 22
115 59
116 2
117 6
118 22
119 2
120 37
121 2
122 9
123 60
124 61
125 62
126 63
127 42
128 64
129 4
130 29
131 11
132 2
133 25
134 4
135 65
136 66
137 4
Name: id, dtype: int64
You could try setting the index as id and then using join:
df1 = pd.DataFrame([('microsoft inc',1),
('apple computer.',2),
('Google Inc.',3),
('IBM',4),
('amazon, Inc.',5)],columns = ('A','id'))
df2 = pd.DataFrame([('(01780-500-01)','237489', '- 342','API', 1),
('(409-6043-01)','234324', ' API','Other ',2),
('23423423','API', 'NaN','NaN', 3),
('(001722-5e240-60)','NaN', 'NaN','Other', 4),
('(0012172-52411-60)','32423423',' NaN','Other', 4),
('29849032-29482390','API', ' Yes',' False', 5),
('329482030-23490-1','API', ' Yes',' False', 5)],
columns = ['B','C','D','E','id'])
df1 =df1.set_index('id')
df1.drop_duplicates(inplace=True)
df2 = df2.set_index('id')
df3 = df1.join(df2,how='outer')
Since you've set the index columns (aka join keys) for both dataframes, you wouldn't have to specify the on='id' param.
This is an alternate way to solve the problem.. I don't see anything wrong with pd.merge(df1, df2, on='id', how='outer'). You might want to double check the id column in both dataframes, as mentioned by #JohnE