Double unique counts in python - python

I have golf data from the PGA Tour that I am trying to clean. Originally, I had white space in front of some of the hole scores. This is what a unique count looked like:
df["Hole Score"].value_counts()
Out[76]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score, dtype: int64
I was able to run a whitespace remover function that got rid of leading white space. However, my count values function returns the same frequency counts:
df["Hole Score_2"].value_counts()
Out[74]:
4 566072
5 272074
3 218873
6 48596
4 38306
5 19728
2 17339
3 15093
7 7750
4232
6 3011
8 1313
2 1080
7 389
9 369
10 66
8 61
11 38
1 27
9 20
Name: Hole Score_2, dtype: int64
For reference, this is the helper function that I used:
def remove_whitespace(x):
try:
x = "".join(x.split())
except:
pass
return x
df["Hole Score_2"] = df["Hole Score"].apply(remove_whitespace)
My question: How do I get the unique counts to list one Hole Score per count? What could be causing the double counting?

Related

Pandas groupby apply a random day to each group of years

I am trying to generate a different random day within each year group of a dataframe. So I need replacement = False, otherwise it will fail.
You can't just add a column of random numbers because I'm going to have more than 365 years in my list of years and once you hit 365 it can't create any more random samples without replacement.
I have explored agg, aggreagte, apply and transform. The closest I have got is with this:
years = pd.DataFrame({"year": [1,1,2,2,2,3,3,4,4,4,4]})
years["day"] = 0
grouped = years.groupby("year")["day"]
grouped.transform(lambda x: np.random.choice(366, replace=False))
Which gives this:
0 8
1 8
2 319
3 319
4 319
5 149
6 149
7 130
8 130
9 130
10 130
Name: day, dtype: int64
But I want this:
0 8
1 16
2 119
3 321
4 333
5 4
6 99
7 30
8 129
9 224
10 355
Name: day, dtype: int64
You can use your code with a minor modification. You have to specify the number of samples.
random_days = lambda x: np.random.choice(range(1, 366), len(x), replace=False)
years['day'] = years.groupby('year').transform(random_days)
Output:
>>> years
year day
0 1 18
1 1 300
2 2 154
3 2 355
4 2 311
5 3 18
6 3 14
7 4 160
8 4 304
9 4 67
10 4 6
With numpy broadcasting :
years["day"] = np.random.choice(366, years.shape[0], False) % 366
​
years["day"] = years.groupby("year").transform(lambda x: np.random.permutation(x))
​
Output :
print(years)
year day
0 1 233
1 1 147
2 2 1
3 2 340
4 2 267
5 3 204
6 3 256
7 4 354
8 4 94
9 4 196
10 4 164

Conditional filling of column based on string

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?
You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

Remove non-integer and non-float values from a column in Pandas? [duplicate]

In my dataset, I have a few rows which contain characters.
I only need rows which contain all integers. What is the best possible way to do this? Below data set:
e.g I want to remove the rows 2nd and 3rd as they contain 051A, 04A, and 08B respectively.
1 2017 0 321 3 20 42 18
2 051A 0 321 3 5 69 04A
3 460 0 1633 16 38 17 08B
4 1811 0 822 8 13 65 18
Not sure if apply can be avoided here
df.apply(lambda x: pd.to_numeric(x, errors = 'coerce')).dropna()
0 1 2 3 4 5 6 7
0 1 2017.0 0 321 3 20 42 18.0
3 4 1811.0 0 822 8 13 65 18.0
This very similar to #jpp's solution but differs in the technique to check if digit.
df[df.applymap(lambda x: str(x).isdecimal()).all(1)].astype(int)
0 1 2 3 4 5 6 7
0 1 2017 0 321 3 20 42 18
3 4 1811 0 822 8 13 65 18
Thanks to #jpp for suggesting isdecimal as opposed to isdigit
For this task, as stated, try / except is a solution which should deal with all cases.
pd.DataFrame.applymap applies a function to each element in the dataframe.
def CheckInt(s):
try:
int(s)
return True
except ValueError:
return False
res = df[df.applymap(CheckInt).all(axis=1)].astype(int)
# 0 1 2 3 4 5 6 7
# 0 1 2017 0 321 3 20 42 18
# 3 4 1811 0 822 8 13 65 18
As an alternative to the other good answers, this solution uses the stack + unstack paradigm to avoid a loopy solution.
v = df.stack().astype(str)
v.where(v.str.isdecimal()).unstack().dropna().astype(int)
0 1 2 3 4 5 6 7
0 1 2017 0 321 3 20 42 18
3 4 1811 0 822 8 13 65 18
In one line, I think you can use convert_objects function from pandas. With this, we convert object to integer, which will result in NA. We finally drop na.
df = df.convert_objects(convert_numeric=True).dropna()
You can check more information here on pandas documentation.

How to randomly drop rows in Pandas dataframe until there are equal number of values in a column?

I have a dataframe pd with two columns, X and y.
In pd[y] I have integers from 1 to 10 inclusive. However they have different frequencies:
df[y].value_counts()
10 6645
9 6213
8 5789
7 4643
6 2532
5 1839
4 1596
3 878
2 815
1 642
I want to cut down my dataframe so that there are equal number of occurrences for each label. As I want an equal number of each label, the minimum frequency is 642. So I only want to keep 642 randomly sampled rows of each class label in my dataframe so that my new dataframe has 642 for each class label.
I thought this might have helped however stratifying only keeps the same percentage of each label but I want all my labels to have the same frequency.
As an example of a dataframe:
df = pd.DataFrame()
df['y'] = sum([[10]*6645, [9]* 6213,[8]* 5789, [7]*4643,[6]* 2532, [5]*1839,[4]* 1596,[3]* 878, [2]*815, [1]* 642],[])
df['X'] = [random.choice(list('abcdef')) for i in range(len(df))]
Use pd.sample with groupby-
df = pd.DataFrame(np.random.randint(1, 11, 100), columns=['y'])
val_cnt = df['y'].value_counts()
min_sample = val_cnt.min()
print(min_sample) # Outputs 7 in as an example
print(df.groupby('y').apply(lambda s: s.sample(min_sample)))
Output
y
y
1 68 1
8 1
82 1
17 1
99 1
31 1
6 1
2 55 2
15 2
81 2
22 2
46 2
13 2
58 2
3 2 3
30 3
84 3
61 3
78 3
24 3
98 3
4 51 4
86 4
52 4
10 4
42 4
80 4
53 4
5 16 5
87 5
... ..
6 26 6
18 6
7 56 7
4 7
60 7
65 7
85 7
37 7
70 7
8 93 8
41 8
28 8
20 8
33 8
64 8
62 8
9 73 9
79 9
9 9
40 9
29 9
57 9
7 9
10 96 10
67 10
47 10
54 10
97 10
71 10
94 10
[70 rows x 1 columns]

Python to_csv the missing 0 in front of zipcode

I have a data frame
USER =
zipcode userCount
0 00601 5
1 00602 23
2 00603 53
3 00604 2
4 00605 6
5 00606 10
6 00610 8
7 00612 33
8 00613 2
9 00614 2
10 00616 1
11 00617 9
12 00622 6
13 00623 28
14 00624 10
15 00627 8
16 00631 1
17 00637 13
18 00638 9
19 00641 12
20 00646 13
When I save it
USER.to_csv('Total_user.csv',index = False)
I got missing 0 in front of the zipcode. 00601 -> 601
zipcode userCount
601 5
602 23
603 53
604 2
605 6
606 10
610 8
612 33
613 2
614 2
616 1
617 9
622 6
623 28
624 10
627 8
631 1
637 13
638 9
641 12
646 13
Is that anything I missed in the to_csv line? I just want to save the 0 in front of the csv. Then when I read_csv(low_memory = False) Then the zipcode has the normal format.
Assuming that the column df['zipcode'] of the first dataframe is already a column of strings, then save it this way:
>>> df.to_csv('zipcodes.csv',dtype={'zipcode':'str','userCount':int})
And then when reading, set all data types to be str, and then convert the ones that are not this way:
>>> pd.read_csv('zipcodes.csv',dtype='str',index_col=0)
zipcode userCount
0 00601 5
1 00602 23
2 00603 53
3 00604 2
4 00605 6
5 00606 10
6 00610 8
7 00612 33
8 00613 2
9 00614 2
10 00616 1
11 00617 9
12 00622 6
13 00623 28
14 00624 10
15 00627 8
16 00631 1
17 00637 13
18 00638 9
19 00641 12
20 00646 13
>>> df['userCount'] = df['userCount'].astype(int)
>>> df.dtypes
zipcode object
userCount int64
dtype: object
Your data probably being stored as an object type in the data frame. You can confirm this by typing:
df.dtypes
>>> zipCode object
userCount object
dtype: object
Python doesn't like 0 prefixed integers thus the object dtype. You'll need to quote your data when you save it. You can do this via the quoting parameter in read_csv()
import csv
df.to_csv('tmp.csv', quoting=csv.QUOTE_NONNUMERIC)
If you don't quote your data pandas will convert it to an integer when you re-read it and strip the leading zeros.
Please use dtype=str as a parameter to read_csv(file,sep,dtype=str) method.
That will fix the issue.

Categories