I have a pandas dataframe and I can select a column I want to look at with:
column_x = str(data_frame[4])
If I print column_x, I get:
0 AF1000g=0.09
1 AF1000g=0.00
2 AF1000g=0.14
3 AF1000g=0.02
4 AF1000g=0.02
5 AF1000g=0.00
6 AF1000g=0.54
7 AF1000g=0.01
8 AF1000g=0.00
9 AF1000g=0.04
10 AF1000g=0.00
11 AF1000g=0.03
12 AF1000g=0.00
13 AF1000g=0.02
14 AF1000g=0.00
...
I want to count how many rows that contain the values AF1000g=0.05 or less there are. As well as rows that contain the values AF1000g=0.06 or greater.
Less_than_0.05 = count number of rows with AF1000g=0.05 and less
Greater_than_0.05 = count number of rows with AF1000g=0.06 and greater
How can I count these values from this column when the value in the column is a String that contains string and numeric content?
Thank you.
Rodrigo
You can use apply to extract the numerical values, and do the counting there:
vals = column_x.apply(lambda x: float(x.split('=')[1]))
print sum(vals <= 0.05) #number of rows with AF1000g=0.05 and less
print sum(vals >= 0.06) #number of rows with AF1000g=0.06 and greater
The comment above makes a good point. Generally you should focus on parsing before analyzing.
That said, this isn't too hard. Use pd.Series.str.extract with a regex, then force to a float, then do operations on that.
floats = column_x.str.extract("^AF1000g=(.*)$").astype(float)
num_less = (vals <= 0.05).sum()
num_greater = (vals > 0.05).sum()
This takes advantage of the fact that the boolean array returned by the comparison with vals can be forced to 0s and 1s.
Related
I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64
I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967
I am trying to select the highest value from this data but i also need the month it comes from too, here printing the whole row. Currently i'm using df.max() which just pulls the highest value. Does anyone know how to do this in pandas.
#current code
accidents["month"] = accidents.Date.apply(lambda s: int(s.split("/")[1]))
temp = accidents.groupby('month').size().rename('Accidents')
#selecting the highest value from the dataframe
temp.max()
answer given = 10937
answer i need should look like this (month and no of accidents): 11 10937
temp dataframe;
month
1 9371
2 8838
3 9427
4 8899
5 9758
6 9942
7 10325
8 9534
9 10222
10 10311
11 10937
12 9972
Name: Accidents, dtype: int64
would also be good to rename the accidents column to accidents is anyone can help too. Thanks
If the value is unique (in your case it is) you can simply get a subset of the dataframe.
temp[temp.iloc[:,1]==temp.iloc[:,1].max()]
So what the code is doing is looking at the integer position (rows then columns) and matching it with your condition, which is the max temp.
The 'ratings' DataFrame has two columns of interest: User-ID and Book-Rating.
I'm trying to make a histogram showing the amount of books read per user in this dataset. In other words, I'm looking to count Book-Ratings per User-ID. I'll include the dataset in case anyone wants to check it out.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
!wget https://raw.githubusercontent.com/porterjenkins/cs180-intro-data-science/master/data/ratings_train.csv
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
# Sort by User
ratings2 = ratings2.sort_values(by=['User-ID'])
usersList = []
booksRead = []
for i in range(2000):
numBooksRead = ratings2.isin([i]).sum()['User-ID']
if numBooksRead != 0:
usersList.append(i)
booksRead.append(numBooksRead)
new_dict = {'User_ID':usersList,'booksRated':booksRead}
usersBooks = pd.DataFrame(new_dict)
usersBooks
The code works as is, but it took almost 5 minutes to complete. And this is the problem: the dataset has 823,000 values. So if it took me 5 minutes to sort through only the first 2000 numbers, I don't think it's feasible to go through all of the data.
I also should admit, I'm sure there's a better way to make a DataFrame than creating two lists, turning them into a dict, and then making that a DataFrame.
Mostly I just want to know how to go through all this data in a way that won't take all day.
Thanks in advance!!
It seems you want a list of user IDs, with the count how often an ID appears in the dataframe. Use value_counts() for that:
ratings = pd.read_csv('ratings_train.csv')
# Remove Values where Ratings are Zero
ratings2 = ratings.loc[(ratings != 0).all(axis=1)]
In [74]: ratings2['User-ID'].value_counts()
Out[74]:
11676 6836
98391 4650
153662 1630
189835 1524
23902 1123
...
258717 1
242214 1
55947 1
256110 1
252621 1
Name: User-ID, Length: 21553, dtype: int64
The result is a Series, with the User-ID as index, and the value is number of books read (or rather, number of books rated by that user).
Note: be aware that the result is heavily skewed: there are a few very active readers, but most will have rated very few books. As a result, your histogram will likely just show one bin.
Taking the log (or plotting with the x-axis on a log scale) may show a clearer histogram:
np.log(s).hist()
First filter by column Book-Rating for remove 0 values and then count values by Series.value_counts with convert to DataFrame, loop here is not necessary:
ratings = pd.read_csv('ratings_train.csv')
ratings2 = ratings[ratings['Book-Rating'] != 0]
usersBooks = (ratings2['User-ID'].value_counts()
.sort_index()
.rename_axis('User_ID')
.reset_index(name='booksRated'))
print (usersBooks)
User_ID booksRated
0 8 6
1 17 4
2 44 1
3 53 3
4 69 2
... ...
21548 278773 3
21549 278782 2
21550 278843 17
21551 278851 10
21552 278854 4
[21553 rows x 2 columns]
I have a dataframe with 1 column as timestamp
UNIX Timestamp
1546357505
1546357518
1546357609412
1546357612
1546357761
I want to make all values as 10 digit number. So here only the value "1546357609412" needs to be divided by 1000. The final output should be
UNIX Timestamp
1546357505
1546357518
1546357609
1546357612
1546357761
I tried using the div function but then I'm not sure how to check if the value is 13 digit or not. Also, this column has 5 million values so I need an efficient way to make the change.
You could cast it as a str type and slice the first 10 digits only, then cast back as int:
df['Timestamp'] = df['Timestamp'].astype(str).str[:10].astype(int)
0 1546357505
1 1546357518
2 1546357609
3 1546357612
4 1546357761