How to select rows in dataframe based on a condition - python

I have an emails dataframe in which I have given this query:
williams = emails[emails["employee"] == "kean-s"]
This selects all the rows that have employee kean-s. Then I count the frequencies and print the top most. This is how it's done:
williams["X-Folder"].value_counts()[:10]
This gives output like this:
attachments 2026
california 682
heat wave 244
ferc 188
pr-crisis management 92
federal legislation 88
rto 78
india 75
california - working group 72
environmental issues 71
Now, I need to print all the rows from emails that has X_Folder column equal to attachments, california, heat way etc. How do I go about it? When I print values[0] it simply returns the frequency number and not the term corresponding to it (tried printing it because if I'm able to loop through it, Ill just put a condition inside dataframe)

Use Series.isin with boolean indexing for values of index:
df = williams[williams["X-Folder"].isin(williams["X-Folder"].value_counts()[:10].index)]
Or:
df = williams[williams["X-Folder"].isin(williams["X-Folder"].value_counts().index[:10])]
If need filter all rows in original DataFrame (also rows with not matched kean-s) then use:
df1 = emails[emails["X-Folder"].isin(williams["X-Folder"].value_counts().index[:10])]

Related

pyspark sum of rows to get one new row based on values in another column in groupby

Have a dataframe which has user, location, values. Currently US related location are in different rows for one user:
user location values
209 OH_US 45
O09 PA_US 30
O09 AQ 10
209 CA_US 50
209 UK 10
....
For each user want to generate a new row to replace US related locations with a sum and location name is 'US'.Remove those rows in different states in US.
Expected result looks like this:
user location values
209 US 200
209 UK 10
O09 US 300
O09 AQ 10
...
Currently I'm thinking to pull all US related rows to a separate dataframe to do a sum in groupby, then drop all the rows related to US in original dataframe to join with the US sum dataframe.
Is there a more efficient way to do this?
Hi there can we multiple approach to solve this in pyspark
Using spark.sql -
df.createOrReplaceTempView("SAMPLE_TABLE")
df.createOrReplaceTempView("SAMPLE_TABLE")
df2=spark.sql("SELECT user , case when location like '%_US' then 'US' else location end Location , SUM(VALUES) VALUES from SAMPLE_TABLE group by user , case when location like '%_US' then 'US' else location end ")
df2.show()
Using pyspark api
import pyspark.sql.functions as F
df.groupby(F.when(F.col('location').\
like("%_US"),"US").\
otherwise(F.col("location")).\
alias('location'))\
.agg(F.sum('values').alias("values"))\
.show()

Python Range Problem For Using Jype1 and can not use for loop every column and row

I would like to ask you that i have a data and i would like to call a package. Package is a Jar file type.
Anyway, i have a csv file:
Ürünler01 Ürünler02 Ürünler03 Ürünler04
0 trafik musavirligi na na
1 aruba 2930f 48g poe
2 minilink 6363 721l na
3 rendezvous point appliance na
4 in uzak oku sayaç na
... ... ... ... ...
79 inbpano kurulum kor panos
80 tn card değişim na
81 servis kapı kaynaklı panel
82 evrensel microwave outdoor unit
83 hp ekipman na na
As you can see column names are : 'Ürünler01', 'Ürünler02', 'Ürünler03', 'Ürünler04'.
And i would like to apply my "message" function and its at below:
new=[]
for message in df['Ürünler01']:
new.append(clean_messages(message))
after that code i will take it data frame and i can publish it.
df = pd.DataFrame (new)
And result is
df
0
0 trafik
1 araba
2 minicik
3 rendezvous
4 in uzak
... ...
79 inbpano
80 en
81 servis
82 evrensel
83 hp
AND my question is i can not apply my append "message" function all over Ürünler01,Ürünler02,Ürünler03 and Ürünler04. I could not find iloc or loc and can not understand for usage in python. As i can apply at C programming using i and j for loops and i can do my functions all of rows and columns. But my problem is at this question i can not use my functions all columns.
Please help my situation. I added pictures below. I can print out "0" column but also i need 1,2,3 which are painted on screenshots. I am waiting your helps
The final shape of your dataframe isn't very clear from your question, but you could iterate over the column names (default iteration over a dataframe) and then the rows by indexing the column from the original dataframe by-name
import pandas as pd
# load dataframe
df = pd.read_csv("path_to_file.csv")
# start a new string series
series = pd.Series([], dtype=str)
for colname in df: # iterate over the column names
for message in df[colname]: # iterate over the rows in the column
series.append(clean_messages(message))
df_result = pd.DataFrame(series) # optional, can directly use series
However, you may be able to use df.apply directly to apply clean_messages to every value in your dataframe
df_result = pd.DataFrame()
for colname in df:
df_result[colname] = df[colname].apply(clean_messages)

Pandas: Split columns having multiple values

An example of my df:
Index time type pwa0 pwa1 pwa2
63 16:05:03 nonJ [20:733:845] [] [2750]
I would like to split the columns having no, one or multiple values (pwa0, pwa1 and pwa2) like this:
Index time type pwa0 pwa1 pwa2
63 16:05:03 nonJ 20 2750
63 16:05:03 nonJ 733
63 16:05:03 nonJ 845
In contrast to the possible shown duplicate, the to be splitted columns are not correlated. The split order per column should just be order based: first the first value, then the second etc. If no values are present in a column, the cell should remain empty. Any suggestions will be highly appreciated!

Python: How to efficiently do operations using different rows of the same column?

My goal is that by given a value on a row (let's say 3), look for the value of a given column 3 rows below. Currently I am perfoming this using for loops but it is tremendously inefficient.
I have read that vectorizing can help to solve this problem but I am not sure how.
My data is like this:
Date DaysToReception Quantity QuantityAtTheEnd
20/03 3 102
21/03 - 88
22/03 - 57
23/03 5 178
24/03
And I want to obtain:
Date DaysToReception Quantity QuantityAtReception
20/03 3 102 178
21/03 - 88
22/03 - 57
23/03 5 178
24/03
...
Thanks for your help!
If you have unique date or DaysToReception, you can actually use Map/HashMap where the key will be the date or DaysToReception and the values will be other information which you can store using a list or any other appropriate data structure.
This will definitely improve the efficiency.
As you pointed out that "number of rows I search the value below depends on the value "DaysToReception", I believe "DaysToReception" will not be unique. In that case, the key to your Map will be the date.
The easiest way I can think of to do this in pandas is the following:
# something like your dataframe
df = pd.DataFrame(dict(date=['20/03', '21/03', '22/03', '23/03'],
days=[3, None, None, 5,],
quant=[102, 88, 57, 178]))
# get the indexs of all days that aren't missing
idxs = df.index[~pd.isnull(df.days)]
# get number of days to go
values = df.days[idxs].values.astype(int)
# get index of three days ahead
new_idxs = idxs+values
# create a blank column
df['quant_end'] = None
# Now fill it with the data we're after
df.quant_end[idxs] = df.quant[new_idxs]

Python conditional filtering in csv file

Please help! I have tried different things/packages writing a program that takes in 4 inputs and returns the writing score statistics of a group based on those combination of inputs from a csv file. This is my first project, so I would appreciate any insights/hints/tips!
Here is the csv sample (has 200 rows total):
id gender ses schtyp prog write
70 male low public general 52
121 female middle public vocation 68
86 male high public general 33
141 male high public vocation 63
172 male middle public academic 47
113 male middle public academic 44
50 male middle public general 59
11 male middle public academic 34
84 male middle public general 57
48 male middle public academic 57
75 male middle public vocation 60
60 male middle public academic 57
Here is what I have so far:
import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
data.append(row)
data=numpy.array(data)
#asks for inputs
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')
#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())
What I am missing is how to filter and gets stats only for a specific group. For example, say I input male, public, middle, and academic -- I'd want to get the average writing score for that subset. I tried the groupby function from pandas, but that only gets you stats for broad groups (such as public vs private). I also tried DataFrame from pandas, but that only gets me filtering for one input and not sure how to get the writing scores. Any hints would be greatly appreciated!
Agreeing with Ramon, Pandas is definitely the way to go, and has extraordinary filtering/sub-setting capability once you get used to it. But it can be tough to first wrap your head around (or at least it was for me!), so I dug up some examples of the sub-setting you need from some of my old code. The variable itu below is a Pandas DataFrame with data on various countries over time.
# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania' # returns True/False values
itu[subset] # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania'] # one-line command, equivalent to the above two lines
# Pandas has many built-in functions like .isin() to provide params to filter on
itu[itu.cntrycode.isin(['USA','FRA'])] # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])] # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])] # Both of above at same time
# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]
# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) &
itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]
# Finally, if you're comfortable with using map() and list comprehensions,
you can do some advanced subsetting that includes evaluations & functions
to determine what elements you want to select from the whole, such as all
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName'] # gives us UAE, UK, & US
Look at pandas. I think it will shorten your csv parsing work and gives the subset funcitonality you're asking for...
import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)
#get all of the male students
data[data['gender'] == 'male']

Categories