dataframe count frequency of a string in a column - python

I have a csv that contains 1000 rows in a python code and returning a new dataframe with 3 columns:
noOfPeople and Description, Location
My final df will be like this one:
id companyName noOfPeople Description Location
1 comp1 75 tech USA
2 comp2 22 fashion USA
3 comp3 70 tech USA
I want to write a code that will stop once I have 200 rows where noOfPeople is greater or equal to 70 and it will return all the rest rows empty. So the code will count columns where noOfPeople >=70. Once I have 200 rows that has this condition, the code will stop.
Can someone help?

df[df['noOfPeople'] >= 70].iloc[:200]

Use head or iloc for select first 200 values and then get max:
print (df1['noOfPeople'].iloc[:199].max())
And add your filter what ever you need.

Related

Combine Duplicate Rows in a Column in PySpark Dataframe

I have duplicate rows in a PySpark data frame and I want to combine and sum all of them into one row per column based on duplicate entries in one column.
Current Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 350 900
30 Deal 1 Client A 360 850
50 Deal 2 Client B 30 50
30 Deal 1 Client A 125 200
30 Deal 1 Client A 90 100
10 Deal 3 Client C 32 121
Attempted PySpark Code
F.when(F.count(F.col('Deal_ID')) > 1, F.sum(F.col('In_Progress')) && F.sum(F.col('Deal_Total'))))
.otherwise(),
Expected Table
Deal_ID Title Customer In_Progress Deal_Total
30 Deal 1 Client A 925 2050
50 Deal 2 Client B 30 50
10 Deal 3 Client C 32 121
I think you need to group by the columns with duplicated rows then aggregate the amounts. I think this solves your problem :
df = df.groupBy(['Deal_ID', 'Title', 'Customer']).agg({'In_Progress': 'sum', ' Deal_Total': 'sum'})
You have a SQL tag, so that's how it will work in there
select
deal_id,
title,
customer,
sum(in_progress) as in_progress,
sum(deal_total) as deal_total
from <table_name>
group by 1,2,3
otherwise you can use the same group by function in python pandas / dataframe and apply to your datadrame:
you have to pass in the columns that you would need to aggregate by as a list
then you need to specify the aggregation type and the column you want to add up
df = df.groupBy(['deal_id', 'title', 'Customer']).agg({'in_progress': 'sum', ' deal_total': 'sum'})

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

How do I subset with .isin (seems like it doesn't work properly)?

I'm a student from Moscow State University and I'm doing a small research about suburban railroads. I crawled information from wikipedia about all stations in Moscow region and now I need to subset those, that are Moscow Central Diameter 1 (railway line) station. I have a list of Diameter 1 stations (d1_names) and what I'm trying to do is to subset from whole dataframe (suburban_rail) with isin pandas method. The problem is it returns only 2 stations (the first one and the last one), though I'm pretty sure there are some more, because using str.contains with absent stations returns what I was looking for (so they are in dataframe). I've already checked spelling and tried to apply strip() to each element of both dataframe and stations' list. Attached several screenshots of my code.
suburban_rail dataframe
stations' list I use to subset
what isin returns
checking manually for Bakovka station
checking manually for Nemchinovka station
Thanks in advance!
Next time provide a minimal reproducible example, such as the one below:
suburban_rail = pd.DataFrame({'station_name': ['a','b','c','d'], 'latitude': [1,2,3,4], 'longitude': [10,20,30,40]})
d1_names = pd.Series(['a','c','d'])
suburban_rail
station_name latitude longitude
0 a 1 10
1 b 2 20
2 c 3 30
3 d 4 40
Now, to answer your question: using .loc the problem is solved:
suburban_rail.loc[suburban_rail.station_name.isin(d1_names)]
station_name latitude longitude
0 a 1 10
2 c 3 30
3 d 4 40

Plot multiple rows of dataframe in pandas for specific columns

df
SKU Comp Brand Jan_Sales Feb_Sales Mar_sales Apr_sales Dec_sales..
A AC BA 122 100 50 200 300
B BC BB 100 50 80 90 250
C CC BC 40 30 100 10 11
and so on
Now I want a graph which will plot Jan sales, feb sales and so on till dec in one line for SKU A, Similarly one line on the same graph for SKU B and same way for SKU C.
I read few answers which say that I need to transpose my data. Something like below
df.T. plot()
However my first column is SKU, and I want to plot based on that. Rest of the columns are numeric. So I want that on each line SKU Name should be mentioned. And plotting should be row wise
EDIT(added after receiving some answers as I am facing this issue in few other datasets):
lets say I dont want columns Company, brand etc, then what to do
Use DataFrame.set_index for convert SKU to index and then tranpose:
df.set_index('SKU').T.plot()
Use set_index then transpose:
df.set_index("SKU").T.plot()
Output:

Trying to code a Python equivalent of SUMIFs feature in Excel

I am trying to rewrite a .xlsx file from scratch using Python. The excel sheet has 99 rows and 11 columns. I have generated 99 rows x 8 columns already and I am currently working on generating the 99 rows x 9th column.
This 9th column is calculated based on a SUM-IFS formula in excel. It takes into account columns 2, 4 and 7.
Col. 2 has numerical int values.
Col. 4 has three letter airport code values like NYC for New York City
Col. 7 also has three letter airport code values like DEL for Delhi.
The sum-if formula for column 9 cells
SUMIFS(B:B, D:D, D2, G:G, G2)
Hence it sums the numerical values in column 2 for corresponding cities in col. 4 and col. 7. If there is only one occurrence of the pair of cities in col. 4 and col. 7 then there is nothing to sum and the cell in col.9 = int value of cell in col. 2
However, if are multiple occurrences of the pair of cities in col. 4 and col. 7 then the corresponding values in col. 2 are SUMMED and that becomes the value of the cell in col. 9
Example:
In this example, col. 2 is Sales, col.4 is Origin City, col. 7 is Destination City and col. 9 is the Result that utilizes =SUMIFS(B:B,C:C,C2,D:D,D2)
I am trying to calculate the column 9 using python on the large data set that I have. For now, I have been able to create a list of dictionaries, where I have made the key as origin_city-destination_city and the value as the integer value of col. 2. The list of dicts has 99 rows like the excel file, hence each row of the excel file is represented as a dict. On printing the dictionary, it is something like this:
{'YTO-YVR': 570}
{'YVR-YTO': 542}
{'YTO-YYC': 420}
{'YYT-YTO': 32}
{'YWG-YYC': 115}
I have been contemplating if it is possible to loop over the list of dicts and create a SUMIFS version of it --- resulting in 99 dicts in the list, with each dict having the sumif value. After this I have to write all these values to the column in the excel file..
I hope someone here can help !! Thank you very much in advance :)
You can use pandas' groupby with transform:
import pandas as pd
df = pd.DataFrame({'Sales': [100,110,200,300,150,200,100],
'Origin': ['YYZ','YEA','CDG','YYZ','YEA','YVR','YEA'],
'Dest': ['DEL','NYC','YUL','DEL','YTO','HKG','NYC']})
df['Result'] = df.groupby(['Origin','Dest']).Sales.transform('sum')
Result:
Sales Origin Dest Result
0 100 YYZ DEL 400
1 110 YEA NYC 210
2 200 CDG YUL 200
3 300 YYZ DEL 400
4 150 YEA YTO 150
5 200 YVR HKG 200
6 100 YEA NYC 210

Categories