Looping over blocks of data - python

I'm an amateur coder and very new to pyspark, I need to convert my python code to pyspark to run over millions of data and i'm stumped on simple things. I need to iterate few operations within blocks of data. I have done it using below code in python, but i cannot understand how to do the same in pyspark. can someone please help ?
Python code:
new_df = pd.DataFrame()
for blockingid,df in old_df.groupby(by=['blockingId']):
nons = df[df.groupby("country")['country'].transform('size') > 1]
new_df = pd.concat([new_df,nons],axis =0)
Sample block of input could be below:
Name
blockingID.
Country
name_1
block_1
country_1
name_2
block_1
country_2
name_3
block_2
country_2
name_4
block_2
country_2
The above code groups each block and removes the countries which have been repeated only once within that block.
I know about window functions in pyspark, but i don't understand how to apply whatever operations within the blocks ( Many more operations to do within blocks) append the output I got to another dataframe. Please help.

Related

Need help to transform a column of a dataframe into multiple columns for timestamps

Hello everyone need one help regarding the below question,
My dataframe looks like below and I am using pyspark,
the time column needs to be split into two columns 'start time' and 'end time' like below,
I tried couple of methods like the self joining the df on m_id but it looks very tedious and inefficient, I would appreciate if someone can help me on this
Thanks in advance
Performing something based on row order in Spark is not a good idea. Row order is preserved while reading the file but it may get shuffled in between transformations and then there's no way to know which was the previous row(start time). You need to ensure that no shuffling happens in order to avoid this but that will lead to some other complexities.
my suggestion is to work on the file at source level and try to put row numbers column like
r_n m_id time
0 2 2022-01-01T12:12:12.789+000
1 2 2022-01-01T12:14:12.789+000
2 2 2022-01-01T12:16:12.789+000
later in spark you make a left self join with r_n like
df1=df.withColumn("next_r",col("r_n") + lit(1) )
df_final = df.join(df1,df1.next_r == df.r_n , "left" ).select(df("m_id"),df("time").as("start_time"),df1("time").as("end_time") )

Dropping Rows that Contain a Specific String wrapped in square brackets?

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]
Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever
You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

Pick a list of values from one CSV and get the count of the values of the list in a different CSV

i am working on python code to calculate the occurrences of few values in a column within a CSV.
Example - CSV1 is as below
**Type Value**
Simple test
complex problem
simple formula
complex theory
simple idea
simple task
I need to get the content of value for type simple and complex i.e
**Type Value**
simple test
simple formula
simple idea
simple task
complex theory
complex problem
And query other CSV which is CSV1 on the total count of occurrences of simple list i.e [test, formula, idea, task] and complex list i.e [theory, problem]
Other CSV2 is
**Category**
test
test
test
formula
formula
formula
test
test
idea
task
task
idea
task
idea
task
problem
problem
theory
problem
problem
idea
task
problem
test
Both CSV1 and CSV2 are dynamic, from CSV1 as example for type "simple' get the list of the corresponding values and refer CSV2 to know what's count for each value. i.e counts of test, idea, task, formula.
Same for Complex type
I tried multiple methods with pandas but not expecting result as expected. Any pointers please.
Use:
df2['cat'] = df2['Category'].map(df1.set_index('Value')['Type'])
df2 = df2['cat'].value_counts().rename_axis('a').reset_index(name='b')
print (df2)
a b
0 simple 18
1 complex 6
Much like #jezrael,however I would first groupby the second csv. This would help in merging if the second csv is very large.
df2=cv2.groupby('value').agg(cnt=('value','count')).reset_index()
This would give me a dataframe with two columns, value and count.
Now, you can merge it with CV1
df1 = cv1.merge(df2,on=['value'],how='inner')

Nested dictionary with dataframes to dataframe in pandas

I know that there are a few questions about nested dictionaries to dataframe but their solutions do not work for me. I have a dataframe, which is contained in a dictionary, which is contained in another dictionary, like this:
df1 = pd.DataFrame({'2019-01-01':[38],'2019-01-02':[43]},index = [1,2])
df2 = pd.DataFrame({'2019-01-01':[108],'2019-01-02':[313]},index = [1,2])
da = {}
da['ES']={}
da['ES']['TV']=df1
da['ES']['WEB']=df2
What I want to obtain is the following:
df_final = pd.DataFrame({'market':['ES','ES','ES','ES'],'device':['TV','TV','WEB','WEB'],
'ds':['2019-01-01','2019-01-02','2019-01-01','2019-01-02'],
'yhat':[43,38,423,138]})
Getting the code from another SO question I have tried this:
market_ids = []
frames = []
for market_id,d in da.items():
market_ids.append(market_id)
frames.append(pd.DataFrame.from_dict(da,orient = 'index'))
df = pd.concat(frames, keys=market_ids)
Which gives me a dataframe with multiple indexes and the devices as column names.
Thank you
The code below works well and gives the desired output:
t1=da['ES']['TV'].melt(var_name='ds', value_name='yhat')
t1['market']='ES'
t1['device']='TV'
t2=da['ES']['WEB'].melt(var_name='ds', value_name='yhat')
t2['market']='ES'
t2['device']='WEB'
m = pd.concat([t1,t2]).reset_index().drop(columns={'index'})
print(m)
And the output is:
ds yhat market device
0 2019-01-01 38 ES TV
1 2019-01-02 43 ES TV
2 2019-01-01 108 ES WEB
3 2019-01-02 313 ES WEB
The main takeaway here is melt function, which if you read about isn't that difficult to understand what's it doing here. Now as I mentioned in the comment above, this can be done iteratively over whole da named dictionary, but to perform that I'd need replicated form of the actual data. What I intended to do was to take this first t1 as the initial dataframe and then keep on concatinating others to it, which should be really easy. But I don't know how your actual values are. But I am sure you can figure out on your own from above how to put this under a loop.
The pseudo code for that loop thing I am talking about would be like this:
real=t1
for a in da['ES'].keys():
if a!='TV':
p=da['ES'][a].melt(var_name='ds', value_name='yhat')
p['market']='ES'
p['device']=a
real = pd.concat([real,p],axis=0,sort=True)
real.reset_index().drop(columns={'index'})

Categories