Dropping Rows that Contain a Specific String wrapped in square brackets? - python

I'm trying to drop rows which contain strings that are wrapped in a column. I want to drop all values that contain the strings '[removed]', '[deleted]'.
My df looks like this:
Comments
1 The main thing is the price appreciation of the token (this determines the gains or losses more
than anything). Followed by the ecosystem for the liquid staking asset, the more opportunities
and protocols that accept the asset as collateral, the better. Finally, the yield for staking
comes into play.
2 [deleted]
3 [removed]
4 I could be totally wrong, but sounds like destroying an asset and claiming a loss, which I
believe is fraudulent. Like someone else said, get a tax guy - for this year anyway and then
you'll know for sure. Peace of mind has value too.
I have tried df[df["Comments"].str.contains("removed")==False]
But when i try to save the dataframe, it is still not removed.
EDIT:
My full code
import pandas as pd
sol2020 = pd.read_csv("Solana_2020_Comments_Time_Adjusted.csv")
sol2021 = pd.read_csv("Solana_2021_Comments_Time_Adjusted.csv")
df = pd.concat([sol2021, sol2020], ignore_index=True, sort=False)
df[df["Comments"].str.contains("deleted")==False]
df[df["Comments"].str.contains("removed")==False]

Try this
I have created a data frame for comments column and used my own comments but it should work for you
import pandas as pd
sample_data = { 'Comments': ['first comment whatever','[deleted]','[removed]','last comments whatever']}
df = pd.DataFrame(sample_data)
data = df[df["Comments"].str.contains("deleted|removed")==False]
print(data)
output I got
Comments
0 first comment whatever
3 last comments whatever

You can do it like this:
new_df = df[~(df['Comments'].str.startswith('[') & df['Comments'].str.endswith(']'))].reset_index(drop=True)
Output:
>>> new_df
Comments
0 The main thing is the price appreciation of th...
3 I could be totally wrong, but sounds like dest...
That will remove all rows where the value of the Comments column for that row starts with [ and ends with ].

Related

Need help to transform a column of a dataframe into multiple columns for timestamps

Hello everyone need one help regarding the below question,
My dataframe looks like below and I am using pyspark,
the time column needs to be split into two columns 'start time' and 'end time' like below,
I tried couple of methods like the self joining the df on m_id but it looks very tedious and inefficient, I would appreciate if someone can help me on this
Thanks in advance
Performing something based on row order in Spark is not a good idea. Row order is preserved while reading the file but it may get shuffled in between transformations and then there's no way to know which was the previous row(start time). You need to ensure that no shuffling happens in order to avoid this but that will lead to some other complexities.
my suggestion is to work on the file at source level and try to put row numbers column like
r_n m_id time
0 2 2022-01-01T12:12:12.789+000
1 2 2022-01-01T12:14:12.789+000
2 2 2022-01-01T12:16:12.789+000
later in spark you make a left self join with r_n like
df1=df.withColumn("next_r",col("r_n") + lit(1) )
df_final = df.join(df1,df1.next_r == df.r_n , "left" ).select(df("m_id"),df("time").as("start_time"),df1("time").as("end_time") )

Compare columns (per row) of two DataFrames in Python

first of all, I'm quite new to programming overall (< 2 Months), so I'm sorry if that's an 'simple, no need to ask for help, try it yourself until you get it done' problem.
I have two data-frames with partially the same content (general overview of mobile-numbers including their cost centers in the company and monthly invoices with the affected mobile-numbers and their invoice amount).
I'd like to compare the content of the 'mobile-numbers' column of the monthly invoices DF to the content of the 'mobile-numbers' column of the general overview DF and if matching, assign the respective cost center to the mobile-number in the monthly invoices DF.
I'd love to share my code with you, but unfortunately I have absolutely zero clue how to solve that problem in any way.
Thanks
Edit: I'm from germany, I tried my best to explain the problem in english. If there is anything I messed up (so u dont get it) just tell me :)
Example of desired result
program meets your needs, in the second dataframe I put the value '40' to demonstrate that the dataframes already filled will not be zeroed, the replacement will only occur if there is a similar value between the dataframes, if you want a better explanation about the program , comment below, and don't forget to vote and mark as solved, I also put some 'prints' for a better view, but in general they are not necessary
import pandas as pd
general_df = pd.DataFrame({"mobile_number": [1234,3456,6545,4534,9874],
"cost_center": ['23F','67F','32W','42W','98W']})
invoice_df = pd.DataFrame({"mobile_number": [4534,5567,1234,4871,1298],
"invoice_amount": ['19,99E','19,99E','19,99E','19,99E','19,99E'],
"cost_center": ['','','','','40']})
print(f"""GENERAL OVERVIEW DF
{general_df}
________________________________________
INVOICE DF
{invoice_df}
_________________________________________
INVOICE RESULT
""")
def func(line):
t = 0
for x in range(0, len(general_df['mobile_number'])):
t = general_df.loc[general_df['mobile_number'] == line[0]]
if t.empty:
return line[2]
else:
return t.values.tolist()[0][1]
invoice_df['cost_center'] = invoice_df.apply(func, axis = 1)
print(invoice_df)

How to modify multiple values in one column, but skip others in pandas python

Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career.
As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.
That said here is my problem:
Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.
Goal(s)
Main goal I need to skip certain values in my ID column.
The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.
After that I'll start to work on comparing multiple sheets.
Main data frame IDs is formatted as 000-000-000-000
The other data frames that I will compare it to have it with no
dashes "-" as 000000000 and three less 000's totaling nine digits.
The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.
My code that I will use on each ID except the unique ones:
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
but I want to use an if statement like (This does not work)
lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]
if ~dfSS["ID"].isin(lst ).any()
dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
pass
For more clarification my input DataFrame is this:
ID Street # Street Name
0 004-330-002-000 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001-243-313-000 2235 Narnia
3 002-730-032-000 2149 Narnia
4 000-000-000_a 1234 Narnia
And I am looking to do this as the output:
ID Street # Street Name
0 004330002 2272 Narnia
1 021-521-410-000_128 2311 Narnia
2 001243313000 2235 Narnia
3 002730032000 2149 Narnia
4 000-000-000_a 1234 Narnia
Notes:
dfSS is my Dataframe variable name aka the excel I am using. "ID" is
my column heading. Will make this an index after the fact
My Data frame on this job is small with # of (rows, columns) as (2500, 125)
I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.
Here is where I have been to research this:
Comparison of a Dataframe column values with a list
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
if statement with ~isin() in pandas
recordlinkage module-I didn't think this was going to work
Regular expression operations - Having a hard time fully understanding this at the moment
There are a number of ways to do this. The first way here doesn't involve writing a function.
# Create a placeholder column with all transformed IDs
dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.
def transform_ID(ID_val):
if ID_val not in lst:
return ID_val.replace("-", "")[:9]
dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)
This is based on #xyzxyzjayne answers but I have two issues I can not figure out.
First issue
is I get this warning: (see Edit)
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Documentation for this warning
You'll see in the code below I tried to put in .loc but I can't seem to find out how to eliminate this warning by using .loc correctly. Still learning it. NO, I will not just ignore it even though it works. This is a learning opportunity I say.
Second issue
is that I do not under stand this part of the code. I know the left side is supposed to be rows, and the right side is columns. That said why does this work? ID is a column not a row when this code is rune. I make the ID :
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
Area I don't understand yet, is the left side of the comma(,) on this part:
df.loc[~df["ID "].isin(uniqueID), "ID "]
That said here is the final result, basically as I Said its XZY's help that got me here but I'm adding more .locs and playing with the documentation till I can eliminate the warning.
uniqueID = [ and whole list of IDs i had to manually enter 1000+ entries that
will go in the below code. These ids get skipped. example: "032-234-987_#4256"]
# gets the columns i need to make the DateFrame smaller
df = df[['ID ', 'Street #', 'Street Name', 'Debris Finish', 'Number of Vehicles',
'Number of Vehicles Removed', 'County']]
#Place holder will make our new column with this filter
df.loc[:, "Place Holder"] = df.loc[:,"ID "].str.replace("-", "").str[:9]
#the next code is the filter that goes through the list and skips them. Work in progress to fully understand.
df.loc[~df["ID "].isin(uniqueID ), "ID "] = df.loc[~df["ID "].isin(uniqueID ), "Place Holder"]
#Makes the ID our index
df = df.set_index("ID ")
#just here to add the date to our file name. Must import time for this to work
todaysDate = time.strftime("%m-%d-%y")
#make it an excel file
df.to_excel("ID TEXT " + todaysDate + ".xlsx")
Will edit this once i get rid of the warning and figure out the left side so I can explain to for everyone who needs/sees this post.
Edit: SettingWithCopyWarning:
Fixed this chained index problem by making a copy of the orginal data base before filter and making everthing .loc as XYZ has helped me with. Before we start to filter use DataFrame.copy() where DataFrame is the name of your own dataframe.

Divide data on the basis of specific column number using pandas

I am trying to load a .txt file using pandas read_csv function.
My data looks like this:
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN SEEMED CENTRED IN HIS EYES WHICH BECAME BLOODSHOT THE VEINS OF THE THROAT SWELLED HIS CHEEKS AND TEMPLES BECAME PURPLE AS THOUGH HE WAS STRUCK WITH EPILEPSY NOTHING WAS WANTING TO COMPLETE THIS BUT THE UTTERANCE OF A CRY
84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY THUS SPEAK A CRY FRIGHTFUL IN ITS SILENCE
84-..
..
..
First column is ID
Second column is data
The problem I have while loading that data with space separator is that it divides all the subsquent column after #2 in new field, which is not what I want.
I want ID as first column, and then second column should have all the space separated data.
I can thing of using pandas for this task, but if there is any better library please let me know.
Here's the code snippet I tried:
test = pd.read_csv('my_file.txt', sep=' ', names=['id', 'data'])
I get unexpected output. Output I want should be :
id data
84-121123-0000 GO DO YOU HEAR
84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GROANED BENEATH AN EXTRAORDINARY WEIGHT
....
If your id has the same format "xx-xxxxxx-xxxx", you can use it as a separator:
df = pd.read_csv(
"your_file.txt",
sep=r"(?<=\d{2}-\d{6}-\d{4})\s",
engine="python",
names=["id", "data"],
)
print(df)
Prints:
id data
0 84-121123-0000 GO DO YOU HEAR
1 84-121123-0001 BUT IN LESS THAN FIVE MINUTES THE STAIRCASE GR...
2 84-121123-0002 AT THIS MOMENT THE WHOLE SOUL OF THE OLD MAN S...
3 84-121123-0003 AND THE CRY ISSUED FROM HIS PORES IF WE MAY TH...
I'm not sure how much slower it is to first read the text file and then create a pandas DataFrame, but you could first get each line into a list, separate each element by the first space (using split(" ",1)) and then create a DataFrame.
f = open( TXTFILE, "r" )
data = [ s.split(" ", 1) for s in f.readlines() ]
df = pd.DataFrame( data, columns=['col1','col2'] )
Note that f.readlines() only works once after opening the file so save it as a separate list if you are going to use it more than once.

Find similarity between two dataframes, row by row

I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches.
The first one was to append one of the two dataframes to the other one and selecting duplicates:
df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows
The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.
Sample of data: Please note that the datasets have different length
For df1
Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football
and for df2
Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy?
Trump did not win the last presidential election
In order to do this, I am using the following function:
def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]
This should allow me to find rows which match.
The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.
My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).
Please let me know if you need any further information or if you need more code.
Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.
Any suggestion is welcome.
Expected output:
(it is an example; since I am setting a threshold the output may change)
df1
Check sim
how to join to first row []
large data work flows [small data work flows]
I have two dataframes []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here? []
add language identifier []
my dad loves watching football []
df2
Check sim
small data work flows [large data work flows]
I have tried to puzze out an answer []
mix grammatical or spelling errors [fix grammatical or spelling errors]
indent code by 2 spaces [indent code by 4 spaces]
indent code by 8 spaces [indent code by 4 spaces]
put returns between paragraphs []
add curry on the chicken curry []
mom!! mom!! mom!! []
create code fences with backticks []
are you crazy? []
Trump did not win the last presidential election []
I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2) so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores useful as well.
from fuzzywuzzy import fuzz
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)
output:
Check matches
0 how to join to first row []
1 large data work flows [small data work flows]
2 I have two dataframes []
3 fix grammatical or spelling errors [mix gramma... [mix grammatical or spelling errors]
4 indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spa...
5 why are you posting here? [are you crazy?]
6 add language identifier []
7 my dad loves watching football []

Categories