How do I modify pd.DataFrame using for loop - python

I got a problem with iterative modification of dataframe (h0). I need to replace strings in my dataframe, and I have a list of strings which should be replaced (two columns: first for string that should be replaced, second for string which should go to the dataframe as new value - it is the zv dataframe).
I want to replace 584 different strings in my dataframe, therefore I am trying to use a for loop.
I've tried this:
h0 = pd.DataFrame(pd.read_csv('C:/blablabla/vyberhp.csv', delimiter=';'))
zv = pd.DataFrame(pd.read_csv('C:/blablabla/vycuc2.csv', delimiter=';'))
zv_dlzka = len(zv.index)
for i in range(zv_dlzka):
h1 = h0.replace(zv.at[i, 'stary_kod'], zv.at[i,'kod'], regex=True)
print(h1)
The result is that I see only last iteration (so replaced only string from last row from my list for string replacement).
I know where the problem is. It's here:
h1 = h0.replace(zv.at[i, 'stary_kod'], zv.at[i,'kod'], regex=True)
because for cycle always calls the first DataFrame (h0), but I have no idea how to fix it.
What are the ways to make the for loop work with modified DataFrames (so not h0) in each iteration?
Sorry if this is a basic question, I am quite new to coding.

Related

Python loop to search multiple sets of keywords in all columns of dataframe

I've used the code below to search across all columns of my dataframe to see if each row has the word "pool" and the words "slide" or "waterslide".
AR11AR11_regex = r"""
(?=.*(?:slide|waterslide)).*pool
"""
f = lambda x: x.str.findall(AR_regex, flags= re.VERBOSE|re.IGNORECASE)
d['AR'][AR11] = d['AR'].astype(str).apply(f).any(1).astype(int)
This has worked fine but when I want to write a for loop to do this for more than one regex pattern (e.g., AR11, AR12, AR21) using the code below, the new columns are all zeros (i.e., the search is not finding any hits)
for i in AR_list:
print(i)
pat = i+"_regex"
print(pat)
f = lambda x: x.str.findall(i+"_regex", flags= re.VERBOSE|re.IGNORECASE)
d['AR'][str(i)] = d['AR'].astype(str).apply(f).any(1).astype(int)
Any advice on why this loop didn't work would be much appreciated!
A small sample data frame would help understand your question. In any case, your code sample appears to have a multitude of problems.
i+"_regex" is just the string "AR11_regex". It won't evaluate to the value of the variable with the identifier AR11_regex. Put your regex patterns in a dict.
d['AR'] is the values in the AR column. It seems like you expect it to be a row.
d['AR'][str(i)] is adding a new row. It seems like you want to add a new column.
Lastly, this approach to setting a cell generally (always for me) yields the following warning:
/var/folders/zj/pnrcbb6n01z2qv1gmsk70b_m0000gn/T/ipykernel_13985/876572204.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
The suggest approach would be to use "at" as in d.at[str(i), 'AR'] or some such.
Add a sample data frame and refine your question for more suggestions.

Printing and counting unique values from an .xlsx file

I'm fairly new to Python and still learning the ropes, so I need help with a step by step program without using any functions. I understand how to count through an unknown column range and output the quantity. However, for this program, I'm trying to loop through a column, picking out unique numbers and counting its frequency.
So I have an excel file with random numbers down column A. I only put in 20 numbers but let's pretend the range is unknown. How would I go about extracting the unique numbers and inputting them into a separate column along with how many times they appeared in the list?
I'm not really sure how to go about this. :/
unique = 1
while xw.Range((unique,1)).value != None:
frequency = 0
if unique != unique: break
quantity += 1
"end"
I presume as you can't use functions this may be homework...so, high level:
You could first go through the column and then put all the values in a list?
Secondly take the first value from the list and go through the rest of the list - is it in there? If so then it is not unique. Now remove the value where you have found the duplicate from the list. Keep going if you find another remove that too.
Take the second value and so on?
You would just need list comprehension, some loops and perhaps .pop()
Using pandas library would be the easiest way to do. I created a sample excel sheet having only one column called "Random_num"
import pandas
data = pandas.read_excel("sample.xlsx", sheet_name = "Sheet1")
print(data.head()) # This would give you a sneak peek of your data
print(data['Random_num'].value_counts()) # This would solve the problem you asked for
# Make sure to pass your column name within the quotation marks
#eg: data['your_column'].value_counts()
Thanks

Unable to loop through Dataframe rows: Length of values does not match length of index

I'm not entirely sure why I am getting this error as I have a very simple dataframe that I am currently working with. Here is a sample of the dataframe (the date column is the index):
date
News
2021-02-01
This is a news headline. This is a news summary.
2021-02-02
This is another headline. This is another summary
So basically, all I am trying to do is loop through the dataframe one row at a time and pull the News item, use the Sentiment Intensity Analyzer on it and store the compound value into a separate list (which I am appending to an empty list). However, when I run the loop, it gives me this error:
Length of values (5085) does not match the length of index (2675)
Here is a sample of the code that I have so far:
sia = SentimentIntensityAnalyzer()
news_sentiment_list = []
for i in range (0, (df_news.shape[0]-1)):
n = df_news.iloc[i][0]
news_sentiment_list.append(sia.polarity_scores(n)['compound'])
df['News Sentiment'] = news_sentiment_list
I've tried the loop statement a number of different ways using the FOR loop, and I always return that error. I am honestly lost at this point =(
edit: The shape of the dataframe is: (5087, 1)
The target dataframe is df whereas you loop on df_news, the indexes are probably not the same. You might need to merge the dataframes before doing so.
Moreover, there is an easier approach to your problem that would avoid having to loop on it. Assuming your dataframe df_news holds the column News (as shown on your table), you can add a column to this dataframe simply by doing:
sia = SentimentIntensityAnalyzer()
df_news['News Sentiment'] = df_news['News'].apply(lambda x: sia.polarity_scores(x)['compound'])
A general rule when using pandas is to avoid as much as possible using for-loops, except when you have a very specific edge case panda's built-in methods will be sufficient.

Extract values from array type of column in pandas

I am trying to extract the location codes / product codes from a sql table using pandas. The field is an array type, i.e. it has multiple values as a list within each row. I have to extract values from string for product/location codes.
Here is a sample of the table
df.head()
Target_Type Constraints
45 ti_8188,to_8188,r_8188,trad_8188_1,to_9258,ti_9258,r_9258,trad_9258_1
45 ti_8188,to_8188,r_8188,trad_8188_1,trad_22420_1
45 ti_8894,trad_8894_0.2
Now I want to extract the numeric values of the codes. I also want to ignore the end float values after 2nd underscore in the entries, i.e. ignore the _1, _0.2 etc.
Here is a sample output I want to achieve. It should be unique list/df column of all the extracted values -
Target_Type_45_df.head()
Constraints
8188
9258
22420
8894
I have never worked with nested/array type of column before. Any help would be appreciated.
You can use explode to bring each variable into a single cell, under one column:
df = df.explode('Constraints')
df['newConst'] = df['Constraints'].apply(lambda x: str(x).split('_')[1])
I would think the following overall strategy would work well (you'll need to debug):
Define a function that takes a row as input (the idea being to broadcast this function with the pandas .apply method).
In this function, set my_list = row['Constraints'].
Then do my_list = my_list.split(','). Now you have a list, with no commas.
Next, split with the underscore, take the second element (index 1), and convert to int:
numbers = [int(element.split('_')[1]) for element in my_list]
Finally, convert to set: return set(numbers)
The output for each row will be a set - just union all these sets together to get the final result.

Pandas: How to check if any of a list in a dataframe column is present in a range in another dataframe?

I'm trying to compare two bioinformatic DataFrames (one with transcription start and end genomic locations, and one with expression data). I need to check if any of a list of locations in one DataFrame is present within ranges defined by the start and end locations in the other DataFrame, returning rows/ids where they match.
I have tried a number of built-in methods (.isin, .where, .query,), but usually get stuck because the lists are nonhashable. I've also tried a nested for loop with iterrows and itertuples, which is exceedingly slow (my actual datasets are thousands of entries).
tss_df = pd.DataFrame(data={'id':['gene1','gene2'],
'locs':[[21,23],[34,39]]})
exp_df = pd.DataFrame(data={'gene':['geneA','geneB'],
'start': [15,31], 'end': [25,42]})
I'm looking to find that the row with id 'gene1' in tss_df has locations (locs) that match 'geneA' in exp_df.
The output would be something like:
output = pd.DataFrame(data={'id':['gene1','gene2'],
'locs': [[21,23],[34,39]],
'match': ['geneA','geneB']})
Edit: Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data, but I'll need to try it with my actual data!
Based on a comment below, I tried playing with merge_asof:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; so I split out the first value in locs:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data!

Categories