Pandas dataframe finding the mean - python

I have a dataframe that looks like the attached image. I want to find the mean for every finalAward_band 'value'. I'm not sure how to do this.

str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "finalAward_band"
df1[df1['col'].str.contains('finalAward_band', regex=False)]
# same as df1[df1['col'].str.contains('finalAward_band')] but faster.

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".
try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

Creating new data frame by extracting cells (contains strings) with certain substring that matches with a list of words (have over 100)

I already found a roundabout solution to the problem, and I am sure there is a simple way.
I got two data frames with one column in each. DF1 and DF2 contains strings.
Now, I try to match using .str contains in python, limited by knowledge, I am forced to manually enter substrings that I am looking for.
contain_values = df[df['month'].str.contains('Ju|Ma')]
This highlighted way is how I am able to solve the problem of matching substring within DF1 from DF2.
The current scenario pushes me to add 100 words using the vertical bar right here, str.contains('Ju|Ma').
Now can anyone kindly share some wisdom on how to link the second data frame that contain one column (contains 100+ words)
here is one way to do it. if you post a MRE, i would be able to test and share result, but the below should work
# create a list of words to search
w=['ju', 'ma']
# create a search string
s='|'.join(w)
# search using regex
import re
df[df['month'].str.contains(s, re.IGNORECASE)]

Searching for indexes of multiple subtrings in multiple files

I've got two dataframes which are as follows:
df1 : contains one variable ['search_term'] and 100000 rows
These are words/phrases I want to search for in my files
df2: contains parsed file contents in a column called file_text
There are 20000 rows in this dataframe and two columns ['file_name', 'file_text']
What I need is the index of each appearance of a search term in the file_text.
I cannot figure out an efficient way to perform this search.
I am using the str.find() function along with groupby but it's taking around 0.25s per file_text-search term (which becomes really long with 20k files*100k search terms)
Any ideas on ways to do this in a fast and efficient way would be lifesavers!
I remember having to do something similar in one of our projects. We had a very large set of keywords and we wanted to search for them in a large string and find all occurrences of those keywords. Let's call the string we want to search in content. After some bench-marking, the solution I adopted was a two pass method: first check to see if a keyword exists in the content using the highly optimized in operator, and then use regular expressions to find all occurrences of it.
import re
keywords = [...list of your keywords ...]
found_keywords = []
for keyword in keywords:
if keyword in content:
found_keywords.append(keyword)
for keyword in found_keywords:
for match in re.finditer(keyword, content):
print(match.start())

dataframes selecting data by [] and by .(attibute)

I found out that I have problem understanding when I should be accessing data from dataframe(df) using df[data] or df.data .
I mostly use the [] method to create new columns, but I can also access data using both df[] and df.data, but what's the difference and how can I better grasp those two ways of selecting data? When one should be used over the other one?
If I understand the Docs correctly, they are pretty much equivalent, except in these cases:
You can use [the .] access only if the index element is a valid python identifier, e.g. s.1 is not allowed.
The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.
Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items, labels.
In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.
However, while
indexing operators [] and attribute operator . provide quick and easy
access to pandas data structures across a wide range of use cases [...]
in production you should really use the optimized panda data access methods such as .loc, .iloc, and .ix, because
[...] since the type of the data to be accessed isn’t known
in advance, directly using standard operators has some optimization
limits. For production code, we recommended that you take advantage of
the optimized pandas data access methods.
Using [] will use the value of the index.
a = "hello"
df[a] # It will give you content at hello
Using .
df.a # content at a
The difference is that with the first one you can use a variable.

python .loc with some condition(string, regex etc)

I am willing to get subset of the dataframe. And the condition is that, the value of certain column starts with the string 'HOUS'. How should I do?.
df.loc[df.id.startswith('HOUS')]
I should have searched more.
Here is the solution.
df[df.id.str.startswith('HOUS')]

Categories