Search column for multiple strings but show faults Python Pandas - python

I am searching a column in my data frame for a list of values contained in a CSV that I have converted to a list. Searching for those values is not the issue here.
import pandas as pd
df = pd.read_csv('output2.csv')
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
result = df.loc[df['PARID'].isin(parcelid_lst)]
result
What I would like to do is once the list has been searched and the data frame is shown with the "found" values I would also like to print or display a list of the values from the list that were "unfound" or did not exist in the data frame column I was searching.
Is there a specific method to call to do this?
Thank you in advance!

Adding the tilde does the opposite. Maybe that would get all the values that are not part of the parcelid_lst
not_found = df.loc[~df['PARID'].isin(parcelid_lst)]
Hope that helps.

After reconsidering my question and thinking about it a little bit differently, the solution I found is to turn all the values in the data frame in the 'PARID' column into a list. Then compare the 'parcelid_lst' to it.
This resulted in a list of all the values that did not exist in the data frame but did exist in the 'parcelid_lst'
df = pd.read_csv('output2.csv')
allparids = df['PARID'].tolist()
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
list(set(parcelid_lst) - set(allparids))

I would also like to print or display a list of the values from the
list that were "unfound" or did not exist in the data frame column I
was searching.
You don't need to subset your dataframe for this. You can filter your series for items not found in your specified list (or series) and then use pd.Series.unique:
not_found = df.loc[~df['PARID'].isin(hos['Parcel ID'].unique()), 'PARID'].unique()
As above, it's a good idea to make your hos['Parcel ID'] an array of unique values if you expect duplicates to exist in the series.

Related

Convert Embedded JSON Dictionary to Pandas Dataframe

I have an embedded set of data given to me which needs to be converted to a pandas Dataframe
"{'rows':{'data':[[{'column_name':'column','row_value':value}]]}"
It's just a snippet of what it looks like at the start. Everything inside data repeats over and over. i.e.
{‘column_name’:’name’, ’row_value :value }
I want the values of column_name to be the column headings. And the values of row_value to be the values in each row.
Ive tried a few different ways. I thought it would be something along the lines of
df = pd.DataFrame(data=[data_rows['row_value'] for data_rows in raw_data['rows']['data']], columns=['column_name'])
But I might be way off. I probably not stepping into the data right with raw_data['rows']['data']
Any suggestions would be great.
You can try to add another loop in your list comprehension to get elements out:
df = pd.DataFrame(data=[data_row for data_rows in raw_data['rows']['data'] for data_row in data_rows])
print(df)
name value type
0 dynamic_tag_tracker null null

Not able to insert a string to a position in a dataframe

I'm trying to iterate over two data frames with different lenghts in order to look for some data in a string. If the data is found, I should be able to add info to a specific position in a data frame. Here is the code.
In the df data frame, I created an empty column, which is going to receive the data in the future.
I also have the df_userstory data frame. And here is where I'm looking for the data. So, I created the code below.
Both df['Issue key'][i] and df_userstory['parent_of'][i] contains strings.
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
item = df_userstory['Issue key'][i]
df['parent'].iloc[i] = item
df
For some reason, when I run this code the df['parent'] remains empty. I've tried different approaches, but everything failed.
I've tried to do the following in order to check what was happening:
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
print('True')
Nothing was printed out.
I appreciate your help here.
Cheers
Iterating over each index will lose all the performance benefits of using Pandas dataframe.
You should use dataframe merge.
# select the two columns you need from df_userstory
df_us = df_userstory.loc[:, ['parent_of', 'Issue key']]
# rename to parent, that will be merged with df dataframe
df_us.rename(columns={'Issue key': 'parent'}, inplace=True).drop_duplicates()
# merge
df = df.merge(df_us, left_on='Issue key', right_on='parent_of', how='left')
Ref: Pandas merge

Splitting a Pandas DataFrame based on whether the index name appears in a list

This should hopefully be a straightforward question but I'm new to Pandas.
I've got a DataFrame called RawData, and a list of permissible indexes called AllowedIndexes.
What I want to do is split the DataFrame into two new ones:
one DataFrame only with the indexes that appear on the AllowedIndexes list.
one DataFrame only with the indexes that don't appear on the
AllowedIndexes list, for data cleaning purposes.
I've provided a simplified version of the actual data I'm using which in reality contains several series.
[image]
import pandas as pd
RawData = pd.DataFrame({'Quality':['#000000', '#FF0000', '#FFFFFF', '#PURRRR','#123Z']}, index = ['Black','Red','White', 'Cat','Blcak'])
AllowedIndexes = ['Black','White','Yellow','Red']
Thanks!
.index takes the index for each row of the RawData dataframe.
.isin() checks if the element exists in the AllowedIndexes list.
allowed = RawData[(RawData.index.isin(AllowedIndexes))==True]
not_allowed = RawData[(RawData.index.isin(AllowedIndexes))==False]
Another way without checking if True, or False:
allowed = RawData[RawData.index.isin(AllowedIndexes)]
not_allowed = RawData[~(RawData.index.isin(AllowedIndexes))]
~ is not in pandas.

Why is the pandas isin - query - loc function not finding all matching items

I have a dataframe where i'd like to add a column "exists" based on the item existing in another dataframe.
Using the isin function only answers back with 1 match based on that other dataframe. Same for a loc filter when i set the column i want to filter as index.
It just doesn't work as expected when i use a reference to a list or column of another DF like this:
table.loc[table.index.isin(tableOther['column']), : ]
In this case it only returns 1 item.
import pandas as pd
import numpy as np
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';', index_col='Keyword')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
### This column based check only returns 1 - seemingly random - match ###
table.loc[table.index.isin(tableSubject['subjects']), : ]
--------------
######## also tried it like this:
# Source that i like to enrich with additional column
table = pd.read_csv('keywordsDataSource.csv', encoding='utf-8', delimiter=';')
# Source to compare keywords against
tableSubject = pd.read_csv('subjectDataSource.csv', encoding='utf-8', names=["subjects"])
mask = table['Keyword'].isin(tableSubject.subjects)
table[mask]
I've also tried using .query and turning the pd subject column to a list which ends with the same singular match result as above.
as the output is the same in all tries, I expect that it is something with the datasource..
Thank you for your thoughts!
Found the answer to be as simple as capitalization of words. Both sources of data were not set in lower characters. One list had Capitalized Words Like This and the other was random.
Learning: Make sure to set columns to be exactly the same as all options for matching look for exact matches.
This can be done as following:
table['Keyword'] = table['Keyword'].str.lower()
Also found a great answer here in case you don't need exact match:
How to test if a string contains one of the substrings in a list, in pandas?

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories