Not able to insert a string to a position in a dataframe - python

I'm trying to iterate over two data frames with different lenghts in order to look for some data in a string. If the data is found, I should be able to add info to a specific position in a data frame. Here is the code.
In the df data frame, I created an empty column, which is going to receive the data in the future.
I also have the df_userstory data frame. And here is where I'm looking for the data. So, I created the code below.
Both df['Issue key'][i] and df_userstory['parent_of'][i] contains strings.
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
item = df_userstory['Issue key'][i]
df['parent'].iloc[i] = item
df
For some reason, when I run this code the df['parent'] remains empty. I've tried different approaches, but everything failed.
I've tried to do the following in order to check what was happening:
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
print('True')
Nothing was printed out.
I appreciate your help here.
Cheers

Iterating over each index will lose all the performance benefits of using Pandas dataframe.
You should use dataframe merge.
# select the two columns you need from df_userstory
df_us = df_userstory.loc[:, ['parent_of', 'Issue key']]
# rename to parent, that will be merged with df dataframe
df_us.rename(columns={'Issue key': 'parent'}, inplace=True).drop_duplicates()
# merge
df = df.merge(df_us, left_on='Issue key', right_on='parent_of', how='left')
Ref: Pandas merge

Related

Convert Embedded JSON Dictionary to Pandas Dataframe

I have an embedded set of data given to me which needs to be converted to a pandas Dataframe
"{'rows':{'data':[[{'column_name':'column','row_value':value}]]}"
It's just a snippet of what it looks like at the start. Everything inside data repeats over and over. i.e.
{‘column_name’:’name’, ’row_value :value }
I want the values of column_name to be the column headings. And the values of row_value to be the values in each row.
Ive tried a few different ways. I thought it would be something along the lines of
df = pd.DataFrame(data=[data_rows['row_value'] for data_rows in raw_data['rows']['data']], columns=['column_name'])
But I might be way off. I probably not stepping into the data right with raw_data['rows']['data']
Any suggestions would be great.
You can try to add another loop in your list comprehension to get elements out:
df = pd.DataFrame(data=[data_row for data_rows in raw_data['rows']['data'] for data_row in data_rows])
print(df)
name value type
0 dynamic_tag_tracker null null

Merge the data with the help of python or tableau

I have the 2 Excel sheets one have 63000 rows and the other one had 67000 rows which contains careers and their elgibility both have same title so I merged based on the title but the output shows me 44,00,000 rows why so , pls help me in this problem thank you,
Import pandas as pd
Df = pd.read_excel('c/downloads/knowledge.xlsx')
Df1 = pd.read_excel('c/downloads/Abilities.xlsx')
Df2 = pd .merge(df,df1,on = 'Title')
# Create a list of the files in the order you want to merge
all_df_list = [df, df1]
# Merge all the dataframes in all_df_list. Pandas will automatically append based on similar column names if that is what you meant by "same title".
appended_df = pd.concat(all_df_list)
# export as an excel file
appended_df.to_excel("data.xlsx", index=False)
Let me know if this helps. Works only if you have same labels in both of the files.
Make sure you're using the correct join type. Left, Right, Inner, Outer etc. It sounds like you need to use a Left Join. That will match data from the table on the right to the one on the left and return values accordingly, similar to a VLOOKUP. If the default join type is an Outer join then it will include all values from both tables and will dramatically increase your records.

Search column for multiple strings but show faults Python Pandas

I am searching a column in my data frame for a list of values contained in a CSV that I have converted to a list. Searching for those values is not the issue here.
import pandas as pd
df = pd.read_csv('output2.csv')
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
result = df.loc[df['PARID'].isin(parcelid_lst)]
result
What I would like to do is once the list has been searched and the data frame is shown with the "found" values I would also like to print or display a list of the values from the list that were "unfound" or did not exist in the data frame column I was searching.
Is there a specific method to call to do this?
Thank you in advance!
Adding the tilde does the opposite. Maybe that would get all the values that are not part of the parcelid_lst
not_found = df.loc[~df['PARID'].isin(parcelid_lst)]
Hope that helps.
After reconsidering my question and thinking about it a little bit differently, the solution I found is to turn all the values in the data frame in the 'PARID' column into a list. Then compare the 'parcelid_lst' to it.
This resulted in a list of all the values that did not exist in the data frame but did exist in the 'parcelid_lst'
df = pd.read_csv('output2.csv')
allparids = df['PARID'].tolist()
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
list(set(parcelid_lst) - set(allparids))
I would also like to print or display a list of the values from the
list that were "unfound" or did not exist in the data frame column I
was searching.
You don't need to subset your dataframe for this. You can filter your series for items not found in your specified list (or series) and then use pd.Series.unique:
not_found = df.loc[~df['PARID'].isin(hos['Parcel ID'].unique()), 'PARID'].unique()
As above, it's a good idea to make your hos['Parcel ID'] an array of unique values if you expect duplicates to exist in the series.

Cleaning Dataframe in Python 3

I've got a dataframe (haveleft) full of people who have left a service and their reason for leaving. The 'text' column is their reason, but some of them aren't strings. Not many, so I just want to remove those rows, either in place or to a new dataframe. Below code just gives me a dataframe populated with only NaN. Why doesn't it work?
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft[haveleft['text'] == str]]
print(holder[0:10])
or if I remove one of the 'haveleft[ ]' I get an empty dataframe
cleanedleft = pd.DataFrame()
cleanedleft = haveleft[haveleft['text'] == str]
print(holder[0:10])
I've tried to add a type() but can't seem to figure out the way to do this.
It doesn't work because DataFrame columns cannot contain mixed types; your text column will be string or object, even if some values are numerical. You'll want to figure out how to characterize unwanted data and drop them accordingly.
For instance, to drop rows where 'text' consists only of digits as in the single-line example you give:
cleaned = df[~df['text'].str.match('^\d+$')]

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories