Convert Embedded JSON Dictionary to Pandas Dataframe - python

I have an embedded set of data given to me which needs to be converted to a pandas Dataframe
"{'rows':{'data':[[{'column_name':'column','row_value':value}]]}"
It's just a snippet of what it looks like at the start. Everything inside data repeats over and over. i.e.
{‘column_name’:’name’, ’row_value :value }
I want the values of column_name to be the column headings. And the values of row_value to be the values in each row.
Ive tried a few different ways. I thought it would be something along the lines of
df = pd.DataFrame(data=[data_rows['row_value'] for data_rows in raw_data['rows']['data']], columns=['column_name'])
But I might be way off. I probably not stepping into the data right with raw_data['rows']['data']
Any suggestions would be great.

You can try to add another loop in your list comprehension to get elements out:
df = pd.DataFrame(data=[data_row for data_rows in raw_data['rows']['data'] for data_row in data_rows])
print(df)
name value type
0 dynamic_tag_tracker null null

Related

Not able to insert a string to a position in a dataframe

I'm trying to iterate over two data frames with different lenghts in order to look for some data in a string. If the data is found, I should be able to add info to a specific position in a data frame. Here is the code.
In the df data frame, I created an empty column, which is going to receive the data in the future.
I also have the df_userstory data frame. And here is where I'm looking for the data. So, I created the code below.
Both df['Issue key'][i] and df_userstory['parent_of'][i] contains strings.
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
item = df_userstory['Issue key'][i]
df['parent'].iloc[i] = item
df
For some reason, when I run this code the df['parent'] remains empty. I've tried different approaches, but everything failed.
I've tried to do the following in order to check what was happening:
df['parent'] = ""
for i in df_userstory.index:
if df['Issue key'][i] in df_userstory['parent_of'][i]:
print('True')
Nothing was printed out.
I appreciate your help here.
Cheers
Iterating over each index will lose all the performance benefits of using Pandas dataframe.
You should use dataframe merge.
# select the two columns you need from df_userstory
df_us = df_userstory.loc[:, ['parent_of', 'Issue key']]
# rename to parent, that will be merged with df dataframe
df_us.rename(columns={'Issue key': 'parent'}, inplace=True).drop_duplicates()
# merge
df = df.merge(df_us, left_on='Issue key', right_on='parent_of', how='left')
Ref: Pandas merge

Python Conditional NaN Value Replacement of existing Values in Dataframe

I try to transform my DataFrame witch i loaded from a CSV.
In that CSV are columns that have NaN / no Values. The goal is to replace them all!
For Example in column 'gh' row 45 (as shown in the picture: Input Dataframe) is a value missing. I like to replace it with the value of row 1, because 'latitude','longitude', 'time' ,'step','valid_time' are equal. So I like to have a Condition based replacement by those values. But not just for 'gh' but also for meanSea, msl, t, u and v.
Input Dataframe
I tryed something like that (just for 'gh'):
for i,row in df.iterrows():
value = row["gh"]
if pd.isnull(value):
for j,rowx in df.iterrows():
if row["latitude"]==rowx["latitude"] and row["longitude"]==rowx["longitude"] and row["time"]==rowx["time"] and row["step"]==rowx["step"]and row["valid_time"]==rowx["valid_time"]:
valuex = rowx["gh"]
row["gh"]=valuex
break;
My Try
This is very inefficent for big Data Frames so I need a better solution.
Assuming all values can be found somewhere in the dataset, the easiest way is to sort your df by those columns ('latitude','longitude', 'time' ,'step','valid_time') and forward fill your NaN's:
df.sort_values(by=['latitude','longitude', 'time' ,'step','valid_time']).ffill()
However, this fails if there are rows which do not have a counterpart somewhere else in the dataset.

How to get value in dataframe columns that contains multi-value item

I have a columns like this :
colums
how to get individual value from the column ?
desired output is list ex: [42008598,26472654,42054590,42774221,42444463], so it value(s) can be counted
Let me give you an advice: when you have some example code to show us, It would be great if you paste into the code quotes like this. It is easiest to read. Let's go with your question. You can select row in a pandas dataframe like this:
import pandas as pd
print(df.iloc[i])
where i is the row number: 0, 1, 2,... and df is your dataframe. Here is the Documentation
I am also new in Stackoverflow. I hope this could help you.
What you need to convert each row in the dataframe to an array and then do the operation that you want with this array. The way you can do it with Pandas would be to declare a function that deals with each row, and them use apply to run the function each row.
An example to count how many elements has inside each row:
def treat_array(row):
row = row.replace("{", "")
row = row.replace("}", "")
row = row.split(",")
return len(row)
df["Elements Count"] = df["Name of Column with the Arrays"].apply(treat_array)

Search column for multiple strings but show faults Python Pandas

I am searching a column in my data frame for a list of values contained in a CSV that I have converted to a list. Searching for those values is not the issue here.
import pandas as pd
df = pd.read_csv('output2.csv')
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
result = df.loc[df['PARID'].isin(parcelid_lst)]
result
What I would like to do is once the list has been searched and the data frame is shown with the "found" values I would also like to print or display a list of the values from the list that were "unfound" or did not exist in the data frame column I was searching.
Is there a specific method to call to do this?
Thank you in advance!
Adding the tilde does the opposite. Maybe that would get all the values that are not part of the parcelid_lst
not_found = df.loc[~df['PARID'].isin(parcelid_lst)]
Hope that helps.
After reconsidering my question and thinking about it a little bit differently, the solution I found is to turn all the values in the data frame in the 'PARID' column into a list. Then compare the 'parcelid_lst' to it.
This resulted in a list of all the values that did not exist in the data frame but did exist in the 'parcelid_lst'
df = pd.read_csv('output2.csv')
allparids = df['PARID'].tolist()
hos = pd.read_csv('houses.csv')
parcelid_lst = hos['Parcel ID'].tolist()
list(set(parcelid_lst) - set(allparids))
I would also like to print or display a list of the values from the
list that were "unfound" or did not exist in the data frame column I
was searching.
You don't need to subset your dataframe for this. You can filter your series for items not found in your specified list (or series) and then use pd.Series.unique:
not_found = df.loc[~df['PARID'].isin(hos['Parcel ID'].unique()), 'PARID'].unique()
As above, it's a good idea to make your hos['Parcel ID'] an array of unique values if you expect duplicates to exist in the series.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories