create a pandas data frame from several lists

create a pandas data frame from several lists - python

My function outputs a list, for instance when I type:
My_function('TV', 'TV_Screen')
it outputs the following:
['TV', 1, 'TV_Screen', 0.04, 'True']
Now, my TV is made of several parts, such as speaker, transformer, etc., I can keep running my function for each part, and for instance change 'TV_Screen' for 'TV_Speaker', or 'TV_transformer', etc.
The alternative is to create a list with all the part, such as:
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
What I am trying to get is a pandas data frame with 5 columns (because my function outputs 5 variables, see above the section "it outputs the following:") and in this case 3 rows (one of each for 'TV_Screen', 'TV_Speaker', and 'TV_transformer'). Basically, I want the following to be in a data frame:
['TV', 1, 'TV_Screen', 0.04, 'True']
['TV', 9, 'TV_Speaker', 0.56, 'True']
['TV', 3, 'TV_transformer', 0.80, 'False']
I know I need a for loop somewhere, but I am not sure how to create this data frame. Could you please help? (I can change the output of my function to be a pd.Series or something else that would work better).
Thanks!

If you have many arrays, it may be worth converting them into a numpy matrix first and then converting them into a dataframe.
import pandas as pd
import numpy as np
a = ['TV', 1, 'TV_Screen', 0.04, 'True']
b = ['TV', 9, 'TV_Speaker', 0.56, 'True']
c = ['TV', 3, 'TV_transformer', 0.80, 'False']
matrix = np.matrix([a,b,c])
df = pd.DataFrame(data=matrix)

You can do it like this:
def My_function(part):
# prepare result
result = ['TV', 1, part, 0.04, 'True'] # for testing
return result
TV_parts = ['TV_Screen', 'TV_Speaker', 'TV_transformer']
df = pd.DataFrame([My_function(part) for part in TV_parts])
>>> df
0 1 2 3 4
0 TV 1 TV_Screen 0.04 True
1 TV 1 TV_Speaker 0.04 True
2 TV 1 TV_transformer 0.04 True

Related

Python - Encoding Genomic Data in dataframe

Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.
Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.
I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring
Here's my most recent code:
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)
df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)
I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.
Thanks for the tips!
Edit: Here is the print of the dataframe-
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...

I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values

How to parse json file to DataFrame in Python

I have some json data of phone accelerometer data. Which looks like this:
{u'timestamps': {u'1524771017235': [[u'x',
u'y',
u'z',
u'rotationX',
u'rotationY',
u'rotationZ'],
[-0.02, 0, 0.04, 102.65, 68.15, 108.61],
[-0.03, 0.02, 0.02, 102.63, 68.2, 108.5],
[-0.05, 0.01, 0.1, 102.6, 68.25, 108.4],
[-0.02, 0, 0.09, 102.6, 68.25, 108.4],
[-0.01, 0, 0.03, 102.6, 68.25, 108.4]]}}
What I want is to have a dataFrame with columns of the name of the data (x, y,z rotationX, rotationY, rotationZ) and the row of each data entry. The timestamp information can be stored elsewhere.
When I used d = pd.read_json('data.json'), this is what I get:
timestamps
2018-04-26 19:30:17.235 [[x, y, z, rotationX, rotationY, rotationZ], [...
It seems it takes the timestamps as index. And put all the rest in 1 cell.
I dont have much experience with json so the couldn't really make much sense out of the pandas.read_json api. please help.
My current workaround is to manually skip through the first 2 dictionaries. And create a df with the first column as the headers. It works but it is really not ideal...
dataDf = pd.DataFrame(data = d['timestamps']['1524771017235'][1:], columns = d['timestamps']['1524771017235'][0])
x y z rotationX rotationY rotationZ
0 -0.02 0.00 0.04 102.65 68.15 108.61
1 -0.03 0.02 0.02 102.63 68.20 108.50
2 -0.05 0.01 0.10 102.60 68.25 108.40
Thanks

What you need is to get access to the key of the dictionary {u'1524771017235': [[u'x', ... being the value associated to the key timestamps of the dictionary d associated to your json file. Then try:
d['timestamps'].keys()[0]
and it should return your '1524771017235' value, so to create your dataDf just do:
dataDf = pd.DataFrame(data = d['timestamps'][d['timestamps'].keys()[0]][1:],
columns = d['timestamps'][d['timestamps'].keys()[0]][0])
and you get the same result.

Two tuples as one element in python list

I am now getting used to python lists. But I encountered a complicated list and I have troubles parsing it.
prediction=[('__label__inflation_today', 0.8),('__label__economic_outlook', 0.2)]
I am trying to present this prediction in a better way, something like excel.
predicted label probability
Inflation_today 0.8
Economic_outlook 0.2

You can try
for x in prediction:
string=x[0].replace('__label__','')
print(string,":",x[1])
inflation_today : 0.8
economic_outlook : 0.2
If you want to access it using those names, you can also create a dictionary
d={}
for x in prediction:
string=x[0].replace('__label__','')
d[string]=x[1]
d
{'economic_outlook': 0.2, 'inflation_today': 0.8}
d['economic_outlook']
0.2

One possible solution is pandas DataFrame,
then use Series.str.replace:
import pandas as pd
prediction=[('__label__inflation_today', 0.8), ('__label__economic_outlook', 0.2)]
df = pd.DataFrame(prediction, columns=['predicted label',' probability'])
df['predicted label'] = df['predicted label'].str.replace('__label__', '')
print (df)
predicted label probability
0 inflation_today 0.8
1 economic_outlook 0.2
If need only data use DataFrame.to_string:
print (df.to_string(index=False, header=None))
inflation_today 0.8
economic_outlook 0.2

Making a pandas dataframe from a .npy file

I'm trying to make a pandas dataframe from a .npy file which, when read in using np.load, returns a numpy array containing a dictionary. My initial instinct was to extract the dictionary and then create a dataframe using pd.from_dict, but this fails every time because I can't seem to get the dictionary out of the array returned from np.load. It looks like it's just np.array([dictionary, dtype=object]), but I can't get the dictionary by indexing the array or anything like that. I've also tried using np.load('filename').item() but the result still isn't recognized by pandas as a dictionary.
Alternatively, I tried pd.read_pickle and that didn't work either.
How can I get this .npy dictionary into my dataframe? Here's the code that keeps failing...
import pandas as pd
import numpy as np
import os
targetdir = '../test_dir/'
filenames = []
successful = []
unsuccessful = []
for dirs, subdirs, files in os.walk(targetdir):
for name in files:
filenames.append(name)
path_to_use = os.path.join(dirs, name)
if path_to_use.endswith('.npy'):
try:
file_dict = np.load(path_to_use).item()
df = pd.from_dict(file_dict)
#df = pd.read_pickle(path_to_use)
successful.append(path_to_use)
except:
unsuccessful.append(path_to_use)
continue
print str(len(successful)) + " files were loaded successfully!"
print "The following files were not loaded:"
for item in unsuccessful:
print item + "\n"
print df

Let's assume once you load the .npy, the item (np.load(path_to_use).item()) looks similar to this;
{'user_c': 'id_003', 'user_a': 'id_001', 'user_b': 'id_002'}
So, if you need to come up with a DataFrame like below using above dictionary;
user_name user_id
0 user_c id_003
1 user_a id_001
2 user_b id_002
You can use;
df = pd.DataFrame(list(x.item().iteritems()), columns=['user_name','user_id'])
If you have a list of dictionaries like below;
users = [{'u_name': 'user_a', 'u_id': 'id_001'}, {'u_name': 'user_b', 'u_id': 'id_002'}]
You can simply use
df = pd.DataFrame(users)
To come up with a DataFrame similar to;
u_id u_name
0 id_001 user_a
1 id_002 user_b
Seems like you have a dictionary similar to this;
data = {
'Center': [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]],
'Vpeak': [1.1, 2.2],
'ID': ['id_001', 'id_002']
}
In this case, you can simply use;
df = pd.DataFrame(data) # df = pd.DataFrame(file_dict.item()) in your case
To come up with a DataFrame similar to;
Center ID Vpeak
0 [0.1, 0.2, 0.3] id_001 1.1
1 [0.4, 0.5, 0.6] id_002 2.2
If you have ndarray within the dict, do some preprocessing similar to below; and use it to create the df;
for key in data:
if isinstance(data[key], np.ndarray):
data[key] = data[key].tolist()
df = pd.DataFrame(data)

Group by in pandas dataframe and unioning a numpy array column

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following
first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]
When I load the this CSV with pandas data frame and using the following code
data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})
Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like
170.0, [19 23 234 376]
162.0, [1 2 3 4]
How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.
group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))

With your current csv file the 'third' column comes in as a string, instead of a list.
There might be nicer ways to convert to a list, but here goes...
from ast import literal_eval
data = pd.read_csv('test_groupby.csv')
# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')
# Convert string to list...
data['third'] = data['third'].apply(literal_eval)
group_data=data.groupby('first')
# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
# (np.array should work here, but...?)
ans = group_data.aggregate(
{'third': lambda x: list(np.unique(
np.concatenate(x.values)))})
print(ans)
third
first
162 [1, 2, 3, 4]
170 [19, 23, 234, 376]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create a pandas data frame from several lists - python

Related

Python - Encoding Genomic Data in dataframe

How to parse json file to DataFrame in Python

Two tuples as one element in python list

Making a pandas dataframe from a .npy file

Group by in pandas dataframe and unioning a numpy array column

Categories

Resources