I have a question in an assessment where i have to convert NumPy data into Pandas dataframe type. It also should use the data's dtype names as column headers.
I cannot use the pd.DataFrame() function for this task and there has been a clue given where I am supposed to still use pandas methods.
This is the code i have so far -
def convert_to_df(data):
"converting numpy array into dataframe"
far = data.tolist()
return pd.Series(far).to_frame()
which does convert it to a DataFrame ,giving this with the test:
0
0 (2020-02-29 13:32:59, 1.23375E+18, 0.67, 0.293...
1 (2020-02-27 00:20:58, 1.23282E+18, 0.442, 0.38...
2 (2020-02-10 18:54:50, 1.22694E+18, 0.577, 0.42...
3 (2020-02-29 05:23:06, 1.23362E+18, 0.514, 0.41...
4 (2020-02-26 03:20:55, 1.23251E+18, 0.426, 0.37...
im just confused on how to then get the headers in order. My output when i run test code is meant to look like this.
created_at ... emotion_category
0 2020-02-29 13:32:59 ... joy
1 2020-02-27 00:20:58 ... fear
2 2020-02-10 18:54:50 ... joy
3 2020-02-29 05:23:06 ... no specific emotion
4 2020-02-26 03:20:55 ... fear
[5 rows x 9 columns]
I have attached a screenshot of the question so you can see the test codes and the wording.
Hope someone can help !
the data im using looks like this
You can use pandas series instead, what I would do is convert each column in the numpy array to a Series, for example, I have the following numpy array:
data = np.array([[1, 2, 3], [4, 5, 6]])
I will use a for loop to create series for each column:
series = []
for i in range(data.shape[1]):
series.append(pd.Series(data[:,i], name="Serie_" + str(i)))
Finally, concatenate these series to one dataframe:
pd.concat([series[i] for i in range(data.shape[1])], axis=1)
The result:
Serie_0 Serie_1 Serie_2
0 1 2 3
1 4 5 6
I hope this helps.
Try this one:
def convert_to_df(data):
'''convert numpy to pandas'''
headers = ['created_at', 'tweet_ID', 'valence_intensity', 'anger_intensity', 'fear_intensity',
'sadness_intensity', 'joy_intensity', 'sentiment_category', 'emotion_category']
df = data[:][headers[0]].tolist()
df = pd.Series(df).to_frame()
df.columns = ['created_at']
for i in range(1, len(headers)):
newcol = data[:][headers[i]].tolist()
newcol = pd.Series(newcol).to_frame()
df[headers[i]] = newcol
return df
Related
Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]
When storing data in a json object with to_json, and reading it back with read_json, rows and columns are returned sorted alphabetically. Is there a way to keep the results ordered or reorder them upon retrieval?
You could use orient='split', which stores the index and column information in lists, which preserve order:
In [34]: df
Out[34]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
In [35]: df.to_json(orient='split')
Out[35]: '{"columns":["A","C","B"],"index":[5,4,3],"data":[[0,1,2],[3,4,5],[6,7,8]]}'
In [36]: pd.read_json(df.to_json(orient='split'), orient='split')
Out[36]:
A C B
5 0 1 2
4 3 4 5
3 6 7 8
Just remember to use orient='split' on reading as well, or you'll get
In [37]: pd.read_json(df.to_json(orient='split'))
Out[37]:
columns data index
0 A [0, 1, 2] 5
1 C [3, 4, 5] 4
2 B [6, 7, 8] 3
If you want to make a format with "orient='records'" and keep orders of the column, try to make a function like this. I don't think it is a wise approach, and do not recommend because it does not guarantee its order.
def df_to_json(df):
res_arr = []
ldf = df.copy()
ldf=ldf.fillna('')
lcolumns = [ldf.index.name] + list(ldf.columns)
for key, value in ldf.iterrows():
lvalues = [key] + list(value)
res_arr.append(dict(zip(lcolumns, lvalues)))
return json.dumps(res_arr)
In addition, for reading without sorted column please ref this [link] (Python json.loads changes the order of the object)
Good Luck
lets say you have a pandas dataframe, that you read
import pandas as pd
df = pd.read_json ('/abc.json')
df.head()
that give following
now there are two ways to save to json using pandas to_json
result.sample(200).to_json('abc_sample.json',orient='split')
that will give the order like this one column
however, to preserve the order like in csv, use this one
result.sample(200).to_json('abc_sample_2nd.json',orient='records')
this will give result as
I want to drop specific rows from a pandas dataframe. Usually you can do that using something like
df[df['some_column'] != 1234]
What df['some_column'] != 1234 does is creating an indexing array that is indexing the new df, thus letting only rows with value True to be present.
But in some cases, like mine, I don't see how I can express the condition in such a way, and iterating over pandas rows is way too slow to be considered a viable option.
To be more specific, I want to drop all rows where the value of a column is also a key in a dictionary, in a similar manner with the example above.
In a perfect world I would consider something like
df[df['some_column'] not in my_dict.keys()]
Which is obviously not working. Any suggestions?
What you're looking for is isin()
import pandas as pd
df = pd.DataFrame([[1, 2], [1, 3], [4, 6],[5,7],[8,9]], columns=['A', 'B'])
In[9]: df
Out[9]: df
A B
0 1 2
1 1 3
2 4 6
3 5 7
4 8 9
mydict = {1:'A',8:'B'}
df[df['A'].isin(mydict.keys())]
Out[11]:
A B
0 1 2
1 1 3
4 8 9
You could use query for this purpose:
df.query('some_column != list(my_dict.keys()')
You can use the function isin() to select rows whose column value is in an iterable.
Using lists:
my_list = ['my', 'own', 'data']
df.loc[df['column'].isin (my_list)]
Using dicts:
my_dict = {'key1':'Some value'}
df.loc[df['column'].isin (my_dict.keys())]
Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.
I have a DataFrame like:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
and my expected output is the majority value of each row, like:
0 5
1 2
2 6
I'm new with Pandas. Thank you for any help.
with pandas version 0.13.0, you can use df.mode(axis = 1)
(check your version with pd.__version__)
df.mode(axis=1)
0
0 5
1 2
2 6
[3 rows x 1 columns]
The concept you are looking for is a mode, which is the most commonly occurring number in a set. Scipy and Pandas both have ways to handle modes, through scipy.stats.mode and pandas.DataFrame.mode(works along an axis). So for this example you could say:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
for i in np.arange(len(df)):
results = np.zeros(len(df))
results[i] = scipy.stats.mode(df[i])
This should return a numpy array with the modes of each array. To do this same thing with Pandas you can do:
df = np.array([[1,5,3,4,5,5,6,],[1,2,2,3,4,5,6],[1,2,3,4,5,6,6]])
df = pd.DataFrame(df)
df.mode(axis = 1)
The documentation is here: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.mode.html