Pandas: create dataframe without auto ordering column names alphabetically - python

I am creating an initial pandas dataframe to store results generated from other codes: e.g.
result = pd.DataFrame({'date': datelist, 'total': [0]*len(datelist),
'TT': [0]*len(datelist)})
with datelist a predefined list. Then other codes will output some number for total and TT for each date, which I will store in the result dataframe.
So I want the first column to be date, second total and third TT. However, pandas will automatically reorder it alphabetically to TT, date, total at creation. While I can manually reorder this again afterwards, I wonder if there is an easier way to achieve this in one step.
I figured I can also do
result = pd.DataFrame(np.transpose([datelist, [0]*l, [0]*l]),
columns = ['date', 'total', 'TT'])
but it somehow also looks tedious. Any other suggestions?

You can pass the (correctly ordered) list of column as parameter to the constructor or use an OrderedDict:
# option 1:
result = pd.DataFrame({'date': datelist, 'total': [0]*len(datelist),
'TT': [0]*len(datelist)}, columns=['date', 'total', 'TT'])
# option 2:
od = collections.OrderedDict()
od['date'] = datelist
od['total'] = [0]*len(datelist)
od['TT'] = [0]*len(datelist)
result = pd.DataFrame(od)

result = pd.DataFrame({'date': [23,24], 'total': 0,
'TT': 0},columns=['date','total','TT'])

Use pandas >= 0.23 in combination with Python >= 3.6.
result = pd.DataFrame({'date': datelist,
'total': [0]*len(datelist),
'TT': [0]*len(datelist)})
retains the dict's insertion order when creating a DataFrame (or Series) from a dict when using pandas v0.23.0 in combination with Python3.6.
See https://pandas.pydata.org/pandas-docs/version/0.23.0/whatsnew.html#whatsnew-0230-api-breaking-dict-insertion-order.

Related

append successive dictionaries in Dataframe as rows

I have the following empty dataframe.
columns = [
'image_path',
'label',
'nose',
'neck',
'r_sho',
'r_elb',
'r_wri',
'l_sho',
'l_elb',
'l_wri',
'r_hip',
'r_knee',
'r_ank',
'l_hip',
'l_knee',
'l_ank',
'r_eye',
'l_eye',
'r_ear',
'l_ear']
df = pd.DataFrame(columns = columns)
And in every loop, I want to append a new dictionary in this dataframe, with keys the column objects. I tried df = df.append(dictionary, ignore_index = True). Yet only this message appears and the script never ends:
df = df.append(dictionary, ignore_index = True)
/home/karantai/opendr/projects/perception/lightweight_open_pose/demos/inference_demo.py:133: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Can I append a new dictionary as new row in dataframe, in each loop?
As #Amer implied, it's not a good idea to append rows to a dataframe one by one, because each time a new dataframe object would have to be created, which is inefficient. Instead, you can merge all your dictionaries into one and then call pd.DataFrame() only once.
defaultdict is useful here, to automatically initialize the values of the full dictionary as lists. I would also use the get method for dictionaries, in case any one of your dictionaries is missing one of the keys. You can assign NumPy's nan as a placeholder for the missing values in that case.
In the end, pd.DataFrame() will pick up the column names from the dictionary keys. Here is an example with just three columns for simplicity:
from collections import defaultdict
import numpy as np
import pandas as pd
columns = ['image_path', 'label', 'nose']
dictionaries = [{'image_path': 'file1.jpg', 'label': 1, 'nose': 3},
{'label': 2, 'nose': 2},
{'image_path': 'file3.png', 'label': 3, 'nose': 1}]
full_dictionary = defaultdict(list)
for d in dictionaries:
for key in columns:
full_dictionary[key].append(d.get(key, np.nan))
df = pd.DataFrame(full_dictionary)
df
image_path label nose
0 file1.jpg 1 3
1 NaN 2 2
2 file3.png 3 1

How to append several rows to an existing pandas dataframe with number of rows depending on a comprehension list

I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.

Is there a way to rename multiple df columns in Python?

I'm trying to rename multiple columns in a dataframe to certain dates with Python.
Currently, the columns are as such: 2016-04, 2016-05, 2016-06....
I would like the columns to read: April2016, May2016, June2016...
There are around 40 columns. I am guessing a for loop would be the most efficient way to do this, but I'm relatively new to Python and not sure how to concatenate the column names correctly.
You can use loops or comprehensions along with a month dictionary to split, reorder, and replace the string column names
#in your case this would be cols=df.columns
cols=['2016-04', '2016-05', '2016-06']
rpl={'04':'April','05':'May','06':'June'}
cols=[\
''.join(\
[rpl[i.split('-')[1]],
i.split('-')[0]]) \
for i in cols]
cols
['April2016', 'May2016', 'June2016']
#then you would assign it back with df.columns = cols
You didn't share your dataframe so I used basic dataframe to explain how to get month is given date.I supposed your dataframe likes:
d = {'dates': ['2016-04', '2016-05','2016-06']} #just 3 of them
so all code :
import datetime
import pandas as pd
d = {'dates': ['2016-04', '2016-05','2016-06']}
df = pd.DataFrame(d)
for index, row in df.iterrows():
get_date= row['dates'].split('-')
get_month = get_date[1]
month = datetime.date(1900, int(get_month), 1).strftime('%B')
print (month+get_date[0])
OUTPUT :
2016April
2016May
2016June

I have a dataframe with lists of dictionaries. Json_normalize and appending to a new df

This is what one row of themesdf looks like:
[{'code': '1', 'name': 'Economic management'},
{'code': '6', 'name': 'Social protection and risk management'}]
I want to normalize each row and add it to a newdf. This is what I have right now:
import pandas as pd
themesdf = json_df['mjtheme_namecode']
newdf = pd.DataFrame()
%timeit()
for row in themesdf:
for item in row:
newdf.append(json_normalize(item, 'name'))
newdf
After printing out newdf, it comes it with nothing. My ultimate goal with this data is to get the top ten major project themes (column 'name').
I recently worked on something similar. My solution was to explode the list so each json has its own row. Then normalize with pd.json_normalize(). That creates a new df which needs to be rejoined with the original table.
Since you didn't provide any test data here's my best guess as to what works:
import pandas as pd
# explode list column
explodedDf = themesDf.explode('mjtheme_namecode')
# normalize that column into a new df
normalizedDf = pd.json_normalize(explodedDf['mjtheme_namecode'])
# (optional) you may want to drop the original column
themesDf = themesDf.drop('mjtheme_namecode', axis = 1)
# join on index (default) with original df
newDf = themesDf.join(normalizedDf)

sort_values arranging values wrong

I have a dataframe with a few columns Client ID, Prd No, Prd Weight. I made Client ID an index column as part of the process of transforming the data from wide to long using the wide_to_long method.
When I apply the sort_values method to the Prd No column, the arrangement is weird. It arranges it as 1, 10, 100, 101, 2, 20, 200....etc. How I want the data arranged is 1, 2, 3, 4...
I've tried all sorts of things, including explicitly changing the Prd No to an integer type using the astype() method, but no luck.
What could I be doing wrong? Is it a setting or version of Pandas I am using? Help, anyone?
import pandas as pd
df = pd.read_csv("new_export.csv")
df1 = pd.wide_to_long(df,['diameterbh', 'base_diam_1', 'length_1', 'top_diam_1', 'base_diam_2', 'length_2', 'top_diam_2', 'base_diam_3', 'length_3', 'top_diam_3', 'x_product'], i='uniqueID', j='Tree Number', sep='_')
df3 = df2[df2['diameterbh'].notnull()].fillna(value=0)sort_values(by="Tree Number")
A bit of code could be useful. You should indeed change the index to numeric before sorting. Be sure to update your dataframe, because by default pandas returns a copy of your dataframe. This works for me.
import pandas as pd
# you want to convert column a to index, and sort numerically
x = pd.DataFrame({'a': ['10', '2', '12', '1'], 'b': [100, 20, 120, 10]})
x.set_index('a', inplace=True) # Set the index (inplace to overwrite x)
x.index = x.index.astype(int) # Make sure to change the index to a numeric type
x.sort_index(inplace=True) # Again, inplace to prevent returning a copy
This makes x into a dataframe with a properly sorted index.

Categories