Dataframe Sorting - python

I am working on Python pandas, beginning with sorting a dataframe I have created from a csv file. I am trying to create a for loop eventually, using values to compare. However, when I print the new values, they are using the original dataframe instead of the sorted version. How do I properly do the below?
Original CSV data:
date fruit quantity
4/5/2014 13:34 Apples 73
4/5/2014 3:41 Cherries 85
4/6/2014 12:46 Pears 14
4/8/2014 8:59 Oranges 52
4/10/2014 2:07 Apples 152
4/10/2014 18:10 Bananas 23
4/10/2014 2:40 Strawberries 98
Code:
import pandas as pd
import numpy
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
x = 0 #starting my counter values or position in the column
df.sort_values(['fruit'], ascending=True, inplace=True) #sorting the column
fruit
print(df)
old_fruit = df.fruit[x]
new_fruit = df.fruit[x+1]
print(old_fruit)
print(new_fruit)

I believe you are still accessing the old index of x. After you sort, insert this to reindex:
df.reset_index(drop=True, inplace=True)

Related

Rename the name of the columns of a Pandas Dataframe by the values of another Pandas Dataframe

I would like to rename the columns of my Pandas Dataframe, according to the dictionary located in another Dataframe.
For example, I have the DF1:
And I would like to rename the name of the df1 columns according to this dictionary:
My goal is to automatically obtain a dataframe with the following changes:
PS. I would like to rename the column after the word "fields", to keep the format "field.CHANGE_NAME".
### data for the first dataframe
data1 = {
"field.customfield_1032": [48,30,28,38],
"field.customfield_1031": ["RSA", "RSB", "RSC", "RSD"],
"field.customfield_1030": ["Thornton", "MacGyver", "Dalton", "Murdoc "]
}
df1 = pd.DataFrame(data1)
print(df1)
### data for the second dataframe, this is the diccionary with the data
data2 = {"field.ID": ['field.customfield_1030','field.customfield_1031','field.customfield_1032'],
"field.Name": ['Surname','ID', 'Age'],}
df2 = pd.DataFrame(data2)
print(df2)
How could I achieve this?
Thank you in advance.
Let's try
out = df1.rename(columns=dict(df2.values)).add_prefix('field.')
print(out)
field.Age field.ID field.Surname
0 48 RSA Thornton
1 30 RSB MacGyver
2 28 RSC Dalton
3 38 RSD Murdoc
Map the columns to the dict from the combination of df2['field.ID'] & df2['field.Name']
Here is the solution:
df1.columns = df1.columns.map({x: f"field.{y}" for x, y in zip(df2['field.ID'], df2['field.Name'])})
print(df1)
field.Age field.ID field.Surname
0 48 RSA Thornton
1 30 RSB MacGyver
2 28 RSC Dalton
3 38 RSD Murdoc

Extract numbers from string column from Pandas DF

I have the next DataFrame with string column ("Info"):
df = pd.DataFrame( {'Date': ["2014/02/02", "2014/02/03"], 'Info': ["Out of 78 shares traded during the session today, there were 54 increases, 9 without change and 15 decreases.", "Out of 76 shares traded during the session today, there were 60 increases, 4 without change and 12 decreases."]})
I need to extract the numbers from "Info" to new 4 columns in the same df.
The first row will have the values [78, 54, 9, 15]
I have trying with
df[["new1","new2","new3","new4"]]= df.Info.str.extract('(\d+(?:\.\d+)?)', expand=True).astype(int)
but I think that is more complicated.
regards,
Just so I understand, you're trying to avoid capturing decimal parts of numbers, right? (The (?:\.\d+)? part.)
First off, you need to use pd.Series.str.extractall if you want all the matches; extract stops after the first.
Using your df, try this code:
# Get a multiindexed dataframe using extractall
expanded = df.Info.str.extractall(r"(\d+(?:\.\d+)?)")
# Pivot the index labels
df_2 = expanded.unstack()
# Drop the multiindex
df_2.columns = df_2.columns.droplevel()
# Add the columns to the original dataframe (inplace or make a new df)
df_combined = pd.concat([df, df_2], axis=1)
Extractall might be better for this task
df[["new1","new2","new3","new4"]] = df['Info'].str.extractall(r'(\d+)')[0].unstack()
Date Info new1 new2 new3 new4
0 2014/02/02 Out of 78 shares traded during the session tod... 78 54 9 15
1 2014/02/03 Out of 76 shares traded during the session tod... 76 60 4 12

How to Loop over Numeric Column in Pandas Dataframe and filter Values?

df:
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
I want loop over Numeric Column (Age, Salary) to check each value whether it is numeric or not, if string value present in Numeric column filter out the record and create a new data frame without that error.
Output :
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
You could extend this answer to filter on multiple columns for numerical data types:
import pandas as pd
from io import StringIO
data = """
Org_Name,Emp_Name,Age,Salary
Axempl,Rick,29,1000
Lastik,John,34,2000
Xenon,sidd,47,9000
Foxtrix,Ammy,thirty,2000
Hensaui,giny,33,ten
menuia,rony,fifty,7000
lopex,nick,23,Ninety
"""
df = pd.read_csv(StringIO(data))
print('Original dataframe\n', df)
df = df[(df.Age.apply(lambda x: x.isnumeric())) &
(df.Salary.apply(lambda x: x.isnumeric()))]
print('Filtered dataframe\n', df)
gives
Original dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
3 Foxtrix Ammy thirty 2000
4 Hensaui giny 33 ten
5 menuia rony fifty 7000
6 lopex nick 23 Ninety
Filtered dataframe
Org_Name Emp_Name Age Salary
0 Axempl Rick 29 1000
1 Lastik John 34 2000
2 Xenon sidd 47 9000
I believe this can be solved using Pandas' "to_numeric" function.
import pandas as pd
df['Column to Check'] = pd.to_numeric(df['Column to Check'], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
Where 'Column to Check' is the column name that your are checking for values that cannot be cast as an integer (or any numeric type); in your question I believe you will want to apply this code to 'Age' and 'Salary'. "to_numeric" will convert any values in those columns to NaN if they could not be cast as your selected type. The "dropna" method will remove all rows that have a NaN in any of your columns.
To loop over the columns like you ask, you could do the following:
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df.dropna(axis=0, inplace=True)
EDIT:
In response to harry's comment. If there are preexisting NaNs in the data, something like the following should keep any valid row that had a preexisting NaN in one of the other columns.
for col in ['Age', 'Salary']:
df[col] = pd.to_numeric(df[col], downcast='integer', errors='coerce')
df = df[df[col].notnull()]
You can use a mask to indicate wheter or not there is a string type among the Age and Salary columns:
mask_str = (df[['Age', 'Salary']]
.applymap(lambda x: str(type(x)))
.sum(axis=1)
.str.contains("str"))
df[~mask_str]
This is assuming that the dataframe already contains the proper types. If not, you can convert them using the following:
def convert(val):
try:
return int(val)
except ValueError:
return val
df = (df.assign(Age=lambda f: f.Age.apply(convert),
Salary=lambda f: f.Salary.apply(convert)))

Manipulations with Lat-Lon and Time Series Pandas

I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.
Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')

Make all columns (dates) index of data frame

My data is organized like this:
Where country code is the index of the data frame and the columns are the years for the data. First, is it possible to plot line graphs (using matplotlib.pylot) over time for each country without transforming the data any further?
Second, if the above is not possible, how can I make the columns the index of the table so I can plot time series line graphs?
Trying df.t gives me this:
How can I make the dates the index now?
Transpose using df.T.
Plot as usual.
Sample:
import pandas as pd
df = pd.DataFrame({1990:[344,23,43], 1991:[234,64,23], 1992:[43,2,43]}, index = ['AFG', 'ALB', 'DZA'])
df = df.T
df
AFG ALB DZA
1990 344 23 43
1991 234 64 23
1992 43 2 43
# transform index to dates
import datetime as dt
df.index = [dt.date(year, 1, 1) for year in df.index]
import matplotlib.pyplot as plt
df.plot()
plt.savefig('test.png')

Categories