Read in csv file using pandas having issues with close columns - python

I have the foll. csv file:
RUN YR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Here, the column 'YR' has the values 2008, 2009...2013. However, there is no space between the values for YR and values for RUN. Because of this, when I try to read in the dataframe, it does not read the YR column correctly.
pandas.read_csv('file.csv', skipinitialspace=True, usecols=['YR','PMTE'], sep=' ')
The line above reads in the AP15 column instead of YR. How do I fix this?

It seems like your 'csv' is really a fixed-width format file. Sometimes these are accompanied by another file listing the size of each column, but maybe you aren't that lucky, and have to count the column widths manually. You can read this file with pandas's fixed width reading function:
df = pd.read_fwf('fixed_width.txt', widths=[4, 4, 8, 8])
In [7]: df
Out[7]:
RUN YR AP15 PMTE
0 1 2008 4.53 0.04
1 1 2009 3.17 0.26
2 1 2010 6.20 1.38
3 1 2011 5.38 3.55
4 1 2012 7.32 6.13
5 1 2013 4.39 9.40
In [8]: df.columns
Out[8]: Index(['RUN', 'YR', 'AP15', 'PMTE'], dtype='object')
There is an option to find the widths automatically but it probably requires at least a space between each column, as it doesn't seem to work here.

One workaround you can do for this would be to first make the column RUN and YR as one for your csv . Example -
RUNYR AP15 PMTE
12008 4.53 0.04
12009 3.17 0.26
12010 6.20 1.38
12011 5.38 3.55
12012 7.32 6.13
12013 4.39 9.40
Then read the csv into a dataframe with RUNYR as a string column, and then slice the RUNYR column up to make two different columns using pandas.Series.str.slice method. Example -
df = pd.read_csv('file.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
df['YR'] = df['RUNYR'].str.slice(1).astype(int)
df = df.drop('RUNYR',axis=1)
Demo -
In [21]: df = pd.read_csv('a.csv', skipinitialspace=True, header=0, sep=' ',dtype={'RUNYR':str})
In [22]: df['RUN'] = df['RUNYR'].str.slice(None,1).astype(int)
In [23]: df['YR'] = df['RUNYR'].str.slice(1).astype(int)
In [24]: df = df.drop('RUNYR',axis=1)
In [25]: df
Out[25]:
AP15 PMTE RUN YR
0 4.53 0.04 1 2008
1 3.17 0.26 1 2009
2 6.20 1.38 1 2010
3 5.38 3.55 1 2011
4 7.32 6.13 1 2012
5 4.39 9.40 1 2013
And then write this back to your csv using .to_csv method (to fix your csv permanently) .

Related

Sqlite3 Issue, Column title not accepted

Type
Location
2019_perc
2020_perc
2021_perc
2022_perc
0
County
Crawford
1.55
1.85
1.1
1.1
1
County
Deck
0.8
1.76
3
2.5
2
City
Peoria
1.62
1.64
0.94
2.2
I have some data that's in a Dataframe with the above format. I'm accessing it using sqlite3 and using matplotlib to graph the data. I am trying to compare employee raises with the yearly CPI(one section of the bar chart with 2019 percentages for each location and the CPI that year, another for 2020, 2021, and 2022). To do so I'd like to create bins by year, so the table would look more like this:
Year
Crawford
Deck
Peoria
0
2019
1.55
0.8
1.62
1
2020
1.85
1.76
1.64
2
2021
1.1
3
0.94
3
2022
1.1
2.5
2.2
Is there any easy way to do this using pandas queries/sqlite3?
Assuming (df) is your dataframe, here is one way to do it :
out = (
df
.drop("Type", axis=1)
.set_index("Location")
.pipe(lambda df_: df_.set_axis(df_.columns.str[:4], axis=1))
.transpose()
.reset_index(names="Year")
.rename_axis(None, axis=1)​
)
Output :
print(out)
Year Crawford Deck Peoria
0 2019 1.55 0.80 1.62
1 2020 1.85 1.76 1.64
2 2021 1.10 3.00 0.94
3 2022 1.10 2.50 2.20
Plot (with pandas.DataFrame.plot.bar):
out.set_index("Year").plot.bar();
Consider melt + pivot:
Data
from io import StringIO
import pandas as pd
txt = '''\
Type Location 2019_perc 2020_perc 2021_perc 2022_perc
0 County Crawford 1.55 1.85 1.1 1.1
1 County Deck 0.8 1.76 3 2.5
2 City Peoria 1.62 1.64 0.94
'''
with StringIO(txt) as f:
cpi_raw_df = pd.read_csv(f, sep="\s+")
Reshape
cpi_df = (
cpi_raw_df.melt(
id_vars = ["Type", "Location"],
var_name = "Year",
value_name = "perc"
).assign(
Year = lambda df: df["Year"].str.replace("_perc", "")
).pivot(
index = "Year",
columns = "Location",
values = "perc"
)
)
print(cpi_df)
# Location Crawford Deck Peoria
# Year
# 2019 1.55 0.80 1.62
# 2020 1.85 1.76 1.64
# 2021 1.10 3.00 0.94
# 2022 1.10 2.50 NaN
Plot
import matplotlib.pyplot as plt
import seaborn as sns
...
sns.set()
cpi_df.plot(kind="bar", rot=0)
plt.show()
plt.clf()
plt.close()

How to loop over unique dates in a pandas dataframe producing new dataframes in each iteration?

I have a dataframe like below and need to create (1) a new dataframe for each unique date and (2) create a new global variable with the date of the new dataframe as the value. This needs to be in a loop.
Using the dataframe below, I need to iterate through 3 new dataframes, one for each date value (202107, 202108, and 202109). This loop occurs within an existing function that then uses the new dataframe and its respective global variable of each iteration in further calculations. For example, the first iteration would yield a new dataframe consisting of the first two rows of the below dataframe and a value for the new global variable of "202107." What is the most straightforward way of doing this?
Date
Col1
Col2
202107
1.23
6.72
202107
1.56
2.54
202108
1.78
7.54
202108
1.53
7.43
202108
1.58
2.54
202109
1.09
2.43
202109
1.07
5.32
Loop over the results of .groupby:
for _, new_df in df.groupby("Date"):
print(new_df)
print("-" * 80)
Prints:
Date Col1 Col2
0 202107 1.23 6.72
1 202107 1.56 2.54
--------------------------------------------------------------------------------
Date Col1 Col2
2 202108 1.78 7.54
3 202108 1.53 7.43
4 202108 1.58 2.54
--------------------------------------------------------------------------------
Date Col1 Col2
5 202109 1.09 2.43
6 202109 1.07 5.32
--------------------------------------------------------------------------------
Then you can store new_df to a list or a dictionary and use it afterwards.
You can extract the unique date values y the .unique() method, and then store your new dataframes and dates in a dict to access then easily like :
unique_dates = init_df.Date.unique()
df_by_date = {
str(date): init_df[init_df['Date'] == date] for date in unique_dates
}
you use the dict like :
for date in unique_dates:
print(date, ': \n', df_by_date[str(date)])
output:
202107 :
Date Col1 Col2
0 202107 1.23 6.72
1 202107 1.56 2.54
202108 :
Date Col1 Col2
2 202108 1.78 7.54
3 202108 1.53 7.43
4 202108 1.58 2.54
202109 :
Date Col1 Col2
5 202109 1.09 2.43
6 202109 1.07 5.32

Derivative of a dataset using python or pandas

I have a tab-limited csv file with a dataset with 2 columns (time and value) of data of type float. I have 100s of these files from a lab equipment. An example set is shown below.
3.64 1.22e-11
4.14 2.44e-11
4.64 1.22e-11
5.13 2.44e-11
5.66 1.22e-11
6.17 1.22e-11
6.67 2.44e-11
7.17 2.44e-11
7.69 1.22e-11
8.20 2.44e-11
8.70 1.22e-11
9.20 2.44e-11
9.72 2.44e-11
10.22 1.22e-11
10.72 1.22e-11
11.22 1.22e-11
11.72 1.22e-11
12.22 1.22e-11
12.70 -1.95e-10
13.22 -1.57e-09
13.73 -3.04e-09
14.25 -4.39e-09
14.77 -5.73e-09
15.28 -7.02e-09
15.80 -8.26e-09
16.28 -8.61e-09
16.83 -8.70e-09
17.31 -8.76e-09
17.81 -8.80e-09
18.31 -8.83e-09
18.83 -8.91e-09
19.33 -8.98e-09
19.84 -9.02e-09
20.34 -9.05e-09
20.84 -9.06e-09
21.34 -9.07e-09
21.88 -9.08e-09
22.39 -9.08e-09
22.89 -9.09e-09
23.39 -9.09e-09
23.89 -9.09e-09
24.41 -9.09e-09
I want to trim the data to reset time (x,1st col to 0) when the value (y/2nd column) starts to change, and also trim after the value plateaus.
For 1st derivative, if I use NumPy.gradient, I can see where the data changes, but I couldn't find a similar function for pandas.
Any suggestions?
Added: Output (done in excel manually) will look like below where (in this case) first 18 rows and last 3 are removed. The first row is set to 0 by subtracting all values from the previous row.
0.00 0.000000000000
0.52 -0.000000001375
1.03 -0.000000002845
1.55 -0.000000004195
2.07 -0.000000005535
2.58 -0.000000006825
3.10 -0.000000008065
3.58 -0.000000008415
4.13 -0.000000008505
4.61 -0.000000008565
5.11 -0.000000008605
5.61 -0.000000008635
6.13 -0.000000008715
6.63 -0.000000008785
7.14 -0.000000008825
7.64 -0.000000008855
8.14 -0.000000008865
8.64 -0.000000008875
9.18 -0.000000008885
9.69 -0.000000008885
10.19 -0.000000008895
What I have tried is using python and pandas to differentiate and then remove where derivative is 0, but that removes data point within the output I want too.
dfT = df1[df1.dB != 0]
dfT = dfT[df1.dB >= 0]
dfT = dfT.dropna()
dfT = dfT.reset_index(drop=True)
dfT
Why not use what is already working aka np.gradient and put it back to your dataframe? I am not able to create your final desired output however, since it looks like you rely on more than just filtering out gradient = 0? Open to fixing it once I get a bit clearer on logic
### imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
### data
time = [3.64,4.14,4.64,5.13,5.66,6.17,6.67,7.17,7.69,8.2,8.7,9.2,9.72,10.22,10.72,11.22,11.72,12.22,12.7,13.22,13.73,14.25,14.77,15.28,15.8,16.28,16.83,17.31,17.81,18.31,18.83,19.33,19.84,20.34,20.84,21.34,21.88,22.39,22.89,23.39,23.89,24.41,21.88,22.39,22.89,23.39,23.89,24.41]
value = [0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000122,0.0000000000244,0.0000000000244,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,0.0000000000122,-0.000000000195,-0.00000000157,-0.00000000304,-0.00000000439,-0.00000000573,-0.00000000702,-0.00000000826,-0.00000000861,-0.0000000087,-0.00000000876,-0.0000000088,-0.00000000883,-0.00000000891,-0.00000000898,-0.00000000902,-0.00000000905,-0.00000000906,-0.00000000907,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000908,-0.00000000908,-0.00000000909,-0.00000000909,-0.00000000909,-0.00000000909]
### dataframe creation
# df = pd.read_csv('test.csv', names=["time", "value"])
df = pd.DataFrame({'time':time, 'value':value})
plt.plot(df.time,df.value)
Outputs:
Next you can differentiate and as you can see within your first 18 rows you mentioned there are multiple points where gradient is greater than 0:
df['gradient'] = np.gradient(df.value.values)
df
plt.plot(df.time,df.gradient)
Outputs:
Next filter out non change and add new time
### filter data where gradient is not 0 and add new time
df_filtered = df[df.gradient != 0]
df_filtered['time_difference'] = df_filtered.time.diff().fillna(0)
df_filtered['new_time'] = df_filtered['time_difference'].cumsum()
df_filtered.reset_index(drop=True,inplace=True)
df_filtered
Outputs:

Pandas pivots index as columns using groupby-apply

I have come across some strange behavior in Pandas groupby-apply that I am trying to figure out.
Take the following example dataframe:
import pandas as pd
import numpy as np
index = range(1, 11)
groups = ["A", "B"]
idx = pd.MultiIndex.from_product([index, groups], names = ["index", "group"])
np.random.seed(12)
df = pd.DataFrame({"val": np.random.normal(size=len(idx))}, index=idx).reset_index()
print(df.tail().round(2))
index group val
15 8 B -0.12
16 9 A 1.01
17 9 B -0.91
18 10 A -1.03
19 10 B 1.21
And using this framework (which allows me to execute any arbitrary function within a groupby-apply):
def add_two(x):
return x + 2
def pd_groupby_apply(df, out_name, in_name, group_name, index_name, function):
def apply_func(df):
if index_name is not None:
df = df.set_index(index_name).sort_index()
df[out_name] = function(df[in_name].values)
return df[out_name]
return df.groupby(group_name).apply(apply_func)
Whenever I call pd_groupby_apply with the following inputs, I get a pivoted DataFrame:
df_out1 = pd_groupby_apply(df=df,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out1.head().round(2))
index 1 2 3 4 5 6 7 8 9 10
group
A 2.47 2.24 2.75 2.01 1.19 1.40 3.10 3.34 3.01 0.97
B 1.32 0.30 0.47 1.88 4.87 2.47 0.78 1.88 1.09 3.21
However, as soon as my dataframe does not contain full group-index pairs, and I call my pd_groupby_apply function again, I do recieve my dataframe back in the way that I want (i.e. not pivoted):
df_notfull = df.iloc[:-1]
df_out2 = pd_groupby_apply(df=df_notfull,
out_name="test",
in_name="val",
group_name="group",
index_name="index",
function=add_two)
print(df_out2.head().round(2))
group index
A 1 2.47
2 2.24
3 2.75
4 2.01
5 1.19
Why is this? And more importantly, how can I prevent Pandas from pivoting my dataframe when I have full index-group pairs in my dataframe?

Convert dataframe to numpy matrix where indexes stored in dataframe

I have a dataframe that looks like so
time usd hour day
0 2015-08-30 07:56:28 1.17 7 0
1 2015-08-30 08:56:28 1.27 8 0
2 2015-08-30 09:56:28 1.28 9 0
3 2015-08-30 10:56:28 1.29 10 0
4 2015-08-30 11:56:28 1.29 11 0
14591 2017-04-30 23:53:46 9.28 23 609
Given this how would I go about building a numpy 2d matrix with hour being one axis day being the other axis and then usd being the value stored in the matrix
Consider the dataframe df
df = pd.DataFrame(dict(
time=pd.date_range('2015-08-30', periods=14000, freq='H'),
usd=(np.random.randn(14000) / 100 + 1.0005).cumprod()
))
Then we can set the index with the date and hour of df.time column and unstack. We take the values of this result in order to access the numpy array.
a = df.set_index([df.time.dt.date, df.time.dt.hour]).usd.unstack().values
I would do a pivot_table and leave the data as a pandas DataFrame but the conversion to a numpy array is trivial if you don't want labels.
import pandas as pd
data = <data>
data.pivot_table(values = 'usd', index = 'hour', columns = 'day').values
Edit: Thank you #pyRSquared for the "Value"able tip. (changed np.array(data) to df...values)
You can use the pivot functionality of pandas, as described here. You will get NaN values for usd, when there is no value for the day or hour.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'usd': [1.17, 1.27, 1.28, 1.29, 1.29, 9.28], 'hour': [7, 8, 9, 10, 11, 23], 'day': [0, 0, 0, 0, 0, 609]})
In [3]: df
Out[3]:
day hour usd
0 0 7 1.17
1 0 8 1.27
2 0 9 1.28
3 0 10 1.29
4 0 11 1.29
5 609 23 9.28
In [4]: df.pivot(index='hour', columns='day', values='usd')
Out[4]:
day 0 609
hour
7 1.17 NaN
8 1.27 NaN
9 1.28 NaN
10 1.29 NaN
11 1.29 NaN
23 NaN 9.28

Categories