How to read data that has been split into multiple columns? - python

I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}

It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()

Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692

Related

Filtering dataframes based on one column with a different type of other column

I have the following problem
import pandas as pd
data = {
"ID": [420, 380, 390, 540, 520, 50, 22],
"duration": [50, 40, 45,33,19,1,3],
"next":["390;50","880;222" ,"520;50" ,"380;111" ,"810;111" ,"22;888" ,"11" ]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
As you can see I have
ID duration next
0 420 50 390;50
1 380 40 880;222
2 390 45 520;50
3 540 33 380;111
4 520 19 810;111
5 50 1 22;888
6 22 3 11
Things to notice:
ID type is int
next type is a string with numbers separated by ; if more than two numbers
I would like to filter the rows with no next in the ID
For example in this case
420 has a follow up in both 390 and 50
380 has as next 880 and 222 both of which are not in ID so this one
540 has as next 380 and 111 and while 111 is not in ID, 380 is so not this one
same with 50
In the end I want to get
1 380 40 880;222
4 520 19 810;111
6 22 3 11
With only one value I used print(df[~df.next.astype(int).isin(df.ID)]) but in this case isin can not be simply applied.
How can I do this?
Let us try with split then explode with isin check
s = df.next.str.split(';').explode().astype(int)
out = df[~s.isin(df['ID']).groupby(level=0).any()]
Out[420]:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11
Use a regex with word boundaries for efficiency:
pattern = '|'.join(df['ID'].astype(str))
out = df[~df['next'].str.contains(fr'\b(?:{pattern})\b')]
Output:
ID duration next
1 380 40 880;222
4 520 19 810;111
6 22 3 11

How can I specify a different decimal format on each column when using Pandas DataFrame to CSV?

I am parsing specific columns from a text file with data that looks like this:
n Elapsed time TimeUTC HeightMSL GpsHeightMSL P Temp RH Dewp Dir Speed Ecomp Ncomp Lat Lon
s hh:mm:ss m m hPa ∞C % ∞C ∞ m/s m/s m/s ∞ ∞
1 0 23:15:43 198 198 978.5 33.70 47 20.87 168.0 7.7 -1.6 7.6 32.835222 -97.297940
2 1 23:15:44 202 201 978.1 33.03 48 20.62 162.8 7.3 -2.2 7.0 32.835428 -97.298000
3 2 23:15:45 206 206 977.6 32.89 48 20.58 160.8 7.5 -2.4 7.0 32.835560 -97.298077
4 3 23:15:46 211 211 977.1 32.81 49 20.58 160.3 7.8 -2.6 7.4 32.835660 -97.298160
5 4 23:15:47 217 217 976.5 32.74 49 20.51 160.5 8.3 -2.7 7.8 32.835751 -97.298242
6 5 23:15:48 223 223 975.8 32.66 48 20.43 160.9 8.7 -2.8 8.2 32.835850 -97.298317
I perform one calculation on the first m/s column (converting m/s to kt) and write all data where hpa > 99.9 to an output file. That output looks like this:
978.5,198,33.7,20.87,168.0,14.967568
978.1,201,33.03,20.62,162.8,14.190032
977.6,206,32.89,20.58,160.8,14.5788
977.1,211,32.81,20.58,160.3,15.161952
976.5,217,32.74,20.51,160.5,16.133872
975.8,223,32.66,20.43,160.9,16.911407999999998
The code executes fine and the output file works for what I'm using it for, but is there a way to format the column output to a specific decimal place? As you can see in my code, I've tried df.round but it doesn't impact the output. I've also looked at float_format parameter, but that seems like it would apply the format to all columns. My intended output should look like this:
978.5, 198, 33.7, 20.9, 168, 15
978.1, 201, 33.0, 20.6, 163, 14
977.6, 206, 32.9, 20.6, 161, 15
977.1, 211, 32.8, 20.6, 160, 15
976.5, 217, 32.7, 20.5, 161, 16
975.8, 223, 32.7, 20.4, 161, 17
My code is below:
import pandas as pd
headers = ['n', 's', 'time', 'm1', 'm2', 'hpa', 't', 'rh', 'td', 'dir', 'spd', 'u', 'v', 'lat', 'lon']
df = pd.read_csv ('edt_20220520_2315.txt', encoding_errors = 'ignore', skiprows = 2, sep = '\s+', names = headers)
df['spdkt'] = df['spd'] * 1.94384
df['hpa'].round(decimals = 1)
df['spdkt'].round(decimals = 0)
df['t'].round(decimals = 1)
df['td'].round(decimals = 1)
df['dir'].round(decimals = 0)
extract = ['hpa', 'm2', 't', 'td', 'dir', 'spdkt']
with open('test_output.txt' , 'w') as fh:
df_to_write = df[df['hpa'] > 99.9]
df_to_write.to_csv(fh, header = None, index = None, columns = extract, sep = ',')
You can pass dictionary and then if round by 0 casting columns to integers:
d = {'hpa':1, 'spdkt':0, 't':1, 'td':1, 'dir':0}
df = df.round(d).astype({k:'int' for k, v in d.items() if v == 0})
print (df)
n s time m1 m2 hpa t rh td dir spd u v \
0 1 0 23:15:43 198 198 978.5 33.7 47 20.9 168 7.7 -1.6 7.6
1 2 1 23:15:44 202 201 978.1 33.0 48 20.6 163 7.3 -2.2 7.0
2 3 2 23:15:45 206 206 977.6 32.9 48 20.6 161 7.5 -2.4 7.0
3 4 3 23:15:46 211 211 977.1 32.8 49 20.6 160 7.8 -2.6 7.4
4 5 4 23:15:47 217 217 976.5 32.7 49 20.5 160 8.3 -2.7 7.8
5 6 5 23:15:48 223 223 975.8 32.7 48 20.4 161 8.7 -2.8 8.2
lat lon spdkt
0 32.835222 -97.297940 15
1 32.835428 -97.298000 14
2 32.835560 -97.298077 15
3 32.835660 -97.298160 15
4 32.835751 -97.298242 16
5 32.835850 -97.298317 17

Dynamically differencing columns in a pandas dataframe using similar column names

The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0

How can I group multiple columns and sum the last one?

I have this problem which I've been trying to solve:
I want the code to take this DataFrame and group multiple columns based on the most frequent number and sum the values on the last column. For example:
df = pd.DataFrame({'A':[1000, 1000, 1000, 1000, 1000, 200, 200, 500, 500],
'B':[380, 380, 270, 270, 270, 45, 45, 45, 55],
'C':[380, 380, 270, 270, 270, 88, 88, 88, 88],
'D':[45, 32, 67, 89, 51, 90, 90, 90, 90]})
df
A B C D
0 1000 380 380 45
1 1000 380 380 32
2 1000 270 270 67
3 1000 270 270 89
4 1000 270 270 51
5 200 45 88 90
6 200 45 88 90
7 500 45 88 90
8 500 55 88 90
I would like the code to show the result below:
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Notice that the most frequent value on the first rows is 1000, and this way I group the column 'A' so I get the sum 284 on the column 'D'. However, on the last rows, the most frequent number, which is 88, is not on column 'A', but in column 'C'. I am trying to sum the values on column 'D' by grouping column 'C' and get 360. I am not sure if I made myself clear.
I tried to use the function df['D'] = df.groupby(['A', 'B', 'C'])['D'].transform('sum'), but it does not show the desired result aforementioned.
Is there any pandas-style way of resolving this? Thanks in advance!
Code
def get_count_sum(col, func):
return df.groupby(col).D.transform(func)
ga = get_count_sum('A', 'count')
gb = get_count_sum('B', 'count')
gc = get_count_sum('C', 'count')
conditions = [
((ga > gb) & (ga > gc)),
((gb > ga) & (gb > gc)),
((gc > ga) & (gc > gb)),
]
choices = [get_count_sum('A', 'sum'),
get_count_sum('B', 'sum'),
get_count_sum('C', 'sum')]
df['D'] = np.select(conditions, choices)
df
Output
A B C D
0 1000 380 380 284
1 1000 380 380 284
2 1000 270 270 284
3 1000 270 270 284
4 1000 270 270 284
5 200 45 88 360
6 200 45 88 360
7 500 45 88 360
8 500 55 88 360
Explanation
Since we need to group by each column 'A','B' or 'C' considering which one has max repeated number, so first we are checking the max repeated number and storing the groupby output in ga, gb, gc for A,B,C col respectively.
We are checking which col has max frequent number in conditions.
According to the conditions we are applying choices for if else conditions.
np.select is like if-elif-else where we placed the conditions and required output in choices.

Extra Bin with Pandas Resample

I've got a pandas data frame defined like this:
last_4_weeks_range = pandas.date_range(
start=datetime.datetime(2001, 5, 4), periods=28)
last_4_weeks = pandas.DataFrame(
[{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,
'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +
[{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,
'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,
index=last_4_weeks_range.append(last_4_weeks_range))
last_4_weeks.sort(inplace=True)
and when I go to resample it:
In [265]: last_4_weeks.resample('7D', how='sum')
Out[265]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 280 560 700 1050 21
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 0 0 0 0 0
I end up with an extra empty bin I wouldn't expect to see -- 2001-06-01. I wouldn't expect that bin to be there, as my 28 days are evenly divisible into the 7 day resample I'm performing. I've tried messing around with the closed kwarg, but I can't escape that extra bin. Why is that extra bin showing up when I've got nothing to put into it and how can I avoid generating it?
What I'm ultimately trying to do is get 7 day averages per REST_KEY, so doing a
In [266]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[266]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
but that extra empty bin is throwing off my mean (e.g. for COOP_DLY_SLS_AMT I've got 112, which is (20 * 7 * 4) / 5 rather than the 140 I'd get from (20 * 7 * 4) / 4 if I didn't have that extra bin.) I also wouldn't expect REST_KEY to show up in the aggregation since it's part of the groupby, but that's really a smaller problem.
P.S. I'm using pandas 0.11.0
I think it's a bug:
The output with pandas 0.9.0dev on mac is:
In [3]: pandas.__version__
Out[3]: '0.9.0.dev-1e68fd9'
In [6]: last_4_weeks.resample('7D', how='sum')
Out[6]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
2001-05-04 40 80 100 150 3
2001-05-11 280 560 700 1050 21
2001-05-18 280 560 700 1050 21
2001-05-25 280 560 700 1050 21
2001-06-01 240 480 600 900 18
In [4]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[4]:
COOP_DLY_SLS_AMT COOP_DLY_TRN_QT DLY_SLS_AMT DLY_TRN_QT REST_KEY
REST_KEY
1 112 168 504 448 5.6
2 112 280 56 392 11.2
I'm using this versions (via pip freeze):
numpy==1.8.0.dev-9597b1f-20120920
pandas==0.9.0.dev-1e68fd9-20120920

Categories