I struggle with making the move from Excel to Python since I'm so used to having everything be visible. Below, I'm trying to convert the table up top to the table below. Wanted to use pandas dataframes but if there's a different solution that's better then I'd love to hear it.
Also, as an added bonus, if someone can point me to some resources that are empathetic to visual excel converts to Python, that would be awesome!
*Note, there are actually ~350 rows of this and we could go as far as ID12 and Code 12. Also, a state could repeat in my raw data source just like VA is doing here.
State ID Code ID2 Code2 ID3 Code3
VA RIC 733 FFX 787 NULL NULL
NC WIL 798 GSB 698 WSS 444
VA NPN 757 NULL NULL NULL NULL
Required Output:
State ID Code
VA RIC 733
VA FFX 787
VA NPN 757
NC WIL 798
NC GSB 698
NC WSS 444
I think lreshape would be ideal for this situation.
pd.lreshape(df, {'Code': ['Code', 'Code2', 'Code3'], 'ID': ['ID', 'ID2', 'ID3']}) \
.sort_values('State', ascending=False)
State Code ID
0 VA 733.0 RIC
2 VA 757.0 NPN
3 VA 787.0 FFX
1 NC 798.0 WIL
4 NC 698.0 GSB
5 NC 444.0 WSS
A more generic solution apart from #MaxU's would be:
code_list = [col for col in list(df) if col.startswith('Code')]
id_list = [col for col in list(df) if col.startswith('ID')]
pd.lreshape(df, {'Code': code_list, 'ID': id_list}).sort_values('State', ascending=False)
Related
I have two dataframes one with some missing values, and another with values that need to replace the missing values. So, the 2nd dataframe is shorter in length than the 1st one.
The missing values in the first dataframe is noted by either "Height Info Not Found" or "Player Info Not Found"
Is there a way to replace the missing values in the first dataframe with the corresponding values from the 2nd dataframe without looping?
I tried using .map() but values not replaced are returned NaN.
filled_df['height']= filled_df['height'].astype(str) #dataframe with real values
main_df['height']= main_df['height'].astype(str) #dataframe with missing values
mapping = dict(filled_df[['name','height']].values)
main_df['height'] = main_df['url_names'].map(mapping,na_action='ignore')
print(main_df)
name url_names height
0 John Mcenroe John_Mcenroe Height Info Not Found
1 Jimmy Connors Jimmy_Connors Player Info Not Found
2 Ivan Lendl Ivan_Lendl 1.88 m (6 ft 2 in)
3 Mats Wilander Mats_Wilander 1.83 m (6 ft 0 in)
4 Andres Gomez Andres_Gomez 1.93 m (6 ft 4 in)
5 Anders Jarryd Anders_Jarryd 1.80 m (5 ft 11 in)
6 Henrik Sundstrom Henrik_Sundstrom 1.88 m (6 ft 2 in)
7 Pat Cash Pat_Cash Height Info Not Found
8 Eliot Teltscher Eliot_Teltscher 1.75 m (5 ft 9 in)
9 Yannick Noah Yannick_Noah 1.93 m (6 ft 4 in)
10 Joakim Nystrom Joakim_Nystrom 1.87 m (6 ft 2 in)
11 Aaron Krickstein Aaron_Krickstein 6 ft 2 in (1.88 m)
12 Johan Kriek Johan_Kriek 1.75 m (5 ft 9 in)
name height
0 John_Mcenroe 1.80
1 Jimmy_Connors 1.78
2 Pat_Cash 183
3 Jimmy_Arias 175
4 Juan_Aguilera 1.82
5 Henri_Leconte 1.84
6 Balazs_Taroczy 1.82
7 Sammy_Giammalva_Jr 1.78
8 Thierry_Tulasne 1.77
Ithink you need replace only misisng values by matched values by dictionary:
main_df['height'] = main_df['height'].fillna(main_df['url_names'].map(mapping))
This code can do the job
import pandas as pd
d = {'url_names': ['John_Mcenroe', 'Jimmy_Connors', 'Ivan_Lendl'], 'height': ['Height Info Not Found', 'Player Info Not Found', '1.88 m (6 ft 2 in)']}
main_df = pd.DataFrame(d)
d = {'url_names': ['John_Mcenroe', 'Jimmy_Connors'], 'height': ['1.80', '1.78']}
filled_df = pd.DataFrame(d)
df1 = main_df[(main_df.height == 'Height Info Not Found') | (main_df.height == 'Player Info Not Found')].drop(['height'], axis=1).merge(filled_df, on="url_names")
df2 = main_df[(main_df.height != 'Height Info Not Found') & (main_df.height != 'Player Info Not Found')]
pd.concat([df1, df2])
I have a dataframe with the following structure:
Cluster 1 Cluster 2 Cluster 3
ID Name Revenue ID Name Revenue ID Name Revenue
1234 John 123 1235 Jane 761 1237 Mary 276
1376 Peter 254 1297 Paul 439 1425 David 532
However I am unsure how to perform basic functions like .unique or .value_count for columns as I am unsure how to refer to them in the code...
For example, if I want to see the unique values in the Cluster 2 Name column, how would I code that?
Usually I would type df.Name.unique() or df['Name'].unique() but neither of these work.
My original data looked like this:
ID Name Revenue Cluster
1234 John 123 1
1235 Jane 761 2
1237 Mary 276 3
1297 Paul 439 2
1376 Peter 254 1
1425 David 532 3
And I used this code to get me to my current point:
df = (df.set_index([df.groupby('Cluster').cumcount(), 'Cluster'])
.unstack()
.swaplevel(1,0, axis=1)
.sort_index(axis=1)
.rename(columns=lambda x: f'Cluster {x}', level=0))```
You just need to subset by the index in sequence.
So your first step would be to subset Cluster 2, then get unique names.
For example:
df["Cluster 2"]["Names"].unique()
I wrote a syntax to groupby the airlines data as per the Air_time.
largest_airlines = flight_data.groupby(['AIRLINE'])['AIR_TIME'].count()
print (len(largest_airlines))
largest_airlines
The output is:
AIRLINE
AA 8720
AS 768
B6 540
DL 10539
EV 5697
F9 1305
HA 111
MQ 3314
NK 1486
OO 6425
UA 7680
US 1593
VX 986
WN 8310
Name: AIR_TIME, dtype: int64
I want to filter the data greater than 2500. Can anyone help me in the syntax for the same.
It would depend on what method you want to use for it.
In pandas, for example, you could use something like this:
greater_than_2500_df = largest_airlines.loc[largest_airlines['Air_Time'] < 2500]
But it would help to see something you've tried first.
df=pd.DataFrame({'Country':["AU","GB","KR","US","GB","US","KR","AU","US"],'Region Manager':['TL','JS','HN','AL','JS','AL','HN','TL','AL'],'Curr_Sales': [453,562,236,636,893,542,125,561,371],'Curr_Revenue':[4530,7668,5975,3568,2349,6776,3046,1111,4852],'Prior_Sales': [235,789,132,220,569,521,131,777,898],'Prior_Revenue':[1530,2668,3975,5668,6349,7776,8046,2111,9852]})
pd.pivot_table(df, values=['Curr_Sales', 'Curr_Revenue','Prior_Sales','Prior_Revenue'],index=['Country', 'Region Manager'],aggfunc=np.sum,margins=True)
Hi folks,
I have the following dataframe and I'd like to re-order the muti-index columns as
['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
How can I do that in pandas?
The code is shown above
Thanks in advance for all the help!
Slice the resulting dataframe
pd.pivot_table(
df,
values=['Curr_Sales', 'Curr_Revenue', 'Prior_Sales', 'Prior_Revenue'],
index=['Country', 'Region Manager'],
aggfunc='sum',
margins=True
)[['Prior_Sales', 'Prior_Revenue', 'Curr_Sales', 'Curr_Revenue']]
Prior_Sales Prior_Revenue Curr_Sales Curr_Revenue
Country Region Manager
AU TL 1012 3641 1014 5641
GB JS 1358 9017 1455 10017
KR HN 263 12021 361 9021
US AL 1639 23296 1549 15196
All 4272 47975 4379 39875
cols = ['Prior_Sales','Prior_Revenue','Curr_Sales', 'Curr_Revenue']
df = df[cols]
I am trying to convert the following data structure;
To the format below in python 3;
if your data looks like:
array = [['PIN: 123 COD: 222 \n', 'LOA: 124 LOC: Sea \n'],
['PIN:456 COD:555 \n', 'LOA:678 LOC:Chi \n']]
You can do this:
1 Step: use regular expressions to parse your data, because it is string.
see more about reg-exp
raws=list()
for index in range(0,len(array)):
raws.append(re.findall(r'(PIN|COD|LOA|LOC): ?(\w+)', str(array[index])))
Output:
[[('PIN', '123'), ('COD', '222'), ('LOA', '124'), ('LOC', 'Sea')], [('PIN', '456'), ('COD', '555'), ('LOA', '678'), ('LOC', 'Chi')]]
2 Step: extract raw values and column names.
columns = np.array(raws)[0,:,0]
raws = np.array(raws)[:,:,1]
Output:
raws -
[['123' '222' '124' 'Sea']
['456' '555' '678' 'Chi']]
columns -
['PIN' 'COD' 'LOA' 'LOC']
3 Step: Now we can just create df.
df = pd.DataFrame(raws, columns=columns)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 456 555 678 Chi
Is it what you want?
I hope it helps, I'm not sure about your input format.
And don't forget import libraries! (I used pandas as pd, numpy as np, re).
UPD: another way I have created log file like you have:
array = open('example.log').readlines()
Output:
['PIN: 123 COD: 222 \n',
'LOA: 124 LOC: Sea \n',
'PIN: 12 COD: 322 \n',
'LOA: 14 LOC: Se \n']
Then split by ' ' , drop '\n' and reshape:
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(2, 4, 2)
In reshape, first number is raws count in your future dataframe, second - count of columns and last - you don't need to change. It won't works if you don't have whitespace between info and '\n' in each raw. If you don't, I will change an example.
Output:
array([[['PIN:', '123'],
['COD:', '222'],
['LOA:', '124'],
['LOC:', 'Sea']],
[['PIN:', '12'],
['COD:', '322'],
['LOA:', '14'],
['LOC:', 'Se']]],
dtype='|S4')
And then take raws and columns:
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
Finally, create dataframe (and cat last symbol for columns):
pd.DataFrame(raws, columns=[i[:-1] for i in columns])
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
If you have many log files, you can do that for each in for-loop, save each dataframe at array (example, array calls DF_array) and then use pd.concat to do one dataframe from array of dataframes.
pd.concat(DF_array)
If you need I can add an example.
UPD:
I have created a dir with log files and then make array with all files from PATH:
PATH = "logs_data/"
files = [PATH + i for i in os.listdir(PATH)]
Then do for-loop like in last update:
dfs = list()
for f in files:
array = open(f).readlines()
raws = np.array([i.split(' ')[:-1] for i in array]).reshape(len(array)/2, 4, 2)
columns = np.array(raws)[:,:,0][0]
raws = np.array(raws)[:,:,1]
df = pd.DataFrame(raws, columns=[i[:-1] for i in columns])
dfs.append(df)
result = pd.concat(dfs)
Output:
PIN COD LOA LOC
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses
0 15673 2324 13464 Sss
1 12452 3122 11234 Se
2 11 132 4 Ses
0 123 222 124 Sea
1 12 322 14 Se
2 1 32 4 Ses