I am having issue deleting nulls. My input dataframe
name no city tr1_0 tr2_0 tr3_0 tr1_1 tr2_1 tr3_1 tr1_2 tr2_2 tr3_2
John 11 edi boa 51 110 cof 52 220
Rick 12 new cof 61 100 dcu 61 750
Mat t1 nyc
my desired output
name no city tr1 tr3 tr2
0 John 11 edi boa 110 51
1 John 11 edi cof 220 52
2 Rick 12 new cof 100 61
3 Rick 12 new dcu 750 61
4 Matt 13 wil nan nan nan
i used below code
df1 = pd.read_fwf(inputFileName, widths=widths, names=names, dtype=str, index_col=False )
feature_models = [col for col in df1.columns if re.match("tr[0-9]_[0-9]",col) is not None]
features = list(set([ re.sub("_[0-9]","",feature_model) for feature_model in feature_models]))
ub("_[0-9]","",feature_model) for feature_model in feature_models]))
df1 = pd.wide_to_long(df1,i=['name', 'no',
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city',],j='ModelID',stubnames=features,sep="_")
my current output as below. row 2 doesn't make any sense in my use case so i don't want to generate that row at all. if there is no trailer i only want 1 row which is good (row 6). if there are 2 trailers,i only want 2 rows but its giving me 3 rows. (row 2 and row 5 are extra). i tried using dropna but its not working. Also in my case its printing as nan not NaN.
name no city tr1 tr3 tr2
0 John 11 edi boa 110 51 .
1 John 11 edi cof 220 52 .
2 John 11 edi nan nan nan .
3 Rick 12 new cof 100 61 .
4 Rick 12 new dcu 750 61 .
5 Rick 12 new nan nan nan .
6 Matt 13 wil nan nan nan .
You can use this alternative solution with split and stack:
df1 = df1.set_index(['name', 'no', 'city'])
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.stack(1, dropna=False).reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()
print (df1)
name no city tr1 tr2 tr3
0 John 11 edi boa 51.0 110.0
1 John 11 edi cof 52.0 220.0
2 Rick 12 new cof 61.0 100.0
3 Rick 12 new dcu 61.0 750.0
4 Mat t1 nyc NaN NaN NaN
With your solution:
df1 = pd.wide_to_long(df1,i=['name', 'no', 'city'],j='ModelID',stubnames=features,sep="_")
For remove NaNs with duplicated MultiIndex values is possible use filtering by boolean indexing:
#remove counting level
df1 = df1.reset_index(level=3, drop=True)
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
df1 = df1[~mask].reset_index()
Details:
Check dupes by Index.duplicated:
print (df1.index.duplicated())
[False True False True False True]
Then check missing values by DataFrame.all for check all Trues per rows:
print (df1.isnull().all(axis=1))
name no city
John 11 edi False
edi False
Rick 12 new False
new False
Mat t1 nyc True
nyc True
dtype: bool
Chain by & for bitwise AND:
mask = df1.index.duplicated() & df1.isnull().all(axis=1)
print (mask)
name no city
John 11 edi False
edi False
Rick 12 new False
new False
Mat t1 nyc False
nyc True
dtype: bool
Invert boolean mask by ~:
print (~mask)
name no city
John 11 edi True
edi True
Rick 12 new True
new True
Mat t1 nyc True
nyc False
dtype: bool
Related
been trying to crack this for a while, but stuck now.
This is my code
l=list()
column_name=[col for col in df.columns if 'SalesPerson' in col]
filtereddf=pd.DataFrame(columns=['Item','SerialNo','Location','SalesPerson01','SalesPerson02',SalesPerson03',SalesPerson04',SalesPerson05',SalesPerson06','PredictedSales01','PredictedSales02','PredictedSales03','PredictedSales04','PredictedSales05','PredictedSales06']
for i,r in df.iterrows():
if len(r['Name'].split(';'))>1:
for x in r['Name'].split(';'):
for y in column_name:
if x in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
elif for y in column_name:
if r['Name'] in r[y]:
number_is=y[-2:]
filtereddf.at[i,'SerialNo']=r['SerialNo']
filtereddf.at[i,'Location']=r['Location']
filtereddf.at[i,y]=r[y]
filtereddf.at[i,'Item']=r['Item']
filtereddf.at[i,f'PredictedSales{number_is}']=r[f'PredictedSales{number_is}']
#The below statement however prints the values correctly. But I want to filter the values and use in a dataframe
#print(r['SerialNo'],r['Location'],r[f'SalesPerson{number_is}'],r[f'PredictedSales{number_is}]',r['Definition'])
l.append(filtereddf)
finaldf=pd.concat(l,ignore_index=True)
It eventually throws an error
MemoryError: Unable to allocate 9.18 GiB for an array with shape (1, 1231543895) and data type object
Basically I want to extract SalesPersonNN and corresponding PredicatedSalesNN from the main dataframe df
sampled dataset is (Actual csv file is almost 100k entries)
Item Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY Tom Julie Joe Sara Mary Philip 90 80 30 98 99 100
1 WashingMachine Mike 22222 NJ Tom Julie Joe Mike Mary Philip 80 70 40 74 88 42
2 Dishwasher Tony;Sue 33333 NC Margaret Tony William Brian Sue Bert 58 49 39 59 78 89
3 Microwave Bill;Jeff;Mary 44444 PA Elmo Bill Jeff Mary Chris Kevin 80 70 90 56 92 59
4 Printer Keith;Joe 55555 DE Keith Clark Ed Matt Martha Joe 87 94 59 48 74 89
And I want the output dataframe to look like
tem Name SerialNo Location SalesPerson01 SalesPerson02 SalesPerson03 SalesPerson04 SalesPerson05 SalesPerson06 PredictedSales01 PredictedSales02 PredictedSales03 PredictedSales04 PredictedSales05 PredictedSales06
0 TV Joe;Mary;Philip 11111 NY NaN NaN Joe NaN Mary Philip NaN NaN 30.0 NaN 99.0 100.0
1 WashingMachine Mike 22222 NJ NaN NaN NaN Mike NaN NaN NaN NaN NaN 74.0 NaN NaN
2 Dishwasher Tony;Sue 33333 NC NaN Tony NaN NaN Sue NaN NaN 49.0 NaN NaN 78.0 NaN
3 Microwave Bill;Jeff;Mary 44444 PA NaN Bill Jeff Mary NaN NaN NaN 70.0 90.0 56.0 NaN NaN
4 Printer Keith;Joe 55555 DE Keith NaN NaN NaN NaN Joe 87.0 NaN NaN NaN NaN 89.0
I am not sure if my approach using dataframe.at is correct or if any pointers as to what i can use to efficiently filter only those columns values which matches the value in column Name
I would recommend changing from a column focused dataframe to a row focused dataframe. You can rewrite your dataset using melt:
df_person = df.loc[:,'SalesPerson01':'SalesPerson06']
df_sales = df.loc[:,'PredictedSales01':'PredictedSales06']
df_person = df_person.melt(ignore_index=False, value_name='SalesPerson')[['SalesPerson']]
PredictedSales = df_sales.melt(ignore_index=False, value_name='PredictedSales')[['PredictedSales']]
df_person['PredictedSales'] = PredictedSales
index_cols = ['Item','SerialNo', 'Location', 'SalesPerson']
df_person = df_person.reset_index().sort_values(index_cols).set_index(index_cols)
df_person will look like this:
Item SerialNo Location SalesPerson PredictedSales
TV 11111 NY Joe 30
Julie 80
Mary 99
Philip 100
Sara 98
Tom 90
WashingMachine 22222 NJ Joe 40
Julie 70
Mary 88
Mike 74
Philip 42
Tom 80
... ... ... ... ...
Printer 55555 DE Clark 94
Ed 59
Joe 89
Keith 87
Martha 74
Matt 48
Now you only want the values from the names in you 'Name' column. Therefor we create a separate dataframe using explode:
df_names = df[['Name']].explode('Name').rename({'Name':'SalesPerson'}, axis=1)
df_names = df_names.reset_index().set_index(['Item','SerialNo', 'Location', 'SalesPerson'])
df_names will look something like this:
Item SerialNo Location SalesPerson
TV 11111 NY Joe
Mary
Philip
WashingMachine 22222 NJ Mike
Dishwasher 33333 NC Tony
Sue
Microwave 44444 PA Bill
Jeff
Mary
Printer 55555 DE Keith
Joe
Now you can simply merge your dataframes:
df_names.merge(df_person, left_index=True, right_index=True)
Now the PredictedSales are added to you df_names dataframe.
Hopefully this will run without errors. Please let me know 😀
There are two files. If the ID number matches both files, then I want only the value 1 and value 2 from File2.txt , Please let me know if my question is unclear
File1.txt
ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy
File2.txt
0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51
The output should look like this.
ID Number Value 1 Value 2 Country
0001 5 6 Spain
0231 7 18 USA
4213 54 87 Canada
7541 32 29 Italy
New example:
File3.txt
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.739000 -114.635700
01711 CX 21 -- 3 32.779700 -115.567500
08433 CX 21 -- 2 31.919930 -123.321000
File4.txt
20022,32.45,-114.88
01192,32.839,-115.487
01711,32.88,-115.45
01218,32.717,-115.637
output
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.45 -114.88
01711 CX 21 -- 3 32.88 -115.45
08433 CX 21 -- 2 31.919930 -123.321000
Code I got so far
f = open("File3.txt", "r")
x= open("File4.txt","r")
df1 = pd.read_csv(f, sep=' ', engine='python')
df2 = pd.read_csv(x, sep=' ', header=None, engine='python')
df2 = df2.set_index(0).rename_axis("#ID")
df2 = df2.rename(columns={5:'LATITUDE', 6: 'LONGITUDE'})
df1 = df1.set_index('#ID')
df1.update(df2)
print(df1)
Something like this, possibly:
file1_data = []
file1_headers = []
with open("File1.txt") as file1:
for line in file1:
file1_data.append(line.strip().split("\t"))
file1_headers = file1_data[0]
del file1_data[0]
file2_data = []
with open("File2.txt") as file2:
for line in file2:
file2_data.append(line.strip().split("\t"))
file2_ids = [x[0] for x in file2_data]
final_data = [file1_headers] + file1_data
for i in range(1, len(final_data)):
if final_data[i][0] in file2_ids:
match = [x for x in file2_data if x[0] == final_data[i][0]]
final_data[i] = [match[0] + [final_data[i][3]]]
with open("output.txt", "w") as output:
output.writelines(["\t".join(x) for x in final_data])
final_data becomes an alias of file1_data and then is selectively replacing rows with matching id's in file2_data, but keeping the country.
Okay, what you need to do here is to get the indexes to match in both dataframes after importing. This is important because pandas use data alignment based on indexes.
Here is a complete example using your data:
from io import StringIO
import pandas as pd
File1txt=StringIO("""ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy""")
File2txt = StringIO("""0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51""")
df1 = pd.read_csv(File1txt, sep='\s\s+', engine='python')
df2 = pd.read_csv(File2txt, sep='\s\s+', header=None, engine='python')
print(df1)
# ID Number Value 1 Value 2 Country
# 0 1 23 55 Spain
# 1 231 15 23 USA
# 2 4213 10 11 Canada
# 3 7541 32 29 Italy
print(df2)
# 0 1 2
# 0 1 5 6
# 1 231 7 18
# 2 4213 54 87
# 3 5554 12 10
# 4 1111 31 13
# 5 6422 66 51
df2 = df2.set_index(0).rename_axis('ID Number')
df2 = df2.rename(columns={1:'Value 1', 2: 'Value 2'})
df1 = df1.set_index('ID Number')
df1.update(df2)
print(df1.reset_index())
Output:
ID Number Value 1 Value 2 Country
0 1 5.0 6.0 Spain
1 231 7.0 18.0 USA
2 4213 54.0 87.0 Canada
3 7541 32.0 29.0 Italy
I have a dataset in the following format:
Country Code Year Value
0 ABC 32 2000 NaN
1 ABC 32 2001 NaN
2 ABC 32 2002 NaN
3 ABC 32 2003 NaN
4 ABC 32 2004 1000000.0
5 ABC 32 2005 NaN
6 ABC 32 2006 NaN
7 ABC 32 2007 NaN
8 ABC 32 2008 NaN
9 ABC 32 2009 NaN
and I am trying to replace the NaN values in such a way that they show yearly growth of r% around the non-NaN value; in other words, for the example data, Value[i] should be equal to 1000000 * (1+r)^x where x is the difference between the index of the non-NaN value and the index of i.
For this small set, the following code does the job:
df['imputed'] = ''
gr = 0.05 # growth rate
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value[df.Value.first_valid_index()] # first non-NaN value
df['imputed'][i] = nv * (1+gr) ** (i - nx)
df
Country Code Year Value imputed
0 ABC 32 2000 NaN 822702
1 ABC 32 2001 NaN 863838
2 ABC 32 2002 NaN 907029
3 ABC 32 2003 NaN 952381
4 ABC 32 2004 1000000.0 1e+06
5 ABC 32 2005 NaN 1.05e+06
6 ABC 32 2006 NaN 1.1025e+06
7 ABC 32 2007 NaN 1.15763e+06
8 ABC 32 2008 NaN 1.21551e+06
9 ABC 32 2009 NaN 1.27628e+06
However, the real dataset has multiple combinations of 'Country' and 'Code' which require similar calculations (NOTE: each of these combinations has only one non-NaN value just like above).
If I make a new df (df2) with all of the required Country-Code combinations, how could I apply the above calculations to every matching combination in the main df? Please note that there are also many combinations which do not require such calculations.
df2
Country Code
0 ABC 32
1 DEF 27
2 GHI 19
You can process just a filtered dataframe from the whole data with respect to the country or anything, then you can append or merge all together. I just present the method here. Feel free to play around with the code below, and tailor it for a more opitimized solution.
Code:
df2 = pd.DataFrame(columns = cols)
df2['Country'] = np.array([(c*10).split() for c in ['ABC ', 'DEF ', 'GHI ']]).ravel()
df2['Code'] = np.array([(c*10).split() for c in ['32 ' , '27 ', '19 ']]).ravel()
df2['Year'] = np.arange(2000,2010).tolist() * 3
df2['Value'] = np.nan
df2.loc[[4,14,24],'Value'] = [1000000.0, 2000000.0, 3000000.0]
# print(df2)
df2.drop('id', axis=1, inplace=True)
df.Value = df.Value.apply(lambda x: np.nan if x == 'NaN' else float(x))
df2['imputed'] = 0
def process(df):
for i in range(len(df)):
nx = df.Value.first_valid_index() # index of first non-NaN value
nv = df.Value.loc[nx] # first non-NaN value
# print(nv,gr,i,nx)
df.loc[i,'imputed'] = nv * ((1+gr) ** (i - nx))
return df
new_df = pd.DataFrame()
for c in df2.Country.unique():
cond = (df2.Country == c)
p_df = df2[cond].copy()
p_df.reset_index(drop=True,inplace=True)
df_ = process(p_df)
new_df = new_df.append(df_, ignore_index=True)
print(new_df)
Output:
Country Code Year Value imputed
0 ABC 32 2000 NaN 8.227025e+05
1 ABC 32 2001 NaN 8.638376e+05
2 ABC 32 2002 NaN 9.070295e+05
3 ABC 32 2003 NaN 9.523810e+05
4 ABC 32 2004 1000000.0 1.000000e+06
5 ABC 32 2005 NaN 1.050000e+06
6 ABC 32 2006 NaN 1.102500e+06
7 ABC 32 2007 NaN 1.157625e+06
8 ABC 32 2008 NaN 1.215506e+06
9 ABC 32 2009 NaN 1.276282e+06
10 DEF 27 2000 NaN 1.645405e+06
11 DEF 27 2001 NaN 1.727675e+06
12 DEF 27 2002 NaN 1.814059e+06
13 DEF 27 2003 NaN 1.904762e+06
14 DEF 27 2004 2000000.0 2.000000e+06
15 DEF 27 2005 NaN 2.100000e+06
16 DEF 27 2006 NaN 2.205000e+06
17 DEF 27 2007 NaN 2.315250e+06
18 DEF 27 2008 NaN 2.431013e+06
19 DEF 27 2009 NaN 2.552563e+06
20 GHI 19 2000 NaN 2.468107e+06
21 GHI 19 2001 NaN 2.591513e+06
22 GHI 19 2002 NaN 2.721088e+06
23 GHI 19 2003 NaN 2.857143e+06
24 GHI 19 2004 3000000.0 3.000000e+06
25 GHI 19 2005 NaN 3.150000e+06
26 GHI 19 2006 NaN 3.307500e+06
27 GHI 19 2007 NaN 3.472875e+06
28 GHI 19 2008 NaN 3.646519e+06
29 GHI 19 2009 NaN 3.828845e+06
Based on this dataframe
df1 Name Age
Johny 15
Diana 35
Doris 97
Peter 25
Antony 55
I have this dataframe with the number of ranges that I want to use, for example
df2 Header Init1 Final1 Init2 Final2 Init3 Final3
Names NaN NaN NaN NaN NaN NaN
Age 0 20 21 50 51 100
What I'm looking for is to get a result like this
df3 Name Age
Johny 0-20
Diana 21-50
Doris 51-100
Peter 21-50
Antony 51-100
I don't know if a possible solution is with cut () but I'm new to python.
Using pd.cut:
l = df2.iloc[1,1:].tolist()
labels = [str(t[0])+'-'+str(t[1]) for t in zip(l[::1],l[1::1])]
df['Age'] = pd.cut(df['Age'], bins=l, labels=labels)
print(df)
Name Age
0 Johny 0-20
1 Diana 21-50
2 Doris 51-100
3 Peter 21-50
4 Antony 51-100
I have a dataset that looks like the following.
Region_Name Date Average
London 1990Q1 105
London 1990Q1 118
... ... ...
London 2018Q1 157
I converted the date into quarters and wish to create a new dataframe with the matching quarters and region names grouped together, with the mean average.
What is the best way to accomplish such a task.
I have been looking at the groupby function but keep getting a traceback.
for example:
new_df = df.groupby(['Resion_Name','Date']).mean()
dict3={'Region_Name': ['London','Newyork','London','Newyork','London','London','Newyork','Newyork','Newyork','Newyork','London'],
'Date' : ['1990Q1','1990Q1','1990Q2','1990Q2','1991Q1','1991Q1','1991Q2','1992Q2','1993Q1','1993Q1','1994Q1'],
'Average': [34,56,45,67,23,89,12,45,67,34,67]}
df3=pd.DataFrame(dict3)
**Now My df3 is as follows **
Region_Name Date Average
0 London 1990Q1 34
1 Newyork 1990Q1 56
2 London 1990Q2 45
3 Newyork 1990Q2 67
4 London 1991Q1 23
5 London 1991Q1 89
6 Newyork 1991Q2 12
7 Newyork 1992Q2 45
8 Newyork 1993Q1 67
9 Newyork 1993Q1 34
10 London 1994Q1 67
code looks as follows:
new_df = df3.groupby(['Region_Name','Date'])
new1=new_df['Average'].transform('mean')
Result of dataframe new1:
print(new1)
0 34.0
1 56.0
2 45.0
3 67.0
4 56.0
5 56.0
6 12.0
7 45.0
8 50.5
9 50.5
10 67.0