Python pandas show repeated values - python

I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]

You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]

A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!

Related

How to create a new dataframe that contains the value changes from multiple columns between two exisitng dataframes

I am looking at football player development over a five year period.
I have two dataframes (DFs), one that contains all 20 year-old strikers from FIFA 17 and another that contains all 25 year-old strikers from FIFA 22. I want to create a third DF that contains the attribute changes for each player. There are about 30 columns denoting each attribute, e.g. tackling, shooting, passing etc. So I want the new DF to contain +3 for tackling, +2 for shooting, +6 for passing etc.
The best way of solving this that I can think of is by merging the two DFs and then applying a function to every column that gives the difference between the x and y values, which represent the FIFA 17 and FIFA 22 data respectively.
Any tips much appreciated. Thank you.
As stated, use the difference of the dataframes. I'm suspecting they are not ALL NaN values, as you'll only get that for rows where the same player isn't in both 17 and 22 Fifas.
When I do it, there are only 533 player in both 17 and 22 (that were 20 years old in Fifa 17 and 25 in Fifa 22).
Here's an example:
import pandas as pd
fifa17 = pd.read_csv('D:/test/fifa/players_17.csv')
fifa17 = fifa17[fifa17['age'] == 20]
fifa17 = fifa17.set_index('sofifa_id')
fifa22 = pd.read_csv('D:/test/fifa/players_22.csv')
fifa22 = fifa22[fifa22['age'] == 25]
fifa22 = fifa22.set_index('sofifa_id')
compareCols = ['pace', 'shooting', 'passing', 'dribbling', 'defending',
'physic', 'attacking_crossing', 'attacking_finishing',
'attacking_heading_accuracy', 'attacking_short_passing',
'attacking_volleys', 'skill_dribbling', 'skill_curve',
'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration',
'movement_sprint_speed', 'movement_agility',
'movement_reactions', 'movement_balance', 'power_shot_power',
'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression',
'mentality_interceptions', 'mentality_positioning',
'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking_awareness',
'defending_standing_tackle', 'defending_sliding_tackle']
df = fifa22[compareCols] - fifa17[compareCols]
df = df.dropna(axis=0)
df = pd.merge(df,fifa22[['short_name']], how = 'left', left_index=True, right_index=True)
Output:
print(df)
pace shooting ... defending_sliding_tackle short_name
sofifa_id ...
205291 -1.0 0.0 ... 3.0 H. Stengel
205988 -7.0 3.0 ... -1.0 L. Shaw
206086 0.0 8.0 ... 5.0 H. Toffolo
206113 -2.0 21.0 ... -2.0 S. Gnabry
206463 -3.0 8.0 ... 3.0 J. Dudziak
... ... ... ... ...
236311 -2.0 -1.0 ... 18.0 M. Rog
236393 2.0 5.0 ... 0.0 Marc Cardona
236415 3.0 1.0 ... 9.0 R. Alfani
236441 10.0 31.0 ... 18.0 F. Bustos
236458 1.0 0.0 ... 5.0 A. Poungouras
[533 rows x 36 columns]
You might subtract pandas.DataFrames consider following simple example
import pandas as pd
df1 = pd.DataFrame({'X':[1,2],'Y':[3,4]})
df2 = pd.DataFrame({'X':[10,20],'Y':[30,40]})
dfdiff = df2 - df1
print(dfdiff)
gives output
X Y
0 9 27
1 18 36
I have found a solution but it is very tedious as it requires a line of code for each and every attribute.
I'm simply assigning a new column for each attribute change. So for Passing, for instance, the code is:
mergedDF = mergedDF.assign(PassingChange = mergedDF.Passing_x - mergedDF.Passing_y)

Pandas.pivot replace some existing values with NaN

I have to read some table from a SQL Server and merge their values in a single data structure for a machine learning project. I'm using Pandas and in particular pd.read_sql_query for read the values and pd.merge to fuse them.
One table has at least 80 milion rows and if I try to read it entirely it occupies all my memory storage (it's not so much, only 20gb but I've got a small ssd), so I decided to divide it in chunk of 100000 rows:
df_bilanci = pd.read_sql_query('SELECT [IdBilancioRiclassificato], [IdBilancio], [Importo] FROM dbo.Bilanci', conn, chunksize=100000)
One single chunk will be like this:
IdBilancioRiclassificato
IdBilancio
Importo
0
10.0
7001.0
0.0
1
11.0
7001.0
502643.0
2
12.0
7001.0
-4550.0
3
10.0
7002.0
654654.0
4
11.0
7002.0
0.0
5
12.0
7002.0
0.0
I'm interested to have the values of IdBilancioRiclassificato as columns (there are a total of 97 unique values for this column, so they have to be 97 columns), so I used pd.pivot on every chunk and then pd.concat plus merge to put together all the data:
for chunk in df_bilanci:
chunk.reset_index()
chunk_pivoted = pd.pivot(data=chunk,
index='IdBilancio',
columns='IdBilancioRiclassificato',
values='Importo'
)
df_aziende_bil = pd.concat([df_aziende_bil, pd.merge(left=df_aziende_anagrafe, right=chunk_pivoted, left_on='ID', right_index=True)])
At this point however, the chunk_pivoted dataframe has some values replaced with NaN values but if I look in the table the values exist.
The result expected is a table like this one:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
502643.0
-4550.0
7002.0
654654.0
0.0
0.0
but i've got something like this:
IdBilancio
10.0
11.0
12.0
7001.0
0.0
NaN
-4550.0
7002.0
654654.0
NaN
NaN

Python - multiplying dataframes of different size

I have two dataframes:
df1 - is a pivot table that has totals for both columns and rows, both with default names "All"
df2 - a df I created manually by specifying values and using the same index and column names as are used in the pivot table above. This table does not have totals.
I need to multiply the first dataframe by the values in the second. I expect the totals return NaNs since totals don't exist in the second table.
When I perform multiplication, I get the following error:
ValueError: cannot join with no level specified and no overlapping names
When I try the same on dummy dataframes it works as expected:
import pandas as pd
import numpy as np
table1 = np.matrix([[10, 20, 30, 60],
[50, 60, 70, 180],
[90, 10, 10, 110],
[150, 90, 110, 350]])
df1 = pd.DataFrame(data = table1, index = ['One','Two','Three', 'All'], columns =['A', 'B','C', 'All'] )
print(df1)
table2 = np.matrix([[1.0, 2.0, 3.0],
[5.0, 6.0, 7.0],
[2.0, 1.0, 5.0]])
df2 = pd.DataFrame(data = table2, index = ['One','Two','Three'], columns =['A', 'B','C'] )
print(df2)
df3 = df1*df2
print(df3)
This gives me the following output:
A B C All
One 10 20 30 60
Two 50 60 70 180
Three 90 10 10 110
All 150 90 110 350
A B C
One 1.00 2.00 3.00
Two 5.00 6.00 7.00
Three 2.00 1.00 5.00
A All B C
All nan nan nan nan
One 10.00 nan 40.00 90.00
Three 180.00 nan 10.00 50.00
Two 250.00 nan 360.00 490.00
So, visually, the only difference between df1 and df2 is the presence/absence of the column and row "All".
And I think the only difference between my dummy dataframes and the real ones is that the real df1 was created with pd.pivot_table method:
df1_real = pd.pivot_table(PY, values = ['Annual Pay'], index = ['PAR Rating'],
columns = ['CR Range'], aggfunc = [np.sum], margins = True)
I do need to keep the total as I'm using them in other calculations.
I'm sure there is a workaround but I just really want to understand why the same code works on some dataframes of different sizes but not others. Or maybe an issue is something completely different.
Thank you for reading. I realize it's a very long post..
IIUC,
My Preferred Approach
you can use the mul method in order to pass the fill_value argument. In this case, you'll want a value of 1 (multiplicative identity) to preserve the value from the dataframe in which the value is not missing.
df1.mul(df2, fill_value=1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
Alternate Approach
You can also embrace the np.nan and use a follow-up combine_first to fill back in the missing bits from df1
(df1 * df2).combine_first(df1)
A All B C
All 150.0 350.0 90.0 110.0
One 10.0 60.0 40.0 90.0
Three 180.0 110.0 10.0 50.0
Two 250.0 180.0 360.0 490.0
I really like Pir 's approach , and here is mine :-)
df1.loc[df2.index,df2.columns]*=df2
df1
Out[293]:
A B C All
One 10.0 40.0 90.0 60
Two 250.0 360.0 490.0 180
Three 180.0 10.0 50.0 110
All 150.0 90.0 110.0 350
#Wen, #piRSquared, thank you for your help. This is what I ended up doing. There is probably a more elegant solution but this worked for me.
Since I was able to multiply two dummy dataframes of different sizes, I reasoned the issue wasn't the size, but the fact that one of the dataframes was created as a pivot table. Somehow in this pivot table, the headers were not recognized, though visually they were there. So, I decided to convert the pivot table to a regular dataframe. Steps I took:
Converted the pivot table to records and then back to dataframe using solution from this thread: pandas pivot table to data frame .
Cleaned up the column headers using solution from the same thread above: pandas pivot table to data frame .
Set my first column as the index following suggestion in this thread: How to remove index from a created Dataframe in Python?
This gave me a dataframe that was visually identical to what I had before but was no longer a pivot table.
I was then able to multiply the two dataframes with no issues. I used approach suggested by #Wen because I like that it preserves the structure.

resample Pandas dataframe and merge strings in column

I want to resample a pandas dataframe and apply different functions to different columns. The problem is that I cannot properly process a column with strings. I would like to apply a function that merges the string with a delimiter such as " - ". This is a data example:
import pandas as pd
import numpy as np
idx = pd.date_range('2017-01-31', '2017-02-03')
data=list([[1,10,"ok"],[2,20,"merge"],[3,30,"us"]])
dates=pd.DatetimeIndex(['2017-01-31','2017-02-03','2017-02-03'])
d=pd.DataFrame(data, index=,columns=list('ABC'))
A B C
2017-01-31 1 10 ok
2017-02-03 2 20 merge
2017-02-03 3 30 us
Resampling the numeric columns A and B with a sum and mean aggregator works. Column C however kind of works with sum (but it gets placed on the second place, which might mean that something fails).
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': sum})
A C B
2017-01-31 1.0 a 10.0
2017-02-01 NaN 0 NaN
2017-02-02 NaN 0 NaN
2017-02-03 5.0 merge us 25.0
I would like to get this:
...
2017-02-03 5.0 merge - us 25.0
I tried using lambda in different ways but without success (not shown).
If I may ask a second related question: I can do some post processing for this, but how to fill missing cells in different columns with zeros or ""?
Your agg function for column 'C' should be a join
d.resample('D').agg({'A': sum, 'B': np.mean, 'C': ' - '.join})
A B C
2017-01-31 1.0 10.0 ok
2017-02-01 NaN NaN
2017-02-02 NaN NaN
2017-02-03 5.0 25.0 merge - us

Aligning 2 python lists according to 2 other list

I have two arrays namely nlxTTL and ttlState. Both the arrays comprise of repeating pattern of 0's and 1's indicating input voltage which can be HIGH(1) or LOW(0) and are recorded from same source which sends a TTL pulse(HIGH and LOW) with 1second pulse width.
But due to some logging mistake, some drops happen in ttlState list i.e. it doesn't log a repeating sequence of 0 and 1's and ends up dropping values.
The good part is I also log timestamp for each TTL input received for both the lists. Inter TTL event timestamp difference clearly shows that the TTL event has missed one of the pulses.
Here is an example of what data looks like:
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
As you can see the nlxTime and ttlTime clearly are different from each other. How can then using these timestamps I can align all 4 lists?
When dealing with tabular data such as a CSV file, it's a good idea to use a library to make the process easier. I like the pandas dataframe library.
Now for your question, one way to think about this problem is that you really have two datasets... An nlx dataset and a ttl dataset. You want to join those datasets together by timestamp. Pandas makes tasks like this very easy.
import pandas as pd
from StringIO import StringIO
data = """\
nlxTTL, ttlState, nlxTime, ttlTime
0,0,1000,1000
1,1,2000,2000
0,1,3000,4000
1,1,4000,6000
0,0,5000,7000
1,1,6000,8000
0,0,7000,9000
1,1,8000,10000
"""
# Load data into dataframe.
df = pd.read_csv(StringIO(data))
# Remove spaces from column names.
df.columns = [x.strip() for x in df.columns]
# Split the data into an nlx dataframe and a ttl dataframe.
nlx = df[['nlxTTL', 'nlxTime']].reset_index()
ttl = df[['ttlState', 'ttlTime']].reset_index()
# Merge the dataframes back together based on their timestamps.
# Use an outer join so missing data gets filled with NaNs instead
# of just dropping the rows.
merged_df = nlx.merge(ttl, left_on='nlxTime', right_on='ttlTime', how='outer')
# Get back to the original set of columns
merged_df = merged_df[df.columns]
# Print out the results.
print(merged_df)
This produces the following output.
nlxTTL ttlState nlxTime ttlTime
0 0.0 0.0 1000.0 1000.0
1 1.0 1.0 2000.0 2000.0
2 0.0 NaN 3000.0 NaN
3 1.0 1.0 4000.0 4000.0
4 0.0 NaN 5000.0 NaN
5 1.0 1.0 6000.0 6000.0
6 0.0 0.0 7000.0 7000.0
7 1.0 1.0 8000.0 8000.0
8 NaN 0.0 NaN 9000.0
9 NaN 1.0 NaN 10000.0
You'll notice that it fills in the dropped values with NaN values because we are doing an outer join. If this is undesirable, change the how='outer' parameter to how='inner' to perform an inner join. This will only keep records for which you have both an nlx and ttl response at that timestamp.

Categories