combine lines in dataframe with same identifiant

combine lines in dataframe with same identifiant - python

I have this dataframe as a list:
l = [["a",1,2,"","",""],["a",1,2,3,"",""], ["a","",2,"3","4",""],["a",1,"","",4,5]]
I would like to combine all those lines to obtain this final line :
Ideally, I would flatten the list of lists to fill the blank value where needed. What would be the pythonest way to do that ?

Try:
import pandas as pd
import numpy as np
l = [["a",1,2,"","",""],["a",1,2,3,"",""], ["a","",2,"3","4",""],["a",1,"","",4,5]]
df = pd.DataFrame(l)
df.replace('',np.nan).ffill().tail(1)
df_out = df.replace('', np.nan).ffill().tail(1)
print(df_out)
Output:
0 1 2 3 4 5
3 a 1.0 2.0 3 4 5.0

Related

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I'm in a trouble with adding a new column to a pandas dataframe when the length of new column value is bigger than length of index.
Data may like this :
import pandas as pd
df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
So, you see, length of this df's index is 3.
And next I wanna add a new column , code may like this two ways below:
df["new_col"] = [1,2,3,4]
It'll raise an error : Length of values does not match length of index.
Or:
df["new_col"] = pd.Series([1,2,3,4])
I will just get values[1,2,3] in my data frame df.
(The count of new column values can't out of the max index).
Now, what I want just like :
Is there a better way ?
Looking forward to your answer,thanks!

Use DataFrame.join with change Series name and right join:
#if not default index
#df = df.reset_index(drop=True)
df = df.join(pd.Series([1,2,3,4]).rename('new_col'), how='right')
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
Another idea is add reindex by new s.index:
s = pd.Series([1,2,3,4])
df = df.reindex(s.index)
df["new_col"] = s
print (df)
bar zoo new_col
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4
s = pd.Series([1,2,3,4])
df = df.reindex(s.index).assign(new_col = s)

df = pd.DataFrame(
{
"bar": ["A","B","C"],
"zoo": [1,2,3],
})
new_col = pd.Series([1,2,3,4])
df = pd.concat([df,new_col],axis=1)
print(df)
bar zoo 0
0 A 1.0 1
1 B 2.0 2
2 C 3.0 3
3 NaN NaN 4

I want to append an csv file using python pandas to add an array as new column

I have tried the following code:
import pandas as pd
list1 = {'Names':[1,2,3,4,5]}
df = pd.DataFrame(list1)
df_csv = pd.read_csv('try.csv')
df_csv['Names'] = list1
df_csv.to_csv('try.csv', index=False, mode= 'a')
The error is as following:
"ValueError: Length of values does not match length of index"
I understand that the size of the dataframe doesn't match but how can I solve that?
So, this is what I want in try.csv file after appending:
try.csv file

From your code, the correct would be:
import pandas as pd
list1 = {'Names':[1,2,3,4,5]}
df = pd.DataFrame(list1)
df_csv = pd.read_csv('try.csv')
df_csv['Names'] = df.Names # changed here
df_csv.to_csv('try.csv', index=False, mode= 'w')

You are trying to assign a dictionary here, whereas your row requires a list of values. You should try this:
import pandas as pd
list1 = {'Names':[1,2,3,4,5]}
df = pd.DataFrame(list1)
df_csv = pd.read_csv('try.csv')
df_csv['Names'] = list1['Names']
df_csv.to_csv('try.csv', index=False, mode= 'a')
Also, it is not clear why would you need an additional df DataFrame for this?

You can try using concat. Suppose your df_csv has more rows as below then, you can create new dataframe with list1 and concatenate as new column:
import pandas as pd
list1 = {'Names':[1,2,3,4,5]}
# creating dataframe with initial value instead of pd.read_csv
df_csv = pd.DataFrame({'col_1': [100,200,300,400,500,600,700]})
print(df_csv)
Result:
col_1
0 100
1 200
2 300
3 400
4 500
5 600
6 700
Now, concatenate to df_csv with axis=1 by creating new dataframe with list1:
list1_df = pd.DataFrame(list1)
# concatenate df_csv and dataframe with list1_df
df_csv = pd.concat([df_csv,list1_df], axis=1)
print(df_csv)
Result:
col_1 Names
0 100 1.0
1 200 2.0
2 300 3.0
3 400 4.0
4 500 5.0
5 600 NaN
6 700 NaN

Loading Json into Pandas dataframe

I have a valid json file with the following format that I am trying to load into pandas.
{
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
There is a Pandas function called read_json() that takes in json files/buffers and spits out the dataframe but I have not been able to get it to load correctly, which is to show two columns rather than a single column with elements looking like [1424754000000, 0.7413]. I have tried different 'orient' and 'typ' to no avail. What options should I pass into the function to get it to spit out a two column dataframe corresponding the timestamp and the value?

You can use list comprehension with DataFrame contructor:
import pandas as pd
df = pd.read_json('file.json')
print df
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
print pd.DataFrame([x for x in df['testvalues']], columns=['a','b'])
a b
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427

I'm not sure about pandas read_json but IIUC you could do that with astype(str), str.split, str.strip:
d = {
"testvalues": [
[1424754000000, 0.7413],
[1424840400000, 0.7375],
[1424926800000, 0.7344],
[1425013200000, 0.7375],
[1425272400000, 0.7422],
[1425358800000, 0.7427]
]
}
df = pd.DataFrame(d)
res = df.testvalues.astype(str).str.strip('[]').str.split(', ', expand=True)
In [112]: df
Out[112]:
testvalues
0 [1424754000000, 0.7413]
1 [1424840400000, 0.7375]
2 [1424926800000, 0.7344]
3 [1425013200000, 0.7375]
4 [1425272400000, 0.7422]
5 [1425358800000, 0.7427]
In [113]: res
Out[113]:
0 1
0 1424754000000 0.7413
1 1424840400000 0.7375
2 1424926800000 0.7344
3 1425013200000 0.7375
4 1425272400000 0.7422
5 1425358800000 0.7427

You can apply a function that splits it into a pd.Series.
Say you start with
df = pd.read_json(s)
Then just apply a splitting function:
>>> df.apply(
lambda r: pd.Series({'l': r[0][0], 'r': r[0][1]}),
axis=1)
l r
0 1.424754e+12 0.7413
1 1.424840e+12 0.7375
2 1.424927e+12 0.7344
3 1.425013e+12 0.7375
4 1.425272e+12 0.7422
5 1.425359e+12 0.7427

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

I am using xlwings to replace my VB code with Python but since I am not an experienced programmer I was wondering - which data structure to use?
Data is in .xls in 2 columns and has the following form; In VB I lift this into a basic two dimensional array arrCampaignsAmounts(i, j):
Col 1: 'market_channel_campaign_product'; Col 2: '2334.43 $'
Then I concatenate words from 4 columns on another sheet into a similar 'string', into another 2-dim array arrStrings(i, j):
'Austria_Facebook_Winter_Active vacation'; 'rowNumber'
Finally, I search strings from 1. array within strings from 2. array; if found I write amounts into rowNumber from arrStrings(i, 2).
Would I use 4 lists for this task?
Two dictionaries?
Something else?

Definitely use pandas Dataframes. Here are references and very simple Dataframe examples.
#reference: http://pandas.pydata.org/pandas-docs/stable/10min.html
#reference: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html.
import numpy as np
import pandas as pd
def df_dupes(df_in):
'''
Returns [object,count] pairs for each unique item in the dataframe.
'''
# import pandas
if isinstance(df_in, list) or isinstance(df_in, tuple):
import pandas as pd
df_in = pd.DataFrame(df_in)
return df_in.groupby(df_in.columns.tolist(),as_index=False).size()
def df_filter_example(df):
'''
In [96]: df
Out[96]:
A B C D
0 1 4 9 1
1 4 5 0 2
2 5 5 1 0
3 1 3 9 6
'''
import pandas as pd
df=pd.DataFrame([[1,4,9,1],[4,5,0,2],[5,5,1,0],[1,3,9,6]],columns=['A','B','C','D'])
return df[(df.A == 1) & (df.D == 6)]
def df_compare(df1, df2, compare_col_list, join_type):
'''
df_compare compares 2 dataframes.
Returns left, right, inner or outer join
df1 is the first/left dataframe
df2 is the second/right dataframe
compare_col_list is a lsit of column names that must match between df1 and df2
join_type = 'inner', 'left', 'right' or 'outer'
'''
import pandas as pd
return pd.merge(df1, df2, how=join_type,
on=compare_col_list)
def df_compare_examples():
import numpy as np
import pandas as pd
df1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9 '''
df2=pd.DataFrame([[4,5,6],[7,8,9],[10,11,12]], columns = ['c1', 'c2', 'c3'])
''' c1 c2 c3
0 4 5 6
1 7 8 9
2 10 11 12 '''
# One can see that df1 contains 1 row ([1,2,3]) not in df3 and
# df2 contains 1 rown([10,11,12]) not in df1.
# Assume c1 is not relevant to the comparison. So, we merge on cols 2 and 3.
df_merge = pd.merge(df1,df2,how='outer',on=['c2','c3'])
print(df_merge)
''' c1_x c2 c3 c1_y
0 1 2 3 NaN
1 4 5 6 4
2 7 8 9 7
3 NaN 11 12 10 '''
''' One can see that columns c2 and c3 are returned. We also received
columns c1_x and c1_y, where c1_X is the value of column c1
in the first dataframe and c1_y is the value of c1 in the second
dataframe. As such,
any row that contains c1_y = NaN is a row from df1 not in df2 &
any row that contains c1_x = NaN is a row from df2 not in df1. '''
df1_unique = pd.merge(df1,df2,how='left',on=['c2','c3'])
df1_unique = df1_unique[df1_unique['c1_y'].isnull()]
print(df1_unique)
df2_unique = pd.merge(df1,df2,how='right',on=['c2','c3'])
print(df2_unique)
df_common = pd.merge(df1,df2,how='inner',on=['c2','c3'])
print(df_common)
def delete_column_example():
print 'create df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a','b','c'])
print 'drop (delete/remove) column'
col_name = 'b'
df.drop(col_name, axis=1, inplace=True) # or df = df.drop('col_name, 1)
def delete_rows_example():
print '\n\ncreate df'
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['col_1','col_2','col_3'])
print(df)
print '\n\nappend rows'
df= df.append(pd.DataFrame([[11,22,33]], columns=['col_1','col_2','col_3']))
print(df)
print '\n\ndelete rows where (based on) column value'
df = df[df.col_1 == 4]
print(df)

Include empty series when creating a pandas dataframe with .concat

UPDATE: This is no longer an issue since at least pandas version 0.18.1. Concatenating empty series doesn't drop them anymore so this question is out of date.
I want to create a pandas dataframe from a list of series using .concat. The problem is that when one of the series is empty it doesn't get included in the resulting dataframe but this makes the dataframe be the wrong dimensions when I then try to rename its columns with a multi-index.
UPDATE: Here's an example...
import pandas as pd
sers1 = pd.Series()
sers2 = pd.Series(['a', 'b', 'c'])
df1 = pd.concat([sers1, sers2], axis=1)
This produces the following dataframe:
>>> df1
0 a
1 b
2 c
dtype: object
But I want it to produce something like this:
>>> df2
0 1
0 NaN a
1 NaN b
2 NaN c
It does this if I put a single nan value anywhere in ser1 but it seems like this should be possible automatically even if some of my series are totally empty.

Passing an argument for levels will do the trick. Here's an example. First, the wrong way:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
df = pd.concat(list_of_series, axis=1)
Which produces this:
>>> df
0
0 1
1 2
2 3
But if we add some labels to the levels argument, it will include all the empty series too:
import pandas as pd
ser1 = pd.Series()
ser2 = pd.Series([1, 2, 3])
list_of_series = [ser1, ser2, ser1]
labels = range(len(list_of_series))
df = pd.concat(list_of_series, levels=labels, axis=1)
Which produces the desired dataframe:
>>> df
0 1 2
0 NaN 1 NaN
1 NaN 2 NaN
2 NaN 3 NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

combine lines in dataframe with same identifiant - python

Try: import pandas as pd import numpy as np l = [["a",1,2,"","",""],["a",1,2,3,"",""], ["a","",2,"3","4",""],["a",1,"","",4,5]] df = pd.DataFrame(l) df.replace('',np.nan).ffill().tail(1) df_out = df.replace('', np.nan).ffill().tail(1) print(df_out) Output: 0 1 2 3 4 5 3 a 1.0 2.0 3 4 5.0

Related

Adding new columns to Pandas Data Frame which the length of new column value is bigger than length of index

I want to append an csv file using python pandas to add an array as new column

Loading Json into Pandas dataframe

Which data structure in Python to use to replace Excel 2-dim array of strings/amounts?

Include empty series when creating a pandas dataframe with .concat

Categories

Resources