My text file
Name Surname Age Sex Grade X
Chris M. 14 M 4 10 05 2010
Adam A. 17 M 11 12 2011
Jack O. M 8 08 04 2009
...
I want to count years.
Example output:
{ '2010' : 1 , "2011" : 1 ...}
but I got "Key Error : Year".
import pandas as pd
df = pd.read_fwf("file.txt")
df.join(df['X'].str.split(' ', 2, expand = True).rename(columns={0: '1', 1: '2', 2: '3}))
df.columns=["1"]
df["1"].value_counts().dict()
What's wrong with my code?
Your df will remain original one, you have to assign to it after you join with new column, then you will get the df with column Year. Try this:
import pandas as pd
df = pd.read_fwf("file.txt")
df = df.join(df['Date'].str.split(' ', 2, expand = True).rename(columns={1: 'Year', 0: 'Month'}))
df["Year"].value_counts().to_dict()
output:
{'2009': 1, '2010': 1, '2011': 1}
Related
Hello this is my csv data
Age Name
0 22 George
1 33 lucas
2 22 Nick
3 12 Leo
4 32 Adriano
5 53 Bram
6 11 David
7 32 Andrei
8 22 Sergio
i want to use if else statement , for example if George is adult create new row and insert +
i mean
Age Name Adul
22 George +
What is best way?
This is my code Which i am using to read data from csv
import pandas as pd
produtos = pd.read_csv('User.csv', nrows=9)
print(produtos)
for i, produto in produtos.iterrows():
print(i,produto['Age'],produto['Name'])
IIUC, you want to create a new column (not row) call "Adul". You can do this with numpy.where:
import numpy as np
produtos["Adul"] = np.where(produtos["Age"].ge(18), "+", np.nan)
Edit:
To only do this for a specific name, you could use:
name = input("Name")
if name in produtos["Name"].tolist():
if produtos.loc[produtos["Name"]==name, "Age"] >= 18:
produtos.loc[produtos["Name"]==name, "Adul"] = "+"
You can do this:
produtos["Adul"] = np.where(produtos["Age"] >= 18, "+", np.nan)
I have given the following pandas dataframe:
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(data=d)
print(df)
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 4321 2013 6
6 9567 2002 150
7 1169 2012 47
I now want to merge two rows of the DataFrame, where there are two different IDs, where ultimately only one remains. The merge should only take place if the values of the column "YEAR" match. The values of the column "VALUE" should be added.
The output should look like this:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 1169 2012 47
Line 1 and line 5 have been merged. Line 5 is removed and line 1 remains with the previous ID, but the VALUEs of line 1 and line 5 have been added.
I would like to specify later which two lines or which two IDs should be merged. One of the two should always remain. The two IDs to be merged come from another function.
I experimented with the groupby() function, but I don't know how to merge two different IDs there. I managed it only with identical values of the "ID" column. This then looked like this:
df.groupby(['ID', 'YEAR'])['VALUE'].sum().reset_index(name ='VALUE')
Unfortunately, even after extensive searching, I have not found anything suitable. I would be very happy if someone can help me! I would like to apply the whole thing later to a much larger DataFrame with more rows. Thanks in advance and best regards!
Try this, just group on 'ID' and take the max YEAR and sum VALUE:
df.groupby('ID', as_index=False).agg({'YEAR':'max', 'VALUE':'sum'})
Output:
ID YEAR VALUE
0 1234 2013 27
1 4321 2013 6
Or group on year and take first ID:
df.groupby('YEAR', as_index=False).agg({'ID':'first', 'VALUE':'sum'})
Ouptut:
YEAR ID VALUE
0 2012 1234 3
1 2013 1234 30
Based on all the comments and update to the question it sounds like the logic (maybe not this exact code) is required...
Try:
import pandas as pd
d = {'ID': ['1169', '1234', '2456', '9567', '1234', '4321', '9567', '0169'], 'YEAR': ['2001', '2013', '2009', '1989', '2012', '2013', '2002', '2012'], 'VALUE': [8, 24, 50, 75, 3, 6, 150, 47]}
df = pd.DataFrame(d)
df['ID'] = df['ID'].astype(int)
def correctRows(l, i):
for x in l:
if df.loc[x, 'YEAR'] == df.loc[i, 'YEAR']:
row = x
break
return row
def mergeRows(a, b):
rowa = list(df[df['ID'] == a].index)
rowb = list(df[df['ID'] == b].index)
if len(rowa) > 1:
if type(rowb)==list:
rowa = correctRows(rowa, rowb[0])
else:
rowa = correctRows(rowa, rowb)
else:
rowa = rowa[0]
if len(rowb) > 1:
if type(rowa)==list:
rowb = correctRows(rowb, rowa[0])
else:
rowb = correctRows(rowb, rowa)
else:
rowb = rowb[0]
print('Keeping:', df.loc[rowa].to_string().replace('\n', ', ').replace(' ', ' '))
print('Dropping:', df.loc[rowb].to_string().replace('\n', ', ').replace(' ', ' '))
df.loc[rowa, 'VALUE'] = df.loc[rowa, 'VALUE'] + df.loc[rowb, 'VALUE']
df.drop(df.index[rowb], inplace=True)
df.reset_index(drop = True, inplace=True)
return None
# add two ids. First 'ID' is kept; the second dropped, but the 'Value'
# of the second is added to the 'Value' of the first.
# Note: the line near the start df['ID'].astype(int), hence integers required
# mergeRows(4321, 1234)
mergeRows(1234, 4321)
Outputs:
Keeping: ID 1234, YEAR 2013, VALUE 24
Dropping: ID 4321, YEAR 2013, VALUE 6
Frame now looks like:
ID YEAR VALUE
0 1169 2001 8
1 1234 2013 30 #<-- sum of 6 + 24
2 2456 2009 50
3 9567 1989 75
4 1234 2012 3
5 9567 2002 150
6 169 2012 47
I have a string which is -
str="Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
I want to convert this string to Python DataFrame with KEY and AGE as Column names.
The given key and age are in pair.
How could I achieve this conversion?
You can try regex
import re
import pandas as pd
s = "Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
df = pd.DataFrame(zip(re.findall(r'Key=([^,\s]+)', s, re.IGNORECASE), re.findall(r'age=([^,\s]+)', s, re.IGNORECASE)),
columns=['key', 'age'])
df
key age
0 xxxx 11
1 yyyy 22
2 zzzz 01
3 qqqq 21
4 wwwww 91
5 pppp 22
Use a regex that find all pairs of key/age : "key=(\w+)\s*,\s*age=(\w+)" then use them to build the dataframe
import re
import pandas as pd
content = "Key=xxxx, age=11, key=yyyy , age=22,Key=zzzz, age=01, key=qqqq, age=21,Key=wwwww, age=91, key=pppp, age=22"
pat = re.compile(r"key=(\w+)\s*,\s*age=(\w+)", flags=re.IGNORECASE)
values = pat.findall(content)
df = pd.DataFrame(values, columns=['key', 'age'])
print(df)
# - - - - -
key age
0 xxxx 11
1 yyyy 22
2 zzzz 01
3 qqqq 21
4 wwwww 91
5 pppp 22
I have written below function in python:
def proc_summ(df,var_names_in,var_names_group):
df['Freq']=1
df_summed=pd.pivot_table(df,index=(var_names_group),
values=(var_names_in),
aggfunc=[np.sum],fill_value=0,margins=True,margins_name='Total').reset_index()
df_summed.columns = df_summed.columns.map(''.join)
df_summed.columns = [x.strip().replace('sum', '') for x in df_summed.columns]
string_repr = df_summed.to_string(index=False,justify='center').splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
string_repr.insert(len(df_summed.index)+1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
print(out)
And below is the code I am using to call the function:
proc_summ (
df,
var_names_in=["Freq","sal"] ,
var_names_group=["name","age"])
and below is the output:
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
Please let me know how can I print the data to the center of the screen like :
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
If you are using Python3 you can try something like this
import shutil
columns = shutil.get_terminal_size().columns
print("hello world".center(columns))
As You are Using DataFrame you can try something like this
import shutil
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
# convert DataFrame to string
df_string = df.to_string()
df_split = df_string.split('\n')
columns = shutil.get_terminal_size().columns
for i in range(len(df)):
print(df_split[i].center(columns))
A data frame like this and I am adding some columns from mapping and calculation.
code month of entry name reports
0 JJ 20171002 Jason 14
1 MM 20171206 Molly 24
2 TT 20171208 Tina 31
3 JJ 20171018 Jake 22
4 AA 20090506 Amy 34
5 DD 20171128 Daisy 16
6 RR 20101216 River 47
7 KK 20171230 Kate 32
8 DD 20171115 David 14
9 JJ 20171030 Jack 10
10 NN 20171216 Nancy 28
What it is doing here is select some rows and look up the values from the dictionary and insert a further column from simple calculation. It works fine:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Daisy', 'River', 'Kate', 'David', 'Jack', 'Nancy'],
'code' : ['JJ', 'MM', 'TT', 'JJ', 'AA', 'DD', 'RR', 'KK', 'DD', 'JJ', 'NN'],
'month of entry': ["20171002", "20171206", "20171208", "20171018", "20090506", "20171128", "20101216", "20171230", "20171115", "20171030", "20171216"],
'reports': [14, 24, 31, 22, 34, 16, 47, 32, 14, 10, 28]}
df = pd.DataFrame(data)
dict_hour = {'JasonJJ' : 3, 'MollyMM' : 6, 'TinaTT' : 2, 'JakeJJ' : 3, 'AmyAA' : 8, 'DaisyDD' : 6, 'RiverRR' : 4, 'KateKK' : 8, 'DavidDD' : 5, 'JackJJ' : 5, 'NancyNN' : 2}
wanted = ['JasonJJ', 'TinaTT', 'AmyAA', 'DaisyDD', 'KateKK']
df['name_code'] = df['name'].astype(str) + df['code'].astype(str)
df1 = df[df['name_code'].isin(wanted)]
df1['hour'] = df1['name_code'].map(dict_hour).astype(float)
df1['coefficient'] = df1['reports'] / df1['hour'] - 1
But the last 2 lines received a same warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How can the code can be improved accordingly? Thank you.
You need copy:
df1 = df[df['name_code'].isin(wanted)].copy()
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.