Pandas read_table error - python

I am trying to read a tab delimited text file into a dataframe.
This is the how the file looks in Excel:
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
Import into a df:
df = p.read_table("E:/FileLoc/ThisIsAFile.txt", encoding = "iso-8859-1")
Now it doesn't see the first 3 columns as part of the column index (df[0] = Transaction Type) and all of the headers shift over to reflect this.
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1
I am trying to manipulate the text file and then import it to a mysql database as an end result.

You can use read_csv with separator 2 and more whitespaces:
import pandas as pd
import io
temp=u"""CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE CUSTOMER_NUMBER CUSTOMER_NAME
5/13/2016 0:00 13867666 6892372 S 2026 CUSTOMER 1"""
#after testing replace io.StringIO(temp) to filename
df =pd.read_csv(io.StringIO(temp), sep=r'\s{2,}', engine='python', encoding = "iso-8859-1")
print (df)
CALENDAR_DATE ORDER_NUMBER INVOICE_NUMBER TRANSACTION_TYPE \
0 5/13/2016 0:00 13867666 6892372 S
CUSTOMER_NUMBER CUSTOMER_NAME
0 2026 CUSTOMER 1
If separator is tabulator, use sep='\t'.
EDIT:
I test it with your data and it works:
import pandas as pd
df = pd.read_csv('test/AnonymizedData.txt', sep='\t')
print (df)
CUSTOMER_NUMBER CUSTOMER_NAME CUSTOMER_BRANCH_CODE CUSTOMER_BRANCH_NAME \
0 2026 CUSTOMER 1 83 SALES BRANCH 1
1 2359 CUSTOMER 2 76 SALES BRANCH 2
2 100662 CUSTOMER 3 28 SALES BRANCH 3
3 3245 CUSTOMER 4 84 SALES BRANCH 4
4 3179 CUSTOMER 5 28 SALES BRANCH 5
5 39881 CUSTOMER 6 67 SALES BRANCH 6
6 37020 CUSTOMER 7 58 SALES BRANCH 7
7 1239 CUSTOMER 8 50 SALES BRANCH 8
8 2379 CUSTOMER 9 76 SALES BRANCH 9
CUSTOMER_CITY CUSTOMER_STATE ... PRICING_PRODUCT_TYPE_CODE \
0 TOWN 1 CO ... 11
1 TOWN 2 OH ... 11
2 TOWN 3 ME ... 11
3 TOWN 4 IL ... 11
4 TOWN 5 NH ... 11
5 TOWN 6 TX ... 11
6 TOWN 7 NC ... 11
7 TOWN 8 NY ... 11
8 TOWN 9 OH ... 11
PRICING_PRODUCT_TYPE ORGANIZATION_ID ORGANIZATION_NAME PRODUCT_LINE_CODE \
0 DISPOSABLES 83 ORGANIZATIONNAME 891
1 DISPOSABLES 83 ORGANIZATIONNAME 891
2 DISPOSABLES 83 ORGANIZATIONNAME 891
3 DISPOSABLES 83 ORGANIZATIONNAME 891
4 DISPOSABLES 83 ORGANIZATIONNAME 891
5 DISPOSABLES 83 ORGANIZATIONNAME 891
6 DISPOSABLES 83 ORGANIZATIONNAME 891
7 DISPOSABLES 83 ORGANIZATIONNAME 891
8 DISPOSABLES 83 ORGANIZATIONNAME 891
PRODUCT_LINE ROBOTIC_FLAG Unnamed: 52 Unnamed: 53 Unnamed: 54
0 PRODUCTNAME N N NaN 3
1 PRODUCTNAME N N NaN 3
2 PRODUCTNAME N N NaN 2
3 PRODUCTNAME N N NaN 7
4 PRODUCTNAME N N NaN 1
5 PRODUCTNAME N N NaN 4
6 PRODUCTNAME N N NaN 3
7 PRODUCTNAME N N NaN 5
8 PRODUCTNAME N N NaN 3
[9 rows x 55 columns]

Related

How to join/merge and sum columns with the same name

How can I merge and sum the columns with the same name?
So the output should be 1 Column named Canada as a result of the sum of the 4 columns named Canada.
Country/Region Brazil Canada Canada Canada Canada
Week 1 0 3 0 0 0
Week 2 0 17 0 0 0
Week 3 0 21 0 0 0
Week 4 0 21 0 0 0
Week 5 0 23 0 0 0
Week 6 0 80 0 5 0
Week 7 0 194 0 20 0
Week 8 12 702 3 199 20
Week 9 182 2679 16 2395 260
Week 10 737 8711 80 17928 892
Week 11 1674 25497 153 48195 1597
Week 12 2923 46392 175 85563 2003
Week 13 4516 76095 182 122431 2180
Week 14 6002 105386 183 163539 2431
Week 15 6751 127713 189 210409 2995
Week 16 7081 147716 189 258188 3845
From its current state, this should give the outcome you're looking for:
df = df.set_index('Country/Region') # optional
df.groupby(df.columns, axis=1).sum() # Stolen from Scott Boston as it's a superior method.
Output:
index Brazil Canada
Country/Region
Week 1 0 3
Week 2 0 17
Week 3 0 21
Week 4 0 21
Week 5 0 23
Week 6 0 85
Week 7 0 214
Week 8 12 924
Week 9 182 5350
Week 10 737 27611
Week 11 1674 75442
Week 12 2923 134133
Week 13 4516 200888
Week 14 6002 271539
Week 15 6751 341306
Week 16 7081 409938
I found your dataset interesting, here's how I would clean it up from step 1:
df = pd.read_csv('file.csv')
df = df.set_index(['Province/State', 'Country/Region', 'Lat', 'Long']).stack().reset_index()
df.columns = ['Province/State', 'Country/Region', 'Lat', 'Long', 'date', 'value']
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df = df.pivot_table(index=df.index, columns='Country/Region', values='value', aggfunc=np.sum)
print(df)
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-22 0 0 0 0 0 ... 0 0 0 0 0
2020-01-23 0 0 0 0 0 ... 0 0 0 0 0
2020-01-24 0 0 0 0 0 ... 0 0 0 0 0
2020-01-25 0 0 0 0 0 ... 0 0 0 0 0
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
2020-07-30 36542 5197 29831 922 1109 ... 11548 10 1726 5555 3092
2020-07-31 36675 5276 30394 925 1148 ... 11837 10 1728 5963 3169
2020-08-01 36710 5396 30950 925 1164 ... 12160 10 1730 6228 3659
2020-08-02 36710 5519 31465 925 1199 ... 12297 10 1734 6347 3921
2020-08-03 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
If you now want to do weekly aggregations, it's as simple as:
print(df.resample('w').sum())
Output:
Country/Region Afghanistan Albania Algeria Andorra Angola ... West Bank and Gaza Western Sahara Yemen Zambia Zimbabwe
date ...
2020-01-26 0 0 0 0 0 ... 0 0 0 0 0
2020-02-02 0 0 0 0 0 ... 0 0 0 0 0
2020-02-09 0 0 0 0 0 ... 0 0 0 0 0
2020-02-16 0 0 0 0 0 ... 0 0 0 0 0
2020-02-23 0 0 0 0 0 ... 0 0 0 0 0
2020-03-01 7 0 6 0 0 ... 0 0 0 0 0
2020-03-08 10 0 85 7 0 ... 43 0 0 0 0
2020-03-15 57 160 195 7 0 ... 209 0 0 0 0
2020-03-22 175 464 705 409 5 ... 309 0 0 11 7
2020-03-29 632 1142 2537 1618 29 ... 559 0 0 113 31
2020-04-05 1783 2000 6875 2970 62 ... 1178 4 0 262 59
2020-04-12 3401 2864 11629 4057 128 ... 1847 30 3 279 84
2020-04-19 5838 3603 16062 4764 143 ... 2081 42 7 356 154
2020-04-26 8918 4606 21211 5087 174 ... 2353 42 7 541 200
2020-05-03 15149 5391 27943 5214 208 ... 2432 42 41 738 244
2020-05-10 25286 5871 36315 5265 274 ... 2607 42 203 1260 241
2020-05-17 39634 6321 45122 5317 327 ... 2632 42 632 3894 274
2020-05-24 61342 6798 54185 5332 402 ... 2869 45 1321 5991 354
2020-05-31 91885 7517 62849 5344 536 ... 3073 63 1932 7125 894
2020-06-07 126442 8378 68842 5868 609 ... 3221 63 3060 7623 1694
2020-06-14 159822 9689 74147 5967 827 ... 3396 63 4236 8836 2335
2020-06-21 191378 12463 79737 5981 1142 ... 4466 63 6322 9905 3089
2020-06-28 210487 15349 87615 5985 1522 ... 10242 70 7360 10512 3813
2020-07-05 224560 18707 102918 5985 2186 ... 21897 70 8450 11322 4426
2020-07-12 237087 22399 124588 5985 2940 ... 36949 70 9489 13002 6200
2020-07-19 245264 26845 149611 6098 4279 ... 52323 70 10855 16350 9058
2020-07-26 250970 31255 178605 6237 5919 ... 68154 70 11571 26749 14933
2020-08-02 255739 36370 208457 6429 7648 ... 80685 70 12023 38896 22241
2020-08-09 36747 5620 31972 937 1280 ... 12541 10 1734 6580 4075
Try:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,100,(20,5)), columns=[*'ZAABC'])
df.groupby(df.columns, axis=1, sort=False).sum()
Output:
Z A B C
0 44 111 67 67
1 9 104 36 87
2 70 176 12 58
3 65 126 46 88
4 81 62 77 72
5 9 100 69 79
6 47 146 99 88
7 49 48 19 14
8 39 97 9 57
9 32 105 23 35
10 75 83 34 0
11 0 89 5 38
12 17 83 42 58
13 31 66 41 57
14 35 57 82 91
15 0 113 53 12
16 42 159 68 6
17 68 50 76 52
18 78 35 99 58
19 23 92 85 48
You can try a transpose and groupby, e.g. something similar to the below.
df_T = df.tranpose()
df_T.groupby(df_T.index).sum()['Canada']
Here's a way to do it:
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
First we rename the columns starting with Canada by appending their integer position, which ensures they are no longer duplicates.
Then we use sum() to add across columns like Canada, put the result in a new column named Canada, and drop the columns that were originally named Canada.
Full test code is:
import pandas as pd
df = pd.DataFrame(
columns=[x.strip() for x in 'Brazil Canada Canada Canada Canada'.split()],
index=['Week ' + str(i) for i in range(1, 17)],
data=[[i] * 5 for i in range(1, 17)])
df.columns.names=['Country/Region']
print(df)
df.columns = [(col + str(i)) if col.startswith('Canada') else col for i, col in enumerate(df.columns)]
df = df.assign(Canada=df.filter(like='Canada').sum(axis=1)).drop(columns=[x for x in df.columns if x.startswith('Canada') and x != 'Canada'])
print(df)
Output:
Country/Region Brazil Canada Canada Canada Canada
Week 1 1 1 1 1 1
Week 2 2 2 2 2 2
Week 3 3 3 3 3 3
Week 4 4 4 4 4 4
Week 5 5 5 5 5 5
Week 6 6 6 6 6 6
Week 7 7 7 7 7 7
Week 8 8 8 8 8 8
Week 9 9 9 9 9 9
Week 10 10 10 10 10 10
Week 11 11 11 11 11 11
Week 12 12 12 12 12 12
Week 13 13 13 13 13 13
Week 14 14 14 14 14 14
Week 15 15 15 15 15 15
Week 16 16 16 16 16 16
Brazil Canada
Week 1 1 4
Week 2 2 8
Week 3 3 12
Week 4 4 16
Week 5 5 20
Week 6 6 24
Week 7 7 28
Week 8 8 32
Week 9 9 36
Week 10 10 40
Week 11 11 44
Week 12 12 48
Week 13 13 52
Week 14 14 56
Week 15 15 60
Week 16 16 64

Transform each group in a DataFrame

I have the following DataFrame:
id x y timestamp sensorTime
1 32 30 1031 2002
1 4 105 1035 2005
1 8 110 1050 2006
2 18 10 1500 3600
2 40 20 1550 3610
2 80 10 1450 3620
....
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1,1,1,2,2,2], [32,4,8,18,40,80], [30,105,110,10,20,10], [1031,1035,1050,1500,1550,1450], [2002, 2005, 2006, 3600, 3610, 3620]])).T
df.columns = ['id', 'x', 'y', 'timestamp', 'sensorTime]
For each group grouped by id I would like to add the differences of the sensorTime to the first value of timestamp. Something like the following:
start = df.iloc[0]['timestamp']
df['sensorTime'] -= df.iloc[0]['sensorTime']
df['sensorTime'] += start
But I would like to do this for each id group separately.
The resulting DataFrame should be:
id x y timestamp sensorTime
1 32 30 1031 1031
1 4 105 1035 1034
1 8 110 1050 1035
2 18 10 1500 1500
2 40 20 1550 1510
2 80 10 1450 1520
....
How can this operation done per group?
df
id x y timestamp sensorTime
0 1 32 30 1031 2002
1 1 4 105 1035 2005
2 1 8 110 1050 2006
3 2 18 10 1500 3600
4 2 40 20 1550 3610
5 2 80 10 1450 3620
You can group by id and then pass both timestamp and sensorTime. Then you can use diff to get the difference of sensorTime. The first value would be NaN and you can replace it with the first value of timestamp of that group. Then you can simply do cumsum to get the desired output.
def func(x):
diff = x['sensorTime'].diff()
diff.iloc[0] = x['timestamp'].iloc[0]
return (diff.cumsum().to_frame())
df['sensorTime'] = df.groupby('id')[['timestamp', 'sensorTime']].apply(func)
df
id x y timestamp sensorTime
0 1 32 30 1031 1031.0
1 1 4 105 1035 1034.0
2 1 8 110 1050 1035.0
3 2 18 10 1500 1500.0
4 2 40 20 1550 1510.0
5 2 80 10 1450 1520.0
You could run a groupby twice, first, to get the difference in sensorTime, the second time to do the cumulative sum:
box = df.groupby("id").sensorTime.transform("diff")
df.assign(
new_sensorTime=np.where(box.isna(), df.timestamp, box),
new=lambda x: x.groupby("id")["new_sensorTime"].cumsum(),
).drop(columns="new_sensorTime")
id x y timestamp sensorTime new
0 1 32 30 1031 2002 1031.0
1 1 4 105 1035 2005 1034.0
2 1 8 110 1050 2006 1035.0
3 2 18 10 1500 3600 1500.0
4 2 40 20 1550 3610 1510.0
5 2 80 10 1450 3620 1520.0

Compare two files if they both match first column then replace the values of column 2 and 3 (Python)

There are two files. If the ID number matches both files, then I want only the value 1 and value 2 from File2.txt , Please let me know if my question is unclear
File1.txt
ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy
File2.txt
0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51
The output should look like this.
ID Number Value 1 Value 2 Country
0001 5 6 Spain
0231 7 18 USA
4213 54 87 Canada
7541 32 29 Italy
New example:
File3.txt
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.739000 -114.635700
01711 CX 21 -- 3 32.779700 -115.567500
08433 CX 21 -- 2 31.919930 -123.321000
File4.txt
20022,32.45,-114.88
01192,32.839,-115.487
01711,32.88,-115.45
01218,32.717,-115.637
output
#ID CAT CHN LC SC LATITUDE LONGITUDE
20022 CX 21 -- 4 32.45 -114.88
01711 CX 21 -- 3 32.88 -115.45
08433 CX 21 -- 2 31.919930 -123.321000
Code I got so far
f = open("File3.txt", "r")
x= open("File4.txt","r")
df1 = pd.read_csv(f, sep=' ', engine='python')
df2 = pd.read_csv(x, sep=' ', header=None, engine='python')
df2 = df2.set_index(0).rename_axis("#ID")
df2 = df2.rename(columns={5:'LATITUDE', 6: 'LONGITUDE'})
df1 = df1.set_index('#ID')
df1.update(df2)
print(df1)
Something like this, possibly:
file1_data = []
file1_headers = []
with open("File1.txt") as file1:
for line in file1:
file1_data.append(line.strip().split("\t"))
file1_headers = file1_data[0]
del file1_data[0]
file2_data = []
with open("File2.txt") as file2:
for line in file2:
file2_data.append(line.strip().split("\t"))
file2_ids = [x[0] for x in file2_data]
final_data = [file1_headers] + file1_data
for i in range(1, len(final_data)):
if final_data[i][0] in file2_ids:
match = [x for x in file2_data if x[0] == final_data[i][0]]
final_data[i] = [match[0] + [final_data[i][3]]]
with open("output.txt", "w") as output:
output.writelines(["\t".join(x) for x in final_data])
final_data becomes an alias of file1_data and then is selectively replacing rows with matching id's in file2_data, but keeping the country.
Okay, what you need to do here is to get the indexes to match in both dataframes after importing. This is important because pandas use data alignment based on indexes.
Here is a complete example using your data:
from io import StringIO
import pandas as pd
File1txt=StringIO("""ID Number Value 1 Value 2 Country
0001 23 55 Spain
0231 15 23 USA
4213 10 11 Canada
7541 32 29 Italy""")
File2txt = StringIO("""0001 5 6
0231 7 18
4213 54 87
5554 12 10
1111 31 13
6422 66 51""")
df1 = pd.read_csv(File1txt, sep='\s\s+', engine='python')
df2 = pd.read_csv(File2txt, sep='\s\s+', header=None, engine='python')
print(df1)
# ID Number Value 1 Value 2 Country
# 0 1 23 55 Spain
# 1 231 15 23 USA
# 2 4213 10 11 Canada
# 3 7541 32 29 Italy
print(df2)
# 0 1 2
# 0 1 5 6
# 1 231 7 18
# 2 4213 54 87
# 3 5554 12 10
# 4 1111 31 13
# 5 6422 66 51
df2 = df2.set_index(0).rename_axis('ID Number')
df2 = df2.rename(columns={1:'Value 1', 2: 'Value 2'})
df1 = df1.set_index('ID Number')
df1.update(df2)
print(df1.reset_index())
Output:
ID Number Value 1 Value 2 Country
0 1 5.0 6.0 Spain
1 231 7.0 18.0 USA
2 4213 54.0 87.0 Canada
3 7541 32.0 29.0 Italy

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

Sorting Pandas Dataframe by Category

I have a .xlsx file with 9 variables of 898 observations. I read in a .xlsx file and parsed into a pandas dataframe. I tried sorting a pandas dataframe column called product_id by ascending order, but got all the columns in the result.
I followed the advice from another link, but still got an error.
Question: How can I get the top 10 highest occurring values from the product_id category in ascending order?
import pandas as pd
import xlrd
#Import data
trans = pd.ExcelFile('file.xlsx')
#parse xlsx file into dataframe
transdata = trans.parse('Orders')
#view head of dataframe
print transdata.head()
site_id visitor_id transaction_id transaction_date product_id price \
0 3 10001 20001 2014-10-31 48165 150
1 3 10002 20002 2014-10-31 48162 128
2 3 10002 20003 2014-10-30 48165 150
3 3 10003 20004 2014-10-31 48815 98
4 3 10003 20005 2014-10-29 48165 150
units sales_tax total
0 1 12.38 162.38
1 1 10.56 138.56
2 1 12.38 162.38
3 1 8.09 106.09
4 1 12.38 162.38
grouped = transdata.groupby(['product_id']).size()
print grouped
product_id
36959 78
44524 12
45956 33
46814 11
48162 50
48165 100
48412 12
48478 23
48500 13
48528 14
48552 101
48587 106
48593 104
48628 4
48810 25
48814 16
48815 33
48823 20
49418 11
49444 12
49882 102
51184 2
51380 15
dtype: int64
EDIT: I tried sorting the pandas dataframe category product_id but got rankings of all the columns.
grouped = transdata.groupby(['product_id'])
counts = grouped.size().sort()
result = counts.head(10).index
print result
site_id visitor_id transaction_id transaction_date product_id price \
0 3 10001 20001 2014-10-31 48165 150
1 3 10002 20002 2014-10-31 48162 128
2 3 10002 20003 2014-10-30 48165 150
3 3 10003 20004 2014-10-31 48815 98
4 3 10003 20005 2014-10-29 48165 150
units sales_tax total
0 1 12.38 162.38
1 1 10.56 138.56
2 1 12.38 162.38
3 1 8.09 106.09
4 1 12.38 162.38
Traceback (most recent call last):
File "Trending.py", line 14, in <module>
result = counts.head(10).index
AttributeError: 'NoneType' object has no attribute 'head'
Desired output: A vector with the top highest occuring values from the product_id category.
product_id
48587
48593
49882
48552
48165
36959
48162
45956
48815
48478
From this:
grouped = transdata.groupby(by=['product_id'])
You just need
counts = grouped.size()['product_id']
counts.sort()
counts.head(10)

Categories