data start/end with 0 (string) recognized different data when using Pandas query - python

I'd like to read 2 CSV files and compare them.
I need to process the data so I read them as string.
To compare, I use query of Pandas and output not macthed rows.
When I have a data in row that starts with 0 and ends with 0, it is recognized as not matched although they are exactly the same data.
Do you have any idea why it is recognized as different data?
data
csv_1
id, id_2, name, unit, dep, sales, code, others
1, 22, apple, 100, 243837463, 89.90, 0214008000000, 88899
2, 23, orange, 403, 111839281, 10.10, 7474836251038465, 80000
csv_2
id, id_2, name, unit, department, sales, special_code, others
1, 22, apple, 100, 243837463, 89.900, 0214008000000, 88899
2, 23, orange, 403, 111839281, 10.10, 7474836251038465, 80300
3, 24, banana, 909, 784635281, 30.17, 6241325251038465, 80000
code
df1 = pd.read_csv(csv_1, header=0, dtype=str, na_filter=False)
df2 = pd.read_csv(csv_2, header=0, dtype=str, na_filter=False)
df_1.reset_index(drop=True, inplace=True)
df_2.reset_index(drop=True, inplace=True)
df1 = df_1.applymap(str.strip) #trim
df2 = df_2.applymap(str.strip) #trim
....other data processing
dfx = pd.merge(df1, df2, left_on=['id', 'id_2'], right_on=['id', 'id_2'], suffixes=['_data1', '_data2'])
print(dfx.query('name_data1 != name_data2'))
print(dfx.query('unit_data1 != unit_data2'))
print(dfx.query('dep != department'))
print(dfx.query('master_jan != jan_dh'))
print(dfx.query('sales_data1 != sales_data2'))
print(dfx.query('code != special_code'))
print(dfx.query('others_data1 != others_data2'))
output
name_data1 unit_data1 dep... code...
1 apple, 100, 243837463...0214008000000

Related

Python/Pandas formatting values of a column if its header contains "Price"

Using pandas, I have read an .xlsx file that contains 4 columns: ID, Product, Buy Price and Sell Price.
I would like to format values under the columns that contain "Price" in their headers in the following way:
1399 would become $1,399.00
1538.9 would become $1,538.90
I understand how to address the column headers and impose the desired condition, but I don't know how to format the values themselves. This is how far I got:
for col in df.columns:
if "Price" in col:
print("This header has 'Price' in it")
else:
print(col)
ID
Name
This header has 'Price' in it
This header has 'Price' in it
How can I do this?
Try:
for col in df.columns:
if "Price" in col:
print("This header has 'Price' in it")
df[col] = df[col].map('${:,.2f}'.format)
else:
print(col)
Or if get all columns names to list is possible use DataFrame.applymap:
cols = df.filter(like='Price').columns
df[cols] = df[cols].applymap('${:,.2f}'.format)
In the string formatting, :, puts a comma for as the thousands separator, and .2f formats the floats to 2 decimal points, to become cents.
I suggest you use py-moneyed, see below how to use it for transforming it to a string representing money:
import pandas as pd
from moneyed import Money, USD
res = pd.Series(data=[1399, 1538.9]).map(lambda x: str(Money(x, USD)))
print(res)
Output
0 $1,399.00
1 $1,538.90
dtype: object
Full code
import pandas as pd
from moneyed import Money, USD
# toy data
columns = ["ID", "Product", "Buy Price", "Sell Price"]
df = pd.DataFrame(data=[[0, 0, 1399, 1538.9]], columns=columns)
# find columns with Price in it
filtered = df.columns.str.contains("Price")
# transform the values of those columns
df.loc[:, filtered] = df.loc[:, filtered].applymap(lambda x: str(Money(x, USD)))
print(df)
Output
ID Product Buy Price Sell Price
0 0 0 $1,399.00 $1,538.90

Pyspark dataframe join based on key,group by and max

i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)

Pandas- How to save frequencies of different values in different columns line by line in a csv file (including 0 frequencies)

I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)

Creating new columns in a csv file using data from a different csv file

I have this Data Science problem where I need to create a test set using info provided in two csv files.
Problem
data1.csv
cat,In1,In2
aaa, 0, 1
aaa, 2, 1
aaa, 2, 0
aab, 3, 2
aab, 1, 2
data2.csv
cat,index,attribute1,attribute2
aaa, 0, 150, 450
aaa, 1, 250, 670
aaa, 2, 30, 250
aab, 0, 60, 650
aab, 1, 50, 30
aab, 2, 20, 680
aab, 3, 380, 250
From these two files what I need is a updated data1.csv file. Where in place of In1 and In2, I need the attributes of the specific indices(In1 and In2), under a specific category (cat).
Note: All the indices in a specific category (cat) have their own attributes.
Result should look like this,
updated_data1.csv
cat,In1a1,In1a2,In2a1,In2a2
aaa, 150, 450, 250, 670
aaa, 30, 250, 250, 670
aaa, 30, 250, 150, 450
aab, 380, 250, 20, 680
aab, 50, 30, 20, 680
I need an approach to tackle this problem using pandas in python. So far I have loaded the csv files in to my jupyter notebook. And I have no clue where to start.
Please note this is my first week using python for data manipulation and I have a very little knowledge on python. Also pardon me for ugly formatting. I'm using the mobile phone to type this question.
As others have suggested, you can use pd.merge. In this case, you need to merge on multiple columns. Basically you need to define which columns from the left DataFrame (here data1) map to which columns from the right DataFrame (here data2). Also see pandas merging 101.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# DataFrame with the in1 columns
df1 = pd.merge(left=data1, right=data2, left_on = ['cat','In1'], right_on = ['cat', 'index'])
df1 = df1[['cat','attribute1','attribute2']].set_index('cat')
# DataFrame with the in2 columns
df2 = pd.merge(left=data1, right=data2, left_on = ['cat','In2'], right_on = ['cat', 'index'])
df2 = df2[['cat','attribute1','attribute2']].set_index('cat')
# Join the two dataframes together.
df = pd.concat([df1, df2], axis=1)
# Name the columns as desired
df.columns = ['in1a1', 'in1a2', 'in2a1', 'in2a2']
One should generally try to avoid iterating through DataFrames, because it's not very efficient. But it's definitely a possible solution here.
# Read the csvs
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv')
# This list will be the data for the resulting DataFrame
rows = []
# Iterate through data1, unpacking values in each row to variables
for idx, cat, in1, in2 in data1.itertuples():
# Create a dictionary for each row where the keys are the column headers of the future DataFrame
row = {}
row['cat'] = cat
# Pick the correct row from data2
in1 = (data2['index'] == in1) & (data2['cat'] == cat)
in2 = (data2['index'] == in2) & (data2['cat'] == cat)
# Assign the correct values to the keys in the dictionary
row['in1a1'] = data2.loc[in1, 'attribute1'].values[0]
row['in1a2'] = data2.loc[in1, 'attribute2'].values[0]
row['in2a1'] = data2.loc[in2, 'attribute1'].values[0]
row['in2a2'] = data2.loc[in2, 'attribute2'].values[0]
# Append the dictionary to the list
rows.append(row)
# Construct a DataFrame from the list of dictionaries
df = pd.DataFrame(rows)

how to convert csv to dictionary using pandas

How can I convert a csv into a dictionary using pandas? For example I have 2 columns, and would like column1 to be the key and column2 to be the value. My data looks like this:
"name","position"
"UCLA","73"
"SUNY","36"
cols = ['name', 'position']
df = pd.read_csv(filename, names = cols)
Since the 1st line of your sample csv-data is a "header",
you may read it as pd.Series using the squeeze keyword of pandas.read_csv():
>>> pd.read_csv(filename, index_col=0, header=None, squeeze=True).to_dict()
{'UCLA': 73, 'SUNY': 36}
If you want to include also the 1st line, remove the header keyword (or set it to None).
Convert the columns to a list, then zip and convert to a dict:
In [37]:
df = pd.DataFrame({'col1':['first','second','third'], 'col2':np.random.rand(3)})
print(df)
dict(zip(list(df.col1), list(df.col2)))
col1 col2
0 first 0.278247
1 second 0.459753
2 third 0.151873
[3 rows x 2 columns]
Out[37]:
{'third': 0.15187291615699894,
'first': 0.27824681093923298,
'second': 0.4597530377539677}
ankostis answer in my opinion is the most elegant solution when you have the file on disk.
However, if you do not want to or cannot go the detour of saving and loading from the file system, you can also do it like this:
df = pd.DataFrame({"name": [73, 36], "position" : ["UCLA", "SUNY"]})
series = df["position"]
series.index = df["name"]
series.to_dict()
Result:
{'UCLA': 73, 'SUNY': 36}

Categories