BigQuery-SQL - joining 2 table for revenue calculation
up vote
0
down vote
favorite
I've the following 2 tables:
Table1 - raw data without revenue data:
datehour | geo | platform | dimension_4 | dimension_5
metric_1 | metric_2 | billable_actions (int)
Table2 - the revenue data, which is originated from another source
but eventually inserted into a table with the following schema
(please note that it has 2 missing dimensions):
datehour | geo | platform | revenue (float)
I need to merge Table2 into Table1 ,
so I can calculate the revenue only from table1.
THE QUERY
SELECT
datehour,
cc,
platform,
dimension_1,
dimension_2,
billable_actions,
CASE
WHEN revenue > 0 AND billable_actions > 0 THEN (revenue/total_billable_actions)*billable_actions
WHEN metric_1 IS NULL AND metric_2 IS NULL THEN revenue
ELSE 0
END calc_revenue
FROM (
SELECT
IFNULL(e.datehour,d.from_ts) datehour,
IFNULL(e.cc, d.cc) cc,
IFNULL(e.platform, d.platform) platform,
e.dimension1,
e.dimension2,
e.billable_actions,
SUM(e.billable_actions) OVER (PARTITION BY e.datehour, e.platform, e.geo) total_billable_actions,
d.revenue
FROM Table1 e
FULL JOIN Table2 d
ON
e.datehour = d.from_ts AND
e.appkey = d.appkey AND
e.cc = d.cc AND
e.platform = d.platform
)
EXPECTED RESULT & THE PROBLEM
Both tables revenues should be equal after the join.
I'm testing:
select sum(revenue) from Table2 ==
select sum(row_revenue) from JoinedTable
but row_revenue is always lower:
Table2 sum = 0.44449199771042913
Joined table sum = 0.4421989977126941
Can some one please point me for the reason why they are not equal?
I've a working example at the end of the question .
NOTES
note 1 - the revenue is calculated the following way:
1) Table1 FULL JOIN Table2 ON:
T1.datehour = T2.datehour AND
T1.geo = T2.geo AND
T1.platform = T2.platform
*please read note 2,3 regard why I use full join.
2) on Table1 , sum the billable actions, partition by datehour, geo and platform.
SUM(billable_actions) OVER (PARTITION BY datehour, geo, platform)
AS total_billable_actions
3) Now , we have a table , where each row contains the total_billable_actions per the group , the total revenue and the total actions for the row. wrapping it with another select and calculating the row revenue:
row_revenue = (revenue / total_billable_actions) * billable_actions
note 2
we may have rows from Table1 which do not match any row from table2,
but we still need the data found in those rows . those rows will not affect the revenue calculations.
note 3
we may have rows from Table2 which also do not fit the join.
we need those rows , stay untouched , with the same revenue.
note 4 regard floating point issues:
BigQuery uses 8byte double precision for float,
and I've multiplied all results with great numbers (1000, 1000000) to find out if the issue is decimal point precision.
It seems that multiplying the result gives the same exact result.
working example
for simplicity I've taken out one dimension and the other metrics.
Left table:
datehour geo platform dimension_1 billable_actions
15:00 US Android shmoo 10
15:00 US Android foo 5
15:00 US IOS shmoo 8
15:00 US IOS foo 4
15:00 US UNKNOWN shmoo 4
Right table:
datehour geo platform revenue
15:00 US Android 5$
15:00 US IOS 10$
15:00 UNKNOWN IOS 2$
Joined table:
15:00 US Android shmoo 10 15 (5$/15)*10 = 3.333..3
15:00 US Android foo 5 15 (5$/15)*5 = 1.6666..7
15:00 US IOS shmoo 8 12 (10$/12)*8 = 6.6666..7
15:00 US IOS foo 4 12 (10$/12)*4 = 3.3333..3
15:00 US UNKNOWN shmoo 4 4 0$
15:00 UNKNOWN IOS null 0 0 2$
sum(row_revenue) = 17$
Thanks all !
sql join google-bigquery
add a comment |
up vote
0
down vote
favorite
I've the following 2 tables:
Table1 - raw data without revenue data:
datehour | geo | platform | dimension_4 | dimension_5
metric_1 | metric_2 | billable_actions (int)
Table2 - the revenue data, which is originated from another source
but eventually inserted into a table with the following schema
(please note that it has 2 missing dimensions):
datehour | geo | platform | revenue (float)
I need to merge Table2 into Table1 ,
so I can calculate the revenue only from table1.
THE QUERY
SELECT
datehour,
cc,
platform,
dimension_1,
dimension_2,
billable_actions,
CASE
WHEN revenue > 0 AND billable_actions > 0 THEN (revenue/total_billable_actions)*billable_actions
WHEN metric_1 IS NULL AND metric_2 IS NULL THEN revenue
ELSE 0
END calc_revenue
FROM (
SELECT
IFNULL(e.datehour,d.from_ts) datehour,
IFNULL(e.cc, d.cc) cc,
IFNULL(e.platform, d.platform) platform,
e.dimension1,
e.dimension2,
e.billable_actions,
SUM(e.billable_actions) OVER (PARTITION BY e.datehour, e.platform, e.geo) total_billable_actions,
d.revenue
FROM Table1 e
FULL JOIN Table2 d
ON
e.datehour = d.from_ts AND
e.appkey = d.appkey AND
e.cc = d.cc AND
e.platform = d.platform
)
EXPECTED RESULT & THE PROBLEM
Both tables revenues should be equal after the join.
I'm testing:
select sum(revenue) from Table2 ==
select sum(row_revenue) from JoinedTable
but row_revenue is always lower:
Table2 sum = 0.44449199771042913
Joined table sum = 0.4421989977126941
Can some one please point me for the reason why they are not equal?
I've a working example at the end of the question .
NOTES
note 1 - the revenue is calculated the following way:
1) Table1 FULL JOIN Table2 ON:
T1.datehour = T2.datehour AND
T1.geo = T2.geo AND
T1.platform = T2.platform
*please read note 2,3 regard why I use full join.
2) on Table1 , sum the billable actions, partition by datehour, geo and platform.
SUM(billable_actions) OVER (PARTITION BY datehour, geo, platform)
AS total_billable_actions
3) Now , we have a table , where each row contains the total_billable_actions per the group , the total revenue and the total actions for the row. wrapping it with another select and calculating the row revenue:
row_revenue = (revenue / total_billable_actions) * billable_actions
note 2
we may have rows from Table1 which do not match any row from table2,
but we still need the data found in those rows . those rows will not affect the revenue calculations.
note 3
we may have rows from Table2 which also do not fit the join.
we need those rows , stay untouched , with the same revenue.
note 4 regard floating point issues:
BigQuery uses 8byte double precision for float,
and I've multiplied all results with great numbers (1000, 1000000) to find out if the issue is decimal point precision.
It seems that multiplying the result gives the same exact result.
working example
for simplicity I've taken out one dimension and the other metrics.
Left table:
datehour geo platform dimension_1 billable_actions
15:00 US Android shmoo 10
15:00 US Android foo 5
15:00 US IOS shmoo 8
15:00 US IOS foo 4
15:00 US UNKNOWN shmoo 4
Right table:
datehour geo platform revenue
15:00 US Android 5$
15:00 US IOS 10$
15:00 UNKNOWN IOS 2$
Joined table:
15:00 US Android shmoo 10 15 (5$/15)*10 = 3.333..3
15:00 US Android foo 5 15 (5$/15)*5 = 1.6666..7
15:00 US IOS shmoo 8 12 (10$/12)*8 = 6.6666..7
15:00 US IOS foo 4 12 (10$/12)*4 = 3.3333..3
15:00 US UNKNOWN shmoo 4 4 0$
15:00 UNKNOWN IOS null 0 0 2$
sum(row_revenue) = 17$
Thanks all !
sql join google-bigquery
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I've the following 2 tables:
Table1 - raw data without revenue data:
datehour | geo | platform | dimension_4 | dimension_5
metric_1 | metric_2 | billable_actions (int)
Table2 - the revenue data, which is originated from another source
but eventually inserted into a table with the following schema
(please note that it has 2 missing dimensions):
datehour | geo | platform | revenue (float)
I need to merge Table2 into Table1 ,
so I can calculate the revenue only from table1.
THE QUERY
SELECT
datehour,
cc,
platform,
dimension_1,
dimension_2,
billable_actions,
CASE
WHEN revenue > 0 AND billable_actions > 0 THEN (revenue/total_billable_actions)*billable_actions
WHEN metric_1 IS NULL AND metric_2 IS NULL THEN revenue
ELSE 0
END calc_revenue
FROM (
SELECT
IFNULL(e.datehour,d.from_ts) datehour,
IFNULL(e.cc, d.cc) cc,
IFNULL(e.platform, d.platform) platform,
e.dimension1,
e.dimension2,
e.billable_actions,
SUM(e.billable_actions) OVER (PARTITION BY e.datehour, e.platform, e.geo) total_billable_actions,
d.revenue
FROM Table1 e
FULL JOIN Table2 d
ON
e.datehour = d.from_ts AND
e.appkey = d.appkey AND
e.cc = d.cc AND
e.platform = d.platform
)
EXPECTED RESULT & THE PROBLEM
Both tables revenues should be equal after the join.
I'm testing:
select sum(revenue) from Table2 ==
select sum(row_revenue) from JoinedTable
but row_revenue is always lower:
Table2 sum = 0.44449199771042913
Joined table sum = 0.4421989977126941
Can some one please point me for the reason why they are not equal?
I've a working example at the end of the question .
NOTES
note 1 - the revenue is calculated the following way:
1) Table1 FULL JOIN Table2 ON:
T1.datehour = T2.datehour AND
T1.geo = T2.geo AND
T1.platform = T2.platform
*please read note 2,3 regard why I use full join.
2) on Table1 , sum the billable actions, partition by datehour, geo and platform.
SUM(billable_actions) OVER (PARTITION BY datehour, geo, platform)
AS total_billable_actions
3) Now , we have a table , where each row contains the total_billable_actions per the group , the total revenue and the total actions for the row. wrapping it with another select and calculating the row revenue:
row_revenue = (revenue / total_billable_actions) * billable_actions
note 2
we may have rows from Table1 which do not match any row from table2,
but we still need the data found in those rows . those rows will not affect the revenue calculations.
note 3
we may have rows from Table2 which also do not fit the join.
we need those rows , stay untouched , with the same revenue.
note 4 regard floating point issues:
BigQuery uses 8byte double precision for float,
and I've multiplied all results with great numbers (1000, 1000000) to find out if the issue is decimal point precision.
It seems that multiplying the result gives the same exact result.
working example
for simplicity I've taken out one dimension and the other metrics.
Left table:
datehour geo platform dimension_1 billable_actions
15:00 US Android shmoo 10
15:00 US Android foo 5
15:00 US IOS shmoo 8
15:00 US IOS foo 4
15:00 US UNKNOWN shmoo 4
Right table:
datehour geo platform revenue
15:00 US Android 5$
15:00 US IOS 10$
15:00 UNKNOWN IOS 2$
Joined table:
15:00 US Android shmoo 10 15 (5$/15)*10 = 3.333..3
15:00 US Android foo 5 15 (5$/15)*5 = 1.6666..7
15:00 US IOS shmoo 8 12 (10$/12)*8 = 6.6666..7
15:00 US IOS foo 4 12 (10$/12)*4 = 3.3333..3
15:00 US UNKNOWN shmoo 4 4 0$
15:00 UNKNOWN IOS null 0 0 2$
sum(row_revenue) = 17$
Thanks all !
sql join google-bigquery
I've the following 2 tables:
Table1 - raw data without revenue data:
datehour | geo | platform | dimension_4 | dimension_5
metric_1 | metric_2 | billable_actions (int)
Table2 - the revenue data, which is originated from another source
but eventually inserted into a table with the following schema
(please note that it has 2 missing dimensions):
datehour | geo | platform | revenue (float)
I need to merge Table2 into Table1 ,
so I can calculate the revenue only from table1.
THE QUERY
SELECT
datehour,
cc,
platform,
dimension_1,
dimension_2,
billable_actions,
CASE
WHEN revenue > 0 AND billable_actions > 0 THEN (revenue/total_billable_actions)*billable_actions
WHEN metric_1 IS NULL AND metric_2 IS NULL THEN revenue
ELSE 0
END calc_revenue
FROM (
SELECT
IFNULL(e.datehour,d.from_ts) datehour,
IFNULL(e.cc, d.cc) cc,
IFNULL(e.platform, d.platform) platform,
e.dimension1,
e.dimension2,
e.billable_actions,
SUM(e.billable_actions) OVER (PARTITION BY e.datehour, e.platform, e.geo) total_billable_actions,
d.revenue
FROM Table1 e
FULL JOIN Table2 d
ON
e.datehour = d.from_ts AND
e.appkey = d.appkey AND
e.cc = d.cc AND
e.platform = d.platform
)
EXPECTED RESULT & THE PROBLEM
Both tables revenues should be equal after the join.
I'm testing:
select sum(revenue) from Table2 ==
select sum(row_revenue) from JoinedTable
but row_revenue is always lower:
Table2 sum = 0.44449199771042913
Joined table sum = 0.4421989977126941
Can some one please point me for the reason why they are not equal?
I've a working example at the end of the question .
NOTES
note 1 - the revenue is calculated the following way:
1) Table1 FULL JOIN Table2 ON:
T1.datehour = T2.datehour AND
T1.geo = T2.geo AND
T1.platform = T2.platform
*please read note 2,3 regard why I use full join.
2) on Table1 , sum the billable actions, partition by datehour, geo and platform.
SUM(billable_actions) OVER (PARTITION BY datehour, geo, platform)
AS total_billable_actions
3) Now , we have a table , where each row contains the total_billable_actions per the group , the total revenue and the total actions for the row. wrapping it with another select and calculating the row revenue:
row_revenue = (revenue / total_billable_actions) * billable_actions
note 2
we may have rows from Table1 which do not match any row from table2,
but we still need the data found in those rows . those rows will not affect the revenue calculations.
note 3
we may have rows from Table2 which also do not fit the join.
we need those rows , stay untouched , with the same revenue.
note 4 regard floating point issues:
BigQuery uses 8byte double precision for float,
and I've multiplied all results with great numbers (1000, 1000000) to find out if the issue is decimal point precision.
It seems that multiplying the result gives the same exact result.
working example
for simplicity I've taken out one dimension and the other metrics.
Left table:
datehour geo platform dimension_1 billable_actions
15:00 US Android shmoo 10
15:00 US Android foo 5
15:00 US IOS shmoo 8
15:00 US IOS foo 4
15:00 US UNKNOWN shmoo 4
Right table:
datehour geo platform revenue
15:00 US Android 5$
15:00 US IOS 10$
15:00 UNKNOWN IOS 2$
Joined table:
15:00 US Android shmoo 10 15 (5$/15)*10 = 3.333..3
15:00 US Android foo 5 15 (5$/15)*5 = 1.6666..7
15:00 US IOS shmoo 8 12 (10$/12)*8 = 6.6666..7
15:00 US IOS foo 4 12 (10$/12)*4 = 3.3333..3
15:00 US UNKNOWN shmoo 4 4 0$
15:00 UNKNOWN IOS null 0 0 2$
sum(row_revenue) = 17$
Thanks all !
sql join google-bigquery
sql join google-bigquery
edited Nov 10 at 23:01
asked Nov 10 at 22:37
Amit Triffon
112
112
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47
add a comment |
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53244118%2fbigquery-sql-joining-2-table-for-revenue-calculation%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You should use NUMERIC instead of FLOAT64 to rule out precision loss
– Elliott Brossard
Nov 10 at 22:57
A Minimal, Complete, and Verifiable example includes DBMS (with version), a clear specification, cut & paste & runnable code (with small representative input) (format table initializations as tables) & diffable (hence ordered) desired output. But absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples.
– philipxy
Nov 10 at 23:01
@ElliottBrossard . I can't believe I missed it . Thx.
– Amit Triffon
Nov 10 at 23:04
@ElliottBrossard , Hi ! Unfortunately it did not help , I'm getting a very very close result , i've also changed the full join into inner join for a sprcific period I know I don't have any leftovers (both left and right) but the revenue is still very close but not the same . 1) SOURCE 0.20103616 2) QUERY 0.200784255
– Amit Triffon
Nov 11 at 11:47