Compare two large CSV files and making a third one from the difference

up vote
0
down vote

favorite

I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started

e.g

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45

For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08

By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35

If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36

@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

|
show 2 more comments

up vote
0
down vote

favorite

e.g

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45

For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08

By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35

If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36

@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

|
show 2 more comments

up vote
0
down vote

favorite

e.g

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

e.g

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

File 1.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  1|    a|   Ran1|

|  2|    b|   Ran2|

+---+-----+-------+



File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  3|    c|   Ran3|

|  2|    b|   Ran2|

+---+-----+-------+



Schema of both file is same



Result - file 3.csv

File 2.csv



+---+------+------+

|ID |value1|value2|

+---+------+------+

|  2|    b|   Ran2|

+---+-----+-------+

scala scala-collections

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

edited Nov 11 at 0:49

asked Nov 9 at 15:32

Shaitender Singh

98651333

asked Nov 9 at 15:32

Shaitender Singh

98651333

asked Nov 9 at 15:32

Shaitender Singh

98651333

For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45

For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08

By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35

If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36

@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

|
show 2 more comments

For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45

For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08

By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35

If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36

@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45

For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08

By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35

If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36

@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

|
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nrthugu