Compare two large CSV files and making a third one from the difference











up vote
0
down vote

favorite












I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started



e.g






File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+












share|improve this question
























  • For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
    – Luis Miguel Mejía Suárez
    Nov 9 at 15:45












  • For streaming large files, something like Akka Streams is pretty common.
    – James Whiteley
    Nov 9 at 16:08










  • By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
    – Bob Dalgleish
    Nov 9 at 17:35










  • If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
    – Bob Dalgleish
    Nov 9 at 17:36










  • @BobDalgleish - Schema of both csv is same, their is textual difference in both file
    – Shaitender Singh
    Nov 11 at 0:52















up vote
0
down vote

favorite












I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started



e.g






File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+












share|improve this question
























  • For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
    – Luis Miguel Mejía Suárez
    Nov 9 at 15:45












  • For streaming large files, something like Akka Streams is pretty common.
    – James Whiteley
    Nov 9 at 16:08










  • By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
    – Bob Dalgleish
    Nov 9 at 17:35










  • If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
    – Bob Dalgleish
    Nov 9 at 17:36










  • @BobDalgleish - Schema of both csv is same, their is textual difference in both file
    – Shaitender Singh
    Nov 11 at 0:52













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started



e.g






File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+












share|improve this question















I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started



e.g






File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+








File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+





File 1.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+

File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+

Schema of both file is same

Result - file 3.csv
File 2.csv

+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+






scala scala-collections






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 0:49

























asked Nov 9 at 15:32









Shaitender Singh

98651333




98651333












  • For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
    – Luis Miguel Mejía Suárez
    Nov 9 at 15:45












  • For streaming large files, something like Akka Streams is pretty common.
    – James Whiteley
    Nov 9 at 16:08










  • By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
    – Bob Dalgleish
    Nov 9 at 17:35










  • If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
    – Bob Dalgleish
    Nov 9 at 17:36










  • @BobDalgleish - Schema of both csv is same, their is textual difference in both file
    – Shaitender Singh
    Nov 11 at 0:52


















  • For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
    – Luis Miguel Mejía Suárez
    Nov 9 at 15:45












  • For streaming large files, something like Akka Streams is pretty common.
    – James Whiteley
    Nov 9 at 16:08










  • By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
    – Bob Dalgleish
    Nov 9 at 17:35










  • If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
    – Bob Dalgleish
    Nov 9 at 17:36










  • @BobDalgleish - Schema of both csv is same, their is textual difference in both file
    – Shaitender Singh
    Nov 11 at 0:52
















For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45






For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45














For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08




For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08












By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35




By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35












If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36




If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36












@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52




@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Full-time equivalent

Bicuculline

さくらももこ