Compare two large CSV files and making a third one from the difference
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
scala scala-collections
|
show 2 more comments
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
scala scala-collections
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
scala scala-collections
I have a problem where I need to compare two large CSV Files (Approx 5 to 8 GB) and have to make a third CSV file from their difference.
Any suggestion pointer for supported libraries for the same or any reference to get started
e.g
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
File 1.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 1| a| Ran1|
| 2| b| Ran2|
+---+-----+-------+
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 3| c| Ran3|
| 2| b| Ran2|
+---+-----+-------+
Schema of both file is same
Result - file 3.csv
File 2.csv
+---+------+------+
|ID |value1|value2|
+---+------+------+
| 2| b| Ran2|
+---+-----+-------+
scala scala-collections
scala scala-collections
edited Nov 11 at 0:49
asked Nov 9 at 15:32
Shaitender Singh
98651333
98651333
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52
|
show 2 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53228708%2fcompare-two-large-csv-files-and-making-a-third-one-from-the-difference%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
For handling the CSV files you may take a look at cormorant. Addiotionally, given there will be large files, I wouldn't use standar scala collections - but instead, some form of Streaming (for example fs2). Now, I don't really understand what exactly you mean with "difference".
– Luis Miguel Mejía Suárez
Nov 9 at 15:45
For streaming large files, something like Akka Streams is pretty common.
– James Whiteley
Nov 9 at 16:08
By "difference", do you mean a textual difference, or content difference? "4.50" is textually different than "4.5", but is not different if the field is treated as a number.
– Bob Dalgleish
Nov 9 at 17:35
If there is a difference, how do you resynchronize the two sources? Do you look for the next non-differing line, or do you have key fields, such as a timestamp, that guarantees ordered access?
– Bob Dalgleish
Nov 9 at 17:36
@BobDalgleish - Schema of both csv is same, their is textual difference in both file
– Shaitender Singh
Nov 11 at 0:52