How can I get the first registry with a type A and the first registry with a type B and move it to a single...
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
add a comment |
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
What if theCrow is before theArow? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichIdhave typeA,Cand whichB,D? Maybe help add anotherIdwith output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
python pandas bigdata databricks
edited Nov 10 at 13:12
Community♦
11
11
asked Nov 10 at 12:00
Gabriel Monteiro Nepomuceno
1,0811425
1,0811425
What if theCrow is before theArow? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichIdhave typeA,Cand whichB,D? Maybe help add anotherIdwith output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
What if theCrow is before theArow? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichIdhave typeA,Cand whichB,D? Maybe help add anotherIdwith output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
What if the
C row is before the A row? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
What if the
C row is before the A row? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
1
Sorry, so how is possible distinguish which
Id have type A,C and which B,D ? Maybe help add another Id with output?– jezrael
Nov 10 at 12:11
Sorry, so how is possible distinguish which
Id have type A,C and which B,D ? Maybe help add another Id with output?– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))
Last concat together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))
Last concat together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
up vote
1
down vote
accepted
First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))
Last concat together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))
Last concat together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'ProcessEvent','Datetime':'ProcessDate'}))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns={'type':'DispatchedEvent','Datetime':'DispatchedDate'}))
Last concat together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
edited Nov 10 at 12:50
answered Nov 10 at 12:26
jezrael
305k20239314
305k20239314
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
What if the
Crow is before theArow? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish which
Idhave typeA,Cand whichB,D? Maybe help add anotherIdwith output?– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14