-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read/Write Big/LittleEndian support for Guid #86798
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsRationale
This type represents a 128-bit value in the general format of .NET's Given this is largely an endianness difference and is otherwise just a minor difference in the bits used for Proposed APIsnamespace System
{
public partial struct Guid
{
// public Guid(byte[] value);
// public Guid(ReadOnlySpan<byte> value);
public Guid(ReadOnlySpan<byte> value, bool isBigEndian);
// public byte[] ToByteArray();
public byte[] ToByteArray(bool isBigEndian);
// public bool TryWriteBytes(Span<byte> destination);
// public bool TryWriteBytes(Span<byte> destination, out int bytesWritten); -- new in .NET 8
public bool TryWriteBytes(Span<byte> destination, bool isBigEndian, out int bytesWritten);
}
}
namespace System.Buffers.Binary
{
public static partial class BinaryPrimitives
{
public static Guid ReadGuidBigEndian(ReadOnlySpan<byte> source);
public static Guid ReadGuidLittleEndian(ReadOnlySpan<byte> source);
public static bool TryReadGuidBigEndian(ReadOnlySpan<byte> source, out Guid value);
public static bool TryReadGuidLittleEndian(ReadOnlySpan<byte> source, out Guid value);
public static bool TryWriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
public static bool TryWriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);
public static void WriteGuidBigEndian(ReadOnlySpan<byte> destination, Guid value);
public static void WriteGuidLittleEndian(ReadOnlySpan<byte> destination, Guid value);
} DrawbacksAs discussed on #86084, there is a general concern that users may not be aware that these other overloads exist -or- may not be aware that the difference between variant 1 and variant 2 is endianness and that .NET defaults to variant 2. However, the same general considerations exists from exposing a new type such as After discussion with a few other API review team members, the general consensus was that shipping a new type is undesirable and we should prefer fixing this via new APIs/overloads and potentially looking into additional ways to surface the difference to users (such as analyzers, API documentation, etc). Additional ConsiderationsGiven the above, we may want to consider how to help point users towards their desired APIs given the overloads on We can clearly update the documentation, but an analyzer seems like a desired addition that can help point devs towards specifying the endianness explicitly. Obsoleting the existing overloads was also proposed, but may be undesirable since the current behavior isn't "wrong", it just may be the undesired behavior in some scenarios. We may also want to consider whether a
|
I don't think I see a reasoning for why these are worth adding other than they are in use. When is this format needed? Is there any relationship with execution on big endian architecture, or not particularly? |
The current UUID spec has 2 commonly used formats The difference between the two, in terms of layout, is Thus, these functions are needed for users to be able to correctly interact with such systems and to support the full UUID spec.
The relationship is to how a sequence of raw bytes is interpreted. While machines may operate in big or little endian mode, endianness comes up in many contexts. Networking is a large one where bytes are almost exclusively sent in big endian format (so much so that a common description is "network order"). |
It seems that the history of the worst feature of the .NET BCL (DataTime.Kind) has not taught anyone anything |
This is not a The proposed You could have some This ultimately comes down to:
Edit: The spec does largely detail itself following variant 1 and describes it as "network order". With most of the callouts to variant 0/2 being noted as |
That isn't quite comparable, |
We likewise do not expose the |
Several nitpicks:
So, IMHO, separate Uuid type is preferrable over the proposed API. I don't think |
I'm not sure how this is relevant to byte order (endianness). Can you clarify? The proposal is not to change
That said, this part is still accurate.
As mentioned earlier, the problems with |
This has nothing to do with the internal structure, but rather has everything to do with the implementation of We could choose to implement Things like
Yes, but they have a strong semantic difference that appears as part of comparisons, addition, subtraction, multiplication, division, remainder, conversions, min, max, and almost every other operation you can think of.
This also comes about from perf advantages and significantly reduced complexity that frequently comes up in common usages of the types. What you've effectively asked is that the .NET BCL expose: public readonly struct Guid { }
public readonly struct Uuid { } You could give these any number of names: public readonly struct UuidLittleEndian { }
public readonly struct UuidBigEndian { }
public readonly struct UuidVariant1 { }
public readonly struct UuidVariant2 { }
public readonly struct UuidMachineOrder { }
public readonly struct UuidNetworkOrder { }
etc The This is not how .NET exposes types in the BCL today, and its not something that we want to do moving forward either. We want to grow and expand existing types to support new scenarios instead. If we were designing this today, without any prior concerns, we'd probably name it |
Do you have examples in our own libraries or in the code out there that would use these APIs? We do provide efficient allocation-free access to underlying Guid bytes. If needed, anybody can write their own binary serializer/deserialized that fits the given format in just a few lines. I do not think we need to be providing all possible variants of binary serializers/deserializers for BCL types. For example, we do not provide helpers for different variants of RLE-encoding of integers either. #85891 is very similar proposal, it is also asking for providing a specific binary serializers/deserializers helpers. |
This was opened because we got an issue (linked in the top proposal) that had a massive influx of upvotes/support. Our default behavior for These APIs cover the core need, efficiently, and help make it visible that the existing APIs may not do exactly what the user may be expecting. |
It certainly would not be an end of the world scenario if these weren't exposed. But given the number of upvotes on the original issue asking for a -- There is notably still a decent number of users from the |
The user implemented code would effectively be: public byte[] ToBigEndianByteArray(Guid guid)
{
byte[] tmp = guid.ToByteArray();
WriteInt32BigEndian(tmp, ReadInt32LittleEndian(tmp)):
Span<byte> tmp2 = tmp.Slice(4);
WriteInt16BigEndian(tmp2, ReadInt16LittleEndian(tmp2)):
tmp2 = tmp2.Slice(2);
WriteInt16BigEndian(tmp2, ReadInt16LittleEndian(tmp2)):
return tmp;
} and similar for the inverse direction |
It affects the implementation of sorting/serialization (see https://github.com/dotnet/runtime/pull/81650/files ) and corresponding performance cost.
Well, maybe author of #86084 just wants consistency between string and byte serialization, but I want more. That is, if
But the trouble is just the same. It takes just a single element in a large chain of interacting code fragments with multiple serialization boundaries to mishandle or loose the
But introduction of these new APIs won't solve the problem. (The aforementioned problem being inconsistency between variant-1 and variant-2 compatible treatments of binary and string serialization). You can always implement such API yourself (and people do implement similar API). But 'ordinary' enterprise programmer rarely interacts with serialization boundary directly - it's handled by 1st and 3rd party libraries for serialization/model binding/ object-relational mapping. The problem comes from inconsistencies in handling Guid/Uuid data within these libraries - which is not observable without digging through internals of these libraries. Well, maybe some libraries will replace their implementations with these API's, but even if they do - the problem will persist. Solving the problem will cause them either to break backward compatibility, which isn't what good libraries do often, or to pass BE/LE switches up to the API surface, which is, honestly, looks ugly and javaesque. Separate
There's not much of a surprise here. The issue touches on a major sour subject for modern enterprise programming. Back in the day you had full MSFT-stack, where |
I have experienced the following case using GUIDs.
instead of a special correct generation algorithm, then problem appears as sequence of bits produced by Generally, GUID serves its purpose fine and there is no need to touch or modify it. As I see new API methods will just confuse GUID's API. Also, the motivation of such API extension is not clear from first glance. Instead, the UUID type may be added to BCL along with MSDN documentation that explicitly states the problem UUID type solves. It is much simpler just to fetch out of box methods |
5 cents regarding my recent experience with UUIDs on other platforms. I needed UUID v7 support (sortable one). In JVM it just works with With Moreso, UUID spec for new UUID versions mentions network order
How UUID v7 can be implemented correctly with the existing |
The business case for UUIDv7 - cursor-based pagination which requires sorting. Why UUID in the first place? I need a random, unique, opaque identifier. Snowflake is dead, UUID v7 is the industry standard now. |
Endianness has a relatively minor impact on the implementation in that it determines which fields are compared first vs last and in a couple edge cases involving construction, it determines if any byte swapping is required. That is, the difference between a For sorting,
This has no impact on the actual vectorizability of the code and in fact is the least efficient way to set up the bytes. The "ideal" layout performance wise is
We already can do this on
This is a non-starter, we do not transparently convert between non-equivalent types like that as it breaks type safety, introduces a range of ambiguities and risks of breaking change, etc. It goes against many of the core Framework Design Guidelines.
There are no real quirks or inconsistencies in The one piece of functionality it is missing is the ability to easily serialize it as a big-endian sequence of bytes (that is, serialize it in a format equivalent to a As has been stated many times, some proposed There are some minor adjustments that could be made to field layout, but nothing that is actually meaningful to the implementation or to the user-observable semantics of the type in safe code and nothing that would meaningfully impact the performance or usability of the type.
This is really a non-argument. The same "concern" exists for serializing any primitive type to the network (
There is no data to lose here. The serialization process already includes all bytes, the only thing that a user can mess up is that they use This is something that is trivially caught and handled by a single test for the APIs that perform serialization. The
There is no inconsistency with regards to string serialization. String serialization is deterministic regardless of field layout, regardless of endianness, etc. There is currently today user error in the use of the binary serialization APIs because some users don't pick up that This is functionally no different from a developer using The fix is for the user to use the correct
Exposing If the system requires a Ultimately which is used doesn't really matter provided that both the producer and consumer agree on which is being used. In the majority case this isn't something to surface to the user. In the case you have a general purpose serialization library, then it is something to surface much as it would be for
Exposing a new
The formatted string should never differ between the two systems. If it does, you have a bug. The formatted string would never differ between equivalent The byte sequence of a The bugs occur because developers do not correctly account for the fact that they are reading a Part of this comes from us not having an API that allows them to trivially work with big-endian data. That is what this proposal covers exposing. The other part comes from the existing APIs not clearly surfacing that there is potentially an endianness concern users should be aware of. That endianness concern will not be addressed by A subtle but potential alternative would be to call these |
This is not how we typically view such things in API review. We do generally consider the concerns around additional overloads causing potential confusion. However, new overloads to existing APIs are most frequently considered significantly less confusing than a new but similar type. We do also factor in the chance for user confusion. In the case of In the case of On the other hand, exposing So while some users are surfacing the potential concern about ambiguity, it does not at all match the concrete experience we have from other types that have already done this exact thing to great success. And its worth noting according to API usage metrics (https://apisof.net/), such APIs are actually used by developers in a non-trivial number of projects, so its not like we exposed them and simply no one uses them.
That is not how any other type provided by the BCL works and is not how types work in most other languages/ecosystems either. Numeric strings are functionally always displayed in big endian format. On the other hand, most machines natively operate in little-endian format (there are relatively few exceptions such as IBM System z9) and thus the raw byte sequence in memory is swapped compared to the byte sequence displayed by string formatting for almost every type. |
You pass As I've repeatedly detailed above, Likewise, The behavior of We are not going to expose an entirely new type just to minorly differentiate a serialization/deserialization only concern. We would not ship a
All the different variations of Today, The new APIs make it easy to also serialize/deserialize big-endian ordered data (network order) and thus make it much simpler for developers to correctly handle UUIDv7/UUID variant 2/etc values at the serialization/deserialization boundary. They likewise make it trivial to continue working with the 20 years of existing types/APIs, many of which already use The new APIs completely remove the abiguity that a second type would introduce and the general interchange problems that would arise. They completely remove the consideration of whether existing APIs taking Code that is already working and already doing the right thing continues to work and do the right thing. They can potentially simplify their own existing wrappers that are fixing up the endianness to simply use these new APIs. New or existing code that finds the byte order is not what they expected can now trivially use the new APIs to do the right thing. |
First of all: I urge you not to compare a data type that is unambiguous in many languages with an exaggerated representation of a particular case. By translating such comparisons, you divert our conversation from the point. As for unambiguous behavior: "Except" is the opposite of "identical behavior." If you expose a new type, you can accurately describe its behavior. If you add flags, there will be additional cognitive load and ambiguity. And it's not just serialization. It's about compatibility with: data stores (not just RDBMS), other languages, interoperability with other languages, making language input easier for people with experience using the uuid type. I just can't explain to a colleague (python) what a GUID is and how it differs from a uuid. It seems to me that the current proposal can only distract us from creating applications with new versions of uuid, precisely because we will encounter type assignment ambiguities. Conversely, if we have two types guid and uuid, we can accurately separate them, understand the purpose of each separately and develop them in parallel, taking into account the needs of these particular types. Such an API does not make it easier for developers to process values, but only expands the field for decision-making, forcing each time to think about the ambiguity of design decisions 20 years ago. I'm an average developer and I have no idea why guid supports any other byte sequence. If we talk about serialization, then there are a lot of ambiguities for me too, the absence of an API at all creates a situation where I have to think about things that I cannot know in advance. And by the way, as an average developer, I believe that the new API adds a few methods that can be added to a regular nuget package, I don't see the point of adding them to bcl like this. If we talk about a new type as an alternative to this api (which is not correct), then creating a new type does not replace the old one in any way, but only adds new features. No one insists on marking the GUID as obsolete, it would be wrong, besides, there is a decision field in the GUID where it cannot be replaced. But the type is already overloaded with internal information, and the need to add new flags and methods only speaks to the need for a "new", unambiguous data type, which is in all other languages. I would like to emphasize that the current interfaces work with the GUID type and do not require a transition to a new type. Scenarios are defined where the "correct" byte sequence needs to be used (and it's good that they are different, not specific - that way we can see a whole group of needs, not just one) - serialization, integration and disambiguation with other languages, use of different types guid for enterprise database, uuid for small projects) for different databases. In general, in any design, new flags and methods only add to the ambiguity. A type whose constructor needs to be passed multiple flags does not become unambiguous with the addition of a new flag. In addition to the fact that we separate the types of numbers into different bit depths to determine their size, they can also be stored in completely different ways! P.S. |
Yes, sorry. This is a place I forgot to reiterate both sides of serialization/deserialization.
Yes. APIs that deal with raw byte sequences require you to think about endianness. Not thinking about endianness when dealing with raw byte sequences will only lead to bugs.
Yes, you must understand the endianness when dealing with raw byte sequences. The same would be true for
Yes, they must be aware that
It is explicitly something that From
The wording could be improved in some cases for the constructors. The new APIs then allow developers to specify which format their raw byte sequence is in.
There is no more difference between
There is a fundamental requirement for developers doing binary serialization to use the same interpretation on both sides of the serialization/deserialization boundary. Having different types does not solve this problem. |
With the ongoing conversation here, I've removed the |
If I am understanding correctly, this whole use case seems to rely on the developer getting a The use case of "writing it to a database" is certainly one that has been brought up much in this discussion, however there is nothing special about it. I have already explained in my previous comment how there is (in most cases) only one valid thing for your database layer to do once it encounters a GUID. Judging by the code links posted, all of them do exactly that. You would have the exact same issues with a use case as simple as converting a binary file format to a textual format. It just happens that databases sometimes have switches that allow you to make a right here with two wrongs. The bug here happened the moment the developer passed the wrong endianness to (This entire post can be inverted to go from reading a binary GUID to writing one) Footnotes
|
@tannergooding has done a good job of representing the viewpoints of those of us who maintain the .NET APIs. I understand not everyone is happy with those viewpoints, but at this point no additional arguments are being presented and the discussion has run its course. I'll mark this as api-ready-for-review again so that we can discuss the specifics of this proposal again at the next API review meeting opportunity and decide whether to add the proposed APIs or a subset of them or none at all. Thanks to all who shared their perspectives on the matter. |
Well, this discussion is now mostly an essay exchange, with timezone differences, job and sleep it takes a lot of time to produce another argument, so it mostly halted to a stop. Though
is probably correct. For single post here, there's one weighted and thought-out reply by @tannergooding here, but there are ten giggles and insults from the crowd on Discord as well. This is no longer maintainable. So, I'll address the latest arguments and will eagerly await this issue to be closed.
At the cost of readability. Consider the aforementioned BigInteger: var bytes = bigInt.ToByteArray(true, false); // is it signed little-endian? unsigned big-endian? It's no surprise in the fact top search results for "bool method parameter" are generally negative, e.g., "What is wrong with boolean parameters?", "Is it wrong to use a boolean parameter to determine behavior", "Clean code: The curse of a boolean parameter", "Do Boolean Parameters Make You a Bad Programmer?"
Proposed overloads are to be used outside
Even the original RFC-4122 document from 2005 claims to describe only variant 1, and even specifically requires big-endianness by stating in the same section:
But newer drafts define a different structure for v7 and v8, and even redefine the field structure for versions 2-5. Compare the v1 structure
with v4
or v7
Also, Console.WriteLine(Convert.ToString((Guid.NewGuid().ToByteArray()[8] & 0xF0) >> 4,2)) It produces the This issue notwithstanding, I hope Runtime and BCL will be prepared for eventual popularization of UUID v6/v7/v8 As for the "Why don't you write your own library" argument. It was addressed already, actually. And these libraries already exist. The problem is with integrating these libraries with other libraries. In .NET Community there's a general preconception against taking a hard dependency on another third-party library (fueled not only by recent drama with IdentityServer of ImageSharp, but the whole 20 years of history with open source in .NET circles). The simple, blittable (big-endian. Machines are generally little-endian, yes, but network is generally not. At least middle-endian is dead and forgotten) primitive type in BCL would help a lot in easing cross-integration and operation between these libraries. public struct Binary<int n>
{
byte[n] _buffer;
} with the support from EntityFrameworkCore in mapping these types (which would promote support for these types in other libraries) would, I hope, alleviate all the issues current UUID libraries have. |
Before everything is completed, I would like to highlight a few points.
Creating your own primitive to solve the issues described here is not difficult. Moreover, I have been maintaining one such package myself since 2019. |
Made quick search in Githib of
Also, some libs use own serialization mechanism for Guid npgsql. I would like to suggest do not change Guid api to not affect case 1, 2 and add only proposed |
namespace System
{
public partial struct Guid
{
// public Guid(byte[] value);
// public Guid(ReadOnlySpan<byte> value);
public Guid(ReadOnlySpan<byte> value, bool bigEndian);
// public byte[] ToByteArray();
public byte[] ToByteArray(bool bigEndian);
// public bool TryWriteBytes(Span<byte> destination);
// public bool TryWriteBytes(Span<byte> destination, out int bytesWritten); -- new in .NET 8
public bool TryWriteBytes(Span<byte> destination, bool bigEndian, out int bytesWritten);
}
} |
Can you assign this issue to me? |
|
Should this supersede #53354 ? |
#30940 issue can be closed also. |
Rationale
System.Guid
represents .NETs support forGlobally Unique Identifiers
orGUIDs
(sometimes also referred to asUniversally Unique Identifiers
orUUIDs
).This type represents a 128-bit value in the general format of
xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
where eachx
represents a hexadecimal digit, where the 4 bits ofM
represent the version and the 4 bits ofN
represent the variant number. This is sometimes referred to as the8-4-4-4-12
format string. And while the string representation is "well-defined", the actual underlying order of these bytes has a few different representation and there are several variants of the general RFC 4122 that may require a specific ordering or even may limit specific bytes to have a particular meaning..NET's
System.Guid
follows a field layout best matchingvariant 2
which is identical tovariant 1
outside the endianness. In particular,variant 1
is "big endian" andvariant 2
is "little endian". Variant 1 and 2 are used by the current UUID specification and are by far the most prominent while variant 0 is largely considered deprecated. Outside of the endianness these variants are differented by minor bit pattern requirements.Given this is largely an endianness difference and is otherwise just a minor difference in the bits used for
M
andN
, we would prefer to not introduce a new type just to handle this and would instead prefer to introduce explicit APIs and overloads onGuid
that help identify and handle these differences.Proposed APIs
Drawbacks
As discussed on #86084, there is a general concern that users may not be aware that these other overloads exist -or- may not be aware that the difference between variant 1 and variant 2 is endianness and that .NET defaults to variant 2.
However, the same general considerations exists from exposing a new type such as
System.Uuid
. There are then additional considerations on top in that it further bifurcates the type system, doesn't easily allow polyfilling the support downlevel without shipping a new OOB package, and may further confuse users due to the frequent interchange of theGUID
andUUID
terminology.After discussion with a few other API review team members, the general consensus was that shipping a new type is undesirable and we should prefer fixing this via new APIs/overloads and potentially looking into additional ways to surface the difference to users (such as analyzers, API documentation, etc).
Additional Considerations
Given the above, we may want to consider how to help point users towards their desired APIs given the overloads on
Guid
that do not require specifying endianness.We can clearly update the documentation, but an analyzer seems like a desired addition that can help point devs towards specifying the endianness explicitly. Obsoleting the existing overloads was also proposed, but may be undesirable since the current behavior isn't "wrong", it just may be the undesired behavior in some scenarios.
We may also want to consider whether a
static Guid NewGuid()
overload that allows conforming toVersion 4, Variant 1
is desired. The docs only indicate it is version 4 and calls into the underlying System APIs. It does not indicate if it producesVariant 1
,Variant 2
, or truly random bits forN
.The text was updated successfully, but these errors were encountered: