aboutsummaryrefslogtreecommitdiff
path: root/absl/crc
Commit message (Collapse)AuthorAgeFilesLines
* PR #1662: Replace shift with addition in crc multiplyPavel P2024-05-072-10/+12
| | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1662 Merge 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b into cba31a956209e68e4d4049e8a9bc03b1fd67320a Merging this change closes #1662 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1662 from pps83:crc-add 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b PiperOrigin-RevId: 631470883 Change-Id: I4a72be643ed341ddf0e0007418ab4a613a03db4b
* PR #1653: Remove unnecessary casts when calling CRC32_u64Pavel P2024-04-191-4/+4
| | | | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1653 CRC32_u64 returns uint32_t, no need to cast returned result to uint32_t Merge 90e7b063f39c6b1559a21832d764e500e1cdd40c into 9a61b00dde4031f17ed4fa4bdc0e0e9ad8859846 Merging this change closes #1653 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1653 from pps83:CRC32_u64-cast 90e7b063f39c6b1559a21832d764e500e1cdd40c PiperOrigin-RevId: 626462347 Change-Id: I748a2da5fcc66eb6aa07aaf0fbc7eca927fcbb16
* Optimize crc32 V128_From2x64 on ArmConnal de Souza2024-04-042-12/+15
| | | | | | | This removes redundant vector-vector moves and results in Extend being up to 3% faster. PiperOrigin-RevId: 621948170 Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a
* Adjust conditonal compilation in non_temporal_memcpy.hAbseil Team2024-03-271-17/+18
| | | | | | | | | | This change will allow the AVX version of non-temporal memcpy to be compiled even if the compiler isn't run with AVX support. This allows runtime dispatch to select the AVX implementation for CPUs that are known to be compatible with AVX instructions. PiperOrigin-RevId: 619594422 Change-Id: Ia7d92404ef8d10d152030b29b71948ed954f28f5
* Replace //visibility:private with :__pkg__ for certain targetsAbseil Team2024-03-141-2/+6
| | | | | | | | | This will allow us to give visibility to other Google-internal libraries. The change is necessary since //visibility:private cannot be combined with other specifications. PiperOrigin-RevId: 615779561 Change-Id: I82b1edfa4e1ca280e429cf2a5e4003a1cc316a60
* Add several missing includes in crc/internalAbseil Team2024-03-132-2/+3
| | | | | PiperOrigin-RevId: 615504707 Change-Id: Ia0e8211bd3c3d28fd0715c8f296ec50f6a700757
* Disable ubsan for benign unaligned access in crc_memcpyAbseil Team2024-03-121-3/+8
| | | | | PiperOrigin-RevId: 615160537 Change-Id: I29070c898104c55e6563eed0eef7397441bef1d7
* Delete a stray commentAbseil Team2024-03-121-1/+0
| | | | | PiperOrigin-RevId: 615017130 Change-Id: I73277de8ece31d6a35b47dbdb205b473324b74a2
* PR #1617: fix MSVC 32-bit build with -arch:AVXStanislaw Halik2024-02-151-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1617 The intrinsics used aren't available on `x86_64` processors while running in 32-bit mode. See: - list of 64-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170) - list of 32-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170) - list of predefined MSVC macros (https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170) The error message in question: ```console F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(145,32): error C3861: '_mm_crc32_u64': identifier not found return static_cast<uint32_t>(_mm_crc32_u64(crc, v)); ^ F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(193,50): error C3861: '_mm_cvtsi128_si64': identifier not found inline int64_t V128_Low64(const V128 l) { return _mm_cvtsi128_si64(l); } ``` Merge 06f5832108a2b01e0a900db51e1c870f7069a1f2 into 797501d12ea767dabdc8d36674e083869e62ee7d Merging this change closes #1617 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1617 from sthalik:pr/fix-msvc-32-bit-avx 06f5832108a2b01e0a900db51e1c870f7069a1f2 PiperOrigin-RevId: 607483370 Change-Id: Id2a6f6dd33c2707fe7ffe134e7335916f3fb9da3
* Replace `testonly = 1` with `testonly = True` in abseil BUILD files.Shahriar Rouf2024-01-311-1/+1
| | | | | | | https://bazel.build/build/style-guide#other-conventions PiperOrigin-RevId: 603084345 Change-Id: Ibd7c9573d820f88059d12c46ff82d7d322d002ae
* Migrate empty CrcCordState to absl::NoDestructor.Abseil Team2024-01-183-4/+6
| | | | | | | Note that this only changes how we allocate the empty state, and reference countings of `empty` stay the same. PiperOrigin-RevId: 599526339 Change-Id: I2c6aaf875c144c947e17fe8f69692b1195b55dd7
* Avoid using the non-portable type __m128i_u.Derek Mauro2023-10-262-7/+7
| | | | | | | | | | | | | | | | According to https://stackoverflow.com/a/68939636 it is safe to use __m128i instead. https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170 also uses this type instead __m128i_u is just __m128i with a looser alignment requirement, but simply calling _mm_loadu_si128() instead of _mm_load_si128() is enough to tell the compiler when a pointer is unaligned. Fixes #1552 PiperOrigin-RevId: 576931936 Change-Id: I7c3530001149b360c12a1786c7e1832754d0e35c
* Bazel: Enable the header_modules featureDerek Mauro2023-10-111-0/+1
| | | | | PiperOrigin-RevId: 572575394 Change-Id: Ic1c5ac2423b1634e50c43bad6daa14e82a8f3e2c
* Bazel: Support layering_check and parse_headersDerek Mauro2023-10-101-1/+11
| | | | | | | | | | | | | The layering_check feature ensures that rules that include a header explicitly depend on a rule that exports that header. Compiler support is required, and currently only Clang 16+ supports diagnoses layering_check failures. The parse_headers feature ensures headers are self-contained by compiling them with -fsyntax-only on supported compilers. PiperOrigin-RevId: 572350144 Change-Id: I37297f761566d686d9dd58d318979d688b7e36d1
* Add entries for Neoverse N2,V1, and V2 into CRC dynamic dispatch table.Connal de Souza2023-10-063-5/+28
| | | | | PiperOrigin-RevId: 571430428 Change-Id: I4777c37c5287d26a75f37fe059324ac218878f0e
* Optimize CRC32 for Ampere SirynConnal de Souza2023-09-261-0/+3
| | | | | | | Siryn's crc32 instruction seems to have latency 3 and throughput 1, which makes the optimal ratio of pmull and crc streams close to that of tested x86 machines. Up to +120% faster for large inputs. PiperOrigin-RevId: 568645559 Change-Id: I86b85b1b2a5d4fb3680c516c4c9044238b20fe61
* Optimize CRC32 Extend for large inputs on ArmConnal de Souza2023-09-211-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this: mov w14, #0xef02, lsl x15, x15, #6, mov x13, xzr, movk w14, #0x740e, lsl #16, sub x15, x15, #0x40, ldr q4, [x16, #0x4e0], _LOOP_START: add x16, x9, x13, add x17, x12, x13, fmov d19, x14, <--------- This is Loop invariant and expensive add x13, x13, #0x40, cmp x15, x13, prfm pldl1keep, [x16, #0x140], prfm pldl1keep, [x17, #0x140], ldp x18, x0, [x16, #0x40], crc32cx w10, w10, x18, ldp x2, x18, [x16, #0x50], crc32cx w10, w10, x0, crc32cx w10, w10, x2, ldp x0, x2, [x16, #0x60], crc32cx w10, w10, x18, ldp x18, x16, [x16, #0x70], pmull2 v5.1q, v1.2d, v4.2d, pmull2 v6.1q, v0.2d, v4.2d, pmull2 v7.1q, v2.2d, v4.2d, pmull2 v16.1q, v3.2d, v4.2d, ldp q17, q18, [x17, #0x40], crc32cx w10, w10, x0, pmull v1.1q, v1.1d, v19.1d, crc32cx w10, w10, x2, pmull v0.1q, v0.1d, v19.1d, crc32cx w10, w10, x18, pmull v2.1q, v2.1d, v19.1d, crc32cx w10, w10, x16, pmull v3.1q, v3.1d, v19.1d, ldp q20, q21, [x17, #0x60], eor v1.16b, v17.16b, v1.16b, eor v0.16b, v18.16b, v0.16b, eor v1.16b, v1.16b, v5.16b, eor v2.16b, v20.16b, v2.16b, eor v0.16b, v0.16b, v6.16b, eor v3.16b, v21.16b, v3.16b, eor v2.16b, v2.16b, v7.16b, eor v3.16b, v3.16b, v16.16b, b.ne _LOOP_START There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs. PiperOrigin-RevId: 567391972 Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a
* Remove implicit int64_t->uint64_t conversion in ARM version of V128_Extract64Abseil Team2023-09-151-1/+1
| | | | | PiperOrigin-RevId: 565662176 Change-Id: I18d5d9eb444b0090e3f4ab8f66ad214a67344268
* Rename x86 crc_memcpy tests since they cover ARM as wellAbseil Team2023-09-071-11/+12
| | | | | | | This is a rename only with no other changes. PiperOrigin-RevId: 563428969 Change-Id: Iefc184bf9a233cb72649bc20b8555f6b662cac6d
* Roll forward support for ARM intrinsics in crc_memcpyAbseil Team2023-09-076-50/+78
| | | | | | | | | | | | | | | | | | | This CL rolls forward a previous change which we rolled back temporarily due to compilation errors on x86 when PCLMUL intrinsics were unavailable. *** Original change description *** This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. *** PiperOrigin-RevId: 563416413 Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688
* Rollback adding support for ARM intrinsicsAbseil Team2023-09-056-75/+47
| | | | | | | | In some configurations this change causes compilation errors. We will roll this forward again after those issue are addressed. PiperOrigin-RevId: 562810916 Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66
* Add support for ARM intrinsics in crc_memcpyAbseil Team2023-09-056-47/+75
| | | | | | | | | | | | This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. PiperOrigin-RevId: 562785420 Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159
* Fix incorrect CRC returned by AcceleratedCrcMemcpyEngine when kRegions == 1Abseil Team2023-08-312-10/+25
| | | | | | | | | | | | This bug does not affect any users currently since AcceleratedCrcMemcpyEngine is never configured with a single region currently. Before this CL, if the number of regions for the AcceleratedCrcMemcpyEngine was set to one, the CRC for the sole region would be incorrectly concatenated onto itself and corrupted. PiperOrigin-RevId: 561663848 Change-Id: Ibfc596306ab07db906d2e3ecf6eea3f6cb9f1b2b
* Add CPU detection for Ampere SirynAbseil Team2023-08-302-0/+4
| | | | | PiperOrigin-RevId: 561444259 Change-Id: I205ba9f11f4d41163ce74ae9cfa417fe500ccab3
* Enable non_temporal_store_memcpy for AMD Milan, Genoa, and Ryzen 3000Abseil Team2023-08-291-0/+6
| | | | | PiperOrigin-RevId: 561119886 Change-Id: Ia1483fdb237f4b211068c7ad1f780ab3e6b81eca
* Add CPU detection for AMD Genoa and Ryzen 3000Abseil Team2023-08-292-0/+8
| | | | | PiperOrigin-RevId: 561108037 Change-Id: Idff65e288384cb55ce69f789db2d9374ae781d3d
* Use fallback engine for as the non-temporal engine for unknown CPU typesAbseil Team2023-08-291-1/+0
| | | | | | | | | Using the non-temporal AVX engine for unknown CPU types looks like a mistake to me, and the default built into the switch case is to use the fallback engine. I don't think this is causing issues now, but it might once we add ARM support. PiperOrigin-RevId: 561097994 Change-Id: I7f0edd447017c09acd49e4ea11476e32740d630a
* Include what you spellDmitri Gribenko2023-08-082-4/+5
| | | | | PiperOrigin-RevId: 554936252 Change-Id: Idb2ffbbc11aa6c98414fdd1ec38873d4687ab5e7
* Implement AbslStringify for crc32c_t in order to support absl::StrFormat ↵Abseil Team2023-08-014-0/+21
| | | | | | | natively PiperOrigin-RevId: 552940359 Change-Id: I925764757404c0c9f2a13ed729190d51f4ac46cf
* Changes absl::crc32c_t insertion operator (<<) to return value as 0-padded ↵Abseil Team2023-08-014-1/+23
| | | | | | | hex instead of dec PiperOrigin-RevId: 552927211 Change-Id: I0375d60a9df4cdfc694fe8d3b3d790f80fc614a1
* Remove deprecated function.Abseil Team2023-07-311-0/+1
| | | | | PiperOrigin-RevId: 552638642 Change-Id: I6b43289ca10ee9aecd6b848e78471863b22b01d1
* Removes unused methods CRC::Empty() and CRC::Concat() from the internalAbseil Team2023-06-123-41/+0
| | | | | | | implementation. PiperOrigin-RevId: 539749773 Change-Id: Iec83431ffd360a077b153cea00427580ae287d1f
* Add a declaration for __cpuid for the IntelLLVM compiler.niranjan-nilakantan2023-05-231-8/+14
| | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1452 __cpuid is declared in intrin.h, but is excluded on non-Windows platforms. We add this declaration to compensate. Fixes #1358 PiperOrigin-RevId: 534449804 Change-Id: I91027f79d8d52c4da428d5c3a53e2cec00825c13
* Rollback of add a declaration for __cpuid for the IntelLLVM compiler.Abseil Team2023-05-221-7/+3
| | | | | PiperOrigin-RevId: 534213948 Change-Id: I56b897060b9afe9d3d338756c80e52f421653b55
* Merge pull request #1452 from niranjan-nilakantan:niranjan-nilakantan/issue1358Copybara-Service2023-05-221-3/+7
| | | | | PiperOrigin-RevId: 534179290 Change-Id: I9ad24518cc6a336fbaf602269fb01319491c8b60
* Fix spelling mistakesVertexwahn2023-05-024-4/+4
|
* Merge pull request #1434 from Vertexwahn:fix-spellingCopybara-Service2023-04-254-6/+6
|\ | | | | | | | | PiperOrigin-RevId: 527066823 Change-Id: Ifa1e9a43c7490b34f9f4dbfa12d3acbed6b49777
| * Fix some spelling mistakesVertexwahn2023-04-244-4/+4
|/
* Workaround for MSVC warning that designated initializers are a C++20 featureDerek Mauro2023-03-151-14/+14
| | | | | | | | | | | | https://google.github.io/styleguide/cppguide.html#Designated_initializers recommends using designated initializers as does https://abseil.io/tips/172, but apparently they are a non-standard extension prior to C++20. For maximum compatibility, avoid using them here. Fixes #1413 PiperOrigin-RevId: 516892890 Change-Id: Id7b7857891e39eb52132c3edf70e5bf4973755af
* Use const and static for member functionsRose2023-03-072-5/+5
| | | | This shows that these are member functions that do not modify a class's data.
* Merge pull request #1394 from AtariDreams:constructorsCopybara-Service2023-02-211-3/+3
|\ | | | | | | | | PiperOrigin-RevId: 511271203 Change-Id: I1ed352e06265b705b62d401a50b4699d01f7f1d7
| * Convert empty constructors to default onesRose2023-02-171-3/+3
| | | | | | | | These make the changed constructors match closer to the other ones that are default.
* | Prefer emplace back over push_back where emplace_back is more appropriateRose2023-02-161-1/+1
|/ | | | This also helps a lot with dealing with conversions and data structure creation under the hood.
* Don't assume that AVX implies PCLMULQDQ when using LLVM on Windows.Saran Tunyasuvunakool2023-02-071-1/+1
| | | | | PiperOrigin-RevId: 507790741 Change-Id: I347357f9a2d698510f29b7d1b065ef73f9289292
* Replace absl::base_internal::Prefetch* calls with absl::Prefetch* callsMartijn Vels2023-01-273-14/+12
| | | | | PiperOrigin-RevId: 505184961 Change-Id: I64482558a76abda6896bec4b2d323833b6cd7edf
* Optimize RemoveCrc32cSuffix.Abseil Team2023-01-171-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The implementation can be optimized to not having to perform an ExtendByZero operation. `RemoveCrc32cSuffix` can simply be implemented as uint32_t result = static_cast<uint32_t>(full_string_crc) ^ static_cast<uint32_t>(suffix_crc); CrcEngine()->UnextendByZeroes(&result, suffix_len); return crc32c_t{result}; Math proof that this change is correct: `ComputeCrc32c` actually computes the following: ConditionedCRC(data) = UnconditionedCRC(data) + StartValue(data) + ~0 with: StartValue(data) = ~0 * x**BitLength(data) mod P (with `+` being a carry-less add, ie an xor). ``UnconditionedCRC` in the context of this description means: no initial or final xor with ~0 and a starting value of zero - ie the result that `CrcEngine()->Extend` would give you with a starting value of 0. Given `full_string_crc` and `suffix_crc` (both conditioned CRCs), xoring them together results in: (1): full_string_crc + suffix_crc = UnconditionedCRC(full_string) + StartValue(full_string) + ~0 + UnconditionedCRC(suffix) + StartValue(suffix) + ~0 Since `+` is carry-less addition (ie an XOR), the two ~0 cancel each other out. (2) full_string_crc + suffix_crc = UnconditionedCRC(full_string) + StartValue(full_string) + UnconditionedCRC(suffix) + StartValue(suffix) We can make use of the fact that: (3) UnconditionedCRC(full_string) + UnConditionedCRC(suffix) = UnconditionedCRC(full_string_with_suffix_replaced_by_zeros). Ie, UnconditionedCRC("AABBB") + UnconditionedCRC("BBB") = UnconditionedCRC("AA\0\0\0") Putting (3) into (2) yields: (4) full_string_crc + suffix_crc = UnconditionedCRC(full_string_with_suffix_replaced_by_zeros) + StartValue(full_string) + StartValue(suffix) Using: (5) UnconditionedCRC(full_string_with_suffix_replaced_by_zeros) = UnconditionedCRC(full_string_without_suffix) * x**Bitlength(suffix) mod P and putting (5) into (4) (6) full_string_crc + suffix_crc = UnconditionedCRC(full_string_without_suffix) * x**Bitlength(suffix) mod P + StartValue(full_string) + StartValue(suffix) Using (7) StartValue(full_string) = ~0 * x ** Bitlength(full_string) mod P and (8) StartValue(suffix) = ~0 * x**BitLength(suffix) mod P Putting (7) and (8) in (6): (9): full_string_crc + suffix_crc = UnconditionedCRC(full_string_without_suffix) * x**(Bitlength(suffix)) mod P + ~0 * x ** Bitlength(full_string) mod P + ~0 * x ** BitLength(suffix) mod P Using: (10) Bitlength(full_string) = Bitlength(full_string_without_suffix) + Bitlength(suffix) And putting (10) in (9): (11) full_string_crc + suffix_crc = UnconditionedCRC(full_string_without_suffix) * x**(Bitlength(suffix)) mod P + ~0 * x ** (Bitlength(full_string_without_suffix) + Bitlength(suffix)) mod P + ~0 * x ** BitLength(suffix) mod P using x**(A+B) = x**A * x**B results in: (12) full_string_crc + suffix_crc = UnconditionedCRC(full_string_without_suffix) * x**(Bitlength(suffix)) mod P + [ ~0 * x ** Bitlength(full_string_without_suffix) * x**Bitlength(suffix)] mod P + ~0 * x ** BitLength(suffix) mod P using A mod P + B mod P + C mod P = (A + B + C) mod P: (this works in carry-less arithmetic) (13) full_string_crc + suffix_crc = [ UnconditionedCRC(full_string_without_suffix) * x**(Bitlength(suffix)) + [ ~0 * x ** Bitlength(full_string_without_suffix) * x**Bitlength(suffix)] + ~0 * x ** BitLength(suffix) ] mod P Factor out x**Bitlength(suffix): (14) full_string_crc + suffix_crc = [ x**(Bitlength(suffix)) * [ UnconditionedCRC(full_string_without_suffix) + ~0 * x ** Bitlength(full_string_without_suffix) + ~0 ] mod P Using: (15) ConditionedCRC(full_string_without_suffix) = [ UnconditionedCRC(full_string_without_suffix) + ~0 * x ** Bitlength(full_string_without_suffix) ] mod P + ~0 = [ UnconditionedCRC(full_string_without_suffix) + ~0 * x ** Bitlength(full_string_without_suffix) + ~0] mod P (~0 is less than x**32, so ~0 mod P = ~0) Putting (15) in (14) results in: full_string_crc + suffix_crc = [ x**(Bitlength(suffix)) * ConditionedCRC(full_string_without_suffix)] mod P Or: (16) ConditionedCRC(full_string_without_suffix) = (full_string_crc + suffix_crc) * x**(-Bitlength(suffix)) mod P A multiplication by x**(-8*bytelength) mod P is implemented by `CrcEngine()->UnextendByZeros`. PiperOrigin-RevId: 502659140 Change-Id: I66b0700d258f948be0885f691370b73d7fad56e3
* Don't use Arm vector intrinsics when compiling with CUDA in device mode.Abseil Team2023-01-111-1/+1
| | | | | PiperOrigin-RevId: 501464530 Change-Id: I5a0929a2b88c1c158b1696634a65ffda9c4b8590
* Require 64-bit builds on x86 to use AcceleratedCrcMemcpyEngineDerek Mauro2023-01-053-4/+11
| | | | | | | | This also ensures that there is only one definition of GetArchSpecificEngines by moving the condition to a common place. PiperOrigin-RevId: 500038304 Change-Id: If0c55d701dfdc11a1a9c8c1b34eb220435529ffb
* Require 64-bit builds on x86 to use CRC32 hardware accelerationDerek Mauro2023-01-041-2/+4
| | | | | | | | 32-bit builds with SSE 4.2 do exist, and these builds do not work without this patch. PiperOrigin-RevId: 499498979 Change-Id: I0ade09068804655652c07d0f1ef13554464a1558
* Add prefetch to crc32Ilya Tokar2022-12-134-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already prefetch in case of large inputs, do the same for medium sized inputs as well. This is mostly neutral for performance in most cases, so this also adds a new bench with working size >> cache size to ensure that we are seeing performance benefits of prefetch. Main benefits are on AMD with hardware prefetchers turned off: AMD prefetchers on: name old time/op new time/op delta BM_Calculate/0 2.43ns ± 1% 2.43ns ± 1% ~ (p=0.814 n=40+40) BM_Calculate/1 2.50ns ± 2% 2.50ns ± 2% ~ (p=0.745 n=39+39) BM_Calculate/100 9.17ns ± 1% 9.17ns ± 2% ~ (p=0.747 n=40+40) BM_Calculate/10000 474ns ± 1% 474ns ± 2% ~ (p=0.749 n=40+40) BM_Calculate/500000 22.8µs ± 1% 22.9µs ± 2% ~ (p=0.298 n=39+40) BM_Extend/0 1.38ns ± 1% 1.38ns ± 1% ~ (p=0.651 n=40+40) BM_Extend/1 1.53ns ± 2% 1.53ns ± 1% ~ (p=0.957 n=40+39) BM_Extend/100 9.48ns ± 1% 9.48ns ± 2% ~ (p=1.000 n=40+40) BM_Extend/10000 474ns ± 2% 474ns ± 1% ~ (p=0.928 n=40+40) BM_Extend/500000 22.8µs ± 1% 22.9µs ± 2% ~ (p=0.331 n=40+40) BM_Extend/100000000 4.79ms ± 1% 4.79ms ± 1% ~ (p=0.753 n=38+38) BM_ExtendCacheMiss/10 25.5ms ± 2% 25.5ms ± 2% ~ (p=0.988 n=38+40) BM_ExtendCacheMiss/100 23.1ms ± 2% 23.1ms ± 2% ~ (p=0.792 n=40+40) BM_ExtendCacheMiss/1000 37.2ms ± 1% 28.6ms ± 2% -23.00% (p=0.000 n=38+40) BM_ExtendCacheMiss/100000 7.77ms ± 2% 7.74ms ± 2% -0.45% (p=0.006 n=40+40) AMD prefetchers off: name old time/op new time/op delta BM_Calculate/0 2.43ns ± 2% 2.43ns ± 2% ~ (p=0.351 n=40+39) BM_Calculate/1 2.51ns ± 2% 2.51ns ± 1% ~ (p=0.535 n=40+40) BM_Calculate/100 9.18ns ± 2% 9.15ns ± 2% ~ (p=0.120 n=38+39) BM_Calculate/10000 475ns ± 2% 475ns ± 2% ~ (p=0.852 n=40+40) BM_Calculate/500000 22.9µs ± 2% 22.8µs ± 2% ~ (p=0.396 n=40+40) BM_Extend/0 1.38ns ± 2% 1.38ns ± 2% ~ (p=0.466 n=40+40) BM_Extend/1 1.53ns ± 2% 1.53ns ± 2% ~ (p=0.914 n=40+39) BM_Extend/100 9.49ns ± 2% 9.49ns ± 2% ~ (p=0.802 n=40+40) BM_Extend/10000 475ns ± 2% 474ns ± 1% ~ (p=0.589 n=40+40) BM_Extend/500000 22.8µs ± 2% 22.8µs ± 2% ~ (p=0.872 n=39+40) BM_Extend/100000000 10.0ms ± 3% 10.0ms ± 4% ~ (p=0.355 n=40+40) BM_ExtendCacheMiss/10 196ms ± 2% 196ms ± 2% ~ (p=0.698 n=40+40) BM_ExtendCacheMiss/100 129ms ± 1% 129ms ± 1% ~ (p=0.602 n=36+37) BM_ExtendCacheMiss/1000 88.6ms ± 1% 57.2ms ± 1% -35.49% (p=0.000 n=36+38) BM_ExtendCacheMiss/100000 14.9ms ± 1% 14.9ms ± 1% ~ (p=0.888 n=39+40) Intel skylake: BM_Calculate/0 2.49ns ± 2% 2.44ns ± 4% -2.15% (p=0.001 n=31+34) BM_Calculate/1 3.04ns ± 2% 2.98ns ± 9% -1.95% (p=0.003 n=31+35) BM_Calculate/100 8.64ns ± 3% 8.53ns ± 5% ~ (p=0.065 n=31+35) BM_Calculate/10000 290ns ± 3% 285ns ± 7% -1.80% (p=0.004 n=28+34) BM_Calculate/500000 11.8µs ± 2% 11.6µs ± 8% -1.59% (p=0.003 n=26+34) BM_Extend/0 1.56ns ± 1% 1.52ns ± 3% -2.44% (p=0.000 n=26+35) BM_Extend/1 1.88ns ± 3% 1.83ns ± 6% -2.17% (p=0.001 n=27+35) BM_Extend/100 9.31ns ± 3% 9.13ns ± 7% -1.92% (p=0.000 n=33+38) BM_Extend/10000 290ns ± 3% 283ns ± 3% -2.45% (p=0.000 n=32+38) BM_Extend/500000 11.8µs ± 2% 11.5µs ± 8% -1.80% (p=0.001 n=35+37) BM_Extend/100000000 6.39ms ±10% 6.11ms ± 8% -4.34% (p=0.000 n=40+40) BM_ExtendCacheMiss/10 36.2ms ± 7% 35.8ms ±14% ~ (p=0.281 n=33+37) BM_ExtendCacheMiss/100 26.9ms ±15% 25.9ms ±12% -3.93% (p=0.000 n=40+40) BM_ExtendCacheMiss/1000 23.8ms ± 5% 23.4ms ± 5% -1.68% (p=0.001 n=39+40) BM_ExtendCacheMiss/100000 10.1ms ± 5% 10.0ms ± 4% ~ (p=0.051 n=39+39) PiperOrigin-RevId: 495119444 Change-Id: I67bcf3b0282b5e1c43122de2837a24c16b8aded7